Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
PartialExecuter: Reducing WebAssembly size by exploring all executions in LLVM (leaningtech.com)
295 points by apignotti on March 15, 2022 | hide | past | favorite | 107 comments
WebAssembly is commonly used as part of web applications, and minimizing its size is especially important.

As part of the latest release of Cheerp, our C++ to WebAssembly/JavaScript compiler, we have introduced a powerful new LLVM optimization that aggressively reduce WebAssembly output size at compile time.

We have named this optimization 'PartialExecuter', the key idea behind it being taking advantage of known function parameters to find inner code blocks that cannot ever be possibly executed.

Such blocks can then be completely removed from the compiled output, significantly reducing its size.

What makes this pass more powerful than typical Dead Code Elimination is the ability of reasoning over all the possible executions that the code can take, while being robust to memory stores and side-effects. Moreover, PartialExecuter can even reason over loads as far as they refer to read-only memory. This latter capability is especially useful to drop code from complex functions whose behavior depend on input strings (i.e. printf).

We think this work may be of interest for the HN community, and we welcome feedback and questions.

In-depth blog post: https://leaningtech.com/reducing-webassembly-size-by-explori...



This company has an x86-to-WASM compiler that lets you execute arbitrary binaries in the browser.

Also a JVM to WASM transpiler.

There's a fantastic Meetup presentation given by one of them where they show running a C++ multiplayer game with both client AND server running in a browser, using WebRTC as a networking polyfill. Really mindblowing:

https://youtu.be/7JUs4c99-mo?t=167


> This company has an x86-to-WASM compiler that lets you execute arbitrary binaries in the browser.

Direct link to our latest demo in case anybody would like to see this tech in action: https://webvm.io


> running a C++ multiplayer game with both client AND server running in a browser, using WebRTC as a networking polyfill

Ok, now you have my attention. Been waiting for that possibility for a while!


If you are interested, here is the article I wrote about that particular project: https://medium.com/leaningtech/porting-a-c-multiplayer-game-...


What would the impact be on battery life compared to native execution? Can one get it just as power efficient as native execution, only slower? For computationally less intensive applications browser portability might be an excellent tradeoff.

It gives a new meaning to the slogan "write once, run everywhere".


Generally things that run faster are more power efficient, because the CPU has to run for less time to do the same work. There might be a few odd exceptions here and there, e.g. certain AVX instructions can use a lot of power, but they're not common.


I've seen several student projects that try to find "more energy efficient" compilations through random searches of compiler flags etc., but they never get any results.

In a JIT language you might be able to save time by running the JIT less often though.


Have any of these attempts taken a profile-guided approach?


Those AVX instructions use more power at least in part because they massively better utilize the silicon, so you're making a tradeoff between speed and efficiency as per usual.


Sorry, but to be pedantic for a second: do they consume more energy, which is really what I care about for battery life? (Let's assume I'm committed to asking the machine to do whatever work it is that I asked it to do, so the required computation is constant, for discussion.) My understanding of AVX is that it consumes more power, yes — but the point the parent makes is a good one: if it finishes faster, where does the energy use come out? More power over less time can mean less energy. (Or more. Depends on the additional power vs. the time saved…)


1. This all depends massively on the actual workload and chip, a lot of people are basically winging it when it comes to AVX power consumption stories. The rub is also that, especially on laptops, you can get convexity in all the scaling problems e.g. you do more work, good, but then the laptop throttles then stays throttled for a while etc.

2. The original (real) power issues with the original/early AVX-512 desktop chips have basically gone away. It's not far off the timescale that you'd looking back to be bashing AMD for bulldozer.

I'd measure it for you on my machine but I don't have an accurate power-o-meter for my computer.


Users of computationally intensive applications aren't such efficiency freaks much of the time, they tend to take "good enough" in that dpartment, like it is in other areas. In addition to cross platform portability they appreciate outcomes that stem from development performance/productivity (timely app delivery, app correctness and robustness, low cost, features, adaptability to feature requests, usabiity, etc).


At what point does the browser become an "os", what's next? Chrome hypervisor?


I'd be much more willing to execute some binary in my browser than directly on my OS. Sure, you could do a VM or so, but a browser tab is spun up much more quickly.

The absence of friction of just executing something quickly, and being able to share it with a single text string, is what makes this appealing to people, I reckon. I dunno why people seemingly get offended by this.


> At what point does the browser become an "os", what's next?

When it no longer requires an OS to run.


So ChromeOS


ChromeOS is Linux


It happened about a decade ago.


Not sure what you mean by hypervisor, but Microsoft does have... Microsoft Defender Application Guard, I believe it's called. It's a totally sandboxed version of Edge.


This sounds like ActiveX. Security issues related to ActiveX are solved or not? Because if they are not, then it's just a reskin of ActiveX and malware authors gonna have a field day.


HN is doing its typical “well, technically,” thing in the replies here, so I’ll give a better answer: Yes, the problems that plagued ActiveX are gone now. WebAssembly hosts still have vulnerabilities from time to time, but they are not literally by-design like they were with ActiveX. They’re typically complicated and to do similar things to what ActiveX modules could do completely ordinarily would require breaking many sandbox layers.

The discussion also is breaking into an unrelated branch about security within WebAssembly, so it’s worth pointing out that the security model of WebAssembly is that it should prevent the host machine from being attacked by code running in the sandbox, not that it makes code running in the sandbox any more secure. It doesn’t make code in the sandbox not susceptible to most classes of errors, it just isolates those errors to runtime errors and not exploitable bugs that can escalate privilege.

ActiveX, of course, did not deal with either security model, instead shrugging it off and just assuming that code signing would be a good enough deterrent to malware.

(Of course, the problem with this is manyfold, including the fact that just because a module is trustworthy and not malicious, does not mean it should be exposed to all users, and the fact that even if you trust the company behind the module, that doesn’t make the module any more guaranteed to be secure.)


There is a very big difference: ActiveX was native code that was literally running on your system with full access to system calls.

CheerpX is a Virtual Machine environment, it JIT compiles Wasm code from x86 binaries and it is fully sandboxed by the browser. It _cannot_ access your system even if it tried.


> It _cannot_ access your system even if it tried.

That’s the intent, at least. I’m sure we’ll get to see exploits that manage to do exactly this.


Perhaps, but if so, that would be a bug in the browser’s WebAssembly implementation, not CheerpX.


CheerpX looks amazing. Not blaming that project it in any way.

But rowhammer is still a thing.

There's a whole stack of abstractions, that all _may be_ vulnerable. I'm sure CheerpX is very good, but there's no way to _know_ that all the dependencies from the toolchain used to build all the way down to the running environment is actually bug free.

As a first line of investigation, I'd suspect cheerpX, just because so many eyes look at browser sandboxes. _shrug_ your milage may vary.


You can Rowhammer from JavaScript already, so this really has nothing to do with CheerpX.


Very true. Let me try to restate.

I don't know how to prove the absence of a thing. I can only prove existence.

I was trying to highlight that every layer of abstraction has vulnerabilities all the way down to the hardware level.

I'm perfectly willing to accept that CheerpX has no known vulnerabilities.


It may have more to do with the idea that since CheerpX runs in the browser, the threat model of CheerpX is the same as the thread model of using your web browser to browse any other site.


It can still be exploited the Applets/Flash way, by forcing the internal memory to become corrupted and with it change its behaviour.


Right, you can corrupt memory and thereby alter behavior, but that memory is a managed array. You can't corrupt anything outside of the application's own memory. As you say, you can change behavior, and thereby do whatever you want with whatever the wasm application is allowed to access. Which is a very limited set of things.

Likening it to Java applets or Flash is deceptive -- yes, you can still hack them and exploit vulnerabilities. But the scope of what you can do with such vulnerabilities is wildly, dramatically different. Even when sandboxed, Flash has an enormously wider attack surface to play with. WebAssembly has barely anything.

It's like the difference between patching a leak in your roof with a sponge vs tar paper. In theory, water could find a path through the tar paper.


Imagine a WASM module used to control security authentication in the browser, or controlling IoT devices in a factory, now that it is fashionable to run WASM outside of the browser.


Right, I agree that wasm being used to control nuclear launches is not fundamentally better than native sandboxed C++ code in terms of preventing unwanted nuclear launches.

But wasm being used to control the brightness pattern of a blinky light on the console of the machine that controls nuclear launches? That is fundamentally safer than native sandboxed C++ code being used to control the blinky light.

I'm guessing we don't actually disagree on anything here -- I also feel like people are making unwarranted assumptions that wasm gives you more safety than it actually does. (It reminds me of another incorrect assumption that seems to get made a lot, that running unsafe code in a VM means you don't need to worry that it'll escape to the host or other VMs on that host.)


> But wasm being used to control the brightness pattern of a blinky light on the console of the machine that controls nuclear launches? That is fundamentally safer than native sandboxed C++ code being used to control the blinky light.

How so? There are established techniques for developing safety-critical software, and they don't tend to rely on the assumption that a sophisticated JIT compiler is free of bugs.

Or do you mean untrusted code for controlling the blinky light?


If you're following the proper techniques for safety-critical software, then none of this matters; it makes no difference whether you're using wasm or straight C++.

So yes, untrusted code, but "untrusted" is ambiguous. There's Web-style untrusted code, where the code could be malicious. Then there's unverified code, where the author is presumed to have good intentions but you can't trust the correctness. WebAssembly is useful for both as long as your outputs are properly constrained (as in, it's the blinky light case, not the nuclear launch case.)

The distinctions aren't even 100%. Perhaps your adversary, shortly before launching their nukes, triggers the malware in your blinky light controller to blink at the exact frequency that you know will send the operator into a seizure, preventing them from triggering the counter-launch. You know, because you surreptitiously tested all the people a decade ago when they were applying for the nuclear launch controlling position, and did character assassination on all those who were not susceptible to blinky light seizures... ;-)


> If you're following the proper techniques for safety-critical software, then none of this matters; it makes no difference whether you're using wasm or straight C++.

Under what circumstances would it be appropriate to use WASM in a safety-critical system?

To my knowledge there's no WASM implementation intended or approved for use in safety-critical systems.

> WebAssembly is useful for both

I don't think so. I can't see a reason to let lives hang in the balance of a WASM implementation being free of bugs. Use an approved C/C++/Ada compiler and be done with it.

You're right that WASM is pretty robust against malicious code that aims to escape the sandbox. Sames goes for JavaScript. Web browsers are of course very motivated to focus on the security of their language engines.

> Perhaps your adversary, shortly before launching their nukes, triggers the malware in your blinky light controller to blink at the exact frequency that you know will send the operator into a seizure

I don't see WASM having a place in the development of ultra-low-defect software. If you're serious about that kind of thing you use a language and framework like SPARK Ada.

I agree though that WASM may be useful in mitigating the consequences of undefined behaviour in buggy C/C++ code, in some circumstances. In this regard it's nothing new. C/C++ can be compiled to just about anything, including other sandbox-oriented languages like JavaScript. WASM just improves the performance. Linux has been 'ported' to JavaScript at least twice. [0][1]

[0] https://bellard.org/jslinux/tech.html

[1] https://medium.com/@retrage/lkl-js-running-linux-kernel-on-j...


Yes, it is those unwarranted assumptions that irk me, as then you see talks how everything is "magical" with WASM, when it is just yet another bytecode format.


Right, the only 'magical' thing about WASM is that it has browser support. On the plus side, browsers have a comparatively good track-record as secure sandboxes. Far better than the JVM, say.


You could also imagine a native executable doing the same, which in most cases won’t even have the level of control flow integrity protection WASM will have. I’m not clear what superior alternative you’re comparing with.


Currently WASM has less security protections than native sandboxes, not more. Read security section of the standard.


It has less mitigation than the Chrome or the Linux kernel. It has much better CFI than the vast majority of actual C++ userspace applications.


Wasm has the same security risk as executing JavaScript in your browser, except with less risk of XSS type security issues because wasm modules are better encapsulated.


It literally cannot.


Assuming the VM can't be exploited/escaped somehow, which would be a first.


I advise some courses in pentesting.


That is the thing with fake security advertising of WebAssembly.

You can just as easily compile heartbleed into WebAssembly and call it a day.

Yes it is sandboxed, but hardly any different than running an OS process with sandboxing lock down regarding which OS syscalls are allowed.

Think it this way, just because an OS container cannot exploit its host (minus hypervisor bugs), that doesn't mean that what is inside isn't subject to security flaws.


You keep repeating this, but it's simply not true.

The syscall surface is zero by default. WASI of course expands that by a lot, but sane WASI implementations require restrictive whitelisting of available resources like specific files or directories that can be accessed. (wasmtime-wasi does this well,for example). The whole wasi spec is aiming for capability based security, including things like only providing access to pre-opened sockets.

All production deployments of server side Wasm I have seen have very tight scoping.

Sandboxing on Linux, in contrast, is a complicated mess, with a myriad of partial and leaky solutions (SELinux, Apparmor, Docker, Snap, Flatpak, ...) .

Having a good cross platform toolkit is completely out of reach.

The only OS that has comparable sandboxing that is practical is Fuchsia.


Is this true of server side only, or is there a practical solution for locking down JS within the browser?


Wasm has no access to any browser APIs by default, so yes.

You can expose the required functionality via JavaScript bridges.

The current Rust (wasm-bindgen) and C/C++ (emscripten) tooling exposes the the whole browser API by default though, which isn't ideal.


Darwin has quite good sandboxing, although how to actually use it isn't documented.


And will keep repeating ad eternum given the fake security sales from Webassembly, for God's sake evem the standard has a security section mentioning possible issues not addressed.

There is no magic in WASM, only when the only thing it does is warming up a CPU.


I'm happy to have a technical discussion, but your comments always stop at "Wasm bad", without countering the arguments.


I would be interested in the previously mentioned claims like “wasm can’t protect against rowhammer” —- while perhaps not an easily exploited vulnerability, and likely every current sandbox tech is vulnerable, is there anything specific to wasm that would help here?

Also, how does wasm relate to traditional sandboxes that “just” hijack system calls? Is there a difference?


So, a lot better than ActiveX, which didn't have effective sandboxing and all the problems webassembly still has due to (lack of) memory protection etc.


This looks great! Please consider upstreaming this into LLVM; it would benefit many other users of LLVM. I'd love to see this used in Rust, for instance.

What kind of compilation performance do you see for how long this pass takes?

Do you apply this to all functions, or to all functions with certain properties, or to functions tagged some particular way?


Thanks a lot!

I also believe it would be cool to upstream this, we will have to sit down and do some planning.

Compilation time could improve (I was actually working on this today), but it's already in line with other optimizations, taking < 10% of the time spent doing optimizations on a big codebase we use as benchmark.

Currently it's applied to all functions, since runtime it's anyhow somehow linear in the number of Instructions a Function has, but possibly in more costly versions of this (that we have on paper but yet to implement) some logic to filter functions in advance could be used.


When upstreaming it, it might make sense to give it two mechanisms: either process every function, or process only functions labeled to use it. Then, a frontend can experiment with things like automatically detecting which functions would benefit from it, using language-specific mechanisms. Or, worst-case, frontends can provide attributes to tag functions explicitly, and libraries can tag functions known to benefit from this.


My understanding is that CheerpX, but simulating a full userspace, can optimize all the way through system libraries, which are typically very general and have lots of code that is “dead” to your application. How would this approach fail for applications where the code tends to be more specific to the task?


Seeing through system libraries sounds problematic already, since some of them like libc violate aliasing when you can see their implementations, even in LLVM byte code form.


I have to double check this, but the approach should be theoretically sound since we are quite strict with what functions are considered to have known call-sites: Function has to be internal (as in no visibility from outside) + no indirect uses (so no saving a pointer to the function somewhere). This should be enough to solve the problem you are thinking about.


That's actually a good point–more aggressive optimizations that peek through libraries that are not typically visible to the compiler are generally going to break things like allocators. Although, I guess something must be compiling them anyways, so perhaps it might end up OK?


Consider integrating with PGO there, at least for getting most of the gains (assuming they are about performance, not binary size) with a fraction of the cost.


> This looks great! Please consider upstreaming this into LLVM; it would benefit many other users of LLVM. I'd love to see this used in Rust, for instance.

Or in straight C++ for that matter.

Though possibly the optimisation passes relies on specific properties of the target which don't exist for non-wasm? I didn't see anything in the writeup but that doesn't mean they don't exist.


PartialExecuter happens at the LLVM's IR level, and in theory it's fully generic. Then has been only partially tested outside the Cheerp-pipeline, so I would expect it to rely on some implicit assumptions on the kind of legalizations or lowering that are done before-hand.

Target-wise, all information is kept encoded in the IR, so that part is easier (the only strange thing that might happens is bumping some alignment to 8, but I expect every target to be able to handle the prescribed IR alignemtn)


I'd love if the is pass submitted upstream as well if possible. I think a lot of ecosystems could benefit from the awesome work you did!


How does this differ from existing LLVM interprocedural optimizations? Particularly value propagation sounds like it would handle a lot of these cases.

> You might have heard of dead-code elimination, an LLVM optimization pass that removes code proven as unreachable. I was actually interested in less-obvious situations. In particular blocks that are reachable on the control flow graph, but not when consider wider execution invariants.

I've got a trivial example here: https://gcc.godbolt.org/z/7EnPG5WM6 noinline is added to foo to demonstrate Clang is actually changing the number of arguments foo takes, and its inlined the arguments from bar and baz.

> Can we use information on the call-sites and the corresponding format strings to prove that some code paths, for example formatting for floating point numbers, are actually never taken?

Seems to already have that effect in Clang. Using Emscripten 3.1.3, compiling 2 different main's with just

    printf("Hello world %d\n", argc);
And

    printf("Hello world %d %f\n", argc, volatileFloatArg);
For the single int arg, the top 3 symbols by size are

    47.7%  2.65Ki -NAN(IND)%       0    printf_core
     7.4%     421 -NAN(IND)%       0    [section Data]
     6.6%     375 -NAN(IND)%       0    main
And for the one that added a float arg,

    34.5%  3.04Ki -NAN(IND)%       0    fmt_fp
    28.8%  2.54Ki -NAN(IND)%       0    printf_core
     5.1%     464 -NAN(IND)%       0    [section Data]
So when a float argument wasn't passed into printf, it did not include fmt_fp, a delegate that handles floating point for printf

One issue that can complicate this analysis is linking multiple libraries that reference standard library (or functions in other libraries). If you're linking together object code, without more IR context, LLVM is going to have to be more conservative. So I think if you link in a WASM object code file that references printf, it won't be able to perform all the IPO that will allow it to trim the CFG.


Take this other example: https://gcc.godbolt.org/z/nebP68Tx8

Here by playing HumanCompiler it should be possible to prove that the if condition never evaluates to true, so removing the if is safe.

This is an example of optimization that PartialExecuter is able to do. Note that some combinations of other optimizations might also be potentially able to get to this result (say adding a tag "is always power of 2", but doing this in a general way it's what PartialExecuter does well).

Somehow similarly, clang might be instructed to enable a check to avoid including printf_float and somehow detect it and exclude it (this is what happens here), but this is hardly generalizable.


Interesting, looks like this is bit arithmetic related, as it looks like even complex expressions seems to be handled fine, as long as there's no bit ops.


This is sweet! This is actually a very similar approach to how I deobfuscate Python bytecode: https://github.com/landaire/unfuck/blob/bfa164b4e261deffeb37...

My code is pretty messy, but I take the same exact approach of taking known function parameters, interpreting the instructions, and removing any condition and the instructions which built its arguments if it evaluates to a constant value. Even called it partial execution as well :p


Congrats, later will check & take inspiration!


Nice work. I love simple and "conservative brute force" techniques like this--brute force in that you run all the example call sites in the interpreter, conservative in that it bails out when things get hard (e.g. bail out if a single basic block is executed too many times).

I am wondering if this technique could be hybridized with abstract interpretation or partial evaluation, which are known techniques in compilers. In essence, this would become a type of concolic execution.


The step forward that I believe worked well here is that given some actual paths (and since they are simply a sequence of Instruction, they can be fed to an interpreter) they are merged or forked while entering or exiting SCCs, this allows to keep a low count of actual visit but exploring a big enough set of path to prove that it covers all possible executions.

On the second part, I have to do some thinking and coming back, thanks!


I love all these techniques where you can find static info about a program by 'running' it at compile time.

Normally you run a program at runtime (obviously) at which point you have the full environment and inputs, so you can run the program fully.

However... you can also kind of "run" a program at compile time. You can do this using some known and some unknown values in the source code, so you can "partially" run it.

https://en.wikipedia.org/wiki/Partial_evaluation

Or you can use abstract/pretend values instead of real values, so you can "abstractly" interpret it.

https://en.wikipedia.org/wiki/Abstract_interpretation

Running at compile time lets you learn things about the program at compile time, allowing advanced optimisations and error checking.

You might realise that some code can never execute, so you can remove it (as in the project above). Or you can learn that some code is incorrect and give an error at compile time...


Optimizations like this are incredible to me. In my compiler class in college, I was smitten by how different optimizations were performed, and we only ever worked with pretty basic constant propagation and dead code elimination. The work here is incredible. My respect to the entire team.


> My respect to the entire team.

Same!


Are the these new techniques only applicable to WebAssembly? If so, why?


WebAssembly implies static linking (malloc / printf & all are part of the shipped module) and code size matters since it directly influence users (since there might be delays in downloading big payloads). Both factors plays a role in deciding to plan putting work in optimizations like this.

There are some general gains to be had with this optimization, and probably the same ideas were already around for a while, but putting together in this way was helped by thinking about WebAssembly specific constraints.


Code size always matters, and while wasm may be an extreme case, hopefully this kind of benefit can contribute to shrinking other kinds of code targets.


Not necessarily, shared libraries are on the WebAssembly roadmap.


Yes, I wanted to be more nuanced but oversimplified. Thanks for mentioning this!


GCC and Clang aren't super aggressive with DCE (dead code elimination) because most of the time the complex cases for DCE have essentially no impact on run time performance, and may be expensive to implement (i.e. significantly increase compile times).

For example, suppose you have a library that is configured with some options struct that contains a bunch of flags/settings. The behavior of the library changes based on these flags. The compiler will see a bunch of branches like "if (opts.foo) { ... }" and will generate code for all of these branches of. Now let's say you statically compile this library, and in practice in your code you only ever has one set of options enabled. In principle the compiler could figure out which branches can be eliminated based on the single instantiation of the opts struct in your code, and eliminate dead branches. But in practice neither Clang nor GCC will actually do this kind of DCE even at -O3 because it's simply not worth it. By the way, this kind of example is exactly the kind of DCE that the blog post is talking about and could be removed by the new DCE pass implemented by the author.

How big of an impact would this kind of DCE make on performance? Well the compiled binary size will be a bit smaller, which is kind of nice. But in practice this will have almost no impact on performance. Loading and mapping an ELF file is practically instantaneous even on huge executables. The branch predictor will predict all of the options branches that are hit repeatedly at close to 100%. Eliminating the branch entirely is in theory better than having a branch with a 100% hit rate, but hard to demonstrate in real world benchmarks for all but the most critical code paths. If there are large pieces of code in the executable that are unused they'll be mapped but won't even be page faulted during program execution. There are some kind of hand wavy arguments you can make about the extra code wasting space in the icache but again it would probably be difficult to actually demonstrate the impact even in microbenchmarks.

This isn't to say that there are no benefits to more expensive DCE passes. But they're generally extremely meager, so it's not worth increasing compile times for most applications. Wasm is an exception because compiled assets need to be transferred over the network and apparently it takes longer to load wasm code than it does to map an ELF executable. It's also worth noting that Clang and GCC do a lot of other types of simpler DCE, and these simpler DCE passes can be critical for performance, so I'm not trying to suggest that DCE entirely is worthless; just that the type of DCE presented here is less useful for traditional compilation.


At least, I hope these could be applied to other VMs like the JVM. I don't see any reason why it can't be done, but I'm not an expert at all.

From my understanding, the main issue is mostly that it is very complex / risky to do these optimizations, and the reward is deemed low for the JVM.

The main JVM target I can think about where it could probably be profitable is the Android subsystem, but why would you care to reduce OS / Apps size when the fact that phones get bloated is very good for your business since you also sell phones, so it's quite useful to push the user to renew its phone on a regular basis.


> We have named this optimization 'PartialExecuter', the key idea behind it being taking advantage of known function parameters to find inner code blocks that cannot ever be possibly executed.

I'm wondering, what exactly are the differences of this approach from "classical" partial evaluation? This seems to be a special case of it, unless I missed something.


I am not completely sure of what "classical" partial evaluation is, but probably yes, this is somehow a special case of it. I have now quite some material to read (see other links about partial evaluation or super-compilation).


How much does this optimization save in terms of executable size on realistic examples (i.e. a full-sized code base)?

I would naively expect that for most functions you know too little about their arguments or the state in which they will be executed to be able to tell that certain branches will definitely not be taken. Especially since any function that has some mildly interesting side-effects (like loading and storing from a pointer or a class/struct field, or even _calling into any function that does that_) would seem to foil the analysis and turn large chunks of code into "unknown reachability".

Rephrasing, I would expect the set of basic blocks that can be removed just by essentially repeated constant propagation of statically-known function arguments to be pretty small. But I would be happy to be proved wrong if that intuition is not right.

Printf is somewhat of an exception, since the format strings are usually known at compile time, meaning that a lot of the control flow can indeed be deduced by compile-time evaluation, but for most functions I would not expect that.


Have you guys looked the literature of supercompilation? This seems to be a special case of it.


All analytical solutions to optimization problems can be seen as special cases of the brute force search I guess. But SC is impractically slow for anything except a few instructions long sequences.

edit: actually I was thinking of superoptimizers, I guess it's a different concept.


Superoptimizers have gotten a lot better in the past few years. E.g. have a look at [1].

[1] https://arxiv.org/abs/1211.0557


Thanks, cool stuff.

This paper from the same project was also cool: https://raw.githubusercontent.com/StanfordPL/stoke/develop/d...

Quote from abstract: " For many applications, the best possible code is conditionally correct: the optimized kernel is equal to the code that it replaces only under certain preconditions on the kernel’s inputs. The main technical challenge in producing conditionally correct opti- mizations is in obtaining non-trivial and useful conditions and proving conditional equivalence formally in the pres- ence of loops. We combine abstract interpretation, decision procedures, and testing to yield a verification strategy that can address both of these problems. This approach yields a superoptimizer for x86 that in our experiments produces binaries that are often multiple times faster than those pro- duced by production compilers"


Supercompilation is not brute force search.


Hi, author of the post here, I am not familiar with this concept, do you have any pointers?


The idea is to evaluate the program at compile time and obtain a trace of the program that allows you to reconstruct a semantically equivalent program which has some desirable properties. In your case (and most cases in literature) it's done for optimization. Partial evaluation, which is a subset of supercompilation, is similar to what you are doing. See the wiki page: https://en.wikipedia.org/wiki/Partial_evaluation

Some literature:

https://dl.acm.org/doi/10.1145/5956.5957

https://ndmitchell.com/downloads/paper-rethinking_supercompi...


Supercompilation came to my mind as well!

Wasm + SC would be a killer deal.


Thanks for sharing, we ended up designing something very similar in Scala Native [1]. The use case we had was to explore all possible code paths was to reduce the overhead of virtual call dispatch (since very often virtual calls have very few targets in practice), which is extremely dominant and prohibitive in JVM languages unless optimized away. I hope to see your work upstream in LLVM.

[1]: https://scala-native.readthedocs.io/en/latest/blog/interflow...


Devirtualization is always plenty of fun + has lots of potential. Added to the list of article to read.


Sounds somewhat similar to the closed-world assumption that graalvm makes (https://www.graalvm.org/22.0/reference-manual/native-image/), and the optimizations that assumption enables.


Please mainstream this in general. It would save a decent amount of RAM if it became a standard part of LLVM and would probably improve overall performance due to cache effects. Dead code sitting in RAM would be bad for cache locality.


It could save ram but presumably if applied wrong you could waste ram by having multiple copies of the code. So you'd probably want to profile guide this.


Dead code removal and deduplication are orthogonal aren't they? You can already turn off verbosity increasing things with -Os etc.


PGO already solves that if you can do hot/cold splitting, though I don't think that pass in LLVM is very maintained.


I don't see any benchmark of this approach ... how much code does it eliminate, what's the performance improvement attributable to PartialExecutor alone?


I really hope this will be upstreamed :)


Is this something that could be exposed as a generic CLI tool to replace something like wasm-opt from Binaryen?


A subset of this could possibly be applied at the wasm-opt level, but consider that at the LLVM's IR level there is more information to be leveraged (in particular PartialExecuter uses information on what memory ranges are read-only).

In general the whole concept behind Cheerp (C++ to WebAssembly + JavaScript compiler) is doing as much as possible at the LLVM IR level, JavaScript concept included, since it's easier and more powerful to do code transformation there.


Binaryen implemented partial evaluation at the beginning of this year https://github.com/WebAssembly/binaryen/pull/4438


Annotated article: https://smort.io/5e621c00-3c95-4807-b0be-488947a34d8a

For the first image to render correctly, please change the theme to light mode


So you highlighted a few sentences? What's the point, other than advertising your project?


I made highlights as well as formatting changes that I found insightful. Sharing it so that others can benefit from it too :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: