One of the interesting aspects of benchmarks is that they are usually designed and intended to run in isolation. E.g. if you benchmark a database system, you expect that to be the sole system running on your server machine, in control of all resources.
That's not true of software running on desktop systems or mobile phones - desktops usually run many concurrent tasks, so do phones to some degree, and then there's also the question of battery use.
That can create skewed incentives if the benchmark isn't carefully designed. E.g. you can usually make space/time tradeoffs regarding performance, so if your benchmark is solely measuring CPU time, it pays off to gobble up all possible RAM for even minor benefits. If your benchmark is only measuring wallclock time, it pays off to gobble up all the CPUs, even if the actual speedup from that is minor.
This can lead to software "winning" the benchmark with improvements that are actually detrimental to the performance on end user's systems.
this commit[0] might be what's referenced, and this[1] looks to be the test that Google runs (not sure if this is ran in CI, or only done anecdotally for when performance tests[2] show issues).
Is this responding to something in the post? They mention sampling across different pages and using "in-the-field telemetry", and the sibling post mentions battery testing, so...
The post mentions that they can drastically reduce memory usage at some CPU time cost in the lite version. V8 is presumably doing the right thing for the end user here, given the in-the-field telemetry etc. It's an instance of carefully weighing different resource consumptions though, where simple benchmarks might drive you to prioritize CPU time at the cost of overall system responsiveness due to memory consumption.
When I saw the title of this article, I got really excited because I thought they were referring to a lighter "build".
IMHO, one of the biggest problems facing v8 right now is the build process. You need to download something like 30 gigs of artifacts, and building on windows is difficult - to say the least.
It's bad enough that the trusted postgres extension plv8 is considering changing it's name to pljs and switching engines to something like QuickJS. [0]
One of the driving factors is that building and distributing v8 as a shared lib as part of a distro is incredibly difficult, and increasing numbers of distros are dropping it. This has downstream effects for anyone (like plv8) that are linking to it.[1]
Also, embedding it is super complex. Referenced in the above conversation is a discussion that NodeJS had to create their own build process for v8. At this point, it's easier to user the NodeJS build process and use the Node v8 API than it is to use v8 directly.
At the beginning of the article, they are talking about building a "v8 light" for embedded application purposes, which was pretty exciting to me, then they diverged and focused on memory optimization that's useful for all v8. This is great work, no doubt, but as the most popular and well tested JavaScript engine, I'd love to see a focus on ease of building and embedding.
I completely agree that the most difficult part of using V8 is the build process. In node we have three layers (!!!) of build tooling stacked together to insulate ourselves from it (gyp, ninja, and some extra python scripts), and it still requires constant effort to keep working. Deno just gave up and uses GN, but that requires some stupidly complex source layouts and yet more python scripts. Unfortunately, Google just doesn't care about making V8 work with external tooling, it's all about their chromium build process. And this is a real shame, because V8 has the best embedding API of any js engine I know of, it really is a joy to use.
After getting frustrated with v8's build process, I tried dropping all the sources (+ pregen files I had generated on osx) into a new visual studio project...
> At the beginning of the article, they are talking about building a "v8 light" for embedded application purposes, which was pretty exciting to me, then they diverged and focused on memory optimization that's useful for all v8. This is great work, no doubt, but as the most popular and well tested JavaScript engine, I'd love to see a focus on ease of building and embedding.
Would you ever consider an engine designed for that, like XS?[1] Or is it V8-or-nothing as far as you're concerned?
It's easy to conceptualize a configuration or use case that's common to the people reporting a bug that's not being used by the developers or in their QA process. It's not uncommon.
It seems like such a common action, I'm surprised a non-Chromium developer would have to report it. I considered reporting it, but too much friction. I need to use a Google Account to report a bug? Why? Eh.
I think Firefox went through something similar many many years ago, a Project called MemShrink started after many user were complaining how Firefox were getting slower and bloated with Firefox 3 - 4* if I remember correctly.
It was Memory optimisation in every part of Firefox and mostly in SpiderMonkey.
I think Firefox needs another MemShrink initiative. After Electrolysis / e10 - at least in my experience - the browser uses more memory over time than Chrome.
And the "Fission" project (i.e. running each browsing origin in its own process) won't help in that regard.
To be fair, they are working intensively on reducing the overhead of having loads of separate content processes, but the target goal I have heard about is still "only" < 10 MB overhead per content process, which still translates into 1 GB (!) of additional RAM usage for the benchmark browsing session with 100 separate origins (which is perhaps slightly more than the average user uses, but for power users it might not be that unlikely to hit - and it's not just tabs that need to be counted, but every iframe that loads some third-party content and therefore needs to run in a separate process, too).
I tolerate (apparently) a 6G memory leak with 50+ tabs and Tree Style Tabs. It was much faster when Mozilla accidentally remotely disabled extensions for a weekend.
That's literally how Flutter started. "What if we don't have to be backwards compatible? Let's delete the HTML parsing quirks... and expensive CSS selectors... and... all of the DOM... and JavaScript... and why do we have markup even..."
Roughly anything before HTML5/4.01 and XHTML renders in quirks mode.
AFAIK there are no legacy CSS properties. Only nonstandard (i.e. prefixed or experimental and never published in finished specs) properties have been deprecated.
It is rather interesting to me to see that V8 has been packing so many micro-optimizations and special techniques over the years that now it has become actually feasible to just start cutting back on the optimizations to have performance gains.
What all this enables is, beside having more sensible defaults, is the ability for developers who use V8 with Electron or NW.js to tweak the default behavior of the engine, catering to their application's needs. That is always good.
That's not what I took away from this at all. They specifically say that Lite mode started out as a way to reduce memory consumption at the cost of performance. Execution time jumped 120%!
Then they figured out how to get the memory improvements without the performance hit. The only place where they actually removed optimizations was in generating stack traces, and that wasn't a gain in performance, it was just considered acceptable for that to get slower.
I did some work on a Python JIT in the path. The two biggest challenges were:
- Python is much, much, more dynamic than Javascript. You can override just about anything in Python, including the meaning of accessing a property. You have overloaded operators (with pretty complex resolution rules), metaclasses, and more. And they're all used extensively. There's some Javascript equivalents to those things, but either there are fewer deoptimization cases or are features that aren't commonly used in practice (e.g. Proxy objects).
- Python has a ton of important libraries implemented as C extensions. These libraries tend to depend on undefined behavior of the CPython interpreter (e.g. destruction order which is more deterministic with ref counting) or do things that happen to work but are clearly not supposed to be done (e.g. defining a full Python object as a static variable).
I guess also economic incentives, there hasn't been an incentive for anybody to staff a 50 person project to build a Python JIT given that it's cheaper to rewrite some or all of the application in C/C++/Rust/Go whereas that's not an option in Javascriptland.
I completely agree with you and I'd argue that #2 and #3 are the two biggest reasons.
It is easy to forget the colossal amount of engineering resources that browser vendors have spent creating and rewriting their Javascript engines. And due to the nature of how their JITs work, all that work is tied down to the specific Javascript environment they were written for. (For example, you can't really reuse the v8 codebase to create a Python JIT)
And in Python's case a lot of the appeal of the language rests on the extensive library ecosystem, which has a significant number of extensions written in C. Generally speaking JIT compilers aren't very good at optimizing code that spends a lot of time inside or interacting with C extensions, even if we ignore the significant issues you mentioned regarding undefined behavior.
Compiled vs interpreted is not a useful distinction. Until fairly recently when Ignition was introduced, V8 had no interpreter. All the JS was compiled by the first tier “full codegen”
The bigger difference is that the JVM is heavily optimized for performance after a long warmup and V8 needs to produce relatively fast code early during page loading.
Java being much more static certainly helps warmup time but ultimately doesn’t really affect final performance. LuaJIT can beat C in some cases once it has time to compile all traces needed.
V8 needs to produce relatively fast code early during page loading
So for e.g backend js programs is it possible to ask v8 to take more time to optimize?
Static vs dynamic still matters if the program is doing "dynamic stuff", where "dynamic stuff" means any thing that the JIT compiler is currently not able to optimize.
That's fair, but that's also stuff that you just can't really do in Java in general¹, so it's not useful for a comparison. The fact is that the vast majority of JavaScript code is pretty static and there's nothing preventing it from running as fast as Java other than man-decades of compiler engineering.
―
¹ Possibly excluding reflection. It's been a long time since I used the Java reflection APIs and I have no idea if you can do things like add class fields named after arbitrary strings at runtime. Even if you could, presumably this bails out of jitted code so the situation is basically the same as in JS.
Dynamic typing means you pay the cost of trace recording or profiling to collect the type info. The actual code performance should ultimately be the same. JIT can remove the overhead of dynamic dispatch and replace it with a fixed call and a guard, for example. This isn’t possible with dynamically loading C libraries.
> JIT can remove the overhead of dynamic dispatch and replace it with a fixed call and a guard, for example.
Only when the guard isn’t triggered constantly. With an actual type system you can remove many of these guards altogether instead of having them everywhere and falling back to the slow case when you get something unexpected.
There are actually two problems here: How to handle things being re-defined and how to handle an unexpected type after speculation.
In real high performance VMs the guard for redefinition is effectively a single instruction which is a CPU can easily branch predict and handle with out of order execution: https://chrisseaton.com/truffleruby/low-overhead-polling/
With unexpected types we can use LuaJIT as an example: The type speculation guard will be turned into a conditional branch to a side trace. The slow path quickly becomes another fast path.
> For example, you can't really reuse the v8 codebase to create a Python JIT
Actually, come to think of it... V8 also runs WASM right? Right now I think WASM is missing a few features (like garbage collection) which Python would need to be efficiently compiled to WASM, but once those are solved...
At a conceptual level that would be no different than having a Python implementation targeting the JVM or CLR runtimes (aka Jython and IronPython).
The usual situation for these alternate implementations is that they make it easier to interact with other code that targets those runtimes, but that they do not speed up the average speed of the interpreter. The previously-mentioned compatibility and performance issues for C extensions also remain.
There was brief point in time where IronPython was faster than CPython for several major benchmarks, precisely because of the better CLR JIT and GC, plus some smart tricks implemented in the "DLR". (It's almost sad, IronPython and the "DLR" have been left to grow so many weeds since that time.)
I definitely should have been clearer in my other comment. Alternative interpreters such as IronPython can be faster but most of the time the speed stays in the same "order of magnitude" as the original C interpreter. On the other hand, good JIT can deliver a 10x or more speedup in the best case scenarios where it manages to get rid of the dynamic typing overhead. (For subtle technical reasons, running the language interpreter on top a jitting VM like the CLR is not enough. The underlying JIT has a hard time looking further than the IronPython interpreter itself and making optimizations at the "Python level")
That's where some of the most "DLR" magic filled in (cached) and optimized a lot of the "Python level" in a way that the CLR JIT could take advantage. The DLR briefly was a huge bundle of hope for some really interesting business logic caching. In a past life I did some really wild stuff with DLR caching for a complicated business workflow tool. It's dark matter in an enterprise that I'm sure all of it is still running, but I'm not sure if the performance has kept up over time (and have no way to ask, and probably don't care) as the CLR declared "mission complete" on the DLR and maybe hasn't kept it quite as optimized since the IronPython heyday.
Thought experiment: what about transpiling Python to JS? https://github.com/QQuick/Transcrypt looks like a nice implementation, but their readme just talks about deployment to browsers — I'm curious about outside of the browser whether Transcrypt + Node might be more efficient than CPython.
(Not even just CPython, but really any dynamic language implementation.)
And then of course wasm could still be used for C extensions.
Calling asm.js a subset of Javascript is a bit of a stretch. It looks nothing like idiomatic Javascript, and was more of a low-level statically-typed language dressed up in Javascript clothing.
At some point they realized that representing this low-level code as Javascript text instead of as specially-designed bytecode added a significant ammount of parsing and compilation overhead, which was one of the initial motivations for the creation of WebAssembly. If I had to sum up WASM in one sentence, it is that it is kind of like the JVM, except that its instruction set was designed specifically for running programs downloaded from the web. Special attention was paid to security and startup latency.
Erm... the JVM was also designed specifically for running programs downloaded from the web. Special attention was paid to security and startup latency. Javascript got its name because it was the only other language designed to be downloaded and executed in a browser!
That's the key. MicroPython is significantly less dynamic than full Python, and would be much easier (but still not easy) to write a fast JIT for. Unfortunately such a JIT wouldn't be very useful - MicroPython won't run much code that hasn't been written specifically for it. Without the dynamic features, MicroPython is essentially a different language from regular Python.
> I guess also economic incentives, there hasn't been an incentive for anybody to staff a 50 person project to build a Python JIT given that it's cheaper to rewrite some or all of the application in C/C++/Rust/Go whereas that's not an option in Javascriptland.
I think you have the first three points — dynamism, C extensions, and economics — right on the money. The fourth point I would add is that Python has a huge standard library. That's a very large surface area, all of which ends up needing optimization effort to get good performance across a wide variety of programs.
Besides the existence of __getattr__, __add__, etc. that other people mentioned, there's also:
- A Python runtime has to support threads + shared memory, while a JS one doesn't. JS programs are single-threaded (w/ workers). So in this sense writing a fast Python interpreter is harder.
- The Python/C API heavily constrains what a Python interpreter can do. There several orders of magnitude more programs that use it than v8's C++ API.
For example, reference counts are exposed with Py_INCREF/DECREF. That means it's much harder to use a different reclamation scheme like tracing garbage collection. There are thousands of methods in the API that expose all sorts of implementation details about CPython.
Of course PyPy doesn't support all of the API, but that's a major reason why it isn't as widely adopted as CPython.
- Python has multiple inheritance; JS doesn't
- In Python you can inherit from builtin types like list and dict (as of Python 2.2). In JS you can't.
- Python's dynamic type system is richer. typeof(x) in JS gives you a string. type(x) in Python gives you a type object which you can do more with. And common programs/frameworks make use of this introspection.
- Python has generators, Python 2 coroutines (send, yield from), and Python 3 coroutines (async/await).
In summary, it's a significantly bigger language with a bigger API surface area, and that makes it hard to implement and hard to optimize. As I learn more about CPython internals, I realize what an amazing project PyPy is. They are really fighting an uphill battle.
I always find it ironic that the CMUCL lisp compiler (upon which SBCL was based) was called 'the python compiler', had machine-code generation in 1992 and that CMUCL sports native multithreading that is largely lock free..
Hm that's a good question, not sure. Do Common Lisp implementations have a "core" that multiple inheritance and multiple dispatch can be desugared to? Or are those features "axiomatic" in the language?
If it's the former, I would say that optimizing a small core is easier than optimizing a big language. Python's core is 200-400K lines of C and there are a lot of nontrivial corners to get right.
I was surprised when looking at Racket's implemetation that it's written much like CPython. IIRC it was more than 200K lines of C code. Some of that was libraries but it's still quite big IMO. I would have thought that Racket, as a Scheme dialect, would have a smaller core.
AFAIK Racket is not significantly faster than Python; it's probably slower in many areas. Maybe it's just that SBCL put a focus on performance from the beginning?
(I looked at Racket since I heard they are moving to Chez Scheme, which also has a focus on performance.)
>Do Common Lisp implementations have a "core" that multiple inheritance and multiple dispatch can be desugared to?
Yes, that's the Meta Object Protocol. It isn't on the standard, yet most Lisp implementation have it, and now you can use it in a portable way as well.
>If it's the former, I would say that optimizing a small core is easier than optimizing a big language. Python's core is 200-400K lines of C
Common Lisp's "core" (that means, not including "batteries") is considerably more involved and complex than Python's. Creating a new CL implementation is a big deal.
Any future "super speed" Python efforts would probably do well to build on the amazing work that PyPy has done in teasing apart an optimization-friendly subset of the language in the form of RPython, and building the rest of it in that language.
Like, focus on further optimizing the RPython runtime rather than starting from scratch.
That doesn't really make sense -- there is no "RPython runtime". There is a PyPy runtime written in RPython.
RPython isn't something that's exposed to PyPy users. It's
meant for writing interpreters that are then "meta-traced". It's not for writing applications.
It's also not a very well-defined language AFAIK. It used to change a lot and only existed within PyPy.
I'm pretty sure the PyPy developers said that RPython is a fairly unpleasant language to write programs in. It's meant to be meta-traceable and fast, not convenient. It's verbose, like writing C with Python syntax.
The PyPy interpreter is written in RPython, but is a full Python interpreter with a JIT. When you compile PyPy, it generates C files from RPython sources, which are then compiled with a normal C compiler into a standalone binary.
RPython is both a language (a very ill-defined subset of Python... pretty much defined as "the subset of Python accepted by the RPython compiler"), and a tool chain for building interpreters. One benefit of writing an interpreter in RPython is that, with a few hints about the interpreter loop, it can automatically generate a JIT.
Basically because it can be "meta-traced", and C can't (at least not easily).
The whole point of the PyPy project is to write a more "abstract" Python interpreter in Python.
VMs written in C force you to commit to a lot of implementation details, while PyPy is more abstract and flexible. There's another layer of indirection between the interpreter source and the actual interpreter/JIT compiler you run.
See PyPy's approach to virtual machine construction
Building implementations of general programming languages, in
particular highly dynamic ones, using a classic direct coding approach, is typically a long-winded effort and produces a result that
is tailored to a specific platform and where architectural decisions
(e.g. about GC) are spread across the code in a pervasive and invasive way.
Normal Python and PyPy users should probably pretend that RPython doesn't exist. It's an implementation detail of PyPy. (It has been used by other experimental VMs,
but it's not super popular.)
I don't think multiple inheritance is a performance issue. A class's resolution order is resolved when it's defined (using C3: https://en.wikipedia.org/wiki/C3_linearization), and after that it's only a matter of following it, like Javascript's prototype chain.
Objects and types are mutable after definition, but that's no more severe than what you can do in Javascript. Assigning to .__class__ is like assigning to .__proto__, and assigning to a class's .__bases__ is more or less like assigning to a prototype's .__proto__.
The resolution order is calculated when it's defined. It's calculated again whenever you assign to __bases__ (or a superclass's __bases__). But it's not calculated every time it's used, which means there's no significant performance penalty to multiple inheritance unless you're changing a class's bases very often.
Metaclasses can override the MRO calculation, which we can abuse to track when it's recalculated: https://pastebin.com/NdiA12Ce
Doing ordinary things with the class or its instances doesn't trigger any calculation related to multiple inheritance. You only pay for that during definition or redefinition. So there's no performance problem there compared to Javascript.
I do agree that basically everything is dynamic in Python. But some things are more dynamic than others.
Hm yeah I see what you mean. I don't know the details of how v8 deals with __proto__, but I can see in theory they are similar.
Though I think the general point that Python is a very large language does have a lot to do with its speed / optimizability. v8 is just a really huge codebase relative to the size of the language, and doing the same for Python would be a correspondingly larger amount of effort.
I don't know the details but v8 looks like it has several interpreters and compilers within it, and is around 1M lines of non-test code by my count!
v8 was written in the ES3 era. And I knew ES3 pretty well and Python 2.5-2.7 very well, which was contemporary. I'd guess at a minimum Python back then was a 2x bigger language, could be even 4x or more.
IIUC this is also true for PHP but HHVM has/had some interesting techniques to deal with it, like pairing up and cancelling out reference count operations, and bulk changing the reference count before taking a side exit or calling a C function.
V8 has the full resources of Google, not to mention Microsoft, Node.js, and every community that uses the V8 engine.
Basically V8 has some of the best engineers in the world being paid to work full time on it, and have resources from dozens of other high profile companies.
Not knocking python in any way - V8 essentially just has more resources available. I highly recommend trying pypy if you're looking for a performance benefit and your code works with it.
Many of the techniques used in modern JS VMs like v8 or JavaScriptCore could totally be applied to a Python interpreter, and it wouldn't take 50 people. Someone just needs to invest the effort. The core techniques are a fast-start interpreter, a templatized baseline JIT, polymorphic inline caches, and runtime type information combined with higher JIT tiers that speculatively optimize for observed types, which allows many checks throughout the generated code to be replaced by side exits. (Also a good garbage collector, and escape analysis to avoid allocating temporaries).
I believe most of these could be applicable to Python. JavaScript has crazy levels of dynamism too, and the above methods are the short version of how you deal with it.
It seems like no one in the Python community has had the knowledge + motivation to try these approaches.
Evan Phoenix (Rubinius) talked about applying the Self techniques of collecting type info using inline caches for CRuby in 2015.
I'm using part of this idea in my prototype tracing JIT for CRuby. With a tracing JIT it's much, much easier to implement basic escape analysis because the control flow is linear. One basically gets Partial Escape Analysis (the big deal from Graal) for free.
So far it's proving unreasonably effective to re-use the method lookup info from the inline caches and use the same invalidation mechanism.
The CRuby 2.6 JIT can't use this approach because the compilation pipeline is too slow to invalidate it with the method cache, but I'm using the CraneLift compiler from Mozilla.
Post Haswell, there's not a ton of value to baseline JIT, especially for dynamic languages. Mike Pall talked at length about how the highly optimized bytecode VM of LuaJIT 2.x was sometimes faster than the baseline-ish method JIT from LuaJIT 1.x, and that was before Haswell.
I think we'll see all the major JS engines remove baseline JIT over the next few years in favour of even more optimized bytecode interpreters.
> I'm using part of this idea in my prototype tracing JIT for CRuby. With a tracing JIT it's much, much easier to implement basic escape analysis because the control flow is linear. One basically gets Partial Escape Analysis (the big deal from Graal) for free.
Maybe Ruby is different. But with JavaScript, tracing JIT turned out to not be a winning strategy. Every engine that tried it eventually moved to a more traditional multi-tiered JIT (with OSR entry/exit).
> Post Haswell, there's not a ton of value to baseline JIT, especially for dynamic languages. Mike Pall talked at length about how the highly optimized bytecode VM of LuaJIT 2.x was sometimes faster than the baseline-ish method JIT from LuaJIT 1.x, and that was before Haswell.
I can't say definitively for other languages, but for the JavaScriptCore implementation of JavaScript, we are very aware of the performance value of all our JIT tiers, and the baseline JIT makes a significant difference. And yes, our interpreter is very optimized. We essentially have a CPU-specific interpreter loop using assembly code generated from a meta-language. Baseline JIT on top of that is still a big perf win (as are our two higher JIT tiers, DFG and FTL/B3).
Haswell can branch predict the indirect branch at the end of each bytecode instruction to dispatch the next bytecode instruction much better than previous generations.
In highly dynamic languages like JS, Ruby and Python it’s not even this which is the main source of branching anyway. It’s branching on the typesfor each opcode to handle all valid types.
There's a bunch of things: Python allows metaprogramming in a way that JS does, which means you end up needing more guards (or conflating more guards); the Python ecosystem fairly heavily relies on CPython extension modules, and if you wish to remain compatible with them you're constrained in some ways, especially if you care about performance of calling into/from them.
And of course money, lots of it. The amount of money invested in optimizing v8 is staggering -- Google brought Lars Bak out of retirement[1] to start v8, and that guy is no joke.
> Python allows metaprogramming in a way that JS does, which means you end up needing more guards (or conflating more guards)
JS allows you to dynamically modify some of the scopes that names refer to, as well as changing the actual prototype chain itself. I'm not sure you can do such crazy things with Python classes/metaclasses.
Of course, for v8 in particular, doing any of this crazy manipulation tends to set off alarm klaxons that kick your code off every optimization path, but the language still permits it.
> the Python ecosystem fairly heavily relies on CPython extension modules, and if you wish to remain compatible with them you're constrained in some ways, especially if you care about performance of calling into/from them
And for JS, very low overhead of calling into the DOM APIs (written in C++) is a necessary feature for having competitive performance. Arguably more so than in Python, since the overhead of the FFI trampoline itself here is considered a bottleneck.
> dynamically modify some of the scopes that names refer to
You can do some fairly disgusting things to name resolution in class bodies, but names within functions are resolved statically nowadays.
> as well as changing the actual prototype chain itself
You can change a class's MRO, if that's the closest analogue.
class Foo:
x = 'foo'
class Bar:
x = 'bar'
class Baz(Foo):
pass
print(Baz.x)
Baz.__bases__ = (Bar,)
print(Baz.x)
In Python you can also hook your own entire custom import system into importlib, or just arbitrarily change the meaning of the `import` statement by replacing builtins.__import__:
You can make your own class that inherits from types.ModuleType and use it to replace an existing module's class and add interesting new behaviors to its object:
> JS allows you to dynamically modify some of the scopes that names refer to, as well as changing the actual prototype chain itself. I'm not sure you can do such crazy things with Python classes/metaclasses.
> Of course, for v8 in particular, doing any of this crazy manipulation tends to set off alarm klaxons that kick your code off every optimization path, but the language still permits it.
Most of the real badness in JS (direct eval and the with statement stand out above everything else here) can be statically detected; the fact in Python that you can fundamentally change operation of things already on the call stack through prodding at things via the `sys` module makes this an order of magnitude worse (and yes, guards and OSR in principle can be used here, but it's very easy to end up with a _lot_ of guards).
> And for JS, very low overhead of calling into the DOM APIs (written in C++) is a necessary feature for having competitive performance. Arguably more so than in Python, since the overhead of the FFI trampoline itself here is considered a bottleneck.
Oh yes, it's absolutely essential, but the definition is on a very different level: we might have an interface defined in WebIDL that must be exposed to JS in a certain way, but how that's implemented is an implementation detail (and there's nothing in the public API stopping a browser from changing how their JS VM represents strings, for example; the JS VMs themselves don't really have totally stable APIs). Whereas in Python, the C API is public and includes implementation details like refcounting, string representation, etc.
You can't change the inheritance as far as I know after creation without some hacks, but you can change the class that an instance refers to which can kinda sorta achieve the same thing. You can't necessarily add properties to a base class and have all of those reflect immediately unless you use some hackery with class properties.
Would there be a performance penalty (I’m guessing in cache coherency) in having an interpreter that’s really two interpreters in the same process, where modules that use the “strict subset” of the language (the part that doesn’t require the more advanced object-model, or any FFI preemption safety) run their code through a more minimal interpreter, and then whenever your code jumps into a module that requires those things, the interpreter itself jumps into a more-complete “fallback” interpreter? Sort of doing what profile-guided JIT optimization does, but without the need for JITing (and before JITing would even kick in), just instead using a little bit of static analysis during the interpreter’s source-parsing step.
I ask because I know that this is something hardware “interpreters” (CISC CPU microcode decoders) do, by detecting whether the stream of CISC opcodes in the decode pipeline entirely consist of some particular uarch, and then shunting decode to an optimized decode circuit for that uarch that doesn’t need to consider cases the uarch can’t encode. But, of course, unlike hardware, software interpreters have to try to fit in a CPU’s cache lines and stay branch-predicted, so there might not be a similar win.
(Tangent: I once considered writing a compiler that takes Ruby code, rewrites the modules using only a “strict subset” of it to another language, and then either has that language’s runtime host a Ruby interpreter for the fallback, or has the Ruby runtime call the optimized modules through its FFI. I never got far enough into this to determine the performance implications; the plan was actually to enable better concurrency by transpiling Rails web-apps into Phoenix ones, switching out the stack entirely at the framework level and keeping only the “app” code, so single-request performance wasn’t actually the top-level goal.)
Have you tried pypy? I got about 2x of C performance for a simulated annealing problem I was working on recently. Ultimately what I realized was that the clever python structures that made prototyping fast were inherently slow (dicts with tuple keys, etc). Once I ported it to C, then went back to python and used the same simple data structures, pypy was practically just as fast as C.
History suggests that, for a good approximation of the truth, the resources put into a language implementation (and, in particular, into JIT compiling VMs) strongly correlate with its performance. So V8 and HotSpot, for example, both have great performance -- and both have had large teams working on them for many years.
Interestingly, PyPy has pretty decent performance despite having had a much smaller team working on it, mostly part-time. An interesting thought experiment is whether similar resources put into PyPy -- or a PyPy-like system -- would achieve similar results. My best guess is "yes".
I can't speak to JSC, but at least SpiderMonkey has had memory optimizations that are equivalent to the ones described here (e.g. discarding cold-function bytecode) for a while... I agree that it would be interesting to have more competition, including more measurement, in this space.
This exact thing is why browser engine diversity is so important, and why even though I never used Edge, it's very disappointing to me that MS is killing it.
Somewhat off-topic, but is there an RSS feed available for the V8 blog? Their posts are always interesting to me, but I've searched a few times with no luck.
I wonder if something like this could be used to build an electron alternative (or modify the existing electron backend to use this engine), since memory usage is a major complaint for these applications
Well... Electron is powered by Chromium, but uses NodeJS which is based off of V8, V8 is also the default JS engine for Chromium. So in theory, it would benefit Electron directly, not sure why anybody would waste the engineering efforts to recreate Electron if these changes will find their way on Electron.
Edit: Didn't mean to make it sound personal on my second paragraph, but the rest of what I wrote still applies with the context of the article and the comment made.
Please edit personal swipes out of your HN comments.
Your comment broke the site guidelines and provoked an off-topic spat. Would you please review them and stick to them? They're all there for good reason, otherwise we'd have taken them out. Note that they include Assume good faith. That's the opposite of "I can't tell if you're trolling hard".
Dear lord people are getting pointlessly defensive and aggressive. Obviously this article is not referencing the full V8 engine, but a light version that may or may not make it into Chromium
You might want to re-read the HN guidelines
> When disagreeing, please reply to the argument instead of calling names. "That is idiotic; 1 + 1 is 2, not 3" can be shortened to "1 + 1 is 2, not 3."
Ok, but please don't respond to a bad comment by breaking the guidelines yourself, as in the first sentence above. I realize it's hard to do when someone has replied to you with a provocative swipe, but it's even more necessary in such cases.
> Obviously this article is not referencing the full V8 engine, but a light version that may or may not make it into Chromium
Obviously? Right from the second sentence, the article says:
> Initially this project was envisioned as a separate Lite mode of V8 specifically aimed at low-memory mobile devices or embedder use-cases that care more about reduced memory usage than throughput execution speed. However, in the process of this work, we realized that many of the memory optimizations we had made for this Lite mode could be brought over to regular V8 thereby benefiting all users of V8.
The article is about how many of the lite mode optimizations were added to the last 7 v8 releases, resulting in an 18% reduction vs lite mode's 22%. The people responding to you are making the case that the 18% reduction is basically as good as.
Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that."
> However, in the process of this work, we realized that many of the memory optimizations we had made for this Lite mode could be brought over to regular V8 thereby benefiting all users of V8.
What is it about the JS DOM that makes a DOM of N elements modelling a given app’s view, have a higher memory footprint than the “DOM” (view state) of a native graphics toolkit modelling an equivalent app’s view?
>What is it about the JS DOM that makes a DOM of N elements modelling a given app’s view, have a higher memory footprint than the “DOM” (view state) of a native graphics toolkit modelling an equivalent app’s view?
For starters, because the DOM is a very inefficient design of an app's view, primarily designed for text and simple forms, and with all kinds of extra crap bolted on. Until CSS Grid, there wasn't even a proper layout engine available, and people used styling primitives meant to float text for UI design...
A native UI engine can implement drawing a window with a button (the raw widgets, design wise) with a few lines of code to draw, two rectangles, some edge shading, and some text.
The DOM has thousands of lines for all kinds of contingencies for the same thing...
Part of it is just how feature complete browser compositing is. The other part is bloat due to how it was all implemented. HTML and CSS were never optimal representation for complex documents, and a ton of features have been added on top.
You misread my comment. I also suggested replacing the internal engine it uses with this lighter version to make it performant. modifying to be more performant != getting rid of
> If you prefer watching a presentation over reading articles, then enjoy the video below! If not, skip the video and read on.
This is OT, but I think we can design a format with the best of both worlds. We can have the personal and narrated quality of videos/voiceovers, along with the skimmable/scannable/interactive quality of web content.
I agree, having a transcript of a video is very useful. I've done that with some of my own and other people's videos.
It takes a lot less time to skim over an illustrated transcript than to watch a video, and it lets the readers decide if they're interested enough in actually taking the time to watch the video. Plus it's search engine friendly, and lets you add more links and additional material.
I loved the body language in this classic Steve Jobs video so much that I was compelled to write a transcript with screen snapshots focusing on and transcribing all of his gestures (in parens). After reading the transcript, it's still interesting to watch the video, after you know what body language to look for!
“Focusing is about saying no.” -Steve Jobs, WWDC ‘97
As sad as it was, Steve Jobs was right to “put a bullet in OpenDoc’s head”. Jobs explained (and performed) his side of the story in this fascinating and classic WWDC’97 video: “Focusing is about saying no.”
If only HTML could have hyperlinks, embedded multimedia and textual content in a single file...
(In all seriousness, it's sad that YouTube has such a narrow focus — video files only — which helped it spread to many different devices, from small phones to TVs, but hinders interactivity)
This is a little off-topic, but I find the terminology used in software these days to be a little perplexing. Is "small", a perfectly adequate word to describe less memory usage, not buzzwordy/trendy enough?
"light" or "heavy" just reminds me of that classic story about the weight of software.
Why not allow the runtime to call the garbage collector (GC)? The GC is very lazy by default, it should be possible to make it collect all garbage. Currently v8 will just let the garbage grow because the GC is so lazy.
Probably should be re-titled to match the post as people are getting confused by the reference to V8 lite, which this post isn't directly about.
> However, in the process of this work, we realized that many of the memory optimizations we had made for this Lite mode could be brought over to regular V8 thereby benefiting all users of V8.
> ...we could achieve most of the memory savings of Lite mode with none of the performance impact by making V8 lazier.
Yes, the submitted title ("V8 lite (22% memory savings)") broke the site guideline which asks: "Please use the original title, unless it is misleading or linkbait; don't editorialize."
Doing this tends to skew discussions enormously, so please follow the guidelines!
This is great, but because you have achieved memory reduction by trading off speed, it would be nice to see charts that show processing time increases too.
> Lite mode launched in V8 version 7.3 and provides a 22% reduction in typical web page heap size compared to V8 version 7.1 by disabling code optimization, not allocating feedback vectors and performed aging of seldom executed bytecode (described below). This is a nice result for those applications that explicitly want to trade off performance for better memory usage. However in the process of doing this work we realized that we could achieve most of the memory savings of Lite mode with none of the performance impact by making V8 lazier.
I don’t know about “none”. Updating the age of the compiled representation of a function on every function entrance doesn’t strike me as cost-free. At a minimum that’s a store.
That's not true of software running on desktop systems or mobile phones - desktops usually run many concurrent tasks, so do phones to some degree, and then there's also the question of battery use.
That can create skewed incentives if the benchmark isn't carefully designed. E.g. you can usually make space/time tradeoffs regarding performance, so if your benchmark is solely measuring CPU time, it pays off to gobble up all possible RAM for even minor benefits. If your benchmark is only measuring wallclock time, it pays off to gobble up all the CPUs, even if the actual speedup from that is minor.
This can lead to software "winning" the benchmark with improvements that are actually detrimental to the performance on end user's systems.