Python Bytecode Explained

MichaelMoser123 · on Jan 17, 2022

Author here, Thank you all! really glad to see this on the frontpage!

There is a second part that may be of interest: here a python tracer is implemented, one that shows the side effects of each line, as it is executed (it shows the effect of all the various load and store instructions). The objective was to get something that is similar to the set -x built-in of the bash shell.

https://github.com/MoserMichael/pyasmtool/blob/master/tracer...

And it's all part of this advanced python course: https://github.com/MoserMichael/python-obj-system/blob/maste... (well, I am still working on it)

And me is also looking for a job again ;-( I need a new job in April. So here is my linked-in profile. I also do C++ and Java/Scala. Available on-site in the Tel-Aviv area, considering remote only jobs anywhere else. https://www.linkedin.com/in/michael-moser-32211b1/ E-mail address is in my HN profile.

nyellin · on Jan 17, 2022

Sent you an email!

We're also in the Tel Aviv area (but remote is an option if you prefer) and we've been looking for someone like you who explains technical topics in simple terms.

We have some low level stuff (e.g. I wrote a python debugging tool for Kubernetes which injects debugby into target processes using gdb [1])... but also a lot of higher level stuff in our python framework for k8s automation.

Hope I can interest you

[1] https://github.com/robusta-dev/debug-toolkit

eatonphil · on Jan 17, 2022

Tried to give you a shoutout on twitter but you aren't on there. Tried to give you a shoutout on Linkedin but it won't let me mention your profile. :) Good luck with your search!

https://twitter.com/phil_eaton/status/1482801489907273739

https://www.linkedin.com/posts/phil-e-97a490178_really-fanta...

MichaelMoser123 · on Jan 17, 2022

Really thanks Phil! My email address is in my HN profile.

nedbat · on Jan 16, 2022

An important thing to note that is a bit buried in this text: Python bytecode changes in every version, sometimes just a little, sometimes a lot. So Python 3.10 has difference in the bytecode instruction set from 3.9, and 3.11 will have differences from 3.10.

tyingq · on Jan 17, 2022

>I was suprised to learn, that many bytecode instructions changed in minor releases of the runtime!

That's also how dropbox used to obfuscate their client when it was python. They would ship only pyc files, which is just bytecode. But they would change around the opcodes, map multiple numbers to the same opcode, etc. Then also stream encrypt the pyc file and hide the key inside of it.

The "Looking inside the Dropbox" paper where some researchers reversed engineered it is interesting: https://www.usenix.org/system/files/conference/woot13/woot13...

jonpalmisc · on Jan 16, 2022

For anyone wanting to play around with this online, Compiler Explorer [0] supports displaying Python bytecode, such as this example. [1]

[0] https://godbolt.org

[1] https://godbolt.org/z/5sYs7dT54

masklinn · on Jan 17, 2022

Though playing around with this offline is not exactly difficult either: you can just invoke `dis.dis(codestring)` at the command-line, or use `dis.dis` as a decorator when defining a function (it automatically prints out the bytecode).

Sadly `-mdis` requires feeding a file by path or data through stdin, so for mucking around it’s not the best.

pxeger1 · on Jan 17, 2022

> If you are upgrading or downgrading the python interpreter, then you probably should also delete all __pycache__ folders, these folders hold the binary files that hold the compiled bytecode instructions, but you can't be sure that these will work after a version change!

This is incorrect. Python bytecode files are versioned alongside the interpreter, so when CPython finds a __pycache__/*.pyc file which is the wrong version, it will just ignore it and won't cause any problems.

ridiculous_fish · on Jan 17, 2022

> the stack is maintained separately per each function object

Can someone elaborate on this? Having separate stacks makes sense for coroutines, but does this mean that a normal Python function call allocates a private stack for that function?

woodruffw · on Jan 17, 2022

I thought that was a typo on their part, but it sounds like Python really does maintain a separate stack for each function[1]. Very strange!

[1]: https://github.com/MoserMichael/pyasmtool/blob/master/byteco...

Karliss · on Jan 17, 2022

It refers to operand stack not call stack.

All it means is that python bytecode is stack based where most instructions pop arguments and push results on operand stack. In contrast with register based VMs. When implementing a VM it makes sense to store call stack and operand stack separately so that you don't have to mix types. You probably don't want to allow function to uncontrollably modify operands in lower frames as in most cases that would be either a bug or vulnerability. Having separate operand stack for each frame also makes any kind of analysis much easier. Call instruction can be viewed as a fat instruction which pops some amount of arguments and pushes single result back. Once you restrict cross frame operand stack access, whether it's stored in single or multiple arrays becomes an implementation detail. Many other VMs do more or less the same JVM, AVM2(flash), CIL(C#). It doesn't necesarily mean that the stacks are separate after JIT but from the perspective of bytecode instructions operand stacks are separate.

masklinn · on Jan 17, 2022

It probably makes a bit more sense if you consider that the “per-function stack” is really the tail of the call frame.

That also means the bytecode works solely within its own stack segment.

willcipriano · on Jan 16, 2022

This is my favorite style of documentation. Information dense with lots of examples.

pabs3 · on Jan 17, 2022

I note that you can diff Python bytecode (amongst other things) using diffoscope:

https://diffoscope.org

klelatti · on Jan 17, 2022

Is there are good source that compares bytecodes across languages - e.g. Python, Lua, Ruby etc - and explains the design decisions.

framecowbird · on Jan 17, 2022

> CPython doesn't have a just in time compiler right now, instead the interpreter is running the bytecode instructions directly.

Isn't compiling and then immediately running the code exactly what a just in time compiler is? Or do I have a misunderstanding of the term?

ramchip · on Jan 17, 2022

CPython compiles Python source code to bytecode, but it never compiles the bytecode to machine code. Instead it interprets the bytecode, reading one instruction at a time, and basically calling a giant switch statement that handles every possible opcode.

A JIT would compile the bytecode to machine code then run it directly (at least for frequently executed code paths). There is no "switch" anymore. Each bytecode instruction has already been replaced by the corresponding machine code.

jokoon · on Jan 17, 2022

So PyPy has JIT, but not CPython, that's weird.

Is there any reason why official python doesn't have any JIT option? Would that be too fastidious to develop?

masklinn · on Jan 17, 2022

> Is there any reason why official python doesn't have any JIT option?

Desires to keep the implementation simple and approachable (relatively), as well as avoid issues of performance cliffs and such.

Also the C API has historically been extremely broad and provided large access to what amount to implementation details, making this keep working properly with a jit is difficult (at least for anything but a simplistic macro-ish JIT).

claudex · on Jan 17, 2022

PyPy is not fully compatible with CPython. You won't have the same behaviour and CPython C API is not guaranteed to be fully compatible. So, I'm not sure that having a JIT that is fully compatible is easy.

chrsig · on Jan 17, 2022

pypy was started as an effort to make a JIT for python...when viewed in that light, it's not a weird situation at all.

as for why cpython doesn't use a jit? most likely to prevent any breakage with c modules

eatonphil · on Jan 17, 2022

Most terms in language implementation are fuzzy. But just in time compilation most often refers to switching from interpreting bytecode to (generating machine code and running that) generated machine code in specific spots after having analyzed the currently running bytecode for a while.

"Classical" (again, every term is fuzzy) JIT compilers either do this machine code compilation after seeing a good candidate _entire function_ or a good candidate _section of code within a function_. Good candidates are often areas of code that are executed a large number of times and with consistent internals (e.g. iterating from 0 to 10000 with variables inside that have provably fixed types).

But there are infinite variations of JIT compilation.

In any case, CPython doesn't do that switching from bytecode to generated machine code. Pypy does do that. As does V8 and the JVM and so on.

amelius · on Jan 16, 2022

Why didn't the Python committee opt for a compiled system (like PyPy) when they moved to the 3.0 series (and had to break backward compatibility anyway)?

eatonphil · on Jan 16, 2022

I don't know specifically but the CPython ethos has often been to prize simplicity of the implementation over performance.

MichaelMoser123 · on Jan 17, 2022

You are right, i think that Python is trying to be as expressive and succinct as possible. A runtime like pypy is very difficult to change, and it would therefore make it much more difficult to evolve the language.

Ralfp · on Jan 17, 2022

Luckily „Faster CPython” is now a thing and even Guido participates in it.

Reasoning given for course correction was (AFAIR) that Python really could be faster for things like data science or ML.

chrisseaton · on Jan 17, 2022

> Why didn't the Python committee opt for a compiled system (like PyPy)

Because compilers are complicated and have trade-offs.

> and had to break backward compatibility anyway

A compiler shouldn't break backward compatibility.

eatonphil · on Jan 17, 2022

> A compiler shouldn't break backward compatibility.

I don't understand what you mean by this in context (since they introduced a new language in python3).

chrisseaton · on Jan 17, 2022

They asked why didn't Python 3 introduce a compiler when they were able to break backwards compatibility.

That question doesn't make sense, because a compiler shouldn't have any impact on your compatibility.

They can introduce a compiler without breaking compatibility, so they don't need to do it with a new language version.

eatonphil · on Jan 17, 2022

Oh I see what you mean, I misread the sentence. Switching between language backends has nothing to do with compatibility. Yup makes sense. They can swap implementations at any time.

amelius · on Jan 17, 2022

The point really is that PyPy has some compatibility issues with the C api I think mostly because of the garbage collector. This has less to do with whether you compile or interpret bytecode, yes.