Hacker News new | past | comments | ask | show | jobs | submit login
Swap_8_and_9: A simple import can modify the Python interpreter (kenschutte.com)
152 points by ks2048 on Aug 9, 2023 | hide | past | favorite | 41 comments



Related: one of my favorite StackExchange questions: "Write a program that makes 2+2=5"

https://codegolf.stackexchange.com/questions/28786/write-a-p...

Really shows you the skeletons hiding in some languages. My favorite is Haskell, which will happily do what you tell it to.


I legit burst out laughing at the C solution (https://codegolf.stackexchange.com/a/28889). Definitely got me.


The Haskell one is not that spectacular if you understand that operators are just functions in that language:

    λ> let 2+2=5 in 2+2
    5
This simply defines a new local function that happens to be called "+". It's no different than something like this in JavaScript:

   let Math = { sqrt: (x) => 42 };
   console.log(Math.sqrt(16));


This reminds me of an SO user that was eventually banned because he (or possibly she, we will never know) had a hobby of writing fairly detailed questions that looked totally innocent on the surface but got strange once you started digging into them. Those that suffered through would eventually realize the whole thing was masterfully crafted to waste your time. There was once an excellent collection of these but I can't find it now and wish for the life of me I could.


The Ruby versions are a bit underwhelming, this is not 2+2=5 but much more underhanded:

https://youtube.com/watch?v=F3feRCr6S64

(recommend watching it by doing the four "tests" before going to the reveal)


Forth is probably cheating considering it has no restrictions when naming symbols.

  : 4 5 ;
  2 2 + .


Did you try this? This won't work.

What you did will make inputting a 4 yield a 5, but it won't change the behaviour of outputting a 4. (And it will only turn an input 4 into a 5 if your interpreter checks for words before checking for numbers which is not universally the case).


Try : + drop drop 5 ;


this is how javascript works


You don't even need C:

    ctypes.c_int.from_address(id(8)+24).value = 9


That works only on 64-bit systems. You need something like this instead of 24:

    ctypes.sizeof(ctypes.c_long) * 3


good thinking I definitely need this useful and widely used code to be legible and platform independent


Is there really anything unexpected here? I would have thought that it is obvious that when importing some library, that library will execute code already at import time. And that you can do strange things, esp when messing with the underlying C structures, should also be clear.

Even in C itself, you can do the same in shared libraries. It's a very important functionality that the library can run some init code when it is being loaded. And you can do all kinds of strange things, modifying some random memory.


The CPython source is really a bit of a joy to look through. The lowest level stuff is really tough, but generally it's all relatively straightforward.

Though with the specializing compiler work, some stuff is... less obvious. But I still generally find it straightforward when I want to know some detail about how the language works.


You get an hint that Python's import is not a simple "add name to scope" when importing a builtin package opens a web browser:

    import antigravity
As others have mentioned, every line of code not in a function/class gets executed when imported, except if guarded with an `if __name__ == '__main__':` (only true when executing the script with `python xxx.py`). A related catch: functions' default arguments are also evaluated when imported the first time.

Try creating `fun.py`:

    def evil():
        print("Gotcha")
        return 1
    
    def abc(x = evil()):
        print(x)
Now: `python fun.py` or `python -c "import fun"`


How is that different from what most dynamic languages does? I can't think of one where you wouldn't be able to do the same.


And this is why optimizing Python code is really hard. When at runtime you can change almost any aspect of the language it's virtually impossible to give a semantics for Python code beyond "run it and find out".


While optimizing Python code is indeed really hard, this is not a good example of why.

It uses implementation-specific details which are outside of the scope of anything to do with Python semantics.

It's roughly equivalent to:

  #include <stdio.h>

  int nine = 9;

  int main(void) {
    printf("nine = %d\n", nine);
    return 0;
  }

  /* in a library */
  __attribute__((constructor))
  static void sneaky(void) {
    int *n = (int *) &nine;
    *n = 8;
  }
Your hyperbole simply isn't true, as demonstrated by the many static code analysis tools for Python. They can't handle all cases, certainly, but they demonstrate it's mostly possible to give semantics for Python code without running it.


I dunno, the fact that integers are by (often mutable) references has to make it really difficult for optimization.

You don't have to be "sneaky" for this to bite you with Python. Maybe it looks obvious when stated in a bare-bones fashion but this bug was not easy to track down in a larger code base:

   i = 1
   incr_by_1 = lambda x : x + i
   i = 4
   incr_by_4 = lambda x : x + i
i is a reference in both incr_by_1 and incr_by_4 are equivalent at this point. If anyone assigns to i, then their behaviour will change.

In most languages, integers are values so an optimiser has a chance to (for example) replace incr1 by a single CPU increment instruction but can't do it here as the value "pointed to" by i needs to be fetched according to Python semantics.


Agreed! Optimizing Python code is indeed really hard, and the lack of const and ability to describe capture semantics don't help.

To be fair, the equivalent in C++ is:

  #include <iostream>
  int main() {
    int i;
  
    i = 1;
    auto incr_by_1 = [&](int x) {return x + i;};
    i = 4;
    auto incr_by_4 = [&](int x) {return x + i;};
    std::cout << incr_by_1(0) << " and " << incr_by_4(0) << std::endl;
    return 0;
  }
which prints "4 and 4". Replace the first [&] with [i] and it prints "1 and 4".

A Python implementation also can't replace incr1 by a single CPU increment instruction because it doesn't know the type of x.

That's still a far cry from being unable to give semantics for Python code without running it.


Not really. Stuff like this is shown around from time to time as a massive "gotcha" kind of thing for a few languages, but it's really just the nature of boxed primitives and interning (=literal bindings).


Ok, though most of the time this isn't the kind of thing that changes.


Funny how things come around again and again.

Fortran had basically the same bug, fifty years ago. https://softwareengineering.stackexchange.com/a/254921


This isn't a bug, it's absolutely expected behaviour. The author has just dressed it up in a blog post to make something of it, but anyone who has written a python library will know that code that isn't in a function (including function default arguments) gets evaluated on import. You don't need a C-extension to do that part. Then he messes with some internals, which isn't surprising either since python's philosophy is very much "internals are available - caveat emptor but if you want to mess with them go nuts".


Side note: distutils is deprecated and will be removed in 3.12.

In this case, setuptools is a drop in replacement.


That’s clever, but illustrates something not widely appreciated:

When you import a module, Python executes it. For instance, `def` isn’t syntax that says “hey compiler, this is a function!” It’s a statement that’s executed at runtime to define a function. You can put any code you want at the top level of a module and it’ll get executed when the Python interpreter gets to that line.


IMO Python imports behave like the bash source command.

This is why people use the `if __name__ == "__main__"` so the majority of people will address it in all their scripts even if they don't know the reason why.

It's a feature not a bug IMO. You can use importing a .py file as a singleton hack. You can also use `refresh` to re-load a module, to clear it of any runtime overrides.


Python inports collect the locals in the "module script" and store them in a module object, which is then made available to the module that ran `import`. That module object is cached, so reimporting the module another time will not run the code again.

`source` is much more primitive.


Python imports are much more principled than sourcing bash though. They are executed in a new namespace, and subsequent imports reference that namespace directly instead of re-evaluating the code.

C extensions don't significantly change matters because the module is still constructed by procedural C code.


It’s definitely a feature! Just one that’s often not understood. If you include a file in a C program, that codes just sitting there until you call it (more or less, yada yada #define, etc.). If you import a Python module, it executes the code in it. That code is typically a set of statements that defines functions and classes, but it could be anything.


This reminds me of one of my internet white whales. It's very similar to this, but it goes through a ton of different ways that one `import malevolent` or something alone those lines and completely change how normal Python works. Mentioning it here in case anyone remembers it!


Piece of cake in FORTH:

    : REAL8 8 ;
    : 8 9 ;
    : 9 REAL8 ;


: 8 #9 ; : 9 #8 ;


Why is that array not const in the Cpython code?


Because every Python object also contains a reference count (which needs to be modified whenever the object is passed around), a `const PyObject*` is effectively useless.


But we’re talking about an integer cache. It’s full of singletons; Why do they need to have reference counts?


This special casing just had not been implemented yet. But as it is an interesting optimization, more so with multi-interpreter or no-GIL Python, the developers will actually introduce immortal objects in Python 3.12 to avoid counting references on some objects (PEP 683 has been accepted):

https://peps.python.org/pep-0683/


That wouldn't make a difference in this case, would it? I suppose it would mean that the code would be UB, but in practice it'd just work the same.


Would this affect other Python interpreters or just CPython?


cpython caches the small integers, and this is just grabbing a reference to 8 and 9 and then altering the value of the integer held in each cached reference.

The author could have skipped the c library and used the ctypes module to munge the bits.

There's no guarantee that any other version of python would use the same caching, same structure layout and certain not be able to link with the same c library.

So, yeah, it's specific.

A fun little adventure into how things work for the author, though.


This seems to be a C extension so will likely only work with CPython, but since Python is so dynamic you can do all sorts of weird things just using plain Python.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: