Hacker News new | past | comments | ask | show | jobs | submit login
Multicore Programming in PyPy and CPython (morepypy.blogspot.dk)
99 points by disgruntledphd2 on Sept 30, 2012 | hide | past | favorite | 51 comments



Although it adds another piece of software on top, I've been using Python with Redis a lot lately, and it's wonderful. Here's a 'base' for building the larger program. It doesn't make a single python instance multithreaded, but it easily allows me to connect other computers on my network to join in the processing fun.

    filePath = "/home/user/file"
    p = Process(target=readFile, args=(filePath,))
    p.start()

    numProcesses = 6
    for x in range(numProcesses):
        p = Process(target=mapper, args=())
        p.start()
So I have a file reader to prevent a seek storm on a large file, I read something (like separations between tags that I want to process more), I put the binary data in a "ready:NAME" key, with "NAME" going into a "readyList", and things go wonderfully. It makes my data persistent without writing to files or pickling, and I can easily share data between process instances.

My question is - what advantage does a multicore programming offer that my take wouldn't? I can import process and chain things out over my 8 cores, but why would I do that if I can have persistent data with the ability to more easily add processing power when I want to?


well it allows you to share in-memory, mutable state between processes. if you have problems that don't need that then yes, parallelization is easy (the python-only version of the above would be to use multiprocessing). but maybe i've missed the point (seem to be making error after error at the moment...)

[edit: to clarify a bit more, the example given in the article linked is making each execution of a loop run in parallel, even if the loop affects some variable. so you would hope to get speedups in code that isn't written explicitly to be executed by multiple processes - without messaging etc. in particular, it guarantees safety while giving the chance (sometimes it doesn't work out) for speedups.]


Message passing is usually a "better" method of IPC anyway. The situations that genuinely require shared memory are rare.


Perhaps in your line of work, but in scientific computing, it's quite common. Scientific codes have parallelism at multiple levels, but loop-level parallelism is extremely common. Simply, you have large vectors or matrices that are operated on by a loop. The iterations of the loop are (mostly) independent, so you can use multiple threads to do them in parallel. Assuming shared memory, of course; the communication cost of transferring the vectors or matrices to another process over typical IPC mechanisms would kill the benefit of the parallelism.

OpenMP (http://openmp.org/wp/) for C, C++ and Fortran is the most common means of exploiting such parallelism in the high performance computing world.


I'm curious about what field of scientific computing you work in. I am only familiar with MPI and hybrid MPI/OpenMP codes. I haven't come across any OpenMP only programs ever in chemistry or physics.

If you have a loop with independent iterations then you can use message passing or shared memory to approach your problem (as well as loop vectorization). The main problem with message passing is that for very large datasets you can run into memory limitations.


I don't, really. I do system software for high performance and distributed computing. The scientific applications are given to me, and are largely benchmarks. (Although, now, I don't do much actual HPC, but high-throughput, low-latency stream processing.)

I have a friend who has a galaxy simulation that uses OpenMP but not MPI. The reason is simple: she doesn't have the expertise to make it distributed. Slapping a few OpenMP directives on the most expensive loops is easier than figuring out how to make it distributed using message passing. How much parallelism you extract out of a program is often a function of how much time and effort you can invest into it. Some people get "good enough" performance improvements by scaling on a single node.


It's worth pointing out that those algorithms are also generally the ones that most benefit from being ported to a GPU, which removes the need for a shared-memory mechanism in the host-side language.


Yes, if the latency of transferring the data and results to and back from GPU memory is low compared to the speedup.


I'm also finding pass-by-value multiprocessing with Redis to be extremely effective for going wide on I/O-bound problems. It's just so easy ...


It would be good to first make it possible to write Python code that can execute concurrently and use plain old locks and conditions -style synchronization. When that works efficiently, things like transactional memory can add value, but not before.

What comes to automatic mutual exclusion "AME", that sounds pretty far fetched to me. I don't think there's a production-worthy implementation of anything like this out there, all parallel programming environments need some kind of annotations from the programmer to keep stuff in sync. Is anyone aware of any research into this kind of systems?

If anyone is interested in parallel programming features in languages, Haskell's parallel programming facilities should be worth looking at. Not because the language or functional programming would be some kind of magic medicine, but simply because there's been years of research and hard work put into making Haskell work in a concurrent environment.


So are you saying the current STM prototype should be put on hold until a "plain old locks" Python comes along?


Well you can do that in Jython, but as we can see, not so many people actually "need" this.


I think the main problem is that the vast majority of Python programs assume that things like dict lookups are atomic. With small locks (which would be really hard to do with CPython anyway), that guarantee is hard to keep.

The nicest thing about STM is that transaction start/stop map quite well to GIL aquire/release, so it should fit the common cases quite well.


Here's his EuroPython talk http://m.youtube.com/watch?v=pDkrkP0yf70

And here's another talk about an alternative approach By Robert Hancock http://lanyrd.com/2011/pygotham/shppw


this gives a route for AME on STM in CPython via the LLVM. it assumes (afaict - it took me a while to grok the article and i'm still not sure i have it right) that the CPython GIL will be removed. but we already have a failed attempt to remove the GIL via LLVM (unladen swallow). so the obvious question is - how meaningful is that failure? does it mean that removing the GIL via LLVM is hard? how hard? because without removing the GIL, AME and STM are pointless, right?

[edit: ok, thanks for the reply. it turns out it was dropped from the roadmap for unladen swallow because the garbage collection library they thought would help turned out not to - see https://code.google.com/p/unladen-swallow/wiki/ProjectPlan#G... - so i guess this question is much less interesting. sorry.]

unrelated note / question. this is the first time i've heard that HTM is restricted to the cache(s) (although it makes a lot of sense in retrospect). how serious a limitation is that for other languages? is haskell planning to use it? what about clojure (JVM - guess not)? and what happens to HTM if the cache overflows?

AME: automatic mutual exclusion, STM: software transactional memory, HTM: hardware transactional memory, LLVM: itself [fixed, thanks], GIL: global interpreter lock, JVM: java virtual machine, uff.


Acronym nitpick: LLVM originally stood for low-level virtual machine, not little-language. It doesn't really have a connection to the little-language or domain-specific-language communities, but instead originates in a desire to provide a compiler target that was lower-level than generating C code, but less platform-specific than directly generating machine code (C-- is another project in that space). But it's since been renamed to just LLVM, a brand that doesn't officially stand for anything.


Unladen Swallow never failed to remove the GIL, that goal was never even attempted before the project fizzled out.


I find this incredibly interesting. I have a production server running a Flask app, performance is not stellar, and I've been meaning to try running it with Pypy w/ a web server, but I have yet to find a good resource for this. Anyone has any tips? Seems kinda counterintuitive with all these performance numbers with Pypy > Cpython when I can't even run it in production.


It's strange that you have a problem with Flask in production. What I've heard, there shouldn't be much of a problem with it. It should easily be able to handle 1,000 basic requests per second.

Have you done any performance analysis to figure out where the bottlenecks might be? Have you tried the flask-debugtoolbar (http://pypi.python.org/pypi/Flask-DebugToolbar)? How many requests do you get? What's the load average of the process? Is there an un-indexed SQL query you don't know about? In the most extreme case, have you put a debugging sleep(1) call in your code which you've forgotten to remove?

Without knowing more about where your Flask app is being bogged down, it's not that worthwhile to switch to pypy.


Are you sure your app is not db or I/O bound? I would suggest giving https://github.com/mgood/flask-debugtoolbar a try and getting some numbers before making a switch.


Can you elaborate on what is not performing well?


Haha wtf. After learning Haskell and seeing stuff like lparallel for CL, I just fundamentally can't pretend this is even worth learning. The GIL and the lack of foresight in designing Python are problems that are too fundamental to the language.

I love python, I've used it for a long time, but using it as anything other than for readable shell scripting is a waste of time. It's using the wrong tool for the job. If you are worrying about speed aside from algorithms, learn a different language.


> The GIL and the lack of foresight in designing Python are problems that are too fundamental to the language.

The GIL isn't part of the Python language, it's part of the implementation of CPython. Jython for example doesn't have a GIL.


Did you read the article? There are some interesting ideas being explored, really just theory for now. I don't see it as asking you to learn anything unless you want to participate in theorycrafting.

To me it seems like a waste of time to build crud websites in CL, but if people want to write articles about ways to make that experience better, i'm not going to write 'haha wtf' in response. (Yes, all analogies suck)


You rather lack imagination if you can't think of an appropriate domain for using Python other than "readable shell scripting".


Sometimes performance is unrelated to the ability to thread, so the GIL is not always a relevant issue.


Do you have examples of lack of foresight?


There are many examples of things that people now call "lack of foresight", but I think people quickly gloss over just how long Python has been around. It started in December 1989, when the first Pentium was still three years into the future. If Guido had tried to "foresight" his way to a language with an acceptable multicore solution at that point (on any hardware he could afford), the only possible result is total failure. I don't think there are many attributes the language currently possesses that were truly that avoidable. Most people's snap suggestions result in incurring costs to the language that it may not be able to recover from when it is young.

Personally, I think the truth is that Python is what it is; a very, very, very good OO language that has reached the mature stage of its life. It probably won't be able to make the leap to "true" multicore, and besides, even if it does it won't be likely to be very successful anyhow because it'll still be a very slow-but-powerful language.

(There's very little point in taking a language very near the bottom of the Shootout and adding 4 or even 8 way parallelism to it, when you could just rewrite the target hotspot in C and get the same performance on one core. The math just doesn't favor trying to add lots of "multicore" to such a slow language. Python's one of my favorite languages, but that doesn't mean I can't see where it is weak.)


PyPy (together with LuaJIT and tracemonkey) has been banned from shootout, partly because we complained about unfair rules. Looking at shootout to asses the relative performance of languages where you cannot compare high-performance implementations of a language is a very very bad idea. Looking at shootout to compare would very likely be a bad idea anyhow, but right now it's just completely useless.


I was confining myself to CPython on the grounds that it is still what people complain about in terms of adding multicore to it. People don't complain about PyPy not having it since it's still developing, and it is not yet known whether PyPy will be fast enough to be worth worrying about multicore. To be honest I'm skeptical, but open to the possibility. But I rather suspect in the end that Python will forever be a slower language than the competition; all the dynamicness doesn't come for free.

Also the nicer primitives that IMHO the really-useful multicore languages are building on are either impossible (Haskell) or too late (Go/Erlang) for Python to deeply embrace due to massive legacy libraries and code. If you want to make threading "work" in Python, go nuts, it will bring much benefit, but Python will simply never be the first choice for tasks in which multicore performance and safety is the first or second priority, rather than the fifth or sixth. And that's fine. I'd really rather see Python become a better Python than see it become a crappy Go or something.


PyPy (on it's good day) is within 2x mark from the equivalent C. We're planning on closing this gap, but also we're planning on making every day the good day (right now you kind of have to know what you're doing in case you want to hit the sweet spot). This might be an interesting read for you: https://www2.cisl.ucar.edu/sites/default/files/2cameron_spar...


Can you give details and/or a citation for the complaints and the banning? I use the shootout a lot to give me a rough idea of a specific language's properties, so if there are valid criticisms then I'd like to hear them, but your post doesn't really stand on its own to someone who doesn't know the backstory.


There is pypy-dev discussion about the subject that starts here: http://mail.python.org/pipermail/pypy-dev/2011-April/007139....

It's a long and boring read, but here you go.

As in the actual complaints, as far as I remember, but this was a bit ago:

* our versions got rejected because they perform slower on CPython (while being faster on PyPy) and because they're 'unobvious'. At the same time, the C implementation of a lot of things there is very unobvious.

* we were not allowed to use array.array module, despite being in standard library.

* you can use gmp library for C, CPython, but not PyPy via ctypes

* C can use a bad random (that's fast, but also gives a really bad distribution), while PyPy not because Python comes with a random in stdlib

* custom malloc libraries for C are fine, tuning JIT parameters per-benchmark is not.

And various complaints like this. After some discussion, Isaac just kicked us out alltogether. Overall, my personal impression of the experience was that shootout is there to showcase things that are predetermined and we're trying to attack the status quo. This is obviously very personal, I don't have any actual data to back up this claim.


READER BEWARE!

>>our versions got rejected because they perform slower on CPython (while being faster on PyPy)<<

Maciej Fijalkowski knows that I asked him to contribute a PyPy version of n-body because the written-for-CPython version failed with PyPy.

Maciej Fijalkowski knows that his own PyPy version of n-body "performed slower on CPython (while being faster on PyPy)" -- AND WAS NOT REJECTED.

http://mail.python.org/pipermail/pypy-dev/2011-April/007177....

Justify your accusation or take it back!

>> you can use gmp library for C, CPython, but not PyPy via ctypes <<

Joe La Fata's pi-digits code worked first time on x86 and x64, on PyPy and CPython and Python3; and used ctypes to get to GMP -- AND WAS NOT REJECTED

http://anonscm.debian.org/viewvc/shootout/shootout/bench/pid...

Justify your accusation or take it back!

>> C can use a bad random (that's fast, but also gives a really bad distribution), while PyPy not because Python comes with a random in stdlib <<

The fasta task uses random numbers, and every fasta program (including C and Python) implements the same random function.

http://shootout.alioth.debian.org/u32/program.php?test=fasta...

http://shootout.alioth.debian.org/u32/program.php?test=fasta...

Justify your accusation or take it back!

----

Tell Maciej Fijalkowski to justify his accusations!


Look Igor -- I don't care how things are now since PyPy got kicked out anyway, so I seriously don't care any more.

gmpy uses C extension, it's not using ctypes. We disagreed over a ctypes version at some point.

random - indeed it got fixed (and it's too late to edit a parent if you wonder)


>> we were not allowed to use array.array module, despite being in standard library. <<

2 April 2011, Joe LaFata contributed a faster for PyPy spectral-norm program. His program worked first time on x86 and x64, on PyPy and CPython and Python 3. His program was measured and published within 24 hours.

    from array     import array
http://shootout.alioth.debian.org/u32/program.php?test=spect...

Justify your accusation or take it back!


>> gmpy uses C extension, it's not using ctypes. <<

    import ctypes
    from ctypes.util import find_library

 	_libgmp = ctypes.CDLL(find_library("gmp"))

 	class mpz_t_struct(ctypes.Structure):
 	_fields_ = [("mp_alloc", ctypes.c_int),
 	("mp_size", ctypes.c_int),
 	("mp_d", ctypes.c_void_p)]


>> random - indeed it got fixed <<

Nonsense! The fasta task has ALWAYS required ALL the programs to implement the same random function.

(Use a different random function and the program output will fail the diff.)


>> so I seriously don't care any more <<

You don't seem to care that your accusations are demonstrably false.


This is a full writeup: http://alexgaynor.net/2011/apr/03/my-experience-computer-lan...

You're correct, I got my facts wrong. We were allowed to use ctypes for gmpy, but not for something else. We weren't allowed to submit a ctypes-using and array-using benchmark that didn't work on CPython due to a CPython bug.

the end result of us trying to be slightly better was that we ended up being kicked and a few other people ended up being kicked as well, though they weren't really guilty.

Overall the experience was that the results are not "fair" and are guarded by arbitrary rules that are not really written anywhere. One of those rules is that an arbitrary (typically most popular, but not necesarilly so) implementation is only allowed. This makes it much less trustworthy that it can actually be.


>> a full writeup <<

No.

The blog post tell's us that Alex Gaynor confirmed "with some CPython core developers" that his program didn't work because of a bug in CPython.

But the blog post doesn't tell us that Alex Gaynor never said there was any problem with a CPython bug.

Alex Gaynor's blog post tell's us that "It's also not possible to send any messages once your ticket has been marked as closed, meaning to dispute a decision you basically need to pray the maintainer reopens it for some reason."

But that's completely untrue! You can send messages when the ticket is marked closed! And you can open topics in the public forum! And you can click on a username and send email in 2 clicks.

There just wouldn't be any story to blog about, if Alex Gaynor admitted that he could easily have told me -- the bug is in CPython not in my program, so show my program -- but chose to say nothing.

>> arbitrary rules that are not really written anywhere ... an arbitrary ... implementation is only allowed <<

Have you even read the home page?

"There exist multiple implementations for some programming languages - different C++ compilers, different Java VMs - but those other language implementations are not shown here."


seriously, I don't think this is the correct medium - however you did refuse our solution using ctypes and you did refuse our solution using array at some point. Maybe you changed your mind later. This is one example:

http://mail.python.org/pipermail/pypy-dev/2011-April/007169....

You also kicked PyPy out around the same time the array benchmark was introduced (within days).


>> I don't think this is the correct medium <<

You chose to make false accusations here.

>> however you did refuse our solution using ctypes and you did refuse our solution using array at some point <<

You don't seem to know what program was supposedly refused for using array - so why do you continue to make that accusation?

Presumably by "our solution using ctypes" you mean Alex Gaynor's revcomp program - I'll say something about that where you linked to his blog post.

>> You also kicked PyPy out around the same time the array benchmark was introduced (within days). <<

No more than coincidence -- Joe LaFata contributed excellent PyPy pi-digits, spectral-norm, mandelbrot programs in one week; and they were all accepted.


Cool that you're linking to this mail, this shows yet another problem that I forgot about - that pypy extensions are uncool, while GCC extensions are cool (and say a custom malloc library for GCbench - after all the benchmark does not do anything, an empty malloc would do just fine).


>>a rough idea of a specific language's properties<<

No, but maybe a rough idea of a specific language implementation's properties.

>> if there are valid criticisms <<

There are plenty ;-)

http://shootout.alioth.debian.org/dont-jump-to-conclusions.p...

Readers obviously jump to conclusions based on box plots and summary statistics -- when they really need to look at the individual tasks and look at the program source code, because there's so much variation.

I think there would need to be incredibly tight constraints on minimum program run-time, source code size, and memory use; before summary statistics could be taken at face value.


>> partly because we complained about unfair rules <<

That's not true.


I take your word for it - I genuinely don't know, the correlation is however hard to miss.


I also stopped making measurements for Java -Xint, LuaJIT, Tracemonkey, CPython, Iron Python, and Ruby 1.8.7

Post hoc ergo propter hoc is a well known fallacy.

>> I genuinely don't know <<

When you don't know, don't make accusations.


Haskell 1.0 was released in 1990.


The "History of Haskell" paper, a fascinating read in and of itself, is replete with admissions of lack of foresight. The reason Haskell has fared better than Python is probably only that it has a much bigger emphasis on trying new ideas in the reference implementation. The biggest problems with "Python" (GIL, FFI, standard library) are mostly issues with the reference implementation.

http://research.microsoft.com/en-us/um/people/simonpj/papers...


And where Python entered production usage in the late 90s and was building up a huge community and a base of code and continued building that base all the way to the present time, Haskell didn't really begin that until around the mid-200xs. It may be an older language in calendar terms, but in terms of libraries and tooling support it's basically 5 years old or so.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: