Optimize Python with closures

rpcope1 · on May 8, 2015

If you're going to optimize with closures, why not go a step further and pull the conditional out of the closure? You could instead have a conditional in the function that produces the right closure based on mode rather than constantly re-checking mode every single time you call the closure.

Also, maybe it would be worth discussing building temporary function variables in outer scope that resolve the methods before the closure is called (i.e. my_list_append = my_list.append). There was no opportunity to do this in example, but it's worth discussing.

Finally, dict access seems like a bad idea if you're this pressed for performance, especially since you only need one value -- maybe better just to encourage explicitly passing in a parameter.

saurik · on May 8, 2015

This same oddity (checking a mode flag rather than returning one of two different implementations) shows up in their original code example for "Classes": that mode field and dictionary configuration should not "pass most code reviews" unless the people you are showing it to do not know much about object orientation... the description of the problem they are trying to solve (having different implementations that provide the same interface) should have been handled via polymorphism.

(And as they care about performance, and have tested PyPy, they are hopefully using it, and so the class-based solution is just as fast as the closure-based one, and is going to allow for better decoupling of filter modalities; though, as someone who does a lot of functional programming, the closure-based approach does resonate with me.)

dcrosta · on May 8, 2015

As I mentioned in a comment on Jaime's response (https://wrongsideofmemphis.wordpress.com/2015/05/08/optimise...), we actually do quite a bit more than this in our production code. The blog post here is meant to demonstrate the point, not to show our exact production code.

In particular, I wanted all 3 examples to use nearly identical inside the filter function, to isolate the differences just between the ways of accessing data in the benchmark results, and to show that closures are an easy way to gain some performance in hot spots (in CPython at least).

syllogism · on May 8, 2015

I get that you wanted to simplify the example, but the example you've written really just doesn't make much sense, so it's hard to understand your point.

The API in the example is really bad. Accepting a dictionary, only to require a specific key in the dictionary, is the worst of both worlds. But then if you use .get to access the dictionary body instead of the attribute access, you'll take on additional performance penalties, and other solutions will start to compete.

Patient0 · on May 8, 2015

This was the first thing that jumped out at me too when I saw the code. I kept expecting them to optimize that- it would have been cleaner too. Very surprised when they wrapped it all up leaving the if statement still inside the closure.

mercurial · on May 8, 2015

> If you're going to optimize with closures, why not go a step further and pull the conditional out of the closure? You could instead have a conditional in the function that produces the right closure based on mode rather than constantly re-checking mode every single time you call the closure.

I tend to try and do that in any language as soon as a single conditional means a change in behaviour in different locations, regardless of optimization concerns. It makes for (IMHO) more robust code and much easier testing (though you need more mocks).

syllogism · on May 8, 2015

Why not just use Cython[1]?

Optimizing pure Python is a waste of time. You end up with weird, unidiomatic code that it takes ages to come up with, because you're fighting the language. And in the end you hit a wall: a point at which the code can't go any faster.

If you just write Cython, you can easily reason about what the code is doing, and how you should write it to be more performant. Ultimately you can make the code run as fast as C, if necessary.

[1] http://cython.org/

elyase · on May 8, 2015

Exactly my thoughts, if the code is slow after algorithmic optimizations I would go directly for pypy, Cython, Numba, Pythran, etc. In my opinion it makes little sense to optimize pure Python.

jerf · on May 8, 2015

Errr, did the article change since you read it?

"At Magnetic, we’ve switched from running all of our performance-sensitive Python code from CPython to PyPy, so let’s consider whether these code optimizations are still appropriate... PyPy is incredibly fast, around an order of magnitude faster than CPython in this benchmark. Because PyPy performs more advanced optimizations than CPython, including many optimizations for classes and methods, the timings for the class vs. closure implementations are a statistical tie."

elyase · on May 8, 2015

Why I said is that I would go directly to pypy and related, without first trying to optimize like they did in Python world. They guys at Magnetic came to same conclusion in the end: "Curiously, the function implementation is actually slower than the class approach in PyPy." I don't find this surprising anymore because it is what almost always happens in my code.

gknoy · on May 8, 2015

I think the techniques discussed here have value, pedagogical and otherwise. Perhaps one is working on a codebase where we don't have the luxury to move to pypy or similar (e.g., my current one which uses some incompatible libraries, so transition would be nontrivial).

The fact that that method accessors are much slower than properties, or that closures can speed things up, is something that we might not always be mindful of, and that new python programmers might not think through.

jerf · on May 8, 2015

Ah! That makes sense. Thank you.

PhantomGremlin · on May 8, 2015

People say Python is "slow", so it's impressive to me that:

   our application handles about 300,000 requests
   per second at peak volumes, and responds in
   under 10 milliseconds

Of course that is using PyPy instead of CPython.

jerf · on May 8, 2015

(C)Python is slow, relative to almost any other language. But that hardly means that every task in Python will take 3 seconds to perform or something. If you do something small, Python will do it quickly. Another language may do it more quickly, but if it's fast enough, it's fast enough.

Where I get really frustrated with the 90s-era dynamic scripting languages is not when you are "filtering one list" or "serving JSON out of the DB", but when you're trying to do all kinds of CPU-type work in one page, with numerous (unavoidable) DB accesses and filtering and transforming etc etc. You can get into "seconds" surprisingly quickly, since it is, indeed, a fairly slow language.

It is likely that rewriting this code in C++ or something could get another 5-10x out of it even over PyPy, probably even more if we really start going crazy with optimizations, but at much larger investment of developer time. By contrast, twiddling with a couple of details and switching to PyPy is the sort of thing you can prototype in an afternoon and deploy in a week or two, even with solid testing procedures. At scale that's probably worth it (using 5-10 times fewer machines is generally a significant cost savings even in this "cloud" era), but they may not be to that scale yet. Perhaps they never will be. Who knows? Not me.

jaimebuelta · on May 8, 2015

We are not using PyPy in the company I work (https://www.demonware.net) at the moment and we are handling way more requests per second, with also very small response times (less than 10ms in most cases, but it depends).

The bottleneck is usually DB work, so PyPy is not a great help there.

Bottomline: Number of requests (scalability) and response times (performance) are more a work of architecture than using a particular language. And Python can be actually quite fast (fast enough)

istvan__ · on May 8, 2015

Which p value is the 10 milliseconds? I would like to hear about the p99.99 latency of this platform. More on this:

https://www.youtube.com/watch?v=9MKY4KypBzg

dcrosta · on May 8, 2015

Hi, I'm the author of the post. Our 95% latency is just shy of 10ms, and max latency around 100ms. Our monitoring tool pre-calculates the percentiles, so I don't have 99% or 99.99%, but my guess is that they're under or around 50ms. Too much more than that and we'd be hearing from our partners about timeout rates. We haven't thoroughly profiled the difference between "most" and "all" in terms of latency sources, but I'd guess that GC pauses account for some of it, and some requests are simply much more expensive for us to process than others.

sologoub · on May 8, 2015

Hi, work for one of the partners. From what I can tell, your account manager should be able to get you this info, including 99 percentile, if you are interested. This is for round trip from our point of view of course.

Spot checking, you've done better than you think :)

Exciting to see python implementation doing this!

dcrosta · on May 8, 2015

That's always good to hear :) Drop me a line if you want to get in touch.

istvan__ · on May 8, 2015

Nice one. Yes the GC pauses are usually the root cause of higher p99+ latency. Anyways, the 10ms-100ms range is pretty amazing by itself. With such a huge throughput it is hard to measure latency accurately you are kind of forced to use sampling, but it can be a good representation of reality still.

Animats · on May 8, 2015

The bottleneck seems to be overuse of dictionaries instead of fixed structures. Attributes on the class object might be faster, especially under PyPy.

jaimebuelta · on May 8, 2015

I wrote a follow up in my blog (too long to post here) optimising a little more https://wrongsideofmemphis.wordpress.com/2015/05/08/optimise...

masklinn · on May 8, 2015

You could get a last slight bit of juice from localising the closed over variables by binding them as default parameters e.g.

    def whitelist_filter(bid_request):
        return not categories.isdisjoint(bid_request["categories"])

to

    def whitelist_filter(bid_request, categories=categories):
        return not categories.isdisjoint(bid_request["categories"])

masklinn · on May 8, 2015

Aside for the criticisms provided by others, there is one more optimisation which is regularly used in the stdlib: using default parameters to make closure and global values local.

Locals are the fastest lookup in cpython (by a fair bit), and since default parameters create local variables and are bound once at function creation, you can use them to aliase both nonlocals and globals in order to improve their lookup time significantly:

    def func1():
        a = 42
        def func2(param, a=a, bool=bool):
            bool(param + a)
        return func2

Of course the gain depends on the exact number of lookups performed on these localised variables and how much work is performed aside from that, but it exists in both pypy and python and for lookup-heavy functions (such as the one above) it can be well into the 10% range.

schmidtc · on May 8, 2015

So the basic gist of this blog post was: Identify bottleneck, optimize bottleneck, throw out optimization because pypy.

heydenberk · on May 8, 2015

Alternate title, given the article's conclusion: "Don't optimize Python with Closures"

dr_zoidberg · on May 8, 2015

If you're using PyPy.

_ZeD_ · on May 8, 2015

why not use namedtuples (or plain tuples) instead of dictionaries?

dr_zoidberg · on May 8, 2015

Dicts are the fastest dynamic structure out of the box. Tuples will bite you with immutability sooner or later.

boothead · on May 8, 2015

I usually think of immutability in terms of saving myself from getting bitten :-)

dr_zoidberg · on May 10, 2015

Touché! Thats a nice thought, however string concatenation with the += operator inside a loop is the kind of bite I was thinking about.