Hacker News new | past | comments | ask | show | jobs | submit | itamarst's comments login

It's somewhat domain specific. Pure Python libraries have easier time supporting new releases than libraries that rely on C APIs, and even slower are those that deal with less stable implementation details like bytecode (e.g. Numba). But definitely getting better and faster every release.


It's much faster! There's been significant performance improvements since 3.8.


There's overhead in transferring data from CPU to GPU and back. I'm not sure how this works with internal GPUs, though, insofar as RAM is shared.

In general, though, as I understand it (not a GPU programmer) you want to pass data to the GPU, have it do a lot of operations, and only then pass it back. Doing one tiny operation isn't worth it.


You can use Numba to speed up some Pandas calculations: https://pandas.pydata.org/docs/user_guide/enhancingperf.html...


NumPy default is that you iterate over the earlier dimensions first.

The slow code is likely at least partially slow due to branch misprediction (this is specific to my CPU, not true on CPUs with AVX-512), see https://pythonspeed.com/articles/speeding-up-numba/ where I use `perf stat` to get branch misprediction numbers on similar code.

With SIMD disabled there's also a clear difference in IPC, I believe.

The bigger picture though is that the goal of this article is not to demonstrate speeding up code, it's to ask about level of parallelism given unchanging code. Obviously all things being equal you'll do better if you can make your code faster, but code does get deployed, and when it's deployed you need to choose parallelism levels, regardless of how good the code is.


> when it's deployed you need to choose parallelism levels, regardless of how good the code is.

Yes, absolutely, exactly. That’s why it can be really helpful to pinpoint that cause of slowdown, right? It might not matter at deployment time if you have an automated shmoo that calculates the optimal thread load, but knowing the actual causes might be critical to the process, and/or really help if you don’t do it automatically. (For one, it’s possible the conditions for optimal thread load could change over the course of a run.)


Neat! Unfortunately at the moment it still ignores cgroups, it's just a wrapper around sched_getaffinity().

https://github.com/python/cpython/blob/6a69b80d1b1f3987fcec3...


True, I'm wondering if they might change the implementation for this actually, since the function name is pretty agnostic.


That's not it. I updated the article with an experiment of processing 5 items at a time. The fast function doing 5 images at a time is slower than the slow function doing 1 image at a time (24*5 > 90).

If your theory was correct, we would expect the optimal number of threads for the fast function processing 5 images at a time to be similar to that of the slow function processing 1 image at a time.

In fact, the optimal threads in this case (5 images at a time) was 20 for slow function, 10 for fast function, so essentially the same as the original setup.


Based on the graph the fast function runtime is really short. You might be just seeing effects of efficiency vs performance cores. Lower thread count makes most of the stuff run on performance cores and task end times align more nicely. With larger number of threads tasks running on performance cores complete first and you are left waiting for efficiency tasks running on efficiency cores to complete, or context switches have to be done and tasks are moved between cores which causes overhead.

You could try what happens if you have 10 times more images when running fast function.

Also you have just 8 physical performance cores and 4 physical efficiency cores. Performance cores have hyper threading so they act as 2 logical cores but that doesn't mean that they can actually execute 2 threads at maximum performance. If processing tasks use the same parts of the processor core, then processor cannot run both threads at the same time and IPC will suffer. Slow task maybe uses more varied parts of the core which allows better IPC with hyper threading. So that also may reduce optimal thread count.


Hyper threading is SMT? I can never keep the branding names straight.


Yes, hyper threading is SMT


Is it a caching thing? The slow version seems less cache efficient, so if it is waiting due to cache misses, that could create an opportunity for something else to get scheduled in.


I doubt it the slow version uses division instead of bit shifting. My guess would be the fast version saturated like i/o or some non cpu portion of the processor and the division one was bottle necked by the division logic in the processor.


Recalibrate how you feel about division and multiplication. It turns out, integer division on new processors is a 1 cycle process (and has been for a while now). Most of the multicycle instructions now-a-days are things like SIMD and encryption.


Which processor has 1-cycle latency for integer division? Even Apple Silicon, which has the mostly highly optimized implementation I am aware of appears to have 2-cycle latency. Recent x86 are much worse, though greatly improved. Integer division is much faster than it used to be but not single cycle.

Also, most of those SIMD and encryption instructions can retire one per cycle on modern cores, but that isn't the same as latency.


You are correct, I mistakenly thought it was 1/cycle because I had previously remembered IMUL taking 30 cycles (and now it has a 1 cycle throughput).

Agner Fog is reporting anywhere from 4->30 cycles on semi-recent architectures.


2-cycle latency for division seems extremely unlikely. From the instruction tables it seems that firestorm has 7-9 cycles latency SDIV (which is excellent). It also has an impressive 2-cycles reciprocal throughput.


Quite possible. I am not confident Apple Silicon has 2-cycle div latency, that seems improbably fast to me, but I had heard some reasonably well-sourced rumors to that effect. I have not measured it myself.

Even at somewhat higher latency it is still fast enough to not be worth optimizing around in most cases, which is great.


But it's iterating through the result vectors twice, so that's basically guaranteed to miss. Moving the threshold check into the loop above would at least eliminate that factor.

Maybe division vs bit shifting does play a factor, but it's hard to compare that while the cache behavior is so different.


Visiting NYC a few weeks ago the bike-based restaurant delivery people were interesting to see, not a thing where I live.

Relevant to "active time", they seemed to spend a lot of time during the day just waiting around...


Author here. The original article I was going to write was about using newer instruction sets, but then I discovered it doesn't even use original SSE instructions by default, so I wrote this instead.

Eventually I'll write that other article; I've been wondering if it's possible to have infrastructure to support both modern and old CPUs in Python libraries without doing runtime dispatch on the C level, so this may involve some coding if I have time.


Yeah. And I dont mean this in a "no true scottsman" way. I really have trouble coaxing any kind of instruction-level parallelism w/o those.


There's presumably a reason they've spent the past 20 years adding additional instructions to CPUs, yeah :) And a large part of the Python ecosystem just ignores all of them. (NumPy has a bunch of SIMD with function-level dispatch, and they add more over time.)


Author here: Note that this hasn't yet been updated for Cython 3, which does fix or improve some of these (but not the fundamental limitation that you're stuck with C or C++).


Pardon me, but your implementation is a strawman. Pick on this (which doesn't require Cython 3):

    from libcpp.vector cimport vector
    from libcpp.pair cimport pair

    cdef class PointVec:
        cdef vector[pair[float, float]] vec

        def __init__(self, points: list[tuple[float, float]]):
            self.vec = points

        def __repr__(self):
            result = ", ".join(f"({x}, {y})" for x, y in self.vec)
            return f"PointVec({result})"

        def __setitem__(
            self, index, point: tuple[float, float]
        ):
            cdef pair[float, float] *p = &self.vec.at(index)
            p.first = point[0]
            p.second = point[1]

        def __getitem__(self, index):
            return self.vec.at(index)


You can't disprove that a language is error prone by providing a 20 line example that happens to be correct.


Nor can you prove that a language is error prone by providing a 40 line example written in an antiquated style that deliberately avoids using the safety features at one's disposal.


Yes you can, if you have to be very experienced or unreasonably thorough to know that all those safety features exist.

Opt-in safety is clearly worse than opt-out safety.


Okay, get specific. What in my implementation requires extensive experience, or where was I unreasonably thorough?


This assertions is representative of the myopic one-dimensional think of rustaceans. It only makes sense if the only thing your care about is safety.


C++ isn't what I would call a fundamental limitation.


Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: