It's somewhat domain specific. Pure Python libraries have easier time supporting new releases than libraries that rely on C APIs, and even slower are those that deal with less stable implementation details like bytecode (e.g. Numba). But definitely getting better and faster every release.
There's overhead in transferring data from CPU to GPU and back. I'm not sure how this works with internal GPUs, though, insofar as RAM is shared.
In general, though, as I understand it (not a GPU programmer) you want to pass data to the GPU, have it do a lot of operations, and only then pass it back. Doing one tiny operation isn't worth it.
NumPy default is that you iterate over the earlier dimensions first.
The slow code is likely at least partially slow due to branch misprediction (this is specific to my CPU, not true on CPUs with AVX-512), see https://pythonspeed.com/articles/speeding-up-numba/ where I use `perf stat` to get branch misprediction numbers on similar code.
With SIMD disabled there's also a clear difference in IPC, I believe.
The bigger picture though is that the goal of this article is not to demonstrate speeding up code, it's to ask about level of parallelism given unchanging code. Obviously all things being equal you'll do better if you can make your code faster, but code does get deployed, and when it's deployed you need to choose parallelism levels, regardless of how good the code is.
> when it's deployed you need to choose parallelism levels, regardless of how good the code is.
Yes, absolutely, exactly. That’s why it can be really helpful to pinpoint that cause of slowdown, right? It might not matter at deployment time if you have an automated shmoo that calculates the optimal thread load, but knowing the actual causes might be critical to the process, and/or really help if you don’t do it automatically. (For one, it’s possible the conditions for optimal thread load could change over the course of a run.)
That's not it. I updated the article with an experiment of processing 5 items at a time. The fast function doing 5 images at a time is slower than the slow function doing 1 image at a time (24*5 > 90).
If your theory was correct, we would expect the optimal number of threads for the fast function processing 5 images at a time to be similar to that of the slow function processing 1 image at a time.
In fact, the optimal threads in this case (5 images at a time) was 20 for slow function, 10 for fast function, so essentially the same as the original setup.
Based on the graph the fast function runtime is really short. You might be just seeing effects of efficiency vs performance cores. Lower thread count makes most of the stuff run on performance cores and task end times align more nicely. With larger number of threads tasks running on performance cores complete first and you are left waiting for efficiency tasks running on efficiency cores to complete, or context switches have to be done and tasks are moved between cores which causes overhead.
You could try what happens if you have 10 times more images when running fast function.
Also you have just 8 physical performance cores and 4 physical efficiency cores. Performance cores have hyper threading so they act as 2 logical cores but that doesn't mean that they can actually execute 2 threads at maximum performance. If processing tasks use the same parts of the processor core, then processor cannot run both threads at the same time and IPC will suffer. Slow task maybe uses more varied parts of the core which allows better IPC with hyper threading. So that also may reduce optimal thread count.
Is it a caching thing? The slow version seems less cache efficient, so if it is waiting due to cache misses, that could create an opportunity for something else to get scheduled in.
I doubt it the slow version uses division instead of bit shifting. My guess would be the fast version saturated like i/o or some non cpu portion of the processor and the division one was bottle necked by the division logic in the processor.
Recalibrate how you feel about division and multiplication. It turns out, integer division on new processors is a 1 cycle process (and has been for a while now). Most of the multicycle instructions now-a-days are things like SIMD and encryption.
Which processor has 1-cycle latency for integer division? Even Apple Silicon, which has the mostly highly optimized implementation I am aware of appears to have 2-cycle latency. Recent x86 are much worse, though greatly improved. Integer division is much faster than it used to be but not single cycle.
Also, most of those SIMD and encryption instructions can retire one per cycle on modern cores, but that isn't the same as latency.
2-cycle latency for division seems extremely unlikely. From the instruction tables it seems that firestorm has 7-9 cycles latency SDIV (which is excellent). It also has an impressive 2-cycles reciprocal throughput.
Quite possible. I am not confident Apple Silicon has 2-cycle div latency, that seems improbably fast to me, but I had heard some reasonably well-sourced rumors to that effect. I have not measured it myself.
Even at somewhat higher latency it is still fast enough to not be worth optimizing around in most cases, which is great.
But it's iterating through the result vectors twice, so that's basically guaranteed to miss. Moving the threshold check into the loop above would at least eliminate that factor.
Maybe division vs bit shifting does play a factor, but it's hard to compare that while the cache behavior is so different.
Author here. The original article I was going to write was about using newer instruction sets, but then I discovered it doesn't even use original SSE instructions by default, so I wrote this instead.
Eventually I'll write that other article; I've been wondering if it's possible to have infrastructure to support both modern and old CPUs in Python libraries without doing runtime dispatch on the C level, so this may involve some coding if I have time.
There's presumably a reason they've spent the past 20 years adding additional instructions to CPUs, yeah :) And a large part of the Python ecosystem just ignores all of them. (NumPy has a bunch of SIMD with function-level dispatch, and they add more over time.)
Author here: Note that this hasn't yet been updated for Cython 3, which does fix or improve some of these (but not the fundamental limitation that you're stuck with C or C++).
Nor can you prove that a language is error prone by providing a 40 line example written in an antiquated style that deliberately avoids using the safety features at one's disposal.