I cannot overstate the importance of using a programming language targeting GPUs directly like Futhark (https://github.com/diku-dk/futhark). In this case, it is a functional, declarative language where you can focus on the why, not the how. Just like CPUs are incredibly complex, higher level abstractions are very important.
If you were a pro GPU programmer and had 10 years, Futhark would be maybe 10x slower. But just like we do not program in assembly when making critically fast software, most non-simple things are easier written in this.
Well, yes, but to be honest that code still has to be annotated with bounds and batch sizes etc. In futhark you need to know absolutely zero about GPUs
How fast can Futhark be compared to a standard CUDA loop with a few arithmetic, load and save operations? Basically, suppose you're doing simple gathering and scattering?
I think it will always be slower than hand-optimized GPU code, just like assembly. But for most complex programs, I think the compiler is better than humans.
@arthas, the author, at some point made comparisons against implementations and it was a most twice as slow, but often faster.
If you are writing some very important function, you may write it in assembly and it will be faster than e.g. a C implementation (CPU). But how often do you do that? I think of e.g. CUDA as assembly for the GPU as you have to know about batch size, and special operations and annotations, but Futhark is like writing C or Java for the GPU (it actually compiles to CPUs as well), and it is just a much nicer experience, and I think 99.9% of all people will write faster code because GPUs are simply so complex
The main problem with GPU is that most of the performance tricks that make GPUs fast are extremely detailed.
Even 10 years ago, back in OpenCL 1.0 days or just as CUDA was getting started, you'd need to use __shared__ memory to really benefit from GPU coding.
Over the past 10 years, GPU programming has become even more detailed: permute / bpermute, warp/wavefront-level voting (vote / ballot operations), warp/wavefront-level coordination (__activemask()), and of course most recently: BFloat16 matrix multiplication / sparse matrix multiplication.
---------
There's some degree of abstraction building going on inside the CUDA-world. You can use cooperative groups to abstract a lot of the wavefront-level stuff away, but not entirely yet. As such, even if you use cooperative groups, you end up needing a mental model of why these operations are efficient before you really benefit.
CUDA cub abstracts a lot of operations away, removing the tedious exercise of making your own prefix sum / scan operation whenever you dip down to this level... but CUDA cub still requires the programmer to learn these tricks before cub is useful.
---------
What we really want, is for a programming interface where "CPU" programmers can add custom code to GPU kernels that adds flexibility and loses very little speed... but doesn't require CPU programmers to spend lots of time learning the GPU-programming tricks of their own.
And such an interface doesn't exist yet in any form. Outside of like... Tensorflow (except that only works for neural net programming, and not for general purpose programming).
-------
The speed thing is very important. C++ AMP, by Microsoft, was a pretty decent interface by 2012 standards (competitive with CUDA and OpenCL at the time), but C++ AMP was criticized as much slower than OpenCL/CUDA, so the programming community stopped using C++ AMP.
If you're writing GPU code that's slower than OpenCL / CUDA, then people will simply ask : why aren't you using OpenCL/CUDA ???
Its hard to sell "programmer convenience" and "simpler thinking process" to your boss.
Unfortunately, I don't see a "just a bit of magic here without learning much of anything new" interface coming because it's all about strategizing the movement of data. This is not unique to GPUs. It's a universal problem across computing hardware. It's just enabled to be explicit in OpenCL/CUDA. As compared to most languages where you indirectly try to steer things the right way and the CPU does it's best with whatever mess it gets.
Both C++AMP and Tensorflow abstracted away the movement of data behind relatively easy to use interfaces.
C++ AMP in particular was a generic interface behind a template<> class: Array<> and ArrayView<>.
Array<> always existed on GPU-side (be it GPU#0 or GPU#1), while ArrayView<> was your abstraction to read/write into that Array<>.
Array<SomeClass> foo = createSomeArray();
ArrayView<SomeClass> bar = foo;
This could be CPU-side or GPU side code (foo could be on GPU#0, while bar could be GPU#1). While you can use C++ AMP code to manually manage when the copies happen, the idea is to simplify the logic down so that CPU-programmers didn't have to think as hard.
In this case, "bar" is bound to foo, and bar will remain in sync with foo through async programming behind the scenes. bar[25] = SomeClass(baz); will change bar in this context, but the Array<> foo will only update at the designated synchronization points (similar to CUDA Streams).
------
So its a lot easier than reading/writing cudaMemcpy() in the right spots, in the right order, in the right CUDA Streams. But probably less efficient, and therefore going to be less popular than using CUDA directly.
With OpenMP (also supported in AMD’s ROCm AOMP) and C++17 standard parallelism support. Legate w/ cuNumeric goes to another direction with implementing a numpy-compatible interface, but GPU accelerated.
Also deriving from the limits of a SIMD abstraction with 1 IP per thread for the sake of guaranteed forward progress (as such changing the programming model needed for the code to be functional).
The underlying machine is implemented as SIMT, so people should still be aware of the associated performance pitfalls. However this allows for better compatibility and more code sharing possible than before.
Microsoft’s problem with C++ AMP is that FXIL as present in D3D11 was just too limited and had too many performance pitfalls I’m afraid, if only it got to see the light of day with D3D12 + DXIL later…
In a shared-virtual memory situation, you'd probably have data sitting on CPU-side, and then SVM over the data around.
With a little bit more code, you can have...
Array<DataStructure> data = foo();
ArrayView<DataStructure> dataview = data;
bar(data);
baz(data);
In this case, the data exists on GPU-side (explicitly: all Array<> in C++Amp are GPU-allocated). If foo, bar and baz are all GPU-functions, then the data never moves (no cudaMemcpy involved).
If bar is a CPU-function, you get the copy back-and-forth for correctness. But that still leaves room for you to convert bar from CPU-code into GPU-code without affecting this section of code... giving you the ability to smoothly improve performance as your project matures.
-------
CUDA / OpenCL default memory is too granular.
Datastructure* data = cudaMalloc(...); // GPU allocation
Datastructure* localData = malloc(...);
cudaMemcpy(localData, data, stream-stuff, etc. etc.)
// ugggh... too much. Lots of steps to do anything, thinking about streams and async and signals.
-------
Finding the right compromise between performance and simplicity is key. CUDA/OpenCL have both something that's too simple(SVM / Unified Memory), and too complicated (cudaMalloc + cudaMemcpy). The answer IMO lies in-between somehow.
> CUDA/OpenCL have both something that's too simple(SVM / Unified Memory)
It’s sadly the only choice around for maximum code reuse… with then manually optimising the parts that need to be later through profiling, as developer time is quite limited.
On systems where the CPU and GPU are coherent such SoCs w/ CPU and GPU on the same die, Infinity Fabric-linked GPUs or NVlink linked ones (Power9/Grace), that option is also quite attractive.
The memory manager measures the frequency of access from the CPU and GPU to (try to) locate where a buffer should be stored or moved dynamically.
However, on PCIe linked GPUs, reliance on heuristics is more necessary to reach good perf levels.
You can also ask to have your memory allocation migrated to the GPU on-demand with cudaMemPrefetchAsync (finally implemented in ROCm 4.5 too, which has unified memory on AMD cards) before doing a kernel launch for the data that you’ll use.
The goal is to make the use of regular cudaMalloc/hipMalloc not necessary for good performance in a wide variety of scenarios. The explicit storage location APIs will stay available for when more optimisation is necessary.
there were the '' languages on the connection machine. in C for example, 'shapes', which were cartesian arrays of values were part of the type system.
communications operations were explicit through the use of indirection on these indices. I doubt its the case on a modern gpu, but aside from the router, performance on the CM was deterministic so you really could understand what you were going to get from the source.
completely agree with you, general parallel programming languages are a thing and really pretty valuable to try to exploit all that machinery. Chapel is the only recent work I'm familiar with.
I think the thing is that the only reason to do general purpose computing on a GPU is to get extra performance. So everyone who's using the device is going to want to know how to optimize where a program where performance doesn't matter will default run on a CPU.
And GPUs are structured to send a whole lot of memory to many processors and I think that doing this efficiently is an inherently harder problem than sending memory to a few complex processors through a pipeline that's complex, yes, but designed to make things feel easy.
I found many small flaws with Julia for my work purposes. But then got in the habit of writing numeric inner loops in Rust and exporting them to Python with pyo3. Maybe I should find out about writing some of that in Julia instead.
If you were a pro GPU programmer and had 10 years, Futhark would be maybe 10x slower. But just like we do not program in assembly when making critically fast software, most non-simple things are easier written in this.