Documentation is lagging reality a bit, we'll probably fix that around the next llvm release. Some information is at https://libc.llvm.org/gpu/using.html
That GPU libc is mostly intended to bring things like fopen to openmp or cuda, but it turns out GPUs are totally usable as bare metal embedded targets. You can read/write to "host" memory, on that and a thread running on the host you can implement a syscall equivalent (e.g. https://dl.acm.org/doi/10.1145/3458744.3473357), and once you have syscall the doors are wide open. I particularly like mmap from GPU kernels.
Is there a way to directly use these developments to already write a reasonable subset of C/C++ for simpler usecases (basically doing some compute and showing the results on screen by just manipulating pixels in a buffer like you would with a fragment/pixel shader) in a way that's portable (across the three major desktop platforms, at least) without dealing with cumbersome non-portable APIs like OpenGL, OpenCL, DirectX, Metal or CUDA? This doesn't require anything close to full libc functionality (let alone anything like the STL), but would greatly improve the ergonomics for a lot of developers.
I'll describe what we've got, but fair warning that I don't know how the write pixels to the screen stuff works on GPUs. There are some instructions with weird names that I assume make sense in that context. Presumably one allocates memory and writes to it in some fashion.
LLVM libc is picking up capability over time, implemented similarly to the non-gpu architectures. The same tests run on x64 or the GPU, printing to stdout as they go. Hopefully standing up libc++ on top will work smoothly. It's encouraging that I sometimes struggle to remember whether it's currently running on the host or the GPU.
The datastructure that libc uses to have x64 call a function on amdgpu, or to have amdgpu call a function on x64, is mostly a blob of shared memory and careful atomic operations. That was originally general purpose and lived on a prototypey GitHub. Its currently specialised to libc. It should end up in an under-debate llvm/offload project which will make it easily reusable again.
This isn't quite decoupled from vendor stuff. The GPU driver needs to be running in the kernel somewhere. On nvptx, we make a couple of calls into libcuda to launch main(). On amdgpu, it's a couple of calls into libhsa. I did have an opencl loader implementation as well but that has probably rotted, intel seems to be on that stack but isn't in llvm upstream.
A few GPU projects have noticed that implementing a cuda layer and a spirv layer and a hsa or hip layer and whatever others is quite annoying. Possibly all GPU projects have noticed that. We may get an llvm/offload library that successfully abstracts over those which would let people allocate memory, launch kernels, use arbitrary libc stuff and so forth running against that library.
That's all from the compute perspective. It's possible I should look up what sending numbers over HDMI actually is. I believe the GPU is happy interleaving compute and graphics kernels and suspect they're very similar things in the implementation.
I’m cautiously optimistic for SYCL. The absurd level of abstraction is a bit alarming, but single source performance portability would be a godsend for library authors.
This is one area where I imagine C++ wannabe replacements like Rust having a very hard time taking over.
It took almost 20 years to move from GPU Assembly (DX 9 timeframe), shading languages, to regular C, C++, Fortran and Python JITs.
There are some efforts with Java, .NET, Julia, Haskell, Chapel, Futhark, however still trailing behind the big four.
Currently in terms of ecosystem, tooling and libraries, as far as I am aware, Rust is trailing those, and not yet being a presence on HPC/Graphics (Eurographics, SIGGRAPH) conferences.
This is one area where I imagine C++ wannabe replacements like Rust having a very hard time taking over.
I 100% agree. Although I have a keen interest in Rust I can’t see it offering any unique value to the GPGPU or HPC space. Meanwhile C++ is gaining all sorts of support for HPC. For instance the parallel stl algorithms, mdspan, std::simd, std::blas, executors (eventually), etc. Not to mention all of the development work happening outside of the ISO standard, e.g. CUDA/ROCm(HIP)/OpenACC/OpenCL/OpenMP/SYCL/Kokkos/RAJA and who knows what else.
C++ is going to be sitting tight in compute for a long time to come.
HPC researchers already employ some techniques to detect memory corruption, hardware flaws, floating point errors, and so on. Maybe Rust could meaningfully reduce memory errors, but if it comes at the cost of bounds checking (or any other meaningful runtime overhead) they will have absolutely zero interest.
If you’re willing to deal with 5 layers of C++ TMP, then a library like Kokkos will let you abstract over those APIs, or at least some of them. Eventually if or when SYCL is upstreamed in the llvm-project it’ll be possible to do it with clang directly.
That GPU libc is mostly intended to bring things like fopen to openmp or cuda, but it turns out GPUs are totally usable as bare metal embedded targets. You can read/write to "host" memory, on that and a thread running on the host you can implement a syscall equivalent (e.g. https://dl.acm.org/doi/10.1145/3458744.3473357), and once you have syscall the doors are wide open. I particularly like mmap from GPU kernels.