Compute shaders, which can draw points faster than the native rendering pipeline. Although I have to admit that WebGPU implements things so poorly and restrictive, that this benefit ends up being fairly small. Storage buffers, which come along with compute shaders, are still fantastic from a dev convenience point of view since it allows implementing vertex pulling, which is much nicer to work with than vertex buffers.
For gaussian splatting, WebGPU is great since it allows implementing sorting via compute shaders. WebGL-based implementations sort on the CPU, which means "correct" front-to-back blending lags behind for a few frames.
But yeah, when you ask like that, it would have been much better if they had simply added compute shaders to WebGL, because other than that there really is no point in WebGPU.
Access to slightly more recent GPU features (e.g. WebGL2 is stuck on a feature set that was mainstream ca. 2008, while WebGPU is on a feature set that was mainstream ca 2015-ish).
The GL programming only feels 'natural' if you've been following GL development closely since the late 1990s and learned to accept all the design compromises for sake of backward compatibility. If you come from other 3D APIs and never touched GL before it's one "WTF were they thinking" after another (just look at VAOs as an example of a really poorly designed GL feature).
While I would have designed a few things differently in WebGPU (especially around the binding model), it's still a much better API than WebGL2 from every angle.
The limited feature set of WebGPU is mostly to blame on Vulkan 1.0 drivers on Android devices I guess, but there's no realistic way to design a web 3D API and ignore shitty Android phones unfortunately.
It's not about feeling natural - I fully agree that OpenGL is a terrible and outdated API. It's about the complete overengengineered and pointless complexity in Vulkan-like APIs and WebGPU. Render Passes are entirely pointless complexity that should not exist. It's even optional in Vulkan nowadays, but still mandatory in WebGPU. Similarly static binding groups are entirely pointless, now I've got to cache thousands of vertex and storage buffers. In Vulkan you can nowadays modify those, but not in WebGPU. Wish I could batch them buffers in a single one so I dont need to create thousands of bind groups, but that's also made needlessly cumbersome in WebGPU due to the requirement to use staging buffers. And since buffer sizes are fairly limited, I can't just create one that fits all, so I have to create multiple buffes anyway, might as well have a separate buffer for all nodes. Virtual/Sparse buffers would be helpful in single-buffer designs by growing those as much as needed, but of course they also dont exist in WebGPU.
The one thing that WebGPU is doing better is that it does implicit syncing by default. The problem is, it provides no options for explicit syncing.
I mainly software-rasterize everything in Cuda nowadays, which makes the complexity of graphics apis appear insane. Cuda allows you to get things done simple and easily, but it still has all the functionaility to make things fast and powerful. The important part is that the latter is optinal, so you can get things done quickly, and still make them fast.
In cuda, allocating a buffer and filling it with data is a simple cuMemAlloc and cuMemcpy. When calling a shader/kernel, I dont need bindings and descriptors, I simply pass a pointer to the data. Why would I need that anyway, the shader/kernel knows all about the data, the host doesnt need to know.
> Render Passes are entirely pointless complexity that should not exist. It's even optional in Vulkan nowadays.
AFAIK Vulkan only eliminated pre-baked render pass objects (which were indeed pointless), and now simply copied Metal's design of transient render passes, e.g. there's still 'render pass boundaries' between vkCmdBeginRendering() and vkCmdEndRendering() and the VkRenderingInfo struct that's passed into the vkCmdBeginRendering() function (https://registry.khronos.org/vulkan/specs/latest/man/html/Vk...) is equivalent with Metal's MTLRenderPassDescriptor (https://developer.apple.com/documentation/metal/mtlrenderpas...).
E.g. even modern Vulkan still has render passes, they just didn't want to call those new functions 'Begin/EndRenderPass' for some reason ;) AFAIK the idea of render pass boundaries is quite essential for tiler GPUs.
WebGPU pretty much tries to copy Metal's render pass approach as much as possible (e.g. it doesn't have pre-baked pass objects like Vulkan 1.0).
> The one thing that WebGPU is doing better is that it does implicit syncing by default.
AFAIK also mostly thanks to the 'transient render pass model'.
> Why would I need that anyway, the shader/kernel knows all about the data, the host doesnt need to know.
Because old GPUs are a thing and those usually don't have such a flexible hardware design to make rasterizing (or even vertex pulling) in compute shaders performant enough to compete with the traditional render pipeline.
> Similarly static binding groups are entirely pointless
I agree, but AFAIK Vulkan's 1.0 descriptor model is mostly to blame for the inflexible BindGroups design.
> but that's also made needlessly cumbersome in WebGPU due to the requirement to use staging buffers
Most modern 3D APIs also switched to staging buffers though, and I guess there's not much choice if you don't have unified memory.
> AFAIK the idea of render pass boundaries is quite essential for tiler GPUs.
I've been told by a driver dev of a tiler GPU that they are, in fact, not essential. They pick that info up by themselves by analyzing the command buffer.
> Most modern 3D APIs also switched to staging buffers though, and I guess there's not much choice if you don't have unified memory.
Well I wouldn't know since I switched to using Cuda as a graphics API. It's mostly nonsense-free, and faster than the hardware pipeline for points, and about as fast for splats. Seeing how Nanite also software-rasterizes as a performance improvement, Cuda may even be great for triangles. Only implemented a rudimentary triangle rasterizer that can draw 10 million small textured triangles per millisecond. Still working on the larger ones, but low-priority since I focus on point clouds.
In any case, I won't touch graphics APIs anymore until they make a clean break to remove the legacy nonsense. Allocating buffers should be a single line, providing data to shaders should be as simple as passing pointers, etc..
- Visualize other scan data such as gaussian splat data sets, or triangle meshes from photogrammetry
- Things like google earth, Cesium, or other 3D globe viewers.
It's a pretty big thing in geospatial sciences and industry.