More

jms55 · 2025-04-05T04:52:20 1743828740

> NVidia and AMD keep designing their cards with Microsoft for DirectX first, and Vulkan, eventually.

Not really. For instance NVIDIA released day 1 Vulkan extensions for their new raytracing and neural net tech (VK_NV_cluster_acceleration_structure, VK_NV_partitioned_tlas, VK_NV_cooperative_vector), as well as equivalent NVAPI extensions for DirectX12. Equal support, although DirectX12 is technically worse as you need to use NVAPI and rely on a prerelease version of DXC, as unlike Vulkan and SPIR-V, DirectX12 has no mechanism for vendor-specific extensions (for good or bad).

Meanwhile the APIs, both at a surface level and how the driver implements them under the hood, are basically identical. So identical in fact, that NVIDIA has the nvrhi project which provides a thin wrapper over Vulkan/DirectX12 so that you can run on multiple platforms via one API.

pjmlp · 2025-04-05T06:14:48 1743833688

An exception that doesn't change the rule, where are the Vulkan extensions for DirectX neural shaders, and RTX kit?

As a more recent example, not feeling like enumerating all of them since DirectX 8 shader model introduction, and collaboration with NVidia where Cg became HLSL foundation.

Exactly, proprietary APIs don't have extension spaghetti like Khronos APIs, that always end up out of control, hence Vulkan 2025 roadmap plans.

Khronos got lucky that Google and Samsung decided to embrace Vulkan as the API to be on Android, Valve for their Steam Deck, and IoT displays, basically.

Everywhere else it is middleware engines that support all major 3D APIs, with WebGPU becoming also middleware outside of the browser due to the ways of Vulkan.

jms55 · 2025-04-05T07:02:31 1743836551

> An exception that doesn't change the rule, where are the Vulkan extensions for DirectX neural shaders, and RTX kit?

DirectX "neural shaders" is literately the VK_NV_cooperative_vector extension I mentioned previously, which is actually easier to use in Vulkan at the moment since you don't need a custom prelease version of DXC. Same for all the RTX kit stuff, e.g. https://github.com/NVIDIA-RTX/RTXGI has both VK and DX12 support.

pjmlp · 2025-04-05T08:38:41 1743842321

And how does that prove that NVidia has not designed that together with Microsoft first in DirectX prototype?

Additionally, naturally Intel and AMD will come up with their extensions, if ever, followed by a Khronos common one. Not counting mobile units into this extension frenzy.

So then we will have the pleasure to chose between four extensions for a feature, depending on the card's vendor, with possible incompatible semantics, as it has happened so many times.

jms55 · 2025-04-05T04:41:39 1743828099

> And not only are we getting tile/block-level primitives and TileIR

As someone working on graphics programming, it always frustrates me to see so much investment in GPU APIs _for AI_, but almost nothing for GPU APIs for rendering.

Block level primitives would be great for graphics! PyTorch-like JIT kernels programmed from the CPU would be great for graphics! ...But there's no money to be made, so no one works on it.

And for some reason, GPU APIs for AI are treated like an entirely separate thing, rather than having one API used for AI and rendering.

jms55 · 2025-04-05T04:38:17 1743827897

Never used CUDA, but I'm guessing these map to the same underlying stuff as timestamp queries in graphics APIs, yes?

jms55 · 2025-03-22T21:22:15 1742678535

> But the biggest problem I'm having is management of buffer space for intermediate objects

My advice for right now (barring new APIs), if you can get away with it, is to pre-allocate a large scratch buffer for as big of a workload as you will have over the program's life, and then have shaders virtually sub-allocate space within that buffer.

jms55 · 2025-03-22T21:18:35 1742678315

Agreed, there are two different problems being described here.

1. Divergence of threads within a workgroup/SM/whatever

2. Dynamically scheduling new workloads (i.e. dispatches, draws, etc) in response to the output of a previous workload

Raytracing is problem #1 (and has it's own solutions, like shader execution reodering), while Raph is talking about problem #2.

dragontamer · 2025-03-23T04:13:18 1742703198

> Raytracing is problem #1 (and has it's own solutions, like shader execution reodering)

The "solution" to Raytracing (ignoring hardware acceleration like shader reordering), is stream compaction and stream expansion.

    if (ray hit){ 
        push(hits_array, currentRay); 
    } else { 
        push (miss_array, currentRay); 
    }

If you are willing to have lots of loops inside of a shader (not always possible due to Windows's 2 second maximum), you can while(hits_array is not empty) kind of code, allowing your 1024-wavegroup to keep recursively calling all of the hits and efficiently processing all of the rays recursively.

--------

The important tidbit is that this technique generalizes. If you have 5 functions that need to be "called" after your current processing, then it becomes:

    if (func1 needs to be called next){ 
        push(func1, dataToContinue);
    } else if (func2 needs to be called next){ 
        push(func2, dataToContinue);
    } else if (func3 needs to be called next){ 
        push(func3, dataToContinue);
    } else if (func4 needs to be called next){ 
        push(func4, dataToContinue);
    } else if (func5 needs to be called next){ 
        push(func5, dataToContinue);
    }

Now of course we can't grow "too far", GPUs can't handle divergence very well. But for "small" numbers of next-arrays and "small" amounts of divergence (ie: I'm assuming that func1 is the most common here, like 80%+ so that the buffers remain full), then this technique works.

If you have more divergence than that, then you need to think more carefully about how to continue. Maybe GPUs are a bad fit (ex: any HTTP server code will be awful on GPUs) and you're forced to use a CPU.

jms55 · 2025-03-22T21:10:18 1742677818

> Are you arguing for a better software abstraction, a different hardware abstraction or both?

I don't speak for Raph, but imo it seems like he was arguing for both, and I agree with him.

On the hardware side, GPUs have struggled with dynamic workloads at the API level (not e.g. thread-level dynamism, that's a separate topic) for around a decade. Indirect commands gave you some of that so at least the size of your data/workload can be variable if not the workloads themselves, then mesh shaders gave you a little more access to geometry processing, and finally workgraphs and device generated commands lets you have an actually dynamically defined workload (e.g. completely skipping dispatches for shading materials that weren't used on screen this frame). However it's still very early days, and the performance issues and lack of easy portability are problematic. See https://interplayoflight.wordpress.com/2024/09/09/an-introdu... for instance.

On the software side shading languages have been garbage for far longer than hardware has been a problem. It's only in the last year or two that a proper language server for writing shaders has even existed (Slang's LSP). Much less the innumerable driver compiler bugs, lack of well defined semantics and memory model until the last few years, or the fact that we're still manually dividing work into the correct cache-aware chunks.

raphlinus · 2025-03-22T23:26:03 1742685963

Absolutely. And the fact that we need to evolve both is one of the reasons progress has been difficult.

jms55 · 2025-03-22T21:01:38 1742677298

You can. There are API extensions for persistently mapping memory, and it's up to you to ensure that you never write to a buffer at the same time the GPU is reading from it.

At least for Vulkan/DirectX12. Metal is often weird, I don't know what's available there.

jms55 · 2025-03-22T20:56:37 1742676997

Fast light transport is an incredibly hard problem to solve.

Raytracing (in its many forms) is one solution. Precomputing lightmaps, probes, occluder volumes, or other forms of precomputed visibility are another.

In the end it comes down to a combination of target hardware, art direction and requirements, and technical skill available for each game.

There's not going to be one general purpose renderer you can plug into anything, _and_ expect it to be fast, because there's no general solution to light transport and geometry processing that fits everyone's requirements. Precomputation doesn't work for dynamic scenes, and for large games leads to issues with storage size and workflow slow downs across teams. No precomputation at all requires extremely modern hardware and cutting edge research, has stability issues, and despite all that is still very slow.

It's why game engines offer several different forms of lighting methods, each with as many downsides as they have upsides. Users are supposed to pick the one that best fits their game, and hope it's good enough. If it's not, you write something custom (if you have the skills for that, or can hire someone who can), or change your game to fit the technical constraints you have to live with.

> Nobody has a good solution to this yet. What does the renderer need to know from its caller? A first step I'm looking at is something where, for each light, the caller provides a lambda which can iterate through the objects in range of the light. That way, the renderer can get some info from the caller's spatial data structures. May or may not be a good idea. Too early to tell.

Some games may have their own acceleration structures. Some won't. Some will only have them on the GPU, not the CPU. Some will have an approximate structure used only for specialized tasks (culling, audio, lighting, physics, etc), and cannot be generalized to other tasks without becoming worse at their original task.

Fully generalized solutions will be slow be flexible, and fully specialized solutions will be fast but inflexible. Game design is all about making good tradeoffs.

Animats · 2025-03-23T02:21:01 1742696461

The same argument could be made against Vulkan, or OpenGL, or even SQL databases. The whole NoSQL era was based on the concept that performance would be better with less generality in the database layer. Sometimes it helped. Sometimes trying to do database stuff with key/value stores made things worse.

I'm trying to find a reasonable medium. I have a hard scaling problem - big virtual world, dynamic content - and am trying to make that work well. If that works, many games with more structured content can use the same approach, even if it is overkill.

jms55 · 2025-03-17T05:57:53 1742191073

Not quite correct.

For primary visibility, you don't need more than 1 sample. All it is is a simple "send ray from camera, stop on first hit, done". No monte carlo needed, no noise.

On recent hardware, for some scenes, I've heard of primary visibility being faster to raytrace than rasterize.

The main reasons why games are currently using raster for primary visibility:

1. They already have a raster pipeline in their engine, have special geometry paths that only work in raster (e.g. Nanite), or want to support GPUs without any raytracing capability and need to ship a raster pipeline anyways, and so might as well just use raster for primary visibility. 2. Acceleration structure building and memory usage is a big, unsolved problem at the moment. Unlike with raster, there aren't existing solutions like LODs, streaming, compression, frustum/occlusion culling, etc to keep memory and computation costs down. Not to mention that updating acceleration structures every time something moves or deforms is a really big cost. So games are using low-resolution "proxy" meshes for raytracing lighting, and using their existing high-resolution meshes for rasterization of primary visibility. You can then apply your low(relative) quality lighting to your high quality visibility and get a good overall image.

Nvidia's recent extensions and blackwell hardware are changing the calculus though. Their partitioned TLAS extension lowers the acceleration structure build cost when moving objects around, their BLAS extension allows for LOD/streaming solutions to keep memory usage down as well as cheaper deformation for things like skinned meshes since you don't have to rebuild the entire BLAS, and blackwell has special compression for BLAS clusters to further reduce memory usage. I expect more games in the ~near future (remember games take 4+ years of development, and they have to account for people on low-end and older hardware) to move to raytracing primary visibility, and ditching raster entirely.

jms55 · 2025-03-17T05:41:33 1742190093

Ray tracing refers to the act of tracing rays. You can use it for lighting, but also sound, visibility checks for enemy AI, etc.

Path tracing is a specific technique where you ray trace multiple bounces to compute lighting.

In recent games, "ray tracing" often means just using ray tracing for direct light shadows instead of shadow maps, raytraced ambient occlusion instead of screenspace AO, or raytraced 1-bounce of specular indirect lighting instead of screenspace reflections. "Path traced" often means raytraced direct lighting + 1-bounce of indirect lighting + a radiance cache to approximate multiple bounces. No game does _actual_ path tracing because it's prohibitively expensive.

cubefox · 2025-03-17T10:31:05 1742207465

I believe the "path tracing" you described here is actual path tracing insofar each sample is one "path" rather than one "ray", where a "path" does at least one bounce, which is equivalent to at least two rays per sample. Though I think the "old" path tracing algorithm was indeed very slow, because it sent out samples in random directions, whereas modern path tracing uses the ReSTIR algorithm, which does something called importance sampling, which is a lot faster.

The other significant part is that path tracing is independent of the number of light sources, which isn't the case for some of the classical ray traced effects you mention ("direct shadows" vs path traced direct lighting).

That's at least what I understand of the matter.