The A100 whitepaper "spoiled" a lot of these factoids already. (https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Cent...)
The new bit seems to be the doubling of FP32 "CUDA cores" (I really hate that word: when Intel or AMD double their CPU pipelines it doesn't mean that they're selling more cores, it means their cores got wider... anyway). A100 didn't have this feature (I assume A100 was 16 Floating point + 16 Integer "Cuda cores" per CU like Turing. Correct me if I'm wrong)
You don't need to read the whitepaper to understand that NVidia has really improved performance/cost here. The 3rd party benchmarks are out and the improved performance is well documented at this point.
The FP32 doubling, is one of the most important bits here. But fortunately for programmers, this doesn't really change how you do your code. The compiler / PTX assembler will schedule your code at compile time to best take advantage of that.
The other bit: larger L1 / Shared memory of 128kB per CU, does affect programmers. GPU programmers have tight control over shared memory, and is very useful for optimization purposes.
----------
GDDR6's improved memory bandwidth is also big. "Feeding the beast" with faster RAM is always a laudable goal, and sending 2-bits per pin per clock cycle through PAM4 is a nifty trick.
Sparse Tensor Cores were already implemented in A100, and don't seem to be new. If you haven't heard of the tech before, its cool: basically hardware accelerated sparse-matrix computations. A 4x4xFP16 matrix uses 32 bytes under normal conditions, but can be "compressed" into 16 bytes if half-or-more of its values are 0. NVidia Ampere supports hardware-accelerated matrix multiplications of these 16-byte "virtual" 4x4xFP16 matrixes.
I swear that RTX I/O existed before in some other form. This isn't the first time I heard about offloading PCIe to the GPU. Its niche and I don't expect video games to use it (are M.2 SSDs popular enough to be assumed on the PC / Laptop market yet?). But CUDA-coders probably can control their hardware more carefully and benefit from such a feature
RTX I/O is going to be a big feature, and games are likely some of the first consumer-facing software that will use it because it is a standard features for the next console generation. And AAA devs already support multiple performance profiles, feature support fallbacks..etc. There's no reason they couldn't have the engine take advantage of RTX I/O when it exists, but otherwise fall back on an emulation layer of sorts.
In addition, I suspect the slice of the video game market that has a GPU with RTX I/O capability will also have a NVME SSD. Now, this is niche, but with that slice of the market also being the top-end performance tier, they're still going to be catered to by AAA devs.
Even without an NVMe drive you're better off with this just by skypping system RAM altogether. Bu you're not going to be able to use it to stream back and forth game content at the snap of a finger (well maybe that's a bit hyperbolic) as the console makers have been saying they will.
> The FP32 doubling, is one of the most important bits here. But fortunately for programmers, this doesn't really change how you do your code.
Early benchmarks are showing games under-performing quite a bit in the worst cases. The crux of the issue is that it's not /exactly/ a no-compromise doubling of FP32. Each data path per SM can either do 2xFP32 or 1xINT32/1xFP32 per clock cycle. So if your game or application has any significant INT32 operations scheduled, all of a sudden you're back to the number of FP32 cores you had last generation, though you get the benefit of parallel INT32 execution.
Its not uncommon for GPU workloads in games to max out about 20% INT32 calculations, but alas its enough to drop the FP32 performance quite a bit. I suspect Nvidia next time will probably separate out the INT32 and 2x FP32 units and gradually move towards going towards a better ratio of hardware that better suits the usual workload split.
Due to the lower amount of INT32 in game loads as you stated, I don't think that separating INT32 and FP32 hardware makes a lot of sense, because you can share a substantial amount of the hardware between the two overall leading to space savings.
On the contrary, "dark silicon" instead suggests that separating fp32 and int32 (now in GA102/104, fp32 and int32/fp32) data paths at the cost of more die space usage currently makes excellent sense. (See also: tensor cores, ray tracing cores.) Jensen Huang very briefly alluded to this when during the GA102/104 announcement he mentioned the end of Dennard scaling.
But the GA102/GA104 doesn’t have seperate execution units for INT and FP32 because the INT also does FP32. So I don’t see how that shows that separating FP32 and INT hardware makes sense.
I've been thinking that's why we are seeing the true doubling in full RTX like quake and minecraft but not on more traditional rendering engines.
From my understanding int is often used for lookups, and I'd presume a lot of that is some sort of environment mapping which adds some contention as int is more limited and "steals" from the doubling of FP.
I think "parallel execution of fp32/int32" is kind of vaguely defined by them... Do they mean fp32/int32 instructions from the same thread (aka warp/wavefront) or from different threads? If it's the latter I'm pretty sure AMD GPUs have been doing it too.
The A100 whitepaper "spoiled" a lot of these factoids already. (https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Cent...) The new bit seems to be the doubling of FP32 "CUDA cores" (I really hate that word: when Intel or AMD double their CPU pipelines it doesn't mean that they're selling more cores, it means their cores got wider... anyway). A100 didn't have this feature (I assume A100 was 16 Floating point + 16 Integer "Cuda cores" per CU like Turing. Correct me if I'm wrong)
You don't need to read the whitepaper to understand that NVidia has really improved performance/cost here. The 3rd party benchmarks are out and the improved performance is well documented at this point.
The FP32 doubling, is one of the most important bits here. But fortunately for programmers, this doesn't really change how you do your code. The compiler / PTX assembler will schedule your code at compile time to best take advantage of that.
The other bit: larger L1 / Shared memory of 128kB per CU, does affect programmers. GPU programmers have tight control over shared memory, and is very useful for optimization purposes.
----------
GDDR6's improved memory bandwidth is also big. "Feeding the beast" with faster RAM is always a laudable goal, and sending 2-bits per pin per clock cycle through PAM4 is a nifty trick.
Sparse Tensor Cores were already implemented in A100, and don't seem to be new. If you haven't heard of the tech before, its cool: basically hardware accelerated sparse-matrix computations. A 4x4xFP16 matrix uses 32 bytes under normal conditions, but can be "compressed" into 16 bytes if half-or-more of its values are 0. NVidia Ampere supports hardware-accelerated matrix multiplications of these 16-byte "virtual" 4x4xFP16 matrixes.
I swear that RTX I/O existed before in some other form. This isn't the first time I heard about offloading PCIe to the GPU. Its niche and I don't expect video games to use it (are M.2 SSDs popular enough to be assumed on the PC / Laptop market yet?). But CUDA-coders probably can control their hardware more carefully and benefit from such a feature