On the same basis, it would also help if you could provide a comparison between ...

einpoklum · on Oct 2, 2021

> How has the architecture evolved to make the A100 significantly faster?

Oh, very much so. By way more than an order of magnitude. For a deeper read, have a look at the "architecture white papers" for Kepler, Pascal, Volta/Turing, and Ampere:

https://duckduckgo.com/?t=ffab&q=NVIDIA+architecture+white+p...

or check out the archive of NVIDIA's parallel4all blog ... hmm, that's weird, it seems like they've retired it. They used to have really good blog posts explaining what's new in each architecture.

You could also have a look here:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....

for the table of various numeric sizes and limits which change with different architectures. But that's not a very useful resource in and of itself.

M277 · on Oct 2, 2021

You may find this[0] helpful (note -- download link to a .PDF). It's the GA100 whitepaper.

[0]: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Cent...

kkielhofner · on Oct 2, 2021

As a starter T4 is heavily optimized for low power consumption on inference tasks. IIRC it doesn’t even require additional power beyond what the PCIe bus can provide but basically useless for training unlike the others.

touisteur · on Oct 3, 2021

One day I'll get my hands on both an A40 and an A100 and I'll maybe get an answer to the question: does the 5120bits memory bus help that much? The A100 has less cuda cores, around 1/4 more tensor cores but seems to be the preferred 'compute' and 'ai training' option all around. What gives?