Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thanks a lot, always enjoy your posts on here and r/hardware! Do you have a hardcore introduction with even more detail / perhaps even with examples of implementations? :)

I find white papers quite good (although I admit there are many things I don't understand yet and constantly have to look up), but even these sometimes feel a bit general.



> Do you have a hardcore introduction with even more detail / perhaps even with examples of implementations?

Hard to beat the actual technical references when you want something hardcore!

The actual CUDA documentation is good:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index....

In particular, "Section 5: Performance Guidelines": https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.... gives a lot of micro-architectural details, including those nasty "bank conflicts" that people keep talking about. Honestly, the original documentation says it best with just a few paragraphs on that particular manner:

> To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously. Any memory read or write request made of n addresses that fall in n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module.

>

> However, if two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized. The hardware splits a memory request with bank conflicts into as many separate conflict-free requests as necessary, decreasing throughput by a factor equal to the number of separate memory requests. If the number of separate memory requests is n, the initial memory request is said to cause n-way bank conflicts.

See? Its really not that hard or unapproachable. Just read the original docs, its all there.

AMD's documentation is scattered to the winds, but the same information is around. I'd say that your #1 performance guidelines are from the ancient optimization guide from 2015. Its a bit dated, but its fine: http://developer.amd.com/wordpress/media/2013/12/AMD_OpenCL_...

Chapter 1 and Chapter 2 are relevant to today's architectures (even RDNA, even though some details have changed). Chapter 2: GCN, applies for all AMD GPUs from the 7xxx series, through the Rx 2xx series, Rx 3xx, 4xx, 5xx, Vega, and CDNA (aka: MI100) architectures.

RDNA does not have as good of an architectural guide. Start with the OpenCL guide for optimization, and then "update" your knowledge with the rather short RDNA guide: https://gpuopen.com/performance/

For assembly language details:

* CUDA's PTX is basically assembly language: https://docs.nvidia.com/cuda/parallel-thread-execution/index...

* PTX is a portable assembly language: NVidia continuously updates their GPUs. Volta was very well studied by this paper: https://arxiv.org/abs/1804.06826. I'd suggest reading this AFTER you learn the basics of PTX.

* AMD publishes their assembly language for each generation. Vega is probably a good starting point: https://developer.amd.com/wp-content/resources/Vega_Shader_I...


Thanks for these great links.


Wow.. this is really great, thank you very, very, very much!




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: