A very good and reasonably approachable discussion of the pros and cons of how CUDA programming actually is realized in hardware. The explanation of how the GPU's handle context switching is particularly thoughtful and enlightening. It took me a long time to figure this out a couple months ago, a guide like this would have saved me a few nights.
I was surprised that the author didn't once use the term CUDA though, they even discuss actual syntax from it, but don't mention the language (extension) once.
Probably because the hardware is language agnostic. You would consider how the hardware works no matter if you were using GLSL (GL's shader language), HLSL (D3D's shader language), OpenCL (Compute sister language to GL/GLSL), DirectCompute (Compute sister language to D3D/HLSL), or CUDA.
Very nice article. In my limited experience with OpenCL programming, the most difficult thing is understanding how memory access patterns affect performance. It's not made easier by the fact that it may be different on different platforms.
I wonder if what's needed is a higher-level representation that can compile to the best access patterns for the given hardware. (And something that can try several access patterns for your problem and choose the most efficient one.) GPU programming is still quite new, so I guess it's bound to show up eventually.
If it can't handle all possible situations, such a tool would be still be useful, even if you end up having to go down to the CUDA/OpenCL level for certain problems that are too difficult to express declaratively.
I was surprised that the author didn't once use the term CUDA though, they even discuss actual syntax from it, but don't mention the language (extension) once.