Evaluating AMD GPU by their specs is not going to paint the full picture. Their drivers are a serious problem. I've managed to get ROCm mostly working on my system (ignoring all the notifications of what is officially supported, the jammy debs from the official repo seem to work on Debian testing). The range of supported setups is limited so it is quite easy to end up in a similar situation.
I expect system lockups when doing any sort of model inference. From the experiences of the last few years I assume it is driver bugs. Based on their rate of improvement they probably will get there in around 2025, but their past performance has been so bad I wouldn't recommend buying a card for machine learning until they've proven that they're taking the situation seriously.
Although in my opinion buy AMD anyway if you need a GPU on linux. Their open source drivers are a lot less hassle as long as you don't need BLAS.
In the data center, I think AMD is a lot more viable than most people think. MosaicML recently did a test and were able to swap MI250s with A100s basically seamlessly, within a single training run even, and ran into no issues: https://www.mosaicml.com/blog/amd-mi250
I think where most people have been getting into trouble is with trying to run with unsupported cards (eg, *ALL* of AMD's consumer cards), or wanting to run on Windows. This is obviously a huge fail on AMD's part since anyone who's tried to do anything with any of those consumer cards will just assume the data center cards are the same, but they're quite different. It doesn't help that I've never seen any CDNA2 card on sale/available in retail. How does AMD ever expect to get any adoption when no developers have hardware they can write code to? It's completely mental.
I got really excited until you said all of their consumer cards are out. That's even more infuriating - people have mammoth computing devices laying around and they can't make full use of them, because of drivers.
Not that drivers are simple to make, but still. It's like owning a Ferrari that works perfectly, but you can only drive north.
You can use the WebGPU backend on Tinygrad. It's working well for my test with a Nvidia 960 running inference (Unet 3D). I don't know how well WebGPU is supported on AMD GPUs.
ROCm is not the only option, compute shaders are very reliable on all GPUs. And thanks to the Valve’s work on DXVK 2.0, modern Linux runs Windows D3D11 software just fine.
I dunno, are they? AMD should to pay someone to put up some "how to multiply a 2x2 matrix on our GPU for the average programmer!" tutorials somewhere obvious. I saw a lot of GPU lockups before I gave up on trying and decided that it wasn't worth it. Maybe compute shaders were a thing I should have tried. To be honest, I don't know much about them because my attempts in the space were shut down pretty hard by driver bugs linked to OpenCL and ROCm.
I thought it was just me for a while, but after watching George Hotz's famous meltdown trying to program on an AMD GPU I do wonder if they're underestimating the power of a few good public "how to use the damn thing" sessions. They've been pushing ROCm which would probably be great if it worked reliably.
CUDA is the default tech for GPGPU in HPC or AI applications, for more than a decade now. By now, people have found most of these driver bugs, and nVidia has fixed them.
Similarly, compute shaders is the only tech for GPGPU used in videogames. Modern videogames are using compute shaders for a decade now, in increasing volumes. For example, UE5 even renders triangle meshes with them [1].
However, OpenCL and ROCm are niche technologies. I’ve been hearing complaints about driver quality for some time now. For obvious reason, AMD and Intel prioritize driver bugs which affect modern videogames sold in many million copies, compared to the bugs which only affect a few people working on HPC, AI or other niche GPGPU applications.
> they're underestimating the power of a few good public "how to use the damn thing" sessions
I agree the learning curve is steep, with the lack of good materials. For an introduction article, see [2]. Ignore the parts about D3D10 hardware, the article is old and D3D10 hardware is no longer relevant. Another one, with slightly more depth, is [3]. For an example how to multiply large dense matrices with a compute shader see [4], but that example is rather advanced because optimizations, and because weird memory layout conventions inherited from the upstream project.
> For obvious reason, AMD and Intel prioritize driver bugs which affect modern videogames sold in many million copies, compared to the bugs which only affect a few people working on HPC, AI or other niche GPGPU applications.
If people could develop AI stuff on consumer cards, they'd buy a ton of server-grade cards, or rent them via the usual hyperscalers or dedicated platforms for the actual work.
This entire, multi-million dollar (per training session) market is all firmly in the hands of nVIDIA at the moment, and unless AMD seriously improves their offering that won't change. nVIDIA got to the point they are because they focused on getting developers up to speed very fast and cheap, and so the developers asked their employers to get them the stuff they were already used to.
> If people could develop AI stuff on consumer cards
Technically people can already do that, by leveraging compute shaders in D3D, Vulkan, or WebGPU. I’m certain it’s possible to implement D3D or Vulkan backend for PyTorch, TensorFlow, and similar ML libraries. When I experimented with AI, I did similar stuff myself with C++, C#, and HLSL, and I found it wasn’t terribly hard to do.
However, PyTorch made by Facebook, TensorFlow by Google. It seems most companies who maintain such libraries are only interested in cloud computing. Some of them, like Qualcomm, only care about their own proprietary hardware. None of them seems to care about desktops.
My theory is that AMD initially wanted to compete in compute with nVidia and was going to improve ROCm but then they saw how much trouble nVidia had with artificially keeping enterprise users from just buying consumer GPUs instead of the much more profitable enterprise GPUs so they trashed that idea to keep consumer GPUs from interfering with their very profitable enterprise GPU business for compute applications.
Rocm was designed and implemented for HPC. There's no cunning scheme to stop it working on gaming cards, there just isn't (wasn't?) much investment in making it work either.
I'm suggesting they gave up on investment on making it work when they realized good compute on their consumer cards would cannibalize their enterprise cards like what nVidia is experiencing
It disappoints me that DirectX remains one of the best GPU-compute solutions in practice right now. And Vulkan too I guess.
But it really is. That's the state of the market. The video game artists are GPU-programmers, they've hit DirectX11, DX12, and Vulkan with a wide variety of video games and have turned that ecosystem very stable.
-------------
DX11 is 32-bit only atomics, I don't think its a very serious solution in practice. 64-bit atomics (especially 64-bit CAS) is already very limiting compared to CPU-world where 128-bit CAS is needed to fix the obscure ABA-problem.
DX and Vulkan just have... so much API-crap you need to even get Hello World / SAXY up.
C++Amp was wonderful back in 2014, but it too is stuck in DirectX11 and therefore 32-bit atomic world. And it hasn't had an update since then. Microsoft really should have kept investing in C++Amp IMO.
-------------
ROCm is fine if you get the hardware and if it remains supported. But I think in practice, people expect support longer than what AMD is willing to give.
> 32-bit only atomics, I don't think its a very serious solution in practice
Yeah, I think I encountered that while porting a hash map from CUDA to HLSL.
However, I'm not sure that's necessarily a huge deal. Probably not an issue for machine learning or BLAS stuff, these use cases don't need fine-grained thread synchronization.
For applications which would benefit from such synchronization, traditional lock-free techniques ported from CPU (i.e. compare and swap atomics on global memory) can be slow due to huge count of active threads on GPUs. I mean it's obviously better than locks, but sometimes it's possible to do something better instead of CAS.
> so much API-crap you need to even get Hello World / SAXY up.
I agree for D3D12 and especially Vulkan, but IMO D3D11 is not terribly bad. It has a steep learning curve, but the amount of boilerplate for simple apps is IMO reasonable. Especially for ML or similar GPGPU stuff which only needs a small subset of the API: compute shaders only, no textures, no render targets, no depth-stencil views, no input layouts, etc.
However, unlike simple apps, real-life ones often need profiler and queue depth limiter, relatively hard to implement on top of these queries. I think Microsoft should ship them both in Windows SDK.
Lock-free techniques are not "obviously better than locks".
Lock-free techniques offer a different trade-off than locks. Lock-free techniques are a little faster in the typical case than locks, but this advantage is paid by being much slower in the worst case (because they may need a very large number of retries to succeed). In the case when a great number of threads contend for access, the worst case can be very frequent.
The best application for lock-free techniques is in read-only access to shared data. In this case they are almost always the best solution.
On the other hand, for write access to shared data, which one is better, between optimistic access control with lock-free techniques and deterministic serialization of the accesses with locks, depends on the application and it cannot be said in general that one method or the other is preferable.
> Lock-free techniques are not "obviously better than locks".
On GPUs, they are. GPUs don't have any locks, but they can be emulated on top of these global memory atomics. Because count of active threads is often thousands, the performance of that approach is much worse than lock-free techniques.
> For applications which would benefit from such synchronization, traditional lock-free techniques ported from CPU (i.e. compare and swap atomics on global memory) can be slow due to huge count of active threads on GPUs. I mean it's obviously better than locks, but sometimes it's possible to do something better instead of CAS.
I don't think traditional CAS can be optimized. But a fair number of atomic-operations seem to be coalesced into a prefix sum. So... with regards to your latest post.
> On GPUs, they are. GPUs don't have any locks, but they can be emulated on top of these global memory atomics. Because count of active threads is often thousands, the performance of that approach is much worse than lock-free techniques.
Those "thousands of atomic" operations can become coalessed 32-at-a-time (prefix-sum) and turned into just "dozens of atomics" in practice. Automatically mind you, by the compiler.
Don't discount the brute-force code because its simple. Don't assume ~1000+ atomic operations will actually be physically executed as 1000+ atomics. The compiler can "fix" a lot of this code in practice.
Not always, but the compiler can fix it often enough that its beneficial to write the simple brute-force "thousands-of-atomics" code and check the compiled output.
Probably never with "CAS", but its pretty often that "atomic_add" written in a brute force manner (tracking a parallel counter) will compile into a prefix-sum + 1x atomic from one lane, rather than execute as 32x atomics. And even if it is executed as 32x atomics, there are atomic-accelerators on the GPUs that may make the operation faster than you might think. You know, as long as it isn't a compare-and-swap loop.
Sure, compute shaders might work, but don’t you need rocBLAS, rocSPARSE, MIOpen, etc? Are people reinventing those in compute shaders in another package?
These things are nice to have, but you don’t actually need them.
It only takes 1-2 pages of HLSL to implement efficient matrix multiplication. It’s not rocket science, the area is well researched and there’re many good articles on how to implement these BLAS routines efficiently.
Moreover, manually written compute shaders enable stuff missing from these off-the-shelf higher level libraries.
It’s easy to merge multiple compute operations into a single shader. When possible, this sometimes saves gigabytes of memory bandwidth (and therefore time) these high-level libraries spend writing/reading temporary tensors.
It’s possible to re-shape immutable or rarely changing tensors into better memory layouts. Example for CPU compute https://stackoverflow.com/a/75567894/126995 the idea is equally good on GPUs.
It’s possible to use custom data formats, and they don’t require any hardware support. Upcasting BF16 to FP32 is 1 shader instruction (left shift), downcasting FP32 to BF16 only takes a few of them (for proper rounding), no hardware support necessary. Can pack quantized or sparse tensors into a single ByteAddressBuffer, again nothing special is required from hardware. Can implement custom compression formats for these tensors.
The bar for the AI ecosystem to get models running is git clone + <20 lines of Python.
If AMD can’t make that work on consumer GPUs, then the only option they have is to undercut Nvidia so deeply that it makes sense for the big players to hire a team to write AMD’s software for them.
> undercut Nvidia so deeply that it makes sense for the big players to hire a team
At least for some use cases, I think that has already happened. nVidia forbids using GeForce GPUs in data centers (in the EULA of the drivers), AMD allows that. Cost efficiency difference between AMD high-end desktop GPUs, and nVidia GPUs which they allow to deploy in data centers, is about an order of magnitude now. For example, L40 card costs $8000-9000, and delivers performance similar to $1000 AMD RX 7900 XTX.
For this reason, companies which run large models at scale are spending ridiculous amount of money on compute costs, often by renting nVidia’s data center targeted GPUs from IAAS providers. OpenAI’s CEO once described compute costs of running ChatGPT as “eye watering”.
For companies like OpenAI, I think investing money in development of vendor-agnostic ML libraries makes a lot of sense in terms of ROI.
But you can't run most models on the consumer AMD GPUs, so even though AMD "allows" it, nobody except supercomputer clusters uses AMD GPUs for compute, because all the expensive data scientists you hired will bitch and moan until you get them something they can just run standard CUDA stuff on.
Different people estimate compute costs of ChatGPT to be between $100k and $700k per day. Compared to these numbers, data scientists aren't that expensive.
> just run standard CUDA stuff
I doubt data scientists have skills to write CUDA, or any other low-level GPGPU code. It's relatively hard to do, and takes years of software development experience to become proficient.
Pretty sure most of these people only capable of using higher-level libraries like TensorFlow and PyTorch. For this reason, I don't think these data scientists need or care about standard CUDA stuff, they only need another backend to these Python libraries. Which is much easier problem to solve.
And one more thing. It could be that most of ChatGPT costs are unrelated to data scientists, and caused by end users running inference. In that case, the data scientists won't even notice, because they will continue using these CUDA GPUs to develop new versions of their models.
The more relevant question is which GPU is the OP using? The only officially ROCm supported GPU available for retail purchase is the RDNA2-based Radeon Pro W6800. [1]
In practice it probably means that gfx1030 (Navi 21) GPUs should work (RX6800-RX6950), but again, it also means also those cards (and every other card that AMD currently sells to individuals) is "unsupported."
Are you running the ROCm jobs on the same GPU as the system GUI? I use built from source rocm on debian with reasonable success, but I do remember gnome crashing pretty reliably when trying to run compute tests on my laptop.
How are the Windows drivers for AMD? OS shouldn't matter all that much if its primary role is to host or train models. As long as your code can run under the OS in question it's fine.
I expect system lockups when doing any sort of model inference. From the experiences of the last few years I assume it is driver bugs. Based on their rate of improvement they probably will get there in around 2025, but their past performance has been so bad I wouldn't recommend buying a card for machine learning until they've proven that they're taking the situation seriously.
Although in my opinion buy AMD anyway if you need a GPU on linux. Their open source drivers are a lot less hassle as long as you don't need BLAS.