Evaluating AMD GPU by their specs is not going to paint the full picture. Their ...

lhl · on July 26, 2023

In the data center, I think AMD is a lot more viable than most people think. MosaicML recently did a test and were able to swap MI250s with A100s basically seamlessly, within a single training run even, and ran into no issues: https://www.mosaicml.com/blog/amd-mi250

If you have an officially supported card https://rocm.docs.amd.com/en/latest/release/gpu_os_support.h... and are using PyTorch, then you're pretty much good to go. Also, HIPify works pretty well these days.

I think where most people have been getting into trouble is with trying to run with unsupported cards (eg, *ALL* of AMD's consumer cards), or wanting to run on Windows. This is obviously a huge fail on AMD's part since anyone who's tried to do anything with any of those consumer cards will just assume the data center cards are the same, but they're quite different. It doesn't help that I've never seen any CDNA2 card on sale/available in retail. How does AMD ever expect to get any adoption when no developers have hardware they can write code to? It's completely mental.

FredPret · on July 26, 2023

I got really excited until you said all of their consumer cards are out. That's even more infuriating - people have mammoth computing devices laying around and they can't make full use of them, because of drivers.

Not that drivers are simple to make, but still. It's like owning a Ferrari that works perfectly, but you can only drive north.

shiftpgdn · on July 26, 2023

I think tinygrad is working on AMD and SnapDragon adoption

figomore · on July 26, 2023

You can use the WebGPU backend on Tinygrad. It's working well for my test with a Nvidia 960 running inference (Unet 3D). I don't know how well WebGPU is supported on AMD GPUs.

Const-me · on July 26, 2023

ROCm is not the only option, compute shaders are very reliable on all GPUs. And thanks to the Valve’s work on DXVK 2.0, modern Linux runs Windows D3D11 software just fine.

Here’s an example https://github.com/Const-me/Whisper/issues/42 BTW, a lot of BLAS in the compute shaders of that software.

roenxi · on July 26, 2023

I dunno, are they? AMD should to pay someone to put up some "how to multiply a 2x2 matrix on our GPU for the average programmer!" tutorials somewhere obvious. I saw a lot of GPU lockups before I gave up on trying and decided that it wasn't worth it. Maybe compute shaders were a thing I should have tried. To be honest, I don't know much about them because my attempts in the space were shut down pretty hard by driver bugs linked to OpenCL and ROCm.

I thought it was just me for a while, but after watching George Hotz's famous meltdown trying to program on an AMD GPU I do wonder if they're underestimating the power of a few good public "how to use the damn thing" sessions. They've been pushing ROCm which would probably be great if it worked reliably.

Const-me · on July 26, 2023

> driver bugs linked to OpenCL and ROCm

CUDA is the default tech for GPGPU in HPC or AI applications, for more than a decade now. By now, people have found most of these driver bugs, and nVidia has fixed them.

Similarly, compute shaders is the only tech for GPGPU used in videogames. Modern videogames are using compute shaders for a decade now, in increasing volumes. For example, UE5 even renders triangle meshes with them [1].

However, OpenCL and ROCm are niche technologies. I’ve been hearing complaints about driver quality for some time now. For obvious reason, AMD and Intel prioritize driver bugs which affect modern videogames sold in many million copies, compared to the bugs which only affect a few people working on HPC, AI or other niche GPGPU applications.

> they're underestimating the power of a few good public "how to use the damn thing" sessions

I agree the learning curve is steep, with the lack of good materials. For an introduction article, see [2]. Ignore the parts about D3D10 hardware, the article is old and D3D10 hardware is no longer relevant. Another one, with slightly more depth, is [3]. For an example how to multiply large dense matrices with a compute shader see [4], but that example is rather advanced because optimizations, and because weird memory layout conventions inherited from the upstream project.

[1] https://www.youtube.com/watch?v=TMorJX3Nj6U

[2] https://developer.download.nvidia.com/compute/DevZone/docs/h...

[3] https://github.com/jstoecker/dxcompute-docs/tree/main

[4] https://github.com/Const-me/Whisper/blob/master/ComputeShade...

mschuster91 · on July 26, 2023

> For obvious reason, AMD and Intel prioritize driver bugs which affect modern videogames sold in many million copies, compared to the bugs which only affect a few people working on HPC, AI or other niche GPGPU applications.

If people could develop AI stuff on consumer cards, they'd buy a ton of server-grade cards, or rent them via the usual hyperscalers or dedicated platforms for the actual work.

This entire, multi-million dollar (per training session) market is all firmly in the hands of nVIDIA at the moment, and unless AMD seriously improves their offering that won't change. nVIDIA got to the point they are because they focused on getting developers up to speed very fast and cheap, and so the developers asked their employers to get them the stuff they were already used to.

Const-me · on July 26, 2023

> If people could develop AI stuff on consumer cards

Technically people can already do that, by leveraging compute shaders in D3D, Vulkan, or WebGPU. I’m certain it’s possible to implement D3D or Vulkan backend for PyTorch, TensorFlow, and similar ML libraries. When I experimented with AI, I did similar stuff myself with C++, C#, and HLSL, and I found it wasn’t terribly hard to do.

However, PyTorch made by Facebook, TensorFlow by Google. It seems most companies who maintain such libraries are only interested in cloud computing. Some of them, like Qualcomm, only care about their own proprietary hardware. None of them seems to care about desktops.

mrguyorama · on July 26, 2023

My theory is that AMD initially wanted to compete in compute with nVidia and was going to improve ROCm but then they saw how much trouble nVidia had with artificially keeping enterprise users from just buying consumer GPUs instead of the much more profitable enterprise GPUs so they trashed that idea to keep consumer GPUs from interfering with their very profitable enterprise GPU business for compute applications.

JonChesterfield · on July 27, 2023

Rocm was designed and implemented for HPC. There's no cunning scheme to stop it working on gaming cards, there just isn't (wasn't?) much investment in making it work either.

mrguyorama · on July 27, 2023

I'm suggesting they gave up on investment on making it work when they realized good compute on their consumer cards would cannibalize their enterprise cards like what nVidia is experiencing

dragontamer · on July 26, 2023

It disappoints me that DirectX remains one of the best GPU-compute solutions in practice right now. And Vulkan too I guess.

But it really is. That's the state of the market. The video game artists are GPU-programmers, they've hit DirectX11, DX12, and Vulkan with a wide variety of video games and have turned that ecosystem very stable.

-------------

DX11 is 32-bit only atomics, I don't think its a very serious solution in practice. 64-bit atomics (especially 64-bit CAS) is already very limiting compared to CPU-world where 128-bit CAS is needed to fix the obscure ABA-problem.

DX and Vulkan just have... so much API-crap you need to even get Hello World / SAXY up.

C++Amp was wonderful back in 2014, but it too is stuck in DirectX11 and therefore 32-bit atomic world. And it hasn't had an update since then. Microsoft really should have kept investing in C++Amp IMO.

-------------

ROCm is fine if you get the hardware and if it remains supported. But I think in practice, people expect support longer than what AMD is willing to give.

Const-me · on July 26, 2023

> 32-bit only atomics, I don't think its a very serious solution in practice

Yeah, I think I encountered that while porting a hash map from CUDA to HLSL.

However, I'm not sure that's necessarily a huge deal. Probably not an issue for machine learning or BLAS stuff, these use cases don't need fine-grained thread synchronization.

For applications which would benefit from such synchronization, traditional lock-free techniques ported from CPU (i.e. compare and swap atomics on global memory) can be slow due to huge count of active threads on GPUs. I mean it's obviously better than locks, but sometimes it's possible to do something better instead of CAS.

> so much API-crap you need to even get Hello World / SAXY up.

I agree for D3D12 and especially Vulkan, but IMO D3D11 is not terribly bad. It has a steep learning curve, but the amount of boilerplate for simple apps is IMO reasonable. Especially for ML or similar GPGPU stuff which only needs a small subset of the API: compute shaders only, no textures, no render targets, no depth-stencil views, no input layouts, etc.

However, unlike simple apps, real-life ones often need profiler and queue depth limiter, relatively hard to implement on top of these queries. I think Microsoft should ship them both in Windows SDK.

adrian_b · on July 26, 2023

Lock-free techniques are not "obviously better than locks".

Lock-free techniques offer a different trade-off than locks. Lock-free techniques are a little faster in the typical case than locks, but this advantage is paid by being much slower in the worst case (because they may need a very large number of retries to succeed). In the case when a great number of threads contend for access, the worst case can be very frequent.

The best application for lock-free techniques is in read-only access to shared data. In this case they are almost always the best solution.

On the other hand, for write access to shared data, which one is better, between optimistic access control with lock-free techniques and deterministic serialization of the accesses with locks, depends on the application and it cannot be said in general that one method or the other is preferable.

Const-me · on July 26, 2023

> Lock-free techniques are not "obviously better than locks".

On GPUs, they are. GPUs don't have any locks, but they can be emulated on top of these global memory atomics. Because count of active threads is often thousands, the performance of that approach is much worse than lock-free techniques.

dragontamer · on July 27, 2023

I missed this earlier.

> For applications which would benefit from such synchronization, traditional lock-free techniques ported from CPU (i.e. compare and swap atomics on global memory) can be slow due to huge count of active threads on GPUs. I mean it's obviously better than locks, but sometimes it's possible to do something better instead of CAS.

I don't think traditional CAS can be optimized. But a fair number of atomic-operations seem to be coalesced into a prefix sum. So... with regards to your latest post.

> On GPUs, they are. GPUs don't have any locks, but they can be emulated on top of these global memory atomics. Because count of active threads is often thousands, the performance of that approach is much worse than lock-free techniques.

Those "thousands of atomic" operations can become coalessed 32-at-a-time (prefix-sum) and turned into just "dozens of atomics" in practice. Automatically mind you, by the compiler.

Don't discount the brute-force code because its simple. Don't assume ~1000+ atomic operations will actually be physically executed as 1000+ atomics. The compiler can "fix" a lot of this code in practice.

Not always, but the compiler can fix it often enough that its beneficial to write the simple brute-force "thousands-of-atomics" code and check the compiled output.

Probably never with "CAS", but its pretty often that "atomic_add" written in a brute force manner (tracking a parallel counter) will compile into a prefix-sum + 1x atomic from one lane, rather than execute as 32x atomics. And even if it is executed as 32x atomics, there are atomic-accelerators on the GPUs that may make the operation faster than you might think. You know, as long as it isn't a compare-and-swap loop.

dragontamer · on July 26, 2023

I think I've more or less decided that DX12 is gonna be what I focus on, for my hobbywork. After evaluating all options. Vulkan is a close 2nd.

Gobs of boilerplate code is annoying, but I honestly can live with it. The tooling available for DirectX is really good.

singhrac · on July 26, 2023

Sure, compute shaders might work, but don’t you need rocBLAS, rocSPARSE, MIOpen, etc? Are people reinventing those in compute shaders in another package?

Const-me · on July 26, 2023

These things are nice to have, but you don’t actually need them.

It only takes 1-2 pages of HLSL to implement efficient matrix multiplication. It’s not rocket science, the area is well researched and there’re many good articles on how to implement these BLAS routines efficiently.

Moreover, manually written compute shaders enable stuff missing from these off-the-shelf higher level libraries.

It’s easy to merge multiple compute operations into a single shader. When possible, this sometimes saves gigabytes of memory bandwidth (and therefore time) these high-level libraries spend writing/reading temporary tensors.

It’s possible to re-shape immutable or rarely changing tensors into better memory layouts. Example for CPU compute https://stackoverflow.com/a/75567894/126995 the idea is equally good on GPUs.

It’s possible to use custom data formats, and they don’t require any hardware support. Upcasting BF16 to FP32 is 1 shader instruction (left shift), downcasting FP32 to BF16 only takes a few of them (for proper rounding), no hardware support necessary. Can pack quantized or sparse tensors into a single ByteAddressBuffer, again nothing special is required from hardware. Can implement custom compression formats for these tensors.

kiratp · on July 26, 2023

The bar for the AI ecosystem to get models running is git clone + <20 lines of Python.

If AMD can’t make that work on consumer GPUs, then the only option they have is to undercut Nvidia so deeply that it makes sense for the big players to hire a team to write AMD’s software for them.

Const-me · on July 26, 2023

> undercut Nvidia so deeply that it makes sense for the big players to hire a team

At least for some use cases, I think that has already happened. nVidia forbids using GeForce GPUs in data centers (in the EULA of the drivers), AMD allows that. Cost efficiency difference between AMD high-end desktop GPUs, and nVidia GPUs which they allow to deploy in data centers, is about an order of magnitude now. For example, L40 card costs $8000-9000, and delivers performance similar to $1000 AMD RX 7900 XTX.

For this reason, companies which run large models at scale are spending ridiculous amount of money on compute costs, often by renting nVidia’s data center targeted GPUs from IAAS providers. OpenAI’s CEO once described compute costs of running ChatGPT as “eye watering”.

For companies like OpenAI, I think investing money in development of vendor-agnostic ML libraries makes a lot of sense in terms of ROI.

mrguyorama · on July 26, 2023

But you can't run most models on the consumer AMD GPUs, so even though AMD "allows" it, nobody except supercomputer clusters uses AMD GPUs for compute, because all the expensive data scientists you hired will bitch and moan until you get them something they can just run standard CUDA stuff on.

Const-me · on July 26, 2023

> expensive data scientists

Different people estimate compute costs of ChatGPT to be between $100k and $700k per day. Compared to these numbers, data scientists aren't that expensive.

> just run standard CUDA stuff

I doubt data scientists have skills to write CUDA, or any other low-level GPGPU code. It's relatively hard to do, and takes years of software development experience to become proficient.

Pretty sure most of these people only capable of using higher-level libraries like TensorFlow and PyTorch. For this reason, I don't think these data scientists need or care about standard CUDA stuff, they only need another backend to these Python libraries. Which is much easier problem to solve.

And one more thing. It could be that most of ChatGPT costs are unrelated to data scientists, and caused by end users running inference. In that case, the data scientists won't even notice, because they will continue using these CUDA GPUs to develop new versions of their models.

empyrrhicist · on July 26, 2023

So everyone that wants to use GPUs to accelerate their compute needs to write a BLAS implementation in shaders... FFS that's not reasonable at all!

Sure, it's entirely possible, but there's a reason that high level libraries are the go-to for scientific compute.

irusensei · on July 26, 2023

What do you mean by drivers? The kernel ones? AMDGPU and KFD runs out of the box and without problems from my use case so far.

Id say though that the whole ROCm runtime is in a bit of a weird situation.

But if if you run anything 5.15-ish or later you don’t need proprietary drivers.

lhl · on July 26, 2023

The more relevant question is which GPU is the OP using? The only officially ROCm supported GPU available for retail purchase is the RDNA2-based Radeon Pro W6800. [1]

In practice it probably means that gfx1030 (Navi 21) GPUs should work (RX6800-RX6950), but again, it also means also those cards (and every other card that AMD currently sells to individuals) is "unsupported."

[1] https://rocm.docs.amd.com/en/latest/release/gpu_os_support.h...

JonChesterfield · on July 27, 2023

Are you running the ROCm jobs on the same GPU as the system GUI? I use built from source rocm on debian with reasonable success, but I do remember gnome crashing pretty reliably when trying to run compute tests on my laptop.

api · on July 26, 2023

How are the Windows drivers for AMD? OS shouldn't matter all that much if its primary role is to host or train models. As long as your code can run under the OS in question it's fine.

pantalaimon · on July 26, 2023

I hope RustiCL will become a viable alternative there.