Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Which GPU(s) to Get for Deep Learning (timdettmers.com)
223 points by snow_mac on July 26, 2023 | hide | past | favorite | 128 comments


Evaluating AMD GPU by their specs is not going to paint the full picture. Their drivers are a serious problem. I've managed to get ROCm mostly working on my system (ignoring all the notifications of what is officially supported, the jammy debs from the official repo seem to work on Debian testing). The range of supported setups is limited so it is quite easy to end up in a similar situation.

I expect system lockups when doing any sort of model inference. From the experiences of the last few years I assume it is driver bugs. Based on their rate of improvement they probably will get there in around 2025, but their past performance has been so bad I wouldn't recommend buying a card for machine learning until they've proven that they're taking the situation seriously.

Although in my opinion buy AMD anyway if you need a GPU on linux. Their open source drivers are a lot less hassle as long as you don't need BLAS.


In the data center, I think AMD is a lot more viable than most people think. MosaicML recently did a test and were able to swap MI250s with A100s basically seamlessly, within a single training run even, and ran into no issues: https://www.mosaicml.com/blog/amd-mi250

If you have an officially supported card https://rocm.docs.amd.com/en/latest/release/gpu_os_support.h... and are using PyTorch, then you're pretty much good to go. Also, HIPify works pretty well these days.

I think where most people have been getting into trouble is with trying to run with unsupported cards (eg, *ALL* of AMD's consumer cards), or wanting to run on Windows. This is obviously a huge fail on AMD's part since anyone who's tried to do anything with any of those consumer cards will just assume the data center cards are the same, but they're quite different. It doesn't help that I've never seen any CDNA2 card on sale/available in retail. How does AMD ever expect to get any adoption when no developers have hardware they can write code to? It's completely mental.


I got really excited until you said all of their consumer cards are out. That's even more infuriating - people have mammoth computing devices laying around and they can't make full use of them, because of drivers.

Not that drivers are simple to make, but still. It's like owning a Ferrari that works perfectly, but you can only drive north.


I think tinygrad is working on AMD and SnapDragon adoption


You can use the WebGPU backend on Tinygrad. It's working well for my test with a Nvidia 960 running inference (Unet 3D). I don't know how well WebGPU is supported on AMD GPUs.


ROCm is not the only option, compute shaders are very reliable on all GPUs. And thanks to the Valve’s work on DXVK 2.0, modern Linux runs Windows D3D11 software just fine.

Here’s an example https://github.com/Const-me/Whisper/issues/42 BTW, a lot of BLAS in the compute shaders of that software.


I dunno, are they? AMD should to pay someone to put up some "how to multiply a 2x2 matrix on our GPU for the average programmer!" tutorials somewhere obvious. I saw a lot of GPU lockups before I gave up on trying and decided that it wasn't worth it. Maybe compute shaders were a thing I should have tried. To be honest, I don't know much about them because my attempts in the space were shut down pretty hard by driver bugs linked to OpenCL and ROCm.

I thought it was just me for a while, but after watching George Hotz's famous meltdown trying to program on an AMD GPU I do wonder if they're underestimating the power of a few good public "how to use the damn thing" sessions. They've been pushing ROCm which would probably be great if it worked reliably.


> driver bugs linked to OpenCL and ROCm

CUDA is the default tech for GPGPU in HPC or AI applications, for more than a decade now. By now, people have found most of these driver bugs, and nVidia has fixed them.

Similarly, compute shaders is the only tech for GPGPU used in videogames. Modern videogames are using compute shaders for a decade now, in increasing volumes. For example, UE5 even renders triangle meshes with them [1].

However, OpenCL and ROCm are niche technologies. I’ve been hearing complaints about driver quality for some time now. For obvious reason, AMD and Intel prioritize driver bugs which affect modern videogames sold in many million copies, compared to the bugs which only affect a few people working on HPC, AI or other niche GPGPU applications.

> they're underestimating the power of a few good public "how to use the damn thing" sessions

I agree the learning curve is steep, with the lack of good materials. For an introduction article, see [2]. Ignore the parts about D3D10 hardware, the article is old and D3D10 hardware is no longer relevant. Another one, with slightly more depth, is [3]. For an example how to multiply large dense matrices with a compute shader see [4], but that example is rather advanced because optimizations, and because weird memory layout conventions inherited from the upstream project.

[1] https://www.youtube.com/watch?v=TMorJX3Nj6U

[2] https://developer.download.nvidia.com/compute/DevZone/docs/h...

[3] https://github.com/jstoecker/dxcompute-docs/tree/main

[4] https://github.com/Const-me/Whisper/blob/master/ComputeShade...


> For obvious reason, AMD and Intel prioritize driver bugs which affect modern videogames sold in many million copies, compared to the bugs which only affect a few people working on HPC, AI or other niche GPGPU applications.

If people could develop AI stuff on consumer cards, they'd buy a ton of server-grade cards, or rent them via the usual hyperscalers or dedicated platforms for the actual work.

This entire, multi-million dollar (per training session) market is all firmly in the hands of nVIDIA at the moment, and unless AMD seriously improves their offering that won't change. nVIDIA got to the point they are because they focused on getting developers up to speed very fast and cheap, and so the developers asked their employers to get them the stuff they were already used to.


> If people could develop AI stuff on consumer cards

Technically people can already do that, by leveraging compute shaders in D3D, Vulkan, or WebGPU. I’m certain it’s possible to implement D3D or Vulkan backend for PyTorch, TensorFlow, and similar ML libraries. When I experimented with AI, I did similar stuff myself with C++, C#, and HLSL, and I found it wasn’t terribly hard to do.

However, PyTorch made by Facebook, TensorFlow by Google. It seems most companies who maintain such libraries are only interested in cloud computing. Some of them, like Qualcomm, only care about their own proprietary hardware. None of them seems to care about desktops.


My theory is that AMD initially wanted to compete in compute with nVidia and was going to improve ROCm but then they saw how much trouble nVidia had with artificially keeping enterprise users from just buying consumer GPUs instead of the much more profitable enterprise GPUs so they trashed that idea to keep consumer GPUs from interfering with their very profitable enterprise GPU business for compute applications.


Rocm was designed and implemented for HPC. There's no cunning scheme to stop it working on gaming cards, there just isn't (wasn't?) much investment in making it work either.


I'm suggesting they gave up on investment on making it work when they realized good compute on their consumer cards would cannibalize their enterprise cards like what nVidia is experiencing


It disappoints me that DirectX remains one of the best GPU-compute solutions in practice right now. And Vulkan too I guess.

But it really is. That's the state of the market. The video game artists are GPU-programmers, they've hit DirectX11, DX12, and Vulkan with a wide variety of video games and have turned that ecosystem very stable.

-------------

DX11 is 32-bit only atomics, I don't think its a very serious solution in practice. 64-bit atomics (especially 64-bit CAS) is already very limiting compared to CPU-world where 128-bit CAS is needed to fix the obscure ABA-problem.

DX and Vulkan just have... so much API-crap you need to even get Hello World / SAXY up.

C++Amp was wonderful back in 2014, but it too is stuck in DirectX11 and therefore 32-bit atomic world. And it hasn't had an update since then. Microsoft really should have kept investing in C++Amp IMO.

-------------

ROCm is fine if you get the hardware and if it remains supported. But I think in practice, people expect support longer than what AMD is willing to give.


> 32-bit only atomics, I don't think its a very serious solution in practice

Yeah, I think I encountered that while porting a hash map from CUDA to HLSL.

However, I'm not sure that's necessarily a huge deal. Probably not an issue for machine learning or BLAS stuff, these use cases don't need fine-grained thread synchronization.

For applications which would benefit from such synchronization, traditional lock-free techniques ported from CPU (i.e. compare and swap atomics on global memory) can be slow due to huge count of active threads on GPUs. I mean it's obviously better than locks, but sometimes it's possible to do something better instead of CAS.

> so much API-crap you need to even get Hello World / SAXY up.

I agree for D3D12 and especially Vulkan, but IMO D3D11 is not terribly bad. It has a steep learning curve, but the amount of boilerplate for simple apps is IMO reasonable. Especially for ML or similar GPGPU stuff which only needs a small subset of the API: compute shaders only, no textures, no render targets, no depth-stencil views, no input layouts, etc.

However, unlike simple apps, real-life ones often need profiler and queue depth limiter, relatively hard to implement on top of these queries. I think Microsoft should ship them both in Windows SDK.


Lock-free techniques are not "obviously better than locks".

Lock-free techniques offer a different trade-off than locks. Lock-free techniques are a little faster in the typical case than locks, but this advantage is paid by being much slower in the worst case (because they may need a very large number of retries to succeed). In the case when a great number of threads contend for access, the worst case can be very frequent.

The best application for lock-free techniques is in read-only access to shared data. In this case they are almost always the best solution.

On the other hand, for write access to shared data, which one is better, between optimistic access control with lock-free techniques and deterministic serialization of the accesses with locks, depends on the application and it cannot be said in general that one method or the other is preferable.


> Lock-free techniques are not "obviously better than locks".

On GPUs, they are. GPUs don't have any locks, but they can be emulated on top of these global memory atomics. Because count of active threads is often thousands, the performance of that approach is much worse than lock-free techniques.


I missed this earlier.

> For applications which would benefit from such synchronization, traditional lock-free techniques ported from CPU (i.e. compare and swap atomics on global memory) can be slow due to huge count of active threads on GPUs. I mean it's obviously better than locks, but sometimes it's possible to do something better instead of CAS.

I don't think traditional CAS can be optimized. But a fair number of atomic-operations seem to be coalesced into a prefix sum. So... with regards to your latest post.

> On GPUs, they are. GPUs don't have any locks, but they can be emulated on top of these global memory atomics. Because count of active threads is often thousands, the performance of that approach is much worse than lock-free techniques.

Those "thousands of atomic" operations can become coalessed 32-at-a-time (prefix-sum) and turned into just "dozens of atomics" in practice. Automatically mind you, by the compiler.

Don't discount the brute-force code because its simple. Don't assume ~1000+ atomic operations will actually be physically executed as 1000+ atomics. The compiler can "fix" a lot of this code in practice.

Not always, but the compiler can fix it often enough that its beneficial to write the simple brute-force "thousands-of-atomics" code and check the compiled output.

Probably never with "CAS", but its pretty often that "atomic_add" written in a brute force manner (tracking a parallel counter) will compile into a prefix-sum + 1x atomic from one lane, rather than execute as 32x atomics. And even if it is executed as 32x atomics, there are atomic-accelerators on the GPUs that may make the operation faster than you might think. You know, as long as it isn't a compare-and-swap loop.


I think I've more or less decided that DX12 is gonna be what I focus on, for my hobbywork. After evaluating all options. Vulkan is a close 2nd.

Gobs of boilerplate code is annoying, but I honestly can live with it. The tooling available for DirectX is really good.


Sure, compute shaders might work, but don’t you need rocBLAS, rocSPARSE, MIOpen, etc? Are people reinventing those in compute shaders in another package?


These things are nice to have, but you don’t actually need them.

It only takes 1-2 pages of HLSL to implement efficient matrix multiplication. It’s not rocket science, the area is well researched and there’re many good articles on how to implement these BLAS routines efficiently.

Moreover, manually written compute shaders enable stuff missing from these off-the-shelf higher level libraries.

It’s easy to merge multiple compute operations into a single shader. When possible, this sometimes saves gigabytes of memory bandwidth (and therefore time) these high-level libraries spend writing/reading temporary tensors.

It’s possible to re-shape immutable or rarely changing tensors into better memory layouts. Example for CPU compute https://stackoverflow.com/a/75567894/126995 the idea is equally good on GPUs.

It’s possible to use custom data formats, and they don’t require any hardware support. Upcasting BF16 to FP32 is 1 shader instruction (left shift), downcasting FP32 to BF16 only takes a few of them (for proper rounding), no hardware support necessary. Can pack quantized or sparse tensors into a single ByteAddressBuffer, again nothing special is required from hardware. Can implement custom compression formats for these tensors.


The bar for the AI ecosystem to get models running is git clone + <20 lines of Python.

If AMD can’t make that work on consumer GPUs, then the only option they have is to undercut Nvidia so deeply that it makes sense for the big players to hire a team to write AMD’s software for them.


> undercut Nvidia so deeply that it makes sense for the big players to hire a team

At least for some use cases, I think that has already happened. nVidia forbids using GeForce GPUs in data centers (in the EULA of the drivers), AMD allows that. Cost efficiency difference between AMD high-end desktop GPUs, and nVidia GPUs which they allow to deploy in data centers, is about an order of magnitude now. For example, L40 card costs $8000-9000, and delivers performance similar to $1000 AMD RX 7900 XTX.

For this reason, companies which run large models at scale are spending ridiculous amount of money on compute costs, often by renting nVidia’s data center targeted GPUs from IAAS providers. OpenAI’s CEO once described compute costs of running ChatGPT as “eye watering”.

For companies like OpenAI, I think investing money in development of vendor-agnostic ML libraries makes a lot of sense in terms of ROI.


But you can't run most models on the consumer AMD GPUs, so even though AMD "allows" it, nobody except supercomputer clusters uses AMD GPUs for compute, because all the expensive data scientists you hired will bitch and moan until you get them something they can just run standard CUDA stuff on.


> expensive data scientists

Different people estimate compute costs of ChatGPT to be between $100k and $700k per day. Compared to these numbers, data scientists aren't that expensive.

> just run standard CUDA stuff

I doubt data scientists have skills to write CUDA, or any other low-level GPGPU code. It's relatively hard to do, and takes years of software development experience to become proficient.

Pretty sure most of these people only capable of using higher-level libraries like TensorFlow and PyTorch. For this reason, I don't think these data scientists need or care about standard CUDA stuff, they only need another backend to these Python libraries. Which is much easier problem to solve.

And one more thing. It could be that most of ChatGPT costs are unrelated to data scientists, and caused by end users running inference. In that case, the data scientists won't even notice, because they will continue using these CUDA GPUs to develop new versions of their models.


So everyone that wants to use GPUs to accelerate their compute needs to write a BLAS implementation in shaders... FFS that's not reasonable at all!

Sure, it's entirely possible, but there's a reason that high level libraries are the go-to for scientific compute.


What do you mean by drivers? The kernel ones? AMDGPU and KFD runs out of the box and without problems from my use case so far.

Id say though that the whole ROCm runtime is in a bit of a weird situation.

But if if you run anything 5.15-ish or later you don’t need proprietary drivers.


The more relevant question is which GPU is the OP using? The only officially ROCm supported GPU available for retail purchase is the RDNA2-based Radeon Pro W6800. [1]

In practice it probably means that gfx1030 (Navi 21) GPUs should work (RX6800-RX6950), but again, it also means also those cards (and every other card that AMD currently sells to individuals) is "unsupported."

[1] https://rocm.docs.amd.com/en/latest/release/gpu_os_support.h...


Are you running the ROCm jobs on the same GPU as the system GUI? I use built from source rocm on debian with reasonable success, but I do remember gnome crashing pretty reliably when trying to run compute tests on my laptop.


How are the Windows drivers for AMD? OS shouldn't matter all that much if its primary role is to host or train models. As long as your code can run under the OS in question it's fine.


I hope RustiCL will become a viable alternative there.


Just as an FYI/additional data point, I bought a 3090 FE from Ebay a few months ago for £605 including delivery.

I've only just started using it for Llama running locally on my computer at home and I have to say... colour me impressed.

It generates the output slightly faster than reading speed so for me it works perfectly well.

The 24GB of VRAM should keep it relevant for a bit too and I can always buy another and NVLink them should the need arise.


> The 24GB of VRAM should keep it relevant for a bit too

If anything, I think models are going to shrink a bit, because assumptions around small models reaching capacity during traiing don’t seem fully accurate in practice[0]. We’re already starting to see some effects, like Phi-1[1] (a 1.3B code model outperforming 15B+ models), and BTLM-3B-8K[2] (a 3B model outperforming 7B models)

[0]: https://espadrine.github.io/blog/posts/chinchilla-s-death.ht...

[1]: https://arxiv.org/pdf/2306.11644.pdf

[2]: https://www.cerebras.net/blog/btlm-3b-8k-7b-performance-in-a...


We had a long phase of "models aren't good enough but get better if we make them bigger, let's see how far we can go". This year we finally reached "some models are pretty great, let's see if we can do the same with smaller models". I'm excited for where this will take us.


Is there any way to compute the "capacity" of a model? In theory, if it's encoding all data with 100% efficiency, I guess the data stored in the model should be something like 2^parameters count (weights + biases) ?


There’s a theoretical, but impractical, way: for a given model, each possible set of weight/bias values yields a specific loss value when ran against the full corpus. There’s at least one set of weight values which minimizes it, for which the idealized bit-per-byte entropy can be computed.

That can be compared to what OpenAI’s scaling law paper[0] calls the “entropy of natural language”, which they estimate at about 0.57 bits per byte, based on the differing power law for data vs. compute. In my mind, that highlights more the imprecision of the approach than the information-theoretic content of language semantics: an omniscient being would predict things better, so the closest thing to true entropy should be computed from the list of matching text prefixes among all texts ever.

[0]: https://arxiv.org/pdf/2001.08361


Thanks for the explanation!

> should be computed from the list of matching text prefixes among all texts ever

I initially thought that value is pretty low (possible things you can say), but it's probably infinite. Even though, in practice, we don't say too many different things and use a very limited subset of the words in the dictionary.


Anyone with experience running 2 linked consumer GPU's want to chime in how good this works in practice?


You get a fast link between the GPUs, which should help when you’ve got a model split between them.

However, that split isn’t automatic. You can’t expect to run a 40GB model on that, unless perhaps if it’s been designed for that—the way llama.cpp can split a model between the GPU and CPU, for instance.

What you can do without trouble is keep more models loaded, do more things at the same time, and occasionally run the same model at double speed if it batches well.


CUDA multi-GPU with NVLink is pretty well tested with shared memory space. You still want to use NCCL to optimize the allocation, but many CUDA-aware libraries (and their subsequent ML tools) are capable.


This is incorrect if you are talking about 3090 or 3090ti using nvlink.


You mean those would work like a virtual single GPU with 48GB vram?


No. But pytorch will automatically make use of both GPUs and a NVlink bridge if you use its model parallel and distributed data parallel approaches.


I think you need enterprise grade cards for it to make it work. If I remember correctly consumer cards with nvlink can't share resources to host a 40GB model in vram.


I bought a used 3090 FE from eBay for $600 too! Mine is missing the connector latch, but seems to be firmly inserted so I think fire risk is negligible.

I went with the 3090 because I wanted the most VRAM for the buck, and the price of new GPUs is insane. Most GPUs in the $500-1500 range, even the Quadros and A series, don’t have anywhere near 24GB of VRAM.


> It generates the output slightly faster than reading speed

For 33b? It should be much faster.

What stack are you running? Llama.cpp and exLlama are SOTA as far as I know.


I used Tim's guide to build a dual RTX 3090 PC, paying 2300€ in total by getting used components. It can run inference of Llama-65B 4bit quantized at more than 10tok/s.

Specs: 2x RTX 3090, NVLink Bridge, 128GB DDR4 3200 RAM, Ryzen 7 3700X, X570 SLI mainboard, 2TB M.2 NVMe SSD, air cooled mesh case.

Finding the 3-slot nvlink bridge is hard and it's usually expensive. I think it's not worth it in most cases. I managed to find a cheap used one. Cooling is also a challenge. The cards are 2.7 slots wide and the spacing is usually 3 slots, so there isn't much room. Some people are putting 3d printed shrouds on the back of the PC case to suck the air out of the cards with an extra external fan. Also limiting the power from 350W to 280W or so per card doesn't cost a lot of performance. The CPU is not limiting the performance at all, as long as you have 4 cores per GPU you're good.


My build is close to this. I purchased everything new except the 3090s, and I paid about $3000.

2x RTX 3090

128 GB DDR5

Intel core i9 600 series

Z790 Mainboard

I used Intel instead of AMD for the cpu, which pushed my prices higher... but I saved on the back side by skipping the NVLink Bridge.

Good to know I'm not missing much with out the Bridge, since I get about 13tok/s on Llama-65B 4 bit if I push all layers onto the GPU.


Managed to snatch a 3090 during the GPU shortage in 2020. Did a lot of training and mining, and got some of my results published, think I gained much more than the cost of the hardware purchases. Kinda miss the day of eth mining. 3090 is a still good card and I'm pretty sure your rig is going to serve you well.

ps: ~280W power limit is a good call, it won't heat up your room too much.


I hear a lot about CUDA and how bad ROCm is etc. and I’ve been trying to understand what exactly CUDA is doing that is so special; isn’t the maths for neural networks mostly multiplying large arrays/tensors together? What magic is CUDA doing that is so different for other vendors to implement? Is it just lock-in, the type of operations that are available, some kind of magical performance advantage or something else that CUDA is doing?


1. Driver stability

2. Works on more consumer grade cards

3. Ecosystem advantage (lots of software developed against an existing and well supported ecosystem)

I have a laptop with a mobile 2060 and a desktop with a top-of-the-line consumer 7900XTX. As of yet, the 7900XTX isn't officially supported (and I haven't bothered to go down the obnoxious rabbit hole to figure out how to compute on it). Meanwhile, I can load up CUDA.jl on my laptop in mere minutes with absolutely no fuss.

Edit: if there are any GPU gurus out there who are capable of working on AMDGPU.jl to make it work on cards like the 7900XTX out of the box and writing documentation/tutorials for it... start a Patreon. I bet you could fund some significant effort getting that up and running!


As of today, there is zero consumer card support from AMD. It is an option only if you have a PRO card.

"Formal support for RDNA 3-based GPUs on Linux is planned to begin rolling out this fall, starting with the 48GB Radeon PRO W7900 and the 24GB Radeon RX 7900 XTX, with additional cards and expanded capabilities to be released over time." [0]

[0] https://community.amd.com/t5/rocm/new-rocm-5-6-release-bring...


Right, which SUCKS. Everyone who wants to prototype on their existing gear before jumping into a big pro card purchase is stuck with Nvidia, and the availability/performance of the software stack shows it.


Or even just have some hands on time to get familiar with the flow, to dick around, to build skills, to teach etc, but you CANNOT DO THAT with AMD


Exactly


I’m asking at a lower level than this, CUDA presumably has a list of functionality for GPGPU stuff like tensors, loading data, splitting up training, and building pipelines of networks /attention stuff that can efficiently fit neural networks to many sorts of data.

Why is it so difficult for other manufacturers to provide a compatible layer? If Apple can make Direct X 12 work on Apple Silicon surely AMD should be able to make CUDA (which has to be much simpler that DX12) work on their graphics cards? Is there some fundamental architectural differences that stop this from working?


There's nothing conceptually hard but it's really a lot of work. In addition to the items you listed there's the actual compute kernels or compiler to generate those, and then porting frameworks over (PyTorch etc), and then doing the level of testing, documentation, and ongoing maintenance to make an alternative platform a reasonable idea for end users. The pitch for buying NVIDIA hardware is that existing tools, example code, and third party research will more or less work and perform well out of the box.

Edit: Going back to your original question, the main thing that makes CUDA so special is NVIDIA has already poured billions of dollars into all of this infrastructure and credibly will keep doing so.


There might be intellectual property concerns with "directly" implementing CUDA, and the architectures are (as I understand it) a bit different. That doesn't explain why they don't support something with similar broad compatibility though, as the actual card capabilities are very similar.


Sure, AMD could write a CUDA emulator (if it was legal) for AMD GPUs, but if it's one tenth the performance, whats the point?


Compiler. It would look a lot like HIP and run at about the same performance that a cuda implementation would.


There's no real reason it would need to be 1/10 the performance though, depending on the kernel.


Nvidia's software is also pretty atrocious, in my opinion. The output of various tools is cryptic, updates regularly result in a totally broken system, and things often stop working for no discernible reason. Nvidia GPUs are always the most finnicky part of a system.

A modern Linux system should have uptime measured in years with minimal effort. A modern Linux system with Nvidia GPUs will have uptime of weeks with a lot of fuss.

(I'm no expert, just someone who's managed a number of PCs and a few servers.)


Right, but they can get away with that because they have essentially no competition.

With that said, Pop!OS does a really nice job of handling the Nvidia software stack - I've been running it on the laptop mentioned above for several years with no issues (though I don't leave my machines on 24/7).


Everyone else does the work to make sure it runs on cudnn, because they bought the hardware when it was the only reasonable solution, and if it works on anything else that’s just a happy accident. So you’ll spend weeks of your incredibly expensive engineering or researcher time fighting compatibility issues because you saved $1k by going with an amd card. Your researchers/engineers conclude it’s the only reasonable solution for now and build on nvidia.

It’s classic first mover advantage (plus just a better product / more resourcing to make it a better product honestly). I think you have to be a really massive scale to make the cost per card worth the cost per engineer math work out, unless AMD significantly closes the compatibility gap. But AMD’s job here is to fill a leaky bucket, because new CUDA code is being written every day, and they don’t seem serious about it.


Yup, filling the bucket could be worth hundreds of billions though. Maybe even trillions, seems like a sensible punt.


> it’s the maths for neural networks largely multiplying large arrays/tensors together?

Yes, it's multiplying and adding matrices. That and mapping some simple function over an array.

Neural networks are only that.


It’s the inter GPU communication. Scatter and Gather have much worse performance on AMD GPUs.


You can tell how NVIDIA dominants the market by the fact their price/performance "curve" is almost a straight line.

In a competitive market that line has distortions where one player trts to undercut the other.

There are no bargains because there is almost no competitive pressure and so there is barely any distortion in that line.


I suppose this is one of the reasons (besides AMD dropping the ball) they aren't even trying to be competitive in the gaming market - they can sell the same mm2 silicon area for much more to AI startups:

"There's a full blown run on GPU compute on a level I think people do not fully comprehend right now. Holy cow.

I've talked to a lot of vendors in the last 7 days. It's crazy out there y'all. NVIDIA allegedly has sold out its whole supply through the year. So at this point, everyone is just maximizing their LTVs and NVIDIA is choosing who gets what as it fulfills the order queue." [0]

[0] https://twitter.com/Suhail/status/1683642991490269185


Suhail lacks wisdom.

Stop obsessing about cloud GPUs.

Go buy retail GPUs.

Make them work. Adapt your software. Whatever, stop whining about cloud GPUs just switch to retail.

Solve the barriers/limitations. This has been the essences of computing for 60 years. Stop being intimidated by Nvidia. Get your job done in what is available.

Buy AMD, buy Intel. Work out how to make your GPU thing work on them. Stop wringing your hands about how there’s no cloud GPUs when there’s a ton of cheap retail GPUs.

INNOVATE. Look beyond nvidia. Stop whining.

If you’ve bet your entire business on cloud GPUs then you’re a fool. Bet on retail GPUs.


>Stop obsessing about cloud GPUs.

>Go buy retail GPUs.

If you're doing this professionally, you know what you need and what your budget is.

If you're doing this personally, to simply learn? I did the math for myself and figured I could probably buy enough gpu credits for the cost of a 4090 build that it would probably serve me all the way through getting a phd.


>Stop wringing your hands about how there’s no cloud GPUs when there’s a ton of cheap retail GPUs.

See your R&D investment goes up in flames after driver update to block such workload.


There's something to this.

But also... don't update drivers and accept you are missing out on updates that might also be useful.


> Solve the barriers/limitations

I don't think this is possible when you can't pool memory on the 40 series retail GPUs.


Ok, so make an enormous capital investment in tech that will be outdated in two years - if not already outdated, as you mentioned AMD and Intel. Then, I'll need to hire geniuses who can extract juice out of this hardware at a scale that not even Google, Amazon, Microsoft and others could.

Or, I rent just as much top performance hardware as I need, scaling as I go along, and worry about execution and implementation of my niche application instead. You can see why cloud is winning right now.


Except as the previous post points out, you can’t get cloud GPUs.


You can't buy GPUs to build your own cloud. You can access cloud GPUs via the ones that NVidia is selling to.


No it doesn't, and yes you can.


So Nvidia is going to pretty much corner the market for a long time? This bit I expected but was still sad to read. Surely we would benefit from competition. It would probably take a lot of investment from AMD to make that happen, I imagine.

> AMD GPUs are great in terms of pure silicon: Great FP16 performance, great memory bandwidth. However, their lack of Tensor Cores or the equivalent makes their deep learning performance poor compared to NVIDIA GPUs. Packed low-precision math does not cut it. Without this hardware feature, AMD GPUs will never be competitive.

Edit: what about Intel arc GPU? Any hope there?


> AMD GPUs are great in terms of pure silicon

This has pretty much always been true. AMD cards always had more FLOPS and ROPs and memory bandwidth than the competing nVidia cards which benchmark the same. Is that a pro for AMD? Uhhhh doesn't really sound like it.


That’s the one thing that I feel is a bit misleading in the article (to be fair, it was initially written years ago, and got rewritten a bit recently). FLOPS comparisons given in the wild are not always apple-to-apple (eg. not including Tensor cores for NVIDIA, but including V_DUAL_DOT2ACC_F32_F16 for AMD), while on the flip side, AMD’s WMMA should address the same goals as Tensor cores. I have an article on comparing the two: https://espadrine.github.io/blog/posts/recomputing-gpu-perfo...


> It would probably take a lot of investment from AMD to make that happen, I imagine

Don't AMD deliberately gimp their consumer cards to prevent cannibalising the pro cards? I vaguely recall reading about that a while back.

That being the case, they have already done the R&D but they chose to use the tech on the higher-margin kit, thus preventing hobbyists from buying AMD.


A few years ago AMD split off their GPU architectures to CDNA (focused on data center compute) and RDNA (focused on rendering for gaming and workstations). This in itself is fine and what Nvidia was already doing, it makes sense to optimize silicon for each use case, but where AMD took a massive wrong turn is that they decided to stop supporting compute completely for their RDNA (and all legacy) cards.

I'm not sure exactly what AMD expected to happen when doing that, especially when Nvidia continues to support CUDA on basically every GPU they've ever made: https://developer.nvidia.com/cuda-gpus#compute (looks like back to a GeForce 9400 GT, released in 2008)


Its like they don't care about having a pipeline of programmers ready to use their hardware, and want to ignore most of the workstation market.


Sadly this is still a market segment in which a proprietary stack dominates. From the perspective of AMD, they could be looking at a situation in which they can either throw billions of dollars at a monopoly protected by intellectual property law, and probably fail, or take a Pareto principle approach and cover their usual niche.


App based on this post to help you decide what to buy: https://nanx.me/gpu/


TL;DR, your best option right now is the RTX 4090 with the budget picks being either a used RTX 3090 or a used RTX 3090 Ti.


I think there's one more axis: Frequency-of-use.

For occasionally use, the major constraint isn't speed so much as which models fit. I tend to look at $/GB VRAM as my major spec. Something like a 3060 12GB is an outlier for fitting sensible models while being cheap.

I don't mind waiting a minute instead of 15 seconds for some complex inference if I do it a few times per day. Or having training be slower if it comes up once every few months.


Hopefully the next generation of cards have high-VRAM variants.


"As for capacity, Samsung’s first GDDR7 chips are 16Gb, matching the existing density of today’s top GDDR6(X) chips. So memory capacities on final products will not be significantly different from today’s products, assuming identical memory bus widths. DRAM density growth as a whole has been slowing over the years due to scaling issues, and GDDR7 will not be immune to that."

Source: https://www.anandtech.com/show/18963/samsung-completes-initi...


I can buy a DDR5 64GB kit from Crucial for $160.

https://www.crucial.com/memory/ddr5/ct2k32g48c40u5

If a $1000 GPU came with that, it would blow everything else out-of-the-water for model size. Speed? No. Model size? Yes.

If it came with 320GB, I could run ChatGPT-grade LLMs. That's $800 worth of DDR5.

Instead, I get 24GB on the 3090 or 4090 for $2k.

A $3k LLM-capable card would not be a hard expense to justify.


I'm sticking with nVidia for now (currently a 3090 bought secondhand of eBay) as it is the most tested/supported by far, but it is great to see AMD making progress (finally) as some competition in this segment is desperatly needed.


Any tips for getting one off ebay without getting screwed? I want to pull the trigger, but a bit scared.


It was my first purchase of of ebay, so not sure I can advice much.

I just waited for a reputable seller to show up. I also limited myself to professional sales from the EU to avoid any potential import issues.

I guess there is always a risk involved, but probably more buying from a first time private profile with just one object listed, than from a business that sells every day with high rep.


ditto. Second hard graphics cards are such a wild west to me.


I've bought a Radeon RX 6700XT (12GB) last year, primarily for playing games.

But after Stable Diffusion came out, I started to play around with it and was pleasantly surprised that the GPU could handle it!

The setup is a little messy, and Linux only.

For someone targeting AI, definitely pick an Nvidia card with 12+ GBs of VRAM.


You'll want lots of memory, so depends on your price point.

4090 ($1,600) > 3090 ($1300 new - $600 used) > 3060 ($300)

used 3090 is ideal value. Lots of models will need the 24gb ram


Trying to build a scalable home 4090 cluster but running into a lot of confusion...

Let's say

- I have a motherboard + cpu + other components and they've both got plenty of pcie lanes to spare, total this part draws 250W (incl the 25% extra wattage headroom)

- start off with one RTX 4090, TDP 450W, with headroom ~600W.

- I want to scale up by adding more 4090s over time, as many as my pcie lanes can support.

    1. How do I add more PSUs over time? 

    2. Recommended initial PSU wattage? Recommended wattage for each additional pair of 4090s?

    3. Recommended PSU brands and models for my use case?

    4. Is it better to use PCI gen5 spec-rated PSUs? ATX 3.0? 12vhpwr cables rather than the ordinary 8-pin cables? I've also read somewhere that power cables between different brands of PSUs are *not* interchangeable??

    5. Whenever I add an additional PSU, do I need to do something special to electrically isolate the PCIe slots?

    6. North American outlets are rated for ~15A * 120V. So roughly 1800W. I can just use one outlet per psu whenever it's under 1800W, right? For simplicity let's also ignore whatever load is on that particular electrical circuit.
Each GPU means another 600W. Let's say I want to add another PSU for every 2 4090s. I understand that to sync the bootup of multiple PSUs you need an add2psu adapter.

I understand the motherboard can provide ~75W for a pcie slot. I take it that the rest comes from the psu power cables. I've seen conflicting advice online - apparently miners use pcie x1 electrically isolated risers for additional power supplies, but also I've seen that it's fine as long as every input power cable for 1 gpu just comes from one psu, regardless of whether it's the one that powers the motherboard. Either way x1 risers is an unattractive option bc of bandwidth limitations.

pls help


1. You can pair normal atx PSUs for the motherboard/CPU and server PSUs for the GPUs using breakout boards.

2. You can power limit GPUs down to 250W and barely lose any performance depending on your use case, highly recommend it. So any PSU that can provide those is good.

3. HP 1200w power supplies are both plenty and cheap on ebay, even though they are rated at 1200w, because they are so cheap, you're better of just running them at ~500w and buy multiple of it instead of overheating a single one. A nice benefit of running them at lower wattages is the very loud tiny fan doesn't have to spin as hard and create a ton of noise.

4. Not needed, but having a single cable might be convenient, they are pretty expensive though.

5. You don't need to do anything special here, except if you add too many GPUs, the motherboard might have issues booting because the 75w per gpu draw is too much, but usually those motherboards will have an extra GPU power cable (like the ROMED8-2T) and some risers let you hook up the power cable directly to them so PCIe is only used for data transfer.

6. It's not the outlet, it's the circuit that matters. And keep in mind that whatever power wattage you set on the GPU, you need to account for ac/dc loss, so you need to add an additional ~10-15% to the usage.

If you power limit it to 250W, each additional GPU is essentially an extra ~280W or so. If you plan on having like 8 GPUs or more and you plan to run them 24/7, you're better off just calling a local colocation center and run it there, since they have much cheaper electricity cost, it comes out cheaper for you and you have all the benefits of being in a datacenter.


> 6. North American outlets are rated for ~15A * 120V. So roughly 1800W. I can just use one outlet per psu whenever it's under 1800W, right? For simplicity let's also ignore whatever load is on that particular electrical circuit.

You're going to have a bad time with this assumption; typical non-kitchen household circuits in the U.S. are 15A for the circuit. Each outlet is usually limited to 15A, but the circuit breaker serving the entire circuit is almost certainly 15A as well; one outlet at maximum load will not leave capacity for another outlet on the same circuit to be simultaneously drawing maximum amperage.

Typical residential construction would have a 15A circuit for 1-2 rooms, often with a separate circuit for lighting. Some rooms, e.g. kitchens will have 20A circuits, and some houses may have been built with 20A circuits serving more outlets / rooms.


So those miner motherboards with the crap ton of PCIe x1 slots typically have a molex connector on the motherboard for each of those slots. Molex is famous for starting fires. I’m not sure I would ever go with a setup with molex connectors, but then I’m not sure you have another option. The issue is if they used PCIe power connectors instead, you often wouldn’t have enough of those left over for your GPU, so I get why they went with molex, it’s just a very old, and by modern standards crappy connector.

Combined with the ~1800W per 15A circuit restriction (I wouldn’t load the circuit to 100%, so really ~1600W) I’m not sure you can achieve what you’re going for.

If you’re really wanting to do this, consider adding a say 30A circuit near the circuit breaker of your home, usually the garage or basement and put the equipment there. I would get a dehumidifier in either location.


Have you read Tim's guide?


One, don't use a case. Look at how miners mounted their hardware on racks and take notes. Cheaper, better for temps, and the most efficient use of space.

Two, I recommend ignoring electricity cost and using all you can. If it's cheaper now than it ever will be, use it while it's cheap. If it will go down due to renewables, nuclear, etc in the future, it's good to buy up the GPUs while their price is artificially depressed from energy fears.

Third, go for server type PSUs and breakout boards. The server PSUs cant be beaten in watts for your dollar, and are extremely efficient.

Finally, consider scooping up some x79 and x99 xeon boards from Chinese sellers. They're cheap as hell, have PCI lanes out the wazoo, etc. This means you don't have to fool with as many mobos to run the same amount of gpus. If you go this route, don't get the bottom of the barrel no-name motherboards. Machinist is a decent one.


There’s clearly demand to buy AI capable GPUs at the store at a low price.

But Nvidias monopoly mean a they cripple their retail cards and push the AI stuff to data centers.

If only there was many manufacturers of AI hardware and software there would be abundant cheap products at every level.

AMD and Intel don’t seem to be able to compete and there’s no sign that will change.

So AI is going to remain expensive and hard to get for a very long time.


I almost immediately became suspicious on the accuracy of this article when they said the "Nvidia RTX 40 Ampere series". Ampere was the architecture name for the RTX 30 series. Ada Lovelace is the architecture name for the RTX 40 series.


Probably just an accident. Tim Dettmers has been updating this post for years and it's a super valuable resource.


Any advice for mobile gpus? I'm interested in getting a laptop (preferably in the portable category). Obviously it's not going to be in 4090 territory, that's a tradeoff I'm willing to make.


Weird to leave out Apple. They seem to be the cheapest option to get a large amount of GPU memory.


4090 is now in high end PCs, with 24GB VRAM, that's what I'm going to buy.

Everyone talks about Nvidia GPUs and AMD MI250/MI300, where is Intel? Would love to have a 3rd player.


Consider the 3090, same memory but was way cheaper than the 4090 when I was looking, might be a good trade off if you don't really need the 40 speed boost.


3090 is still under powered by quite a bit though it does have 24GB


Intel has Habana Gaudi2, which is an A100 competitor, but you can only access it on Intel’s developer cloud, apparently.


yes even MI300 from AMD is data center only just like A100 and H100.

I guess what Intel is missing is that it does not have a PC version GPU(ARC is far behind AMD and Nvidia GPU cards), so it can not establish its developer ecosystem and its OneAPI is a hard sell for its AI plan.

Either make ARC or whatever GPU as good as Nvidia/AMD graphic cards, or at least make lots of great AI compute accelerator stick to stay in the game, or no future in the AI era for Intel, sadly.


Raw performance rating for the RTX 3070 seems very weirdly placed in the chart. It's below the RTX 3060 Ti, which doesn't seem to make any sense.


I never tire of this. Tim is a wonderful no nonsense person. I love these posts and I love that it stays up to date.


Really a shame that the 4070ti doesn't have 16GB.

But I guessed it is expected that Nvidia doesn't want to cannibalize the 4080.


Every level below the *100 series has some sort of limitation to give incentives to upgrade one or two levels.

It's hard to blame nvidia when nobody seems to be trying to compete with them on the low end of ML and DL.


nVidia has a 20 GB GPU with the same chip as 4070Ti, the model is RTX 4000 SFF.

One issue is price, it costs almost twice as much. Another one is memory bandwidth, RTX 4000 SFF only delivers 320 GB/second. That is much slower than 4070Ti (504 GB/second) and slightly faster than 4060Ti (288 GB/second). Also the clock frequencies are half of 4070Ti, so the compute performance is worse.


> RTX 4000 SFF

Max Power Consumption - 70W.

Huh?


The power efficiency of the RTX 4000 is awesome, but it costs performance.

The 4070 Ti runs at 2.3 GHz base clock / 2.6 GHz boost, the RTX 4000 SFF only runs at 1.3 GHz base / 1.6 GHz boost. For this reason, despite the chip is the same, the compute performance of the RTX 4000 is not particularly great. 4070 Ti delivers up to 35.48 TFlops at base clock, RTX 4000 only 19 TFlops.


Half the TFflops but 4 times less energy consumption (4070ti is TDP MAX 285W)


For a compromise, how is the recently released 4060ti with 16gb RAM? Its about a third the price of a 4090.


Do local GPUs make sense? For the same price, can't you got a full years worth of cloud gpu time?


Cloud GPU providers are running low on capacity at the moment as people frantically suck up capacity to hop on the AI bandwagon, raising worries about availability. So having guaranteed access is maybe one motivation for local GPUs. But for me the main reason to go local is more psychological. I've mostly used cloud compute up until now but whenever I'm paying an hourly cost (even a small one) there is a pressure to 'make it worthwhile' and I feel guilty when the GPU is sitting idle. This disincentivizes playing and experimentation, whereas when you can run things locally there is almost no friction for quickly trying something out.


Looking at the pricing, if you only spin those instances up when you need them, you can go a while before you break even. Otherwise it only takes a few months depending on the GPU.

I would imagine that someone really serious about training (or any other CUDA workload) uses both.


Having looked at the pricing of retail card vs cloud, I came to the conclusion I could probably buy enough cloud compute to complete a phd before I 'paid for' the cost of a 4090 build...


Buying a high-end gaming GPU also lets you do, well, high-end gaming, 3D and video renders, etc.

If you only care about ML stuff, sure, the calculation is different.


Omg that's a long read but very informative




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: