Evaluating AMD GPU by their specs is not going to paint the full picture. Their drivers are a serious problem. I've managed to get ROCm mostly working on my system (ignoring all the notifications of what is officially supported, the jammy debs from the official repo seem to work on Debian testing). The range of supported setups is limited so it is quite easy to end up in a similar situation.
I expect system lockups when doing any sort of model inference. From the experiences of the last few years I assume it is driver bugs. Based on their rate of improvement they probably will get there in around 2025, but their past performance has been so bad I wouldn't recommend buying a card for machine learning until they've proven that they're taking the situation seriously.
Although in my opinion buy AMD anyway if you need a GPU on linux. Their open source drivers are a lot less hassle as long as you don't need BLAS.
In the data center, I think AMD is a lot more viable than most people think. MosaicML recently did a test and were able to swap MI250s with A100s basically seamlessly, within a single training run even, and ran into no issues: https://www.mosaicml.com/blog/amd-mi250
I think where most people have been getting into trouble is with trying to run with unsupported cards (eg, *ALL* of AMD's consumer cards), or wanting to run on Windows. This is obviously a huge fail on AMD's part since anyone who's tried to do anything with any of those consumer cards will just assume the data center cards are the same, but they're quite different. It doesn't help that I've never seen any CDNA2 card on sale/available in retail. How does AMD ever expect to get any adoption when no developers have hardware they can write code to? It's completely mental.
I got really excited until you said all of their consumer cards are out. That's even more infuriating - people have mammoth computing devices laying around and they can't make full use of them, because of drivers.
Not that drivers are simple to make, but still. It's like owning a Ferrari that works perfectly, but you can only drive north.
You can use the WebGPU backend on Tinygrad. It's working well for my test with a Nvidia 960 running inference (Unet 3D). I don't know how well WebGPU is supported on AMD GPUs.
ROCm is not the only option, compute shaders are very reliable on all GPUs. And thanks to the Valve’s work on DXVK 2.0, modern Linux runs Windows D3D11 software just fine.
I dunno, are they? AMD should to pay someone to put up some "how to multiply a 2x2 matrix on our GPU for the average programmer!" tutorials somewhere obvious. I saw a lot of GPU lockups before I gave up on trying and decided that it wasn't worth it. Maybe compute shaders were a thing I should have tried. To be honest, I don't know much about them because my attempts in the space were shut down pretty hard by driver bugs linked to OpenCL and ROCm.
I thought it was just me for a while, but after watching George Hotz's famous meltdown trying to program on an AMD GPU I do wonder if they're underestimating the power of a few good public "how to use the damn thing" sessions. They've been pushing ROCm which would probably be great if it worked reliably.
CUDA is the default tech for GPGPU in HPC or AI applications, for more than a decade now. By now, people have found most of these driver bugs, and nVidia has fixed them.
Similarly, compute shaders is the only tech for GPGPU used in videogames. Modern videogames are using compute shaders for a decade now, in increasing volumes. For example, UE5 even renders triangle meshes with them [1].
However, OpenCL and ROCm are niche technologies. I’ve been hearing complaints about driver quality for some time now. For obvious reason, AMD and Intel prioritize driver bugs which affect modern videogames sold in many million copies, compared to the bugs which only affect a few people working on HPC, AI or other niche GPGPU applications.
> they're underestimating the power of a few good public "how to use the damn thing" sessions
I agree the learning curve is steep, with the lack of good materials. For an introduction article, see [2]. Ignore the parts about D3D10 hardware, the article is old and D3D10 hardware is no longer relevant. Another one, with slightly more depth, is [3]. For an example how to multiply large dense matrices with a compute shader see [4], but that example is rather advanced because optimizations, and because weird memory layout conventions inherited from the upstream project.
> For obvious reason, AMD and Intel prioritize driver bugs which affect modern videogames sold in many million copies, compared to the bugs which only affect a few people working on HPC, AI or other niche GPGPU applications.
If people could develop AI stuff on consumer cards, they'd buy a ton of server-grade cards, or rent them via the usual hyperscalers or dedicated platforms for the actual work.
This entire, multi-million dollar (per training session) market is all firmly in the hands of nVIDIA at the moment, and unless AMD seriously improves their offering that won't change. nVIDIA got to the point they are because they focused on getting developers up to speed very fast and cheap, and so the developers asked their employers to get them the stuff they were already used to.
> If people could develop AI stuff on consumer cards
Technically people can already do that, by leveraging compute shaders in D3D, Vulkan, or WebGPU. I’m certain it’s possible to implement D3D or Vulkan backend for PyTorch, TensorFlow, and similar ML libraries. When I experimented with AI, I did similar stuff myself with C++, C#, and HLSL, and I found it wasn’t terribly hard to do.
However, PyTorch made by Facebook, TensorFlow by Google. It seems most companies who maintain such libraries are only interested in cloud computing. Some of them, like Qualcomm, only care about their own proprietary hardware. None of them seems to care about desktops.
My theory is that AMD initially wanted to compete in compute with nVidia and was going to improve ROCm but then they saw how much trouble nVidia had with artificially keeping enterprise users from just buying consumer GPUs instead of the much more profitable enterprise GPUs so they trashed that idea to keep consumer GPUs from interfering with their very profitable enterprise GPU business for compute applications.
Rocm was designed and implemented for HPC. There's no cunning scheme to stop it working on gaming cards, there just isn't (wasn't?) much investment in making it work either.
I'm suggesting they gave up on investment on making it work when they realized good compute on their consumer cards would cannibalize their enterprise cards like what nVidia is experiencing
It disappoints me that DirectX remains one of the best GPU-compute solutions in practice right now. And Vulkan too I guess.
But it really is. That's the state of the market. The video game artists are GPU-programmers, they've hit DirectX11, DX12, and Vulkan with a wide variety of video games and have turned that ecosystem very stable.
-------------
DX11 is 32-bit only atomics, I don't think its a very serious solution in practice. 64-bit atomics (especially 64-bit CAS) is already very limiting compared to CPU-world where 128-bit CAS is needed to fix the obscure ABA-problem.
DX and Vulkan just have... so much API-crap you need to even get Hello World / SAXY up.
C++Amp was wonderful back in 2014, but it too is stuck in DirectX11 and therefore 32-bit atomic world. And it hasn't had an update since then. Microsoft really should have kept investing in C++Amp IMO.
-------------
ROCm is fine if you get the hardware and if it remains supported. But I think in practice, people expect support longer than what AMD is willing to give.
> 32-bit only atomics, I don't think its a very serious solution in practice
Yeah, I think I encountered that while porting a hash map from CUDA to HLSL.
However, I'm not sure that's necessarily a huge deal. Probably not an issue for machine learning or BLAS stuff, these use cases don't need fine-grained thread synchronization.
For applications which would benefit from such synchronization, traditional lock-free techniques ported from CPU (i.e. compare and swap atomics on global memory) can be slow due to huge count of active threads on GPUs. I mean it's obviously better than locks, but sometimes it's possible to do something better instead of CAS.
> so much API-crap you need to even get Hello World / SAXY up.
I agree for D3D12 and especially Vulkan, but IMO D3D11 is not terribly bad. It has a steep learning curve, but the amount of boilerplate for simple apps is IMO reasonable. Especially for ML or similar GPGPU stuff which only needs a small subset of the API: compute shaders only, no textures, no render targets, no depth-stencil views, no input layouts, etc.
However, unlike simple apps, real-life ones often need profiler and queue depth limiter, relatively hard to implement on top of these queries. I think Microsoft should ship them both in Windows SDK.
Lock-free techniques are not "obviously better than locks".
Lock-free techniques offer a different trade-off than locks. Lock-free techniques are a little faster in the typical case than locks, but this advantage is paid by being much slower in the worst case (because they may need a very large number of retries to succeed). In the case when a great number of threads contend for access, the worst case can be very frequent.
The best application for lock-free techniques is in read-only access to shared data. In this case they are almost always the best solution.
On the other hand, for write access to shared data, which one is better, between optimistic access control with lock-free techniques and deterministic serialization of the accesses with locks, depends on the application and it cannot be said in general that one method or the other is preferable.
> Lock-free techniques are not "obviously better than locks".
On GPUs, they are. GPUs don't have any locks, but they can be emulated on top of these global memory atomics. Because count of active threads is often thousands, the performance of that approach is much worse than lock-free techniques.
> For applications which would benefit from such synchronization, traditional lock-free techniques ported from CPU (i.e. compare and swap atomics on global memory) can be slow due to huge count of active threads on GPUs. I mean it's obviously better than locks, but sometimes it's possible to do something better instead of CAS.
I don't think traditional CAS can be optimized. But a fair number of atomic-operations seem to be coalesced into a prefix sum. So... with regards to your latest post.
> On GPUs, they are. GPUs don't have any locks, but they can be emulated on top of these global memory atomics. Because count of active threads is often thousands, the performance of that approach is much worse than lock-free techniques.
Those "thousands of atomic" operations can become coalessed 32-at-a-time (prefix-sum) and turned into just "dozens of atomics" in practice. Automatically mind you, by the compiler.
Don't discount the brute-force code because its simple. Don't assume ~1000+ atomic operations will actually be physically executed as 1000+ atomics. The compiler can "fix" a lot of this code in practice.
Not always, but the compiler can fix it often enough that its beneficial to write the simple brute-force "thousands-of-atomics" code and check the compiled output.
Probably never with "CAS", but its pretty often that "atomic_add" written in a brute force manner (tracking a parallel counter) will compile into a prefix-sum + 1x atomic from one lane, rather than execute as 32x atomics. And even if it is executed as 32x atomics, there are atomic-accelerators on the GPUs that may make the operation faster than you might think. You know, as long as it isn't a compare-and-swap loop.
Sure, compute shaders might work, but don’t you need rocBLAS, rocSPARSE, MIOpen, etc? Are people reinventing those in compute shaders in another package?
These things are nice to have, but you don’t actually need them.
It only takes 1-2 pages of HLSL to implement efficient matrix multiplication. It’s not rocket science, the area is well researched and there’re many good articles on how to implement these BLAS routines efficiently.
Moreover, manually written compute shaders enable stuff missing from these off-the-shelf higher level libraries.
It’s easy to merge multiple compute operations into a single shader. When possible, this sometimes saves gigabytes of memory bandwidth (and therefore time) these high-level libraries spend writing/reading temporary tensors.
It’s possible to re-shape immutable or rarely changing tensors into better memory layouts. Example for CPU compute https://stackoverflow.com/a/75567894/126995 the idea is equally good on GPUs.
It’s possible to use custom data formats, and they don’t require any hardware support. Upcasting BF16 to FP32 is 1 shader instruction (left shift), downcasting FP32 to BF16 only takes a few of them (for proper rounding), no hardware support necessary. Can pack quantized or sparse tensors into a single ByteAddressBuffer, again nothing special is required from hardware. Can implement custom compression formats for these tensors.
The bar for the AI ecosystem to get models running is git clone + <20 lines of Python.
If AMD can’t make that work on consumer GPUs, then the only option they have is to undercut Nvidia so deeply that it makes sense for the big players to hire a team to write AMD’s software for them.
> undercut Nvidia so deeply that it makes sense for the big players to hire a team
At least for some use cases, I think that has already happened. nVidia forbids using GeForce GPUs in data centers (in the EULA of the drivers), AMD allows that. Cost efficiency difference between AMD high-end desktop GPUs, and nVidia GPUs which they allow to deploy in data centers, is about an order of magnitude now. For example, L40 card costs $8000-9000, and delivers performance similar to $1000 AMD RX 7900 XTX.
For this reason, companies which run large models at scale are spending ridiculous amount of money on compute costs, often by renting nVidia’s data center targeted GPUs from IAAS providers. OpenAI’s CEO once described compute costs of running ChatGPT as “eye watering”.
For companies like OpenAI, I think investing money in development of vendor-agnostic ML libraries makes a lot of sense in terms of ROI.
But you can't run most models on the consumer AMD GPUs, so even though AMD "allows" it, nobody except supercomputer clusters uses AMD GPUs for compute, because all the expensive data scientists you hired will bitch and moan until you get them something they can just run standard CUDA stuff on.
Different people estimate compute costs of ChatGPT to be between $100k and $700k per day. Compared to these numbers, data scientists aren't that expensive.
> just run standard CUDA stuff
I doubt data scientists have skills to write CUDA, or any other low-level GPGPU code. It's relatively hard to do, and takes years of software development experience to become proficient.
Pretty sure most of these people only capable of using higher-level libraries like TensorFlow and PyTorch. For this reason, I don't think these data scientists need or care about standard CUDA stuff, they only need another backend to these Python libraries. Which is much easier problem to solve.
And one more thing. It could be that most of ChatGPT costs are unrelated to data scientists, and caused by end users running inference. In that case, the data scientists won't even notice, because they will continue using these CUDA GPUs to develop new versions of their models.
The more relevant question is which GPU is the OP using? The only officially ROCm supported GPU available for retail purchase is the RDNA2-based Radeon Pro W6800. [1]
In practice it probably means that gfx1030 (Navi 21) GPUs should work (RX6800-RX6950), but again, it also means also those cards (and every other card that AMD currently sells to individuals) is "unsupported."
Are you running the ROCm jobs on the same GPU as the system GUI? I use built from source rocm on debian with reasonable success, but I do remember gnome crashing pretty reliably when trying to run compute tests on my laptop.
How are the Windows drivers for AMD? OS shouldn't matter all that much if its primary role is to host or train models. As long as your code can run under the OS in question it's fine.
> The 24GB of VRAM should keep it relevant for a bit too
If anything, I think models are going to shrink a bit, because assumptions around small models reaching capacity during traiing don’t seem fully accurate in practice[0]. We’re already starting to see some effects, like Phi-1[1] (a 1.3B code model outperforming 15B+ models), and BTLM-3B-8K[2] (a 3B model outperforming 7B models)
We had a long phase of "models aren't good enough but get better if we make them bigger, let's see how far we can go". This year we finally reached "some models are pretty great, let's see if we can do the same with smaller models". I'm excited for where this will take us.
Is there any way to compute the "capacity" of a model? In theory, if it's encoding all data with 100% efficiency, I guess the data stored in the model should be something like 2^parameters count (weights + biases) ?
There’s a theoretical, but impractical, way: for a given model, each possible set of weight/bias values yields a specific loss value when ran against the full corpus. There’s at least one set of weight values which minimizes it, for which the idealized bit-per-byte entropy can be computed.
That can be compared to what OpenAI’s scaling law paper[0] calls the “entropy of natural language”, which they estimate at about 0.57 bits per byte, based on the differing power law for data vs. compute. In my mind, that highlights more the imprecision of the approach than the information-theoretic content of language semantics: an omniscient being would predict things better, so the closest thing to true entropy should be computed from the list of matching text prefixes among all texts ever.
> should be computed from the list of matching text prefixes among all texts ever
I initially thought that value is pretty low (possible things you can say), but it's probably infinite. Even though, in practice, we don't say too many different things and use a very limited subset of the words in the dictionary.
You get a fast link between the GPUs, which should help when you’ve got a model split between them.
However, that split isn’t automatic. You can’t expect to run a 40GB model on that, unless perhaps if it’s been designed for that—the way llama.cpp can split a model between the GPU and CPU, for instance.
What you can do without trouble is keep more models loaded, do more things at the same time, and occasionally run the same model at double speed if it batches well.
CUDA multi-GPU with NVLink is pretty well tested with shared memory space. You still want to use NCCL to optimize the allocation, but many CUDA-aware libraries (and their subsequent ML tools) are capable.
I think you need enterprise grade cards for it to make it work. If I remember correctly consumer cards with nvlink can't share resources to host a 40GB model in vram.
I bought a used 3090 FE from eBay for $600 too! Mine is missing the connector latch, but seems to be firmly inserted so I think fire risk is negligible.
I went with the 3090 because I wanted the most VRAM for the buck, and the price of new GPUs is insane. Most GPUs in the $500-1500 range, even the Quadros and A series, don’t have anywhere near 24GB of VRAM.
I used Tim's guide to build a dual RTX 3090 PC, paying 2300€ in total by getting used components. It can run inference of Llama-65B 4bit quantized at more than 10tok/s.
Finding the 3-slot nvlink bridge is hard and it's usually expensive. I think it's not worth it in most cases. I managed to find a cheap used one. Cooling is also a challenge. The cards are 2.7 slots wide and the spacing is usually 3 slots, so there isn't much room. Some people are putting 3d printed shrouds on the back of the PC case to suck the air out of the cards with an extra external fan. Also limiting the power from 350W to 280W or so per card doesn't cost a lot of performance. The CPU is not limiting the performance at all, as long as you have 4 cores per GPU you're good.
Managed to snatch a 3090 during the GPU shortage in 2020. Did a lot of training and mining, and got some of my results published, think I gained much more than the cost of the hardware purchases. Kinda miss the day of eth mining. 3090 is a still good card and I'm pretty sure your rig is going to serve you well.
ps: ~280W power limit is a good call, it won't heat up your room too much.
I hear a lot about CUDA and how bad ROCm is etc. and I’ve been trying to understand what exactly CUDA is doing that is so special; isn’t the maths for neural networks mostly multiplying large arrays/tensors together? What magic is CUDA doing that is so different for other vendors to implement? Is it just lock-in, the type of operations that are available, some kind of magical performance advantage or something else that CUDA is doing?
3. Ecosystem advantage (lots of software developed against an existing and well supported ecosystem)
I have a laptop with a mobile 2060 and a desktop with a top-of-the-line consumer 7900XTX. As of yet, the 7900XTX isn't officially supported (and I haven't bothered to go down the obnoxious rabbit hole to figure out how to compute on it). Meanwhile, I can load up CUDA.jl on my laptop in mere minutes with absolutely no fuss.
Edit: if there are any GPU gurus out there who are capable of working on AMDGPU.jl to make it work on cards like the 7900XTX out of the box and writing documentation/tutorials for it... start a Patreon. I bet you could fund some significant effort getting that up and running!
As of today, there is zero consumer card support from AMD. It is an option only if you have a PRO card.
"Formal support for RDNA 3-based GPUs on Linux is planned to begin rolling out this fall, starting with the 48GB Radeon PRO W7900 and the 24GB Radeon RX 7900 XTX, with additional cards and expanded capabilities to be released over time." [0]
Right, which SUCKS. Everyone who wants to prototype on their existing gear before jumping into a big pro card purchase is stuck with Nvidia, and the availability/performance of the software stack shows it.
I’m asking at a lower level than this, CUDA presumably has a list of functionality for GPGPU stuff like tensors, loading data, splitting up training, and building pipelines of networks
/attention stuff that can efficiently fit neural networks to many sorts of data.
Why is it so difficult for other manufacturers to provide a compatible layer? If Apple can make Direct X 12 work on Apple Silicon surely AMD should be able to make CUDA (which has to be much simpler that DX12) work on their graphics cards? Is there some fundamental architectural differences that stop this from working?
There's nothing conceptually hard but it's really a lot of work. In addition to the items you listed there's the actual compute kernels or compiler to generate those, and then porting frameworks over (PyTorch etc), and then doing the level of testing, documentation, and ongoing maintenance to make an alternative platform a reasonable idea for end users. The pitch for buying NVIDIA hardware is that existing tools, example code, and third party research will more or less work and perform well out of the box.
Edit: Going back to your original question, the main thing that makes CUDA so special is NVIDIA has already poured billions of dollars into all of this infrastructure and credibly will keep doing so.
There might be intellectual property concerns with "directly" implementing CUDA, and the architectures are (as I understand it) a bit different. That doesn't explain why they don't support something with similar broad compatibility though, as the actual card capabilities are very similar.
Nvidia's software is also pretty atrocious, in my opinion. The output of various tools is cryptic, updates regularly result in a totally broken system, and things often stop working for no discernible reason. Nvidia GPUs are always the most finnicky part of a system.
A modern Linux system should have uptime measured in years with minimal effort. A modern Linux system with Nvidia GPUs will have uptime of weeks with a lot of fuss.
(I'm no expert, just someone who's managed a number of PCs and a few servers.)
Right, but they can get away with that because they have essentially no competition.
With that said, Pop!OS does a really nice job of handling the Nvidia software stack - I've been running it on the laptop mentioned above for several years with no issues (though I don't leave my machines on 24/7).
Everyone else does the work to make sure it runs on cudnn, because they bought the hardware when it was the only reasonable solution, and if it works on anything else that’s just a happy accident. So you’ll spend weeks of your incredibly expensive engineering or researcher time fighting compatibility issues because you saved $1k by going with an amd card. Your researchers/engineers conclude it’s the only reasonable solution for now and build on nvidia.
It’s classic first mover advantage (plus just a better product / more resourcing to make it a better product honestly). I think you have to be a really massive scale to make the cost per card worth the cost per engineer math work out, unless AMD significantly closes the compatibility gap. But AMD’s job here is to fill a leaky bucket, because new CUDA code is being written every day, and they don’t seem serious about it.
I suppose this is one of the reasons (besides AMD dropping the ball) they aren't even trying to be competitive in the gaming market - they can sell the same mm2 silicon area for much more to AI startups:
"There's a full blown run on GPU compute on a level I think people do not fully comprehend right now. Holy cow.
I've talked to a lot of vendors in the last 7 days. It's crazy out there y'all. NVIDIA allegedly has sold out its whole supply through the year. So at this point, everyone is just maximizing their LTVs and NVIDIA is choosing who gets what as it fulfills the order queue." [0]
Make them work. Adapt your software. Whatever, stop whining about cloud GPUs just switch to retail.
Solve the barriers/limitations. This has been the essences of computing for 60 years. Stop being intimidated by Nvidia. Get your job done in what is available.
Buy AMD, buy Intel. Work out how to make your GPU thing work on them. Stop wringing your hands about how there’s no cloud GPUs when there’s a ton of cheap retail GPUs.
INNOVATE. Look beyond nvidia. Stop whining.
If you’ve bet your entire business on cloud GPUs then you’re a fool. Bet on retail GPUs.
If you're doing this professionally, you know what you need and what your budget is.
If you're doing this personally, to simply learn? I did the math for myself and figured I could probably buy enough gpu credits for the cost of a 4090 build that it would probably serve me all the way through getting a phd.
Ok, so make an enormous capital investment in tech that will be outdated in two years - if not already outdated, as you mentioned AMD and Intel. Then, I'll need to hire geniuses who can extract juice out of this hardware at a scale that not even Google, Amazon, Microsoft and others could.
Or, I rent just as much top performance hardware as I need, scaling as I go along, and worry about execution and implementation of my niche application instead. You can see why cloud is winning right now.
So Nvidia is going to pretty much corner the market for a long time? This bit I expected but was still sad to read. Surely we would benefit from competition. It would probably take a lot of investment from AMD to make that happen, I imagine.
> AMD GPUs are great in terms of pure silicon: Great FP16 performance, great memory bandwidth. However, their lack of Tensor Cores or the equivalent makes their deep learning performance poor compared to NVIDIA GPUs. Packed low-precision math does not cut it. Without this hardware feature, AMD GPUs will never be competitive.
This has pretty much always been true. AMD cards always had more FLOPS and ROPs and memory bandwidth than the competing nVidia cards which benchmark the same. Is that a pro for AMD? Uhhhh doesn't really sound like it.
That’s the one thing that I feel is a bit misleading in the article (to be fair, it was initially written years ago, and got rewritten a bit recently). FLOPS comparisons given in the wild are not always apple-to-apple (eg. not including Tensor cores for NVIDIA, but including V_DUAL_DOT2ACC_F32_F16 for AMD), while on the flip side, AMD’s WMMA should address the same goals as Tensor cores. I have an article on comparing the two: https://espadrine.github.io/blog/posts/recomputing-gpu-perfo...
> It would probably take a lot of investment from AMD to make that happen, I imagine
Don't AMD deliberately gimp their consumer cards to prevent cannibalising the pro cards? I vaguely recall reading about that a while back.
That being the case, they have already done the R&D but they chose to use the tech on the higher-margin kit, thus preventing hobbyists from buying AMD.
A few years ago AMD split off their GPU architectures to CDNA (focused on data center compute) and RDNA (focused on rendering for gaming and workstations). This in itself is fine and what Nvidia was already doing, it makes sense to optimize silicon for each use case, but where AMD took a massive wrong turn is that they decided to stop supporting compute completely for their RDNA (and all legacy) cards.
I'm not sure exactly what AMD expected to happen when doing that, especially when Nvidia continues to support CUDA on basically every GPU they've ever made: https://developer.nvidia.com/cuda-gpus#compute (looks like back to a GeForce 9400 GT, released in 2008)
Sadly this is still a market segment in which a proprietary stack dominates. From the perspective of AMD, they could be looking at a situation in which they can either throw billions of dollars at a monopoly protected by intellectual property law, and probably fail, or take a Pareto principle approach and cover their usual niche.
For occasionally use, the major constraint isn't speed so much as which models fit. I tend to look at $/GB VRAM as my major spec. Something like a 3060 12GB is an outlier for fitting sensible models while being cheap.
I don't mind waiting a minute instead of 15 seconds for some complex inference if I do it a few times per day. Or having training be slower if it comes up once every few months.
"As for capacity, Samsung’s first GDDR7 chips are 16Gb, matching the existing density of today’s top GDDR6(X) chips. So memory capacities on final products will not be significantly different from today’s products, assuming identical memory bus widths. DRAM density growth as a whole has been slowing over the years due to scaling issues, and GDDR7 will not be immune to that."
I'm sticking with nVidia for now (currently a 3090 bought secondhand of eBay) as it is the most tested/supported by far, but it is great to see AMD making progress (finally) as some competition in this segment is desperatly needed.
It was my first purchase of of ebay, so not sure I can advice much.
I just waited for a reputable seller to show up. I also limited myself to professional sales from the EU to avoid any potential import issues.
I guess there is always a risk involved, but probably more buying from a first time private profile with just one object listed, than from a business that sells every day with high rep.
Trying to build a scalable home 4090 cluster but running into a lot of confusion...
Let's say
- I have a motherboard + cpu + other components and they've both got plenty of pcie lanes to spare, total this part draws 250W (incl the 25% extra wattage headroom)
- start off with one RTX 4090, TDP 450W, with headroom ~600W.
- I want to scale up by adding more 4090s over time, as many as my pcie lanes can support.
1. How do I add more PSUs over time?
2. Recommended initial PSU wattage? Recommended wattage for each additional pair of 4090s?
3. Recommended PSU brands and models for my use case?
4. Is it better to use PCI gen5 spec-rated PSUs? ATX 3.0? 12vhpwr cables rather than the ordinary 8-pin cables? I've also read somewhere that power cables between different brands of PSUs are *not* interchangeable??
5. Whenever I add an additional PSU, do I need to do something special to electrically isolate the PCIe slots?
6. North American outlets are rated for ~15A * 120V. So roughly 1800W. I can just use one outlet per psu whenever it's under 1800W, right? For simplicity let's also ignore whatever load is on that particular electrical circuit.
Each GPU means another 600W. Let's say I want to add another PSU for every 2 4090s. I understand that to sync the bootup of multiple PSUs you need an add2psu adapter.
I understand the motherboard can provide ~75W for a pcie slot. I take it that the rest comes from the psu power cables. I've seen conflicting advice online - apparently miners use pcie x1 electrically isolated risers for additional power supplies, but also I've seen that it's fine as long as every input power cable for 1 gpu just comes from one psu, regardless of whether it's the one that powers the motherboard. Either way x1 risers is an unattractive option bc of bandwidth limitations.
1. You can pair normal atx PSUs for the motherboard/CPU and server PSUs for the GPUs using breakout boards.
2. You can power limit GPUs down to 250W and barely lose any performance depending on your use case, highly recommend it. So any PSU that can provide those is good.
3. HP 1200w power supplies are both plenty and cheap on ebay, even though they are rated at 1200w, because they are so cheap, you're better of just running them at ~500w and buy multiple of it instead of overheating a single one. A nice benefit of running them at lower wattages is the very loud tiny fan doesn't have to spin as hard and create a ton of noise.
4. Not needed, but having a single cable might be convenient, they are pretty expensive though.
5. You don't need to do anything special here, except if you add too many GPUs, the motherboard might have issues booting because the 75w per gpu draw is too much, but usually those motherboards will have an extra GPU power cable (like the ROMED8-2T) and some risers let you hook up the power cable directly to them so PCIe is only used for data transfer.
6. It's not the outlet, it's the circuit that matters. And keep in mind that whatever power wattage you set on the GPU, you need to account for ac/dc loss, so you need to add an additional ~10-15% to the usage.
If you power limit it to 250W, each additional GPU is essentially an extra ~280W or so. If you plan on having like 8 GPUs or more and you plan to run them 24/7, you're better off just calling a local colocation center and run it there, since they have much cheaper electricity cost, it comes out cheaper for you and you have all the benefits of being in a datacenter.
> 6. North American outlets are rated for ~15A * 120V. So roughly 1800W. I can just use one outlet per psu whenever it's under 1800W, right? For simplicity let's also ignore whatever load is on that particular electrical circuit.
You're going to have a bad time with this assumption; typical non-kitchen household circuits in the U.S. are 15A for the circuit. Each outlet is usually limited to 15A, but the circuit breaker serving the entire circuit is almost certainly 15A as well; one outlet at maximum load will not leave capacity for another outlet on the same circuit to be simultaneously drawing maximum amperage.
Typical residential construction would have a 15A circuit for 1-2 rooms, often with a separate circuit for lighting. Some rooms, e.g. kitchens will have 20A circuits, and some houses may have been built with 20A circuits serving more outlets / rooms.
So those miner motherboards with the crap ton of PCIe x1 slots typically have a molex connector on the motherboard for each of those slots. Molex is famous for starting fires. I’m not sure I would ever go with a setup with molex connectors, but then I’m not sure you have another option.
The issue is if they used PCIe power connectors instead, you often wouldn’t have enough of those left over for your GPU, so I get why they went with molex, it’s just a very old, and by modern standards crappy connector.
Combined with the ~1800W per 15A circuit restriction (I wouldn’t load the circuit to 100%, so really ~1600W) I’m not sure you can achieve what you’re going for.
If you’re really wanting to do this, consider adding a say 30A circuit near the circuit breaker of your home, usually the garage or basement and put the equipment there. I would get a dehumidifier in either location.
One, don't use a case. Look at how miners mounted their hardware on racks and take notes. Cheaper, better for temps, and the most efficient use of space.
Two, I recommend ignoring electricity cost and using all you can. If it's cheaper now than it ever will be, use it while it's cheap. If it will go down due to renewables, nuclear, etc in the future, it's good to buy up the GPUs while their price is artificially depressed from energy fears.
Third, go for server type PSUs and breakout boards. The server PSUs cant be beaten in watts for your dollar, and are extremely efficient.
Finally, consider scooping up some x79 and x99 xeon boards from Chinese sellers. They're cheap as hell, have PCI lanes out the wazoo, etc. This means you don't have to fool with as many mobos to run the same amount of gpus. If you go this route, don't get the bottom of the barrel no-name motherboards. Machinist is a decent one.
I almost immediately became suspicious on the accuracy of this article when they said the "Nvidia RTX 40 Ampere series". Ampere was the architecture name for the RTX 30 series. Ada Lovelace is the architecture name for the RTX 40 series.
Any advice for mobile gpus? I'm interested in getting a laptop (preferably in the portable category). Obviously it's not going to be in 4090 territory, that's a tradeoff I'm willing to make.
Consider the 3090, same memory but was way cheaper than the 4090 when I was looking, might be a good trade off if you don't really need the 40 speed boost.
yes even MI300 from AMD is data center only just like A100 and H100.
I guess what Intel is missing is that it does not have a PC version GPU(ARC is far behind AMD and Nvidia GPU cards), so it can not establish its developer ecosystem and its OneAPI is a hard sell for its AI plan.
Either make ARC or whatever GPU as good as Nvidia/AMD graphic cards, or at least make lots of great AI compute accelerator stick to stay in the game, or no future in the AI era for Intel, sadly.
nVidia has a 20 GB GPU with the same chip as 4070Ti, the model is RTX 4000 SFF.
One issue is price, it costs almost twice as much. Another one is memory bandwidth, RTX 4000 SFF only delivers 320 GB/second. That is much slower than 4070Ti (504 GB/second) and slightly faster than 4060Ti (288 GB/second). Also the clock frequencies are half of 4070Ti, so the compute performance is worse.
The power efficiency of the RTX 4000 is awesome, but it costs performance.
The 4070 Ti runs at 2.3 GHz base clock / 2.6 GHz boost, the RTX 4000 SFF only runs at 1.3 GHz base / 1.6 GHz boost. For this reason, despite the chip is the same, the compute performance of the RTX 4000 is not particularly great. 4070 Ti delivers up to 35.48 TFlops at base clock, RTX 4000 only 19 TFlops.
Cloud GPU providers are running low on capacity at the moment as people frantically suck up capacity to hop on the AI bandwagon, raising worries about availability. So having guaranteed access is maybe one motivation for local GPUs. But for me the main reason to go local is more psychological. I've mostly used cloud compute up until now but whenever I'm paying an hourly cost (even a small one) there is a pressure to 'make it worthwhile' and I feel guilty when the GPU is sitting idle. This disincentivizes playing and experimentation, whereas when you can run things locally there is almost no friction for quickly trying something out.
Looking at the pricing, if you only spin those instances up when you need them, you can go a while before you break even. Otherwise it only takes a few months depending on the GPU.
I would imagine that someone really serious about training (or any other CUDA workload) uses both.
Having looked at the pricing of retail card vs cloud, I came to the conclusion I could probably buy enough cloud compute to complete a phd before I 'paid for' the cost of a 4090 build...
I expect system lockups when doing any sort of model inference. From the experiences of the last few years I assume it is driver bugs. Based on their rate of improvement they probably will get there in around 2025, but their past performance has been so bad I wouldn't recommend buying a card for machine learning until they've proven that they're taking the situation seriously.
Although in my opinion buy AMD anyway if you need a GPU on linux. Their open source drivers are a lot less hassle as long as you don't need BLAS.