Nvidia can actually charge larger margins if inference compute goes down. It would enable them to manufacture more units of smaller GPUs using inferior and cheaper silicon, all of which would increase the profits per unit sold as well as the number of units they can manufacture.
The industry has to find a way to separate itself from Nvidia's GPGPU technology if they want to stop being gouged. The issue is that nobody, not Apple, not AMD, not Intel, has been treating Nvidia's hardware as a serious threat.
>The issue is that nobody, not Apple, not AMD, not Intel, has been treating Nvidia's hardware as a serious threat
Google has and they've built a much more cost efficient (for them) system: the TPU. They even rent them out, and in terms of cost per unit compute TPUs are significantly cheaper than renting GPUs from the big cloud providers. Amazon's also tried to do something similar with Trainium chips, however they're usefulness is more limited due to software issues (Amazon's much weaker at compiler development than Google, so Trainium software is quite slow and buggy).
You can do inference on almost any hardware, I do not see any edge for NVIDIA here
I can download DeepSeek 30b model and run inference at good speed on AMD GPU ms and even on CPU. Apple silicon works fine too. I get >50 tokens/s on £300 AMD GPUs.
The main bottleneck appears to be memory, not processing power.
1. The future of inference for ChatGPT-style direct consumer usage is on-device. Cloud-based inference is too gaping of a privacy hole in a world where some level of E2EE is rapidly becoming the default expectation for chat. It's not hard to imagine that the iPhone 50 may be able to comfortably run models that firmly surpass GPT-4o and o1. Similarly, for things like coding and any other creation of novel IP, there are obvious security benefits to keeping the inference local.
2. Going forward, the vast majority of inference will be performed by agents for process automation (both personal and business), rather than direct user interaction. For these use cases, centralized infrastructure will be the natural architecture. Even for cases where an end client device technically exists (e.g. Tesla-Optimus-style machines), there may be economy of scale advantages to offloading compute to the cloud.
In fact, I'm not sure how the "we will need tons of centralized inference infrastructure" argument works when Apple with +50% smartphone market share in the USA has a totally opposite strategy focused on privacy: on-device inference.
Fundamentally it is more efficient to process a batch of tokens from multiple users/requests than processing them from a single user's request on device.
Apple's strategy already failed. Their big bet on NPU hardware did not pay off at all, and right now it's effectively wasted silicon on every iDevice while the GPU does all the heavy inference work. Now they partner with OpenAI to handle their inference (and even that's not good enough in many cases[0]). The "centralized compute" lobby is being paid by Apple to do the work their devices cannot.
Until Apple or AMD unifies their GPU architectures and implements complex streaming multiprocessors, Nvidia will remain in a class of their own. Apple used to lead the charge on the foremost CUDA alternative too, but then they abandoned it to focus on proprietary standards instead. It's pretty easy to argue that Apple shot themselves in the foot with every opportunity they had to compete on good faith. And make no mistake: Apple could have competed with Nvidia if they weren't so stubborn about Linux support and putting smartphone GPUs in laptops and desktops.
People talk about Groq and Cerberus as competitors but it seems to me their manufacturing process makes the availability of those chips extremely limited. You can call up Nvidia and order $10B worth of GPUs and have them delivered the next week. Can't say the same for these specialty competitors.
> You can call up Nvidia and order $10B worth of GPUs and have them delivered the next week
Nvidia sold $14.5 billion of datacenter hardware in the third quarter of their fiscal 2024 and that led to severe supply constraints, with estimate lead times for H100's up to 52 weeks some places, so no you can't, as that $14.5 billion was clearly capped by their ability to supply, not demand.
You're right, though, that Groq etc. can't deliver anywhere near the same volume now, but there's little reason to believe that will continue. There's no need for full GPU's for inference only workloads, so competitors can enter the space with a tiny proportion of functionality.
Their architecture means you buy them by the rack. Individual chips are useless, the magic happens when you set them up so each chip handles a subset of the model.
IOW, do you think groq’s 70B models run on 230MB of sram?
I didn’t say the model gonna run on one chip of course. 70B needs ~300 chips (only for weights, fp8, just like they do, key value cache not included), 670B would need ~3000 chips, and in racks or not it’s very hard to set up such cluster for one model. There are reasons they still don’t have Llama 405B model.
The "reasons" are most likely because it's not cost-effective as what is effective at this point a tech demo, that first becomes cheap to run if you're actually going to use a decent portion of the capacity for a single model.
How many servers in one rack? Let's say 42. How many chips in one server? Let's say 8. It's 336 cards per rack - enough for fp8 70B model weights (and, maybe, kv cache if your requests aren't too long, but probably not really). You need 10 (!) racks to serve one (!) DeepSeek model weights. There is also massive amount complexity arises from operating so many nodes.
During short time when Groq hardware appeared on the market it was costing 20K per card. It's 60 mln (!) per 1 Deepseek model. You need absolutely crazy amount of load to justify those costs, and, most likely, you will need massive amount additional nodes to handle KV cache of those requests.
Yes, you need a crazy amount of load for them to make sense. But when you're seeing providers build out whole data centres at a cost of billions, there you have their market.
This is a market where several large Nvidia customers are designing their own chips (e.g. Meta, Amazon, Google) because they're at a scale where it makes sense to try.
Whether it's a market that lets Groq be successful remains to be seen.
No idea about Groq, but Cerebras might give you a similar timeline than nVidia. Each of their wafers are 50x-100x H100s so they need to make less of them, in absolute units.
But cooling, power, etc... nVidia might have an advantage as their ecosystem is huge and more "liquid" in a sense.
Nvidia (NVDA) generates revenue with hardware, but digs moats with software.
The CUDA moat is widely unappreciated and misunderstood. Dethroning Nvidia demands more than SOTA hardware.
OpenAI, Meta, Google, AWS, AMD, and others have long failed to eliminate the Nvidia tax.
Without diving into the gory details, the simple proof is that billions were spent on inference last year by some of the most sophisticated technology companies in the world.
They had the talent and the incentive to migrate, but didn't.
In particular, OpenAI spent $4 billion, 33% more than on training, yet still ran on NVDA. Google owns leading chips and leading models, and could offer the tech talent to facilitate migrations, yet still cannot cross the CUDA moat and convince many inference customers to switch.
People are desperate to quit their NVDA-tine addiction, but they can't for now.
[Edited to include Google, even though Google owns the chips and the models; h/t @onlyrealcuzzo]
The CUDA moat is largely irrelevant for inference. The code needed for inference is small enough that there are e.g. bare-metal CPU only implementations. That isn't what's limiting people from moving fully off Nvidia for inference. And you'll note almost "everyone" in this game are in the process of developing their own chips.
My company recently switched from A100s to MI300s. I can confidently say that in my line of work, there is no CUDA moat. Onboarding took about month, but afterwards everything was fine.
Alternatives exist, especially for mature and simple models. The point isn't that Nvidia has 100% market share, but rather that they command the most lucrative segment and none of these big spenders have found a way to quit their Nvidia addiction, despite concerted efforts to do so.
For instance, we experimented with AWS Inferentia briefly, but the value prop wasn't sufficient even for ~2022 computer vision models.
The calculus is even worse for SOTA LLMs.
The more you need to eke out performance gains and ship quickly, the more you depend on CUDA and the deeper the moat becomes.
Google was omitted because they own the hardware and the models, but in retrospect, they represent a proof point nearly as compelling as OpenAI. Thanks for the comment.
Google has leading models operating on leading hardware, backed by sophisticated tech talent who could facilitate migrations, yet Google still cannot leap over the CUDA moat and capture meaningful inference market share.
Yes, training plays a crucial role. This is where companies get shoehorned into the CUDA ecosystem, but if CUDA were not so intertwined with performance and reliability, customers could theoretically switch after training.
Both matter quite a bit. The first-mover advantage obviously rewards OEMs in a first-come, first-serve order, but CUDA itself isn't some light switch that OEMs can flick and get working overnight. Everyone would do it if it was easy, and even Google is struggling to find buy-in for their TPU pods and frameworks.
Short-term value has been dependent on how well Nvidia has responded to burgeoning demands. Long-term value is going to be predicated on the number of Nvidia alternatives that exist, and right now the number is still zero.
It's unclear why this drew downvotes, but to reiterate, the comment merely highlights historical facts about the CUDA moat and deliberately refrains from assertions about NVDA's long-term prospects or that the CUDA moat is unbreachable.
With mature models and minimal CUDA dependencies, migration can be justified, but this does not describe most of the LLM inference market today nor in the past.
The industry has to find a way to separate itself from Nvidia's GPGPU technology if they want to stop being gouged. The issue is that nobody, not Apple, not AMD, not Intel, has been treating Nvidia's hardware as a serious threat.