Nvidia can actually charge *larger* margins if inference compute goes down. It w...

logicchains · 2025-01-29T18:42:19 1738176139

>The issue is that nobody, not Apple, not AMD, not Intel, has been treating Nvidia's hardware as a serious threat

Google has and they've built a much more cost efficient (for them) system: the TPU. They even rent them out, and in terms of cost per unit compute TPUs are significantly cheaper than renting GPUs from the big cloud providers. Amazon's also tried to do something similar with Trainium chips, however they're usefulness is more limited due to software issues (Amazon's much weaker at compiler development than Google, so Trainium software is quite slow and buggy).

ClumsyPilot · 2025-01-29T19:02:15 1738177335

I think future of inference is on the client side

You can do inference on almost any hardware, I do not see any edge for NVIDIA here

I can download DeepSeek 30b model and run inference at good speed on AMD GPU ms and even on CPU. Apple silicon works fine too. I get >50 tokens/s on £300 AMD GPUs.

The main bottleneck appears to be memory, not processing power.

buu700 · 2025-01-30T02:03:29 1738202609

I would argue that both things are true:

1. The future of inference for ChatGPT-style direct consumer usage is on-device. Cloud-based inference is too gaping of a privacy hole in a world where some level of E2EE is rapidly becoming the default expectation for chat. It's not hard to imagine that the iPhone 50 may be able to comfortably run models that firmly surpass GPT-4o and o1. Similarly, for things like coding and any other creation of novel IP, there are obvious security benefits to keeping the inference local.

2. Going forward, the vast majority of inference will be performed by agents for process automation (both personal and business), rather than direct user interaction. For these use cases, centralized infrastructure will be the natural architecture. Even for cases where an end client device technically exists (e.g. Tesla-Optimus-style machines), there may be economy of scale advantages to offloading compute to the cloud.

mrbungie · 2025-01-29T19:48:18 1738180098

In fact, I'm not sure how the "we will need tons of centralized inference infrastructure" argument works when Apple with +50% smartphone market share in the USA has a totally opposite strategy focused on privacy: on-device inference.

gajjanag · 2025-01-29T20:12:05 1738181525

This is much more nuanced now. See Apple "Private Cloud Compute": https://security.apple.com/blog/private-cloud-compute/ ; they run a lot of the larger models on their own servers.

Fundamentally it is more efficient to process a batch of tokens from multiple users/requests than processing them from a single user's request on device.

talldayo · 2025-01-29T21:22:03 1738185723

Apple's strategy already failed. Their big bet on NPU hardware did not pay off at all, and right now it's effectively wasted silicon on every iDevice while the GPU does all the heavy inference work. Now they partner with OpenAI to handle their inference (and even that's not good enough in many cases[0]). The "centralized compute" lobby is being paid by Apple to do the work their devices cannot.

Until Apple or AMD unifies their GPU architectures and implements complex streaming multiprocessors, Nvidia will remain in a class of their own. Apple used to lead the charge on the foremost CUDA alternative too, but then they abandoned it to focus on proprietary standards instead. It's pretty easy to argue that Apple shot themselves in the foot with every opportunity they had to compete on good faith. And make no mistake: Apple could have competed with Nvidia if they weren't so stubborn about Linux support and putting smartphone GPUs in laptops and desktops.

[0] https://apnews.com/article/apple-ai-news-hallucinations-ipho...

snovv_crash · 2025-01-29T23:00:55 1738191655

Which AMD GPU gives you 50 tok/s on a 30b model? My 3090 does 30 tok/s with a 4 bit quant.

ClumsyPilot · 2025-01-30T02:35:14 1738204514

I don't mean at the same time.

For a simple question, with RX 6800, I am observing ~50 tok/s on 8B models Deepseek 16B gives ~40 tok/s. 32B doesn't fit in memory

vidarh · 2025-01-29T18:43:15 1738176195

For inference Nvidia has more significant competition than for training. See Groq, Google's TPU's etc.

pants2 · 2025-01-29T18:50:27 1738176627

People talk about Groq and Cerberus as competitors but it seems to me their manufacturing process makes the availability of those chips extremely limited. You can call up Nvidia and order $10B worth of GPUs and have them delivered the next week. Can't say the same for these specialty competitors.

vidarh · 2025-01-29T22:17:15 1738189035

> You can call up Nvidia and order $10B worth of GPUs and have them delivered the next week

Nvidia sold $14.5 billion of datacenter hardware in the third quarter of their fiscal 2024 and that led to severe supply constraints, with estimate lead times for H100's up to 52 weeks some places, so no you can't, as that $14.5 billion was clearly capped by their ability to supply, not demand.

You're right, though, that Groq etc. can't deliver anywhere near the same volume now, but there's little reason to believe that will continue. There's no need for full GPU's for inference only workloads, so competitors can enter the space with a tiny proportion of functionality.

boroboro4 · 2025-01-30T00:18:25 1738196305

Groq chips have 230 mb of sram memory. Good luck running 670B model on those chips, even without supply constraints.

baq · 2025-01-30T08:22:22 1738225342

Their architecture means you buy them by the rack. Individual chips are useless, the magic happens when you set them up so each chip handles a subset of the model.

IOW, do you think groq’s 70B models run on 230MB of sram?

boroboro4 · 2025-01-30T13:15:34 1738242934

I didn’t say the model gonna run on one chip of course. 70B needs ~300 chips (only for weights, fp8, just like they do, key value cache not included), 670B would need ~3000 chips, and in racks or not it’s very hard to set up such cluster for one model. There are reasons they still don’t have Llama 405B model.

vidarh · 2025-01-30T14:07:59 1738246079

They deliver pre-built full racks.

The "reasons" are most likely because it's not cost-effective as what is effective at this point a tech demo, that first becomes cheap to run if you're actually going to use a decent portion of the capacity for a single model.

boroboro4 · 2025-01-30T14:47:43 1738248463

How many servers in one rack? Let's say 42. How many chips in one server? Let's say 8. It's 336 cards per rack - enough for fp8 70B model weights (and, maybe, kv cache if your requests aren't too long, but probably not really). You need 10 (!) racks to serve one (!) DeepSeek model weights. There is also massive amount complexity arises from operating so many nodes.

During short time when Groq hardware appeared on the market it was costing 20K per card. It's 60 mln (!) per 1 Deepseek model. You need absolutely crazy amount of load to justify those costs, and, most likely, you will need massive amount additional nodes to handle KV cache of those requests.

vidarh · 2025-01-31T10:18:56 1738318736

Yes, you need a crazy amount of load for them to make sense. But when you're seeing providers build out whole data centres at a cost of billions, there you have their market.

This is a market where several large Nvidia customers are designing their own chips (e.g. Meta, Amazon, Google) because they're at a scale where it makes sense to try.

Whether it's a market that lets Groq be successful remains to be seen.

vidarh · 2025-01-30T06:38:30 1738219110

Nvidia's H100 has 80GB. As long the interconnect is fast enough you don't need everything to fit on one.

moralestapia · 2025-01-29T18:57:59 1738177079

You mean Cerebras.

>call up Nvidia and order $10B worth of GPUs

Doubt it.

No idea about Groq, but Cerebras might give you a similar timeline than nVidia. Each of their wafers are 50x-100x H100s so they need to make less of them, in absolute units.

But cooling, power, etc... nVidia might have an advantage as their ecosystem is huge and more "liquid" in a sense.

panabee · 2025-01-29T22:13:38 1738188818

Nvidia (NVDA) generates revenue with hardware, but digs moats with software.

The CUDA moat is widely unappreciated and misunderstood. Dethroning Nvidia demands more than SOTA hardware.

OpenAI, Meta, Google, AWS, AMD, and others have long failed to eliminate the Nvidia tax.

Without diving into the gory details, the simple proof is that billions were spent on inference last year by some of the most sophisticated technology companies in the world.

They had the talent and the incentive to migrate, but didn't.

In particular, OpenAI spent $4 billion, 33% more than on training, yet still ran on NVDA. Google owns leading chips and leading models, and could offer the tech talent to facilitate migrations, yet still cannot cross the CUDA moat and convince many inference customers to switch.

People are desperate to quit their NVDA-tine addiction, but they can't for now.

[Edited to include Google, even though Google owns the chips and the models; h/t @onlyrealcuzzo]

vidarh · 2025-01-29T22:23:23 1738189403

The CUDA moat is largely irrelevant for inference. The code needed for inference is small enough that there are e.g. bare-metal CPU only implementations. That isn't what's limiting people from moving fully off Nvidia for inference. And you'll note almost "everyone" in this game are in the process of developing their own chips.

buyucu · 2025-01-29T22:46:03 1738190763

My company recently switched from A100s to MI300s. I can confidently say that in my line of work, there is no CUDA moat. Onboarding took about month, but afterwards everything was fine.

panabee · 2025-01-29T22:56:30 1738191390

Alternatives exist, especially for mature and simple models. The point isn't that Nvidia has 100% market share, but rather that they command the most lucrative segment and none of these big spenders have found a way to quit their Nvidia addiction, despite concerted efforts to do so.

For instance, we experimented with AWS Inferentia briefly, but the value prop wasn't sufficient even for ~2022 computer vision models.

The calculus is even worse for SOTA LLMs.

The more you need to eke out performance gains and ship quickly, the more you depend on CUDA and the deeper the moat becomes.

buyucu · 2025-01-30T08:20:19 1738225219

llm inference is fine on rocm. llama.cpp and vllm both have very good rocm support.

llm training is also mostly fine. I have not encountered any issues yet.

most of the cuda moat comes from people who are repeating what they heard 5-10 years ago.

onlyrealcuzzo · 2025-01-29T22:29:56 1738189796

> OpenAI, Meta, AWS, AMD, and others have long attempted to eliminate the Nvidia tax, yet failed.

Gemini / Google runs and trains on TPUs.

You have no incentive to infer on AMD if you need to buy a massive Nvidia cluster to train.

boroboro4 · 2025-01-30T00:16:54 1738196214

Meta trains on Nvidia and infers on AMD. There is incentive if your inference costs are high.

vidarh · 2025-01-30T06:40:36 1738219236

Meta also has a second generation of their own AI accelerator chips designed.

panabee · 2025-01-29T22:51:33 1738191093

Google was omitted because they own the hardware and the models, but in retrospect, they represent a proof point nearly as compelling as OpenAI. Thanks for the comment.

Google has leading models operating on leading hardware, backed by sophisticated tech talent who could facilitate migrations, yet Google still cannot leap over the CUDA moat and capture meaningful inference market share.

Yes, training plays a crucial role. This is where companies get shoehorned into the CUDA ecosystem, but if CUDA were not so intertwined with performance and reliability, customers could theoretically switch after training.

onlyrealcuzzo · 2025-01-30T00:09:20 1738195760

> yet Google still cannot leap over the CUDA moat and capture meaningful inference market share.

It's almost as if being a first-mover is more important than whether or not you use CUDA.

talldayo · 2025-01-30T01:31:56 1738200716

Both matter quite a bit. The first-mover advantage obviously rewards OEMs in a first-come, first-serve order, but CUDA itself isn't some light switch that OEMs can flick and get working overnight. Everyone would do it if it was easy, and even Google is struggling to find buy-in for their TPU pods and frameworks.

Short-term value has been dependent on how well Nvidia has responded to burgeoning demands. Long-term value is going to be predicated on the number of Nvidia alternatives that exist, and right now the number is still zero.

baq · 2025-01-30T08:23:43 1738225423

Google has a self inflicted wound in the time to get an api key.

Der_Einzige · 2025-01-30T18:28:18 1738261698

The fact that this comment is DOWNVOTED despite being literally 1000% true is evidence that HN is full of loonies.

panabee · 2025-01-30T01:23:37 1738200217

It's unclear why this drew downvotes, but to reiterate, the comment merely highlights historical facts about the CUDA moat and deliberately refrains from assertions about NVDA's long-term prospects or that the CUDA moat is unbreachable.

With mature models and minimal CUDA dependencies, migration can be justified, but this does not describe most of the LLM inference market today nor in the past.

buyucu · 2025-01-30T14:20:22 1738246822

both llama.cpp and vllm support inference with rocm or vulkan.

inference is the easiest thing to decouple from nvidia.