I didn’t say the model gonna run on one chip of course. 70B needs ~300 chips (only for weights, fp8, just like they do, key value cache not included), 670B would need ~3000 chips, and in racks or not it’s very hard to set up such cluster for one model. There are reasons they still don’t have Llama 405B model.
The "reasons" are most likely because it's not cost-effective as what is effective at this point a tech demo, that first becomes cheap to run if you're actually going to use a decent portion of the capacity for a single model.
How many servers in one rack? Let's say 42. How many chips in one server? Let's say 8. It's 336 cards per rack - enough for fp8 70B model weights (and, maybe, kv cache if your requests aren't too long, but probably not really). You need 10 (!) racks to serve one (!) DeepSeek model weights. There is also massive amount complexity arises from operating so many nodes.
During short time when Groq hardware appeared on the market it was costing 20K per card. It's 60 mln (!) per 1 Deepseek model. You need absolutely crazy amount of load to justify those costs, and, most likely, you will need massive amount additional nodes to handle KV cache of those requests.
Yes, you need a crazy amount of load for them to make sense. But when you're seeing providers build out whole data centres at a cost of billions, there you have their market.
This is a market where several large Nvidia customers are designing their own chips (e.g. Meta, Amazon, Google) because they're at a scale where it makes sense to try.
Whether it's a market that lets Groq be successful remains to be seen.