I think there are probably Law Firms/doctors offices that would gladly pay ~3-4K euro a month to have this thing delivered and run truely "on-prem" to work with documents they can't risk leaking (patent filings, patient records etc).
For a company with 20-30 people, the legal and privacy protection is worth the small premium over using cloud providers.
Just a hunch though! This would have it paid-off in 3-4 months?
Fair points, but the deal is still great because of the nuances of the RAM/VRAM.
The Blackwells are superior on paper, but there's some "Nvidia Math" involved: When they report performance in press announcements, they don't usually mention the precision. Yes, the Blackwells are more than double the speed of the Hopper H100's, but thats comparing FP8 to FP4 (the H100's can't do native FP4). Yes, thats great for certain workloads, but not the majority.
What's more interesting is the VRAM speed. The 6000 Pro has 96 GB of GPU memory and 1.8 TB/s bandwidth, the H100 haas the same amount, but with HBM3 at 4.9 TB/s. That 2.5X increase is very influential in the overall performance of the system.
Lastly, if it works, the NVLink-C2C does 900 GB/s of bandwidth between the cards, so about 5x what a pair of 6000 Pros could do over PCIE5. Big LLMs need well over the 96 GB on a single card, so this becomes the bottleneck.
The perf delta is smaller than I thought it'd be given the memory bandwidth difference. I guess likely comes from the Blackwell having native MXFP4, since GPT-OSS-120b has MXFP4 MOE layers.
The NVLink is definitely a strong point, I missed that detail. For LLM inference specifically it matters fairly little iirc, but for training it might.
I'm downloading DeepSeek-V3.2-Speciale now at FP8 (reportedly Gold-medal performance in the 2025 International Mathematical Olympiad and International Olympiad in Informatics).
It will fit in system RAM, and as its mixture of experts and the experts are not too large, I can at least run it. Token/second speed will be slower, but as system memory bandwidth is somewhere around 5-600Gb/s, so it should feel OK.
Check out "--n-cpu-moe" in llama.cpp if you're not familiar. That allows you to force a certain number of experts to be kept in system memory while everything else (including context cache and the parts of the model that every token touches) is kept in VRAM. You can do something like "-c128k -ngl 99 --n-cpu-moe <tuned_amt>" where you find a number that allows you to maximize VRAM usage without OOMing.
It was fun when the seller told me to come and look in the back of his dirty white van, because "the servers are in here". This was before I had seen the workshop etc.
Oh no, thats not right. 20 Kg was in the original server case. With the Aluminium frames, and glass panel, its more like 40 Kg now... Shit, maybe I should take it off the Lack table...