Hacker Newsnew | past | comments | ask | show | jobs | submit | dnhkng's commentslogin

She even read the article! (she did not find it funny though ;) )


No, it was done over the course of weeks, and I'm not motivated enough to do the production work required for good quality videos.


Running LLM's directly might not be effective.

I think there are probably Law Firms/doctors offices that would gladly pay ~3-4K euro a month to have this thing delivered and run truely "on-prem" to work with documents they can't risk leaking (patent filings, patient records etc).

For a company with 20-30 people, the legal and privacy protection is worth the small premium over using cloud providers.

Just a hunch though! This would have it paid-off in 3-4 months?


Fair points, but the deal is still great because of the nuances of the RAM/VRAM.

The Blackwells are superior on paper, but there's some "Nvidia Math" involved: When they report performance in press announcements, they don't usually mention the precision. Yes, the Blackwells are more than double the speed of the Hopper H100's, but thats comparing FP8 to FP4 (the H100's can't do native FP4). Yes, thats great for certain workloads, but not the majority.

What's more interesting is the VRAM speed. The 6000 Pro has 96 GB of GPU memory and 1.8 TB/s bandwidth, the H100 haas the same amount, but with HBM3 at 4.9 TB/s. That 2.5X increase is very influential in the overall performance of the system.

Lastly, if it works, the NVLink-C2C does 900 GB/s of bandwidth between the cards, so about 5x what a pair of 6000 Pros could do over PCIE5. Big LLMs need well over the 96 GB on a single card, so this becomes the bottleneck.

e.g. Here are benchmarks on the RTX 6000 pro using the GPT-OSS-120B model, where it generates 145 tokens/sec, and I get 195 tokens/sec on the GH200. https://www.reddit.com/r/LocalLLaMA/comments/1mm7azs/openai_...


The perf delta is smaller than I thought it'd be given the memory bandwidth difference. I guess likely comes from the Blackwell having native MXFP4, since GPT-OSS-120b has MXFP4 MOE layers.

The NVLink is definitely a strong point, I missed that detail. For LLM inference specifically it matters fairly little iirc, but for training it might.


I'm downloading DeepSeek-V3.2-Speciale now at FP8 (reportedly Gold-medal performance in the 2025 International Mathematical Olympiad and International Olympiad in Informatics).

It will fit in system RAM, and as its mixture of experts and the experts are not too large, I can at least run it. Token/second speed will be slower, but as system memory bandwidth is somewhere around 5-600Gb/s, so it should feel OK.


Check out "--n-cpu-moe" in llama.cpp if you're not familiar. That allows you to force a certain number of experts to be kept in system memory while everything else (including context cache and the parts of the model that every token touches) is kept in VRAM. You can do something like "-c128k -ngl 99 --n-cpu-moe <tuned_amt>" where you find a number that allows you to maximize VRAM usage without OOMing.


This is hard to say for sure.

I had 4x 4090, that I had bought for about $2200 each in early 2023. I sold 3 of them to help pay for the GH200, and got 2K each.


Correct! I added an Nvidia T400 to the rig recently, as it gives me 4x Display ports, and a whole extra 2GB VRAM!


https://looking-glass.io/ could be interesting


It was fun when the seller told me to come and look in the back of his dirty white van, because "the servers are in here". This was before I had seen the workshop etc.


The lengths someone will go just to have a graphics card and some ram nowadays smh


Oh no, thats not right. 20 Kg was in the original server case. With the Aluminium frames, and glass panel, its more like 40 Kg now... Shit, maybe I should take it off the Lack table...


Makes sense. I'm so used to the naming I forgot it's not common knowledge. I hope the new title is clearer.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: