I’m not an expert on at-scale inference, but they surely can’t have been running...

littlestymaar · 2025-01-29T23:36:11 1738193771

> I’m not an expert on at-scale inference, but they surely can’t have been running at a batch size of more than 1 if they were getting performance that bad on 4xH100… and I’m not even sure how they were getting performance that low even at batch size 1. Batching is essential to serving large token volumes at scale.

That was my first though as well, but from a quick search it looks like Llama.cpp has a default batch size that's quite high (like 256 or 512 I don't remember exactly, which I find surprising for something that's mostly used by local users) so it shouldn't be the issue.

> As the comments on reddit said, those numbers don’t make sense.

Absolutely, hence my question!

coder543 · 2025-01-29T23:38:20 1738193900

Sure, but that default batch size would only matter if the person in question was actually generating and measuring parallel requests, not just measuring the straight line performance of sequential requests... and I have no confidence they were.