Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

AFAIK this doesn't really work for interactive use, as LLMs process data serially. So your request needs to pass through all of the cards for each token, one at a time. Thus a lot of PCIe traffic and hence latency. Better than nothing, but only really useful if you can batch requests so you can keep each GPU working all the time, rather than just one at a time.


Clearly I wasn't aware enough that DNN is by default like a mesh. Makes sense that it's going to be bottlenecked by the tightest link. Thanks...




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: