Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

LLM inference is inherently a sequential problem. You can't speed it up by doing more in parallel. You can't generate the 101st token before you've generated the 100th.


Technically, I guess you can use speculative execution to speed it up, and in that way take a guess at what the 100th token will be and start on the 101st token at the same time? Though it probably has it's own unforeseen challenges.

Everything is predictable with enough guesses.


People are pretty cagey about what they use in production, but yes, speculative sampling can offer massive speedups in inference


They’re using several hundred cards here. Clearly there is ‘something’ that can be done in parallel.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: