Prefill is part of Inference. It's the first major step where you calculate all ...

dist-epoch · 2025-06-29T06:27:54 1751178474

Doesn't decode also need to stream in the whole of the model weights, thus very I/O heavy?

0xjunhao · 2025-06-30T13:26:37 1751289997

Yes, decoding is very I/O heavy. It has to stream in the whole of the model weights from HBM for every token decoded. However, that cost can be shared between the requests in the same batch. So if the system has more GPU RAM to hold larger batches, the I/O cost per request can be lowered.