Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Tokenization is typically done on CPU and is rarely (if ever) a bottleneck for training or inference.

GPU kernels typically dominate in terms of wall clock time, the only exception might be very small models.

Thus the latency of tokenization can essentially be “hidden”, by having the CPU prepare the next batch while the GPU finishes the current batch.






Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: