Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How many epochs did you train with ? 100k hours is not a lot for an LLM, Feels like bitter lesson


I train for 1M steps (batch size 64, block size 2048), which is enough for the model to more-or-less converge.

It's also a tiny model for LLM standards, with 150M parameters. The goal wasn't really to reach state of the art but to show how the performance of a single language model architecture can be vastly different when you just change the tokenizer.


To get around state of the art, how many parameters would be needed with your approach?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: