Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I train for 1M steps (batch size 64, block size 2048), which is enough for the model to more-or-less converge.

It's also a tiny model for LLM standards, with 150M parameters. The goal wasn't really to reach state of the art but to show how the performance of a single language model architecture can be vastly different when you just change the tokenizer.



To get around state of the art, how many parameters would be needed with your approach?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: