OpenAI's tokenizer makes "chess" "ch" and "ess". We could just make it into "c" "h" "e" "s" "s"
That is, the groups are encoding something the model doesn't have to learn.
This is not much astray from "sight words" we teach kids.
reply
Yup. Just let the actual ML git gud
There is no advantage to tokenization, it just helps solve limitations in context windows and training.
OpenAI's tokenizer makes "chess" "ch" and "ess". We could just make it into "c" "h" "e" "s" "s"