Couldn't we just make every human readable character a token? OpenAI's tokenizer...

taeric · 2024-11-14T23:04:30 1731625470

This is just more tokens? And probably requires the model to learn about common groups. Consider, "ess" makes sense to see as a group. "Wss" does not.

That is, the groups are encoding something the model doesn't have to learn.

This is not much astray from "sight words" we teach kids.

TZubiri · 2024-11-15T01:51:51 1731635511

This is just more tokens?

Yup. Just let the actual ML git gud

taeric · 2024-11-15T02:08:10 1731636490

So, put differently, this is just more expensive?

cco · 2024-11-15T00:32:28 1731630748

We can, tokenization is literally just to maximize resources and provide as much "space" as possible in the context window.

There is no advantage to tokenization, it just helps solve limitations in context windows and training.

TZubiri · 2024-11-15T01:52:10 1731635530

I like this explanation

tchalla · 2024-11-14T23:04:58 1731625498

aka Character Language Models which have existed for a while now.