Hacker News new | past | comments | ask | show | jobs | submit login

It's interesting that the token length for Chinese is only slightly longer than for English. What does tokenization of an ideographic language look like, anyway? One token per ideograph? Something else?



Either one token per radical (as some minimalist proposals for CJK Unicode normalization suggested way back when) — or just map everything to sense-annotated pinyin, i.e. what you type into a Chinese IME to get ideographs out (and which is also, I think, what Chinese text-to-speech engines do internally as an intermediate step.)




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: