It's interesting that the token length for Chinese is only slightly longer than ...

derefr · on May 18, 2023

Either one token per radical (as some minimalist proposals for CJK Unicode normalization suggested way back when) — or just map everything to sense-annotated pinyin, i.e. what you type into a Chinese IME to get ideographs out (and which is also, I think, what Chinese text-to-speech engines do internally as an intermediate step.)