It's interesting that the token length for Chinese is only slightly longer than for English. What does tokenization of an ideographic language look like, anyway? One token per ideograph? Something else?
Either one token per radical (as some minimalist proposals for CJK Unicode normalization suggested way back when) — or just map everything to sense-annotated pinyin, i.e. what you type into a Chinese IME to get ideographs out (and which is also, I think, what Chinese text-to-speech engines do internally as an intermediate step.)