But because of Han unification I all of a sudden DO need to know the language.
The same Unicode code point needs to be rendered differently for a user in Mainland China versus a user in Japan or else the user may not be able to read the text! Even if the user can read the character, they are going to experience a degradation in reading speed and comprehension, and be generally frustrated. Not to mention showing the wrong character is insensitive to the customer's culture, and if I pick to and stick just one set of characters to use, I end up (being accused at least of) promoting cultural hegemony based on which character set I go with.
In what situations do you need to do this, but don't need to show any other data (dates and times, localized UI, user timezone, culturally appropriate fonts, RTLness) that involves knowing the user's languages and locale?
This can happen if the user is intentionally reading mixed-language text or text not in their computer's UI language, of course. In that case different CJK languages also have different preferred fonts, so having language tagging or just guessing is pretty important.
> In what situations do you need to do this, but don't need to show any other data (dates and times, localized UI, user timezone, culturally appropriate fonts, RTLness) that involves knowing the user's languages and locale?
For drawing a given glyph, there is normally a lookup into a font table that involves solely the string of Unicode code points coming in.
Except if any characters in the CJK Unified Ideograph range. Then my function call suddenly has to jump out to read environment variables, which are hopefully setup correctly.
My code to do a lookup into a font file should not depend upon the users environment variables due to a space saving optimization made two decades ago.
> For drawing a given glyph, there is normally a lookup into a font table that involves solely the string of Unicode code points coming in.
Why are you implementing OpenType? It's got working libraries already.
But if you are getting into that, glyphs in a font are stored by "glyph name", not necessarily by code point. There's a bunch more steps than that.
- Font substitution: Find fonts that cover every character in the text. The order of your search list depends on the language.
- Text layout and line breaking: for best results, you don't want to line break in the middle of a word, and you need to place punctuation on the correct side of right-to-left sentences. I think both of these need dictionaries.
You have to read the GSUB tables and do a bunch of expected features, like ligatures, automatic fractions, beginning of word special forms (see Zapfino), &c. This includes language specific glyphs, but fonts can also just choose glyphs with a random number generator.
- Drawing the glyph. Remember not to draw each one individually, or a translucent line of overlapping characters (like in Indian languages) will look bad.
Sorry, Han glyphs render the same in Chinese and Japanese.
Regarding simplified versus traditional, no one is seriously unifying those.
There's some minor disagreements as to when a minor stylistic or historical variant deserves a separate glyph, but this isn't about rendering different glyphs in Chinese or Japanese. If Unicode is doing its job no one should have difficulty reading unified Han characters in one font regardless of language.
But because of Han unification I all of a sudden DO need to know the language.
The same Unicode code point needs to be rendered differently for a user in Mainland China versus a user in Japan or else the user may not be able to read the text! Even if the user can read the character, they are going to experience a degradation in reading speed and comprehension, and be generally frustrated. Not to mention showing the wrong character is insensitive to the customer's culture, and if I pick to and stick just one set of characters to use, I end up (being accused at least of) promoting cultural hegemony based on which character set I go with.