Hacker News new | past | comments | ask | show | jobs | submit login

I'm fluent in Japanese and speak some Mandarin Chinese as well. These 3 characters are identical, not similar.

For a different example, 国 and 國 used to be the same character, but China and Japan (left) have both diverged the traditional form still used in Taiwan (right). Unicode treats them as separate.

今 Looks slightly different in traditional Chinese vs other languages. In traditional Chinese, the little straight line between the two angled lines is sloped, while it is horizontal in simplified Chinese, Japanese or Korean. Any reader of any of these languages would have no issue if the variant they are used to was replaced by the other one. They might think you have a sloppy handwriting or an ugly font if they even notice, but that's about it. Unicode treats them as the same.




This seems again to be a perfect place for rendering rather than encoding. The english letter 'a' can be rendered as a ring with a tail (the way I handwrite), or a ring with a cap and a tail (the way the font usually renders). Both are the same letter, if rendered differently based on my (contextually sensitive) font.


I think not. I do want to be able to say both People's Republic of China 中华人民共和国 and Republic of China 中華民國 in the same text, and if I had to choose rendering either 国+华 or 國+華 then it wouldn't work.


Curious, as I'm not sure when that would actually happen in real life (in Chinese). Generally in mainland China , the ROC would always be rendered with 国, even officially [1]. And in Taiwan the PRC would be rendered with 國 [2].

It gets a bit weirder in Japanese where the word is distinctly not the same - one is a traditional version (proper noun) of the other and you could imagine a text using both (William vs Wilhelm vs Will).

[1] http://baike.baidu.com/view/2200.htm [2] https://www.google.com.tw/?gws_rd=ssl#q=%E4%B8%AD%E5%8D%8E%E...


That is true, mainland China writes "中华民国".

But I still do want to be able to write texts that are like this discussion: mainly in English, but contain fragments in Chinese, and so that I can use both the traditional and simplified characters.

And it also makes total sense to me that 日本 is Japan, both in Japanese and Chinese, using the exact same Unicode characters.


If it's up to font rendering, you would specify the language tag for each of those individually which would render the appropriate font (as you are writing that country's name in its language).

Though that's far more difficult for the laman, and markdown certainly doesn't seem to have any language markers.


Except you can still recognize the 'a' as 'a' no matter which way it is rendered.

Not so with Chinese characters. For instance, the character for "fly" in simplified (飞) and traditional (飛) look very different. Someone who only learned simplified may not recognize the traditional character as being the same.


Which is exactly why 飞 and 飛 are encoded separately. I don't see any problem with that.


Yes, but other characters that also look different are merged. Here's an example: http://www.tofugu.com/2012/04/04/the-sorry-state-of-japanese...

That's the character for "cold". If you showed me (a Chinese speaker) the Japanese or Korean variant, I would have no idea what it meant.


As far as I can see, Unicode has mostly settled down into a sort of "good enough" state: characters that have sufficiently different renderings have gotten separate "variant" codepoints for each rendering, while characters that are very similar (even if not completely identical as commonly written) are still only present as unified codepoints.

I've no idea if these variant codepoints are actually supposed to show up in user files, or are intended mainly for the use of font rendering systems, etc... the whole thing seems a bit of a mess, even if the information is technically present.

Judging from unicode.com, "冷" does seem have separate codepoints: 冷 (chinese/unified), and 冷 (japanese z-variant). However my browser renders both as similar characters. Similarly, on my phone, the same character gets input whether using a Chinese or a Japanese input method, and both get rendered using the Japanese rendering (it's a Japanese phone) which makes Chinese text look a little funny.

An interesting example is "晩" / "晚", which has one more stroke in the Japanese variant, but it's situated in a location which makes both variants look pretty much identical (and in small bitmapped fonts, they are identical). Nonetheless, Unicode includes codepoints for both...


The fact that some characters are debatable does not change the fact that han unification is a good idea. In a few cases you can disagree, but not unifying at all would be madness.

As for this particular character (cold), both variants are familiar to Japanese readers, with the one described as Japanese in your link being the one you'd typically see in print, while the other one is common in handwriting, and nobody in Japan would treat these two as different. From a Japanese point of view, this is definitely the kind of thing you change by switching fonts.

This pdf is the official list of basic chinese characters, published by the Japanese governement. Look on page 9, it shows both variants of this characters in hand writing.

http://www.bunka.go.jp/kokugo_nihongo/pdf/jouyoukanjihyou_h2...

The fact that one of the variants in not familiar to Chinese readers complicates the issue, but there are at least reasonable reasons to argue that this is one, not two, characters.

I think it is possible to argue that han unification was not done very well and that the UC made too many classification mistakes (although I personally think it is generally not that bad), but I don't think arguing that unification is a bad thing entirely has legs.


Encoding those as the same code point makes sense. On the other hand, just the codepoint is obviously not enough on its own for rendering the glyph. The Unicode consortium seem not to care about the actual rendering part of the whole stack and are happy just defining the low-level bits. But then why do we have skin colour coding for emoji and no language coding for CJK glyphs? The entire thing is a mess, but heaping another pile of standards on top of it will make it even more of a mess, I'm afraid.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: