Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> between chars (or "runes" WTF?) and integers.

"Runes" were the original name, as implemented in Plan 9 by the same folks, for what the standards committee later decided to call the relatively blaze term "Unicode codepoints"--and which are not quite the same thing as characters.

(In fact, I would say that the notion of a Unicode "character" is ambiguous to the point of uselessness--there are glyphs composed from several codepoints (base glyph + combining accents), which should be treated as one "character"; there are ligatures that hold single codepoints, but which semantically are multiple "characters"; there are stacking languages where one "character", representing a whole word, will be composed together from several codepoint "radicals"; while in other ideographic languages, each pre-composed idea-part is its own "character" and has its own codepoint; and so forth.)



> -there are glyphs composed from several codepoints (base glyph + combining accents), which should be treated as one "character"

The solution is to use Normalization Form C (NFC) (which combines accents with characters).

> there are ligatures that hold single codepoints, but which semantically are multiple "characters"

OK, so use Normalization Form KC (NFKC) (which splits ligatures, and combines accents with characters).

You're right that "length" of a unicode string is very ambiguous. Arguably, you shouldn't be able to call "length" without supplying an argument about what you are actually asking.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: