Hacker News new | past | comments | ask | show | jobs | submit login

I'd argue that you must use grapheme clusters for text editing and cursor position, because here are popular characters (like ö you used as example) which can be either one or two codepoints depending on the normalization choice, but the difference is invisible to the user and should not matter to the user, so any editor should behave exactly the same for ö as U+00F6 (LATIN SMALL LETTER O WITH DIAERESIS) and ö as a sequence of U+006F (LATIN SMALL LETTER O) and U+0308 (COMBINING DIAERESIS).

Furthermore, you shouldn't assume that there is any relationship between how unicode constructs a combined character from codepoints with how that character is typed, even at the level of typing you're not typing unicode codepoints - they're just a technical standard representation of "text at rest", unicode codepoints do not define an input method. Depending on your language and device, a sequence of three or more keystrokes may be used to get a single codepoint, or a dedicated key on keyboard or a virtual button may spawn a combined character of multiple codepoints as a single unit; you definitely can't assume that the "last codepoint" corresponds to "last user action" even if you're writing a text editor - much of that can happen before your editor receives that input from e.g. OS keyboard layout code; your editor won't know whether I input that ö from a dedicated key, a 'chord' of 'o' key with a modifier, or a sequence of two keystrokes (and if so, whether 'o' was the first keystroke or the second, opposite of how the unicode codepoints are ordered).




> I'd argue that you must use grapheme clusters for text editing and cursor position

Korean packs syllables into Han-script-like squares, but they are unmistakably composed of alphabetic letters, and are both typed and erased that way (the latter may depend on system configuration), yet the NFC form has only a single codepoint per syllable (a fortiori a single grapheme cluster). Hebrew vowel markings are (reasonably) considered to be part of the grapheme cluster including their carrier letter but are nevertheless erased and deleted separately. In both of those cases, pressing backspace will erase less than pressing shift-left, backspace; that is, cursor movement and backspace boundaries are different.

There are IIRC also scripts that will have a vowel both pronounced and encoded in the codepoint stream after the syllable-initial consonant but written before it; and ones where some parts of a syllable will enclose it. I don’t even want to think how cursor movement works there.

Overall, your suggestion will work for Latin, Cyrillic, Greek, and maybe other nonfancy scripts like Armenian, Ge’ez, or Georgian, but will absolutely crash and burn when used for others.


OK, I understand that the initial sentence is too strict, however, using codepoints for text editing and cursor position is even worse - even in your example of Korean there's a clear distinction depending on how the same character is encoded (combined NFC or not), but it should be the same to the user; and obviously if someone inputs a latin-diacritic character by pressing a modifier key before the base letter, then backspace removing the diacritic (since unicode modifiers are after the base letter) would be just ridiculous.

Backspace in general seems to be a very difficult problem because of subtly incompatible expectations depending on the context, as 'undo last input' when you're typing new text, and 'delete previous symbol' if you're editing existing text.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: