I'd argue that you must use grapheme clusters for text editing and cursor positi...

mananaysiempre · on Oct 2, 2023

> I'd argue that you must use grapheme clusters for text editing and cursor position

Korean packs syllables into Han-script-like squares, but they are unmistakably composed of alphabetic letters, and are both typed and erased that way (the latter may depend on system configuration), yet the NFC form has only a single codepoint per syllable (a fortiori a single grapheme cluster). Hebrew vowel markings are (reasonably) considered to be part of the grapheme cluster including their carrier letter but are nevertheless erased and deleted separately. In both of those cases, pressing backspace will erase less than pressing shift-left, backspace; that is, cursor movement and backspace boundaries are different.

There are IIRC also scripts that will have a vowel both pronounced and encoded in the codepoint stream after the syllable-initial consonant but written before it; and ones where some parts of a syllable will enclose it. I don’t even want to think how cursor movement works there.

Overall, your suggestion will work for Latin, Cyrillic, Greek, and maybe other nonfancy scripts like Armenian, Ge’ez, or Georgian, but will absolutely crash and burn when used for others.

PeterisP · on Oct 2, 2023

OK, I understand that the initial sentence is too strict, however, using codepoints for text editing and cursor position is even worse - even in your example of Korean there's a clear distinction depending on how the same character is encoded (combined NFC or not), but it should be the same to the user; and obviously if someone inputs a latin-diacritic character by pressing a modifier key before the base letter, then backspace removing the diacritic (since unicode modifiers are after the base letter) would be just ridiculous.

Backspace in general seems to be a very difficult problem because of subtly incompatible expectations depending on the context, as 'undo last input' when you're typing new text, and 'delete previous symbol' if you're editing existing text.