> Your language's length function is probably just returning the number of unicode codepoints in the string.
The article didn't say that!
"Number of Unicode code points" in a string is ambiguous, because surrogates and astral characters both are code points, so it's ambiguous if a surrogate pair counts as two code points or one. (It unambiguously counts as two UTF-16 code units and as one Unicode scalar value.)
The article presented four kinds of programming language-reported string lengths:
1. Length is number of UTF-8 code units.
2. Length is number of UTF-16 code units.
3. Length is number of UTF-32 code units, which is the same as the number of Unicode scalar values.
4. Length is number of extended grapheme clusters.
The article didn't say that!
"Number of Unicode code points" in a string is ambiguous, because surrogates and astral characters both are code points, so it's ambiguous if a surrogate pair counts as two code points or one. (It unambiguously counts as two UTF-16 code units and as one Unicode scalar value.)
The article presented four kinds of programming language-reported string lengths:
1. Length is number of UTF-8 code units. 2. Length is number of UTF-16 code units. 3. Length is number of UTF-32 code units, which is the same as the number of Unicode scalar values. 4. Length is number of extended grapheme clusters.