> Calling the respective methods to get a strings length for example will always return the length in characters.
Most languages return the length in _unicode characters_, which is just the number of codepoints.
However, in most cases, the programmer actually wants the number of user-perceived characters, ie _unicode grapheme clusters_.
UTF-32 has to be treated as a variable-length coding in most cases, no different from UTF-16 - otherwise, you'd miscount even characters common in western languages like 'รค' if it happens that the user used the decomposed form.
Even normalization doesn't help with that, as not all grapheme clusters can be composed into a single codepoint.
Perl6 is an example of a language which does the right thing here: Its string type has no length method - you have to be explicit if you want to get the number of bytes, codepoints or grapheme clusters.
To add some confusion back in, the language also provides a method which gets the number of 'characters', where the idea of what a character is can be configured at lexical scope (it defaults to grapheme cluster).
> To add some confusion back in, it also provide a method which gets the number of 'characters', where the idea of what a character is can be configured at lexical scope (it defaults to grapheme cluster).
That's actually pretty cool, as it lets the library configure itself for the representation which makes most sense to it: a library which deals in storage or network stuff can configure for codepoints or bytes length, whereas a UI library will use grapheme clusters for bounding box computations & al.
Configuring it lexically also makes sense as it avoid leaking that out (which dynamically scoped configuration would).
Most languages return the length in _unicode characters_, which is just the number of codepoints.
However, in most cases, the programmer actually wants the number of user-perceived characters, ie _unicode grapheme clusters_.
UTF-32 has to be treated as a variable-length coding in most cases, no different from UTF-16 - otherwise, you'd miscount even characters common in western languages like 'รค' if it happens that the user used the decomposed form.
Even normalization doesn't help with that, as not all grapheme clusters can be composed into a single codepoint.
Perl6 is an example of a language which does the right thing here: Its string type has no length method - you have to be explicit if you want to get the number of bytes, codepoints or grapheme clusters.
To add some confusion back in, the language also provides a method which gets the number of 'characters', where the idea of what a character is can be configured at lexical scope (it defaults to grapheme cluster).