> It's quite trivial if the language exposes an interface for iterating both clusters and points, and without such an interface the problem is much harder to notice
I assure you, 99% of people won't handle this correctly even if given a cluster-based interface (if they even bother using it). And this still doesn't handle the question of cutting words in the middle of some languages resulting in broken display of the non-cut part (or languages without space-based word boundaries to cut on). So the preferred thing is still to use a library.
I don't think anyone in k would use UTF-16 via a character list of 2 chars per code unit; an integer list would work much nicer for that (and most k interpreters should be capable of storing such with 16-bit ints; there's still some preference for using UTF-8 char lists, namely, such get pretty-printed as strings); and you'd have to convert on some I/O probably anyway. Never mind the world being basically all-in on UTF-8.
Even if you have a string type that's capable of being backed by either UTF-8 or UTF-16, you'll still need conversions between those at some points; you'd want the Windows API calls to have a "str.asNullTerminatedUTF16Bytes()" or whatnot (lest a UTF-8-encoded string makes its way here), which you can trivially have an equivalent of for a byte list. And I highly doubt that overhead of conversion would matter anywhere you need a UTF-16-only Windows API.
I doubt all of those fancy operations you'll be doing will have optimized impls for all formats internally either, so there's internal conversions too. If anything, I'd imagine that having a unified internal representation would end up better, forcing the user to push the conversions to the I/O boundaries and allowing focus on optimizing for a single type, instead of going back-and-forth internally or wasting time on multiple impls.
I assure you, 99% of people won't handle this correctly even if given a cluster-based interface (if they even bother using it). And this still doesn't handle the question of cutting words in the middle of some languages resulting in broken display of the non-cut part (or languages without space-based word boundaries to cut on). So the preferred thing is still to use a library.
I don't think anyone in k would use UTF-16 via a character list of 2 chars per code unit; an integer list would work much nicer for that (and most k interpreters should be capable of storing such with 16-bit ints; there's still some preference for using UTF-8 char lists, namely, such get pretty-printed as strings); and you'd have to convert on some I/O probably anyway. Never mind the world being basically all-in on UTF-8.
Even if you have a string type that's capable of being backed by either UTF-8 or UTF-16, you'll still need conversions between those at some points; you'd want the Windows API calls to have a "str.asNullTerminatedUTF16Bytes()" or whatnot (lest a UTF-8-encoded string makes its way here), which you can trivially have an equivalent of for a byte list. And I highly doubt that overhead of conversion would matter anywhere you need a UTF-16-only Windows API.
I doubt all of those fancy operations you'll be doing will have optimized impls for all formats internally either, so there's internal conversions too. If anything, I'd imagine that having a unified internal representation would end up better, forcing the user to push the conversions to the I/O boundaries and allowing focus on optimizing for a single type, instead of going back-and-forth internally or wasting time on multiple impls.