I think it's quite obvious that UTF-8 is the better choice over UTF-16 or UTF-32 for exchanging data (if just for the little/big endian mess alone, and that UTF-16 isn't a fixed-length encoding either).
From that perspective, keeping the data in UTF-8 for most of its lifetime also when loaded into a program, and only convert "at the last minute" when talking to underlying operating system APIs makes a lot of sense, except for some very specific application types which do heavy text processing.
I'm gonna do little quotes but, I don't mean to be passive aggressive. It's just that this stuff comes up all the time
> I think it's quite obvious that UTF-8 is the better choice over UTF-16 or UTF-32 for exchanging data (if just for the little/big endian mess alone...
This should be the responsibility of a string library internally, and if you're saving data to disk or sending it over the network, you should be serializing to a specific format. That format can be UTF-8, or it can be whatever, depending on your application's needs.
> and that UTF-16 isn't a fixed-length encoding either)
We should stop assuming any string data is a fixed-length encoding. This is a major disadvantage of UTF-8, because it allows for this conflation.
> keeping the data in UTF-8 for most of its lifetime also when loaded into a program, and only convert "at the last minute" when talking to underlying operating system APIs makes a lot of sense, except for some very specific application types which do heavy text processing.
Well, you're essentially saying "I know about your use case better than you do". It might be important to me to not blow space on UTF-8. But if my platform/libraries have bought into "UTF-8 everywhere" and don't give me knobs to configure the encoding, I have no recourse.
And that's the entire basis for this. It's "having to mess with encodings is worse than the application-specific benefits of being able to choose an encoding". I think that's... at best an impossible claim and at worst pretty arrogant. Again here I don't mean you, but this "UTF-8 everywhere" thing.
>We should stop assuming any string data is a fixed-length encoding. This is a major disadvantage of UTF-8, because it allows for this conflation.
Mistaking a variable-width encoding for a fixed-width one is specifically a UTF-16 problem. UTF-8 is so obviously not fixed-width that such an error could not happen by a mistake, because even before widespread use of emojis, multibyte sequences were not in any way a corner case for UTF-8 text (for additional reference, compare UTF-16 String APIs in Java/JavaScript/etc. with UTF-8 ones in, say, Rust and Go, and see which ones allow you to easily split a string where you shouldn't be able to, or access "half-chars" as a datatype called "char".)
I mean, I think we're both in the realm of [citation needed] here. I would argue that people index into strings quite a lot--whether that's because we thought UCS-2 would be enough for anybody or UTF-8 == ASCII and "it's probably fine" is academic. The solution is the same though: don't index into strings, don't assume an encoding until you've validated. That makes any "advantage" UTF-8 has disappear.
If you really think no one made this mistake with UTF-8, just read up on Python 3.
The difference is that with UTF-8 you're much more likely to trip over those bugs in random testing. With UTF-16 you're likely to pass all your test cases if you didn't think to include a non-BMP character somewhere. Then someone feeds you an emoji character and you blow up.
Yeah, ASCII is such a powerful mental model that I think anyone working with Unicode made a lot of concessions to convert people, no argument there. But I think we need to say we're done with that and move on to phase 2. Here's what I advocate:
- Encodings should be configurable. Programmers get to decide what format their strings are internally, users get to decide what encoding programs use when dealing with filenames or saving data to disk, etc. Defaults matter, and we should employ smarts, but we should never say "I know best" and remove those knobs.
- Engineers need to internalize that "strings" conceal mountains of complexity (because written language is complex), and default to using libraries, to manage them. We should start view manual string manipulation as an anti-pattern. There isn't an encoding out there that we can all standardize on that makes this untrue, again because written language is complex.
From that perspective, keeping the data in UTF-8 for most of its lifetime also when loaded into a program, and only convert "at the last minute" when talking to underlying operating system APIs makes a lot of sense, except for some very specific application types which do heavy text processing.