> not really a good reason to support UTF-8 over UTF-16
Of course there is, the fact that if you're dealing only with ASCII characters then it's backwards-compatible. Which is a nice convenience in a great number of situations programmers encounter.
The minor details of efficiency of an encoding these days isn't particularly relevant -- sure UTF-16 is better for Chinese, but the average webpage usually does have way more markup, CSS and JavaScript than text, and gzip-ing it on delivery will result in a similar payload totally independent of the encoding you choose.
UTF-8's ASCII compatibility is an anti-feature; it's allowed us to continue to use systems that are encoding naive (in practice ASCII-only). It's no substitute for creating encoding-aware programs, libraries, and systems.
The vast majority of text is not in HTML or XML, and there's no reason you can't use Chinese characters in JavaScript besides (your strings and variable/class/component/file names will surely outpace your use of keywords).
It's not an anti-feature, it's a benefit that is a huge asset in the real world. For example, you can be on a legacy ASCII system, inspect a modern UTF-8 file, and if it's in a Latin language then it will still be readable as opposed to gibberish. Yes all modern tools should be (and these days generally are) encoding-aware, but in the real world we're stuck with a lot of legacy tools too.
And of course the vast majority of transmitted digital text is in HTML and similar! What do you think it's in instead?
By sheer quantity of digital words consumed by the average person, it's news and social media delivered in browsers (HTML), followed by apps (still using HTML markup to a huge degree) and ebooks (ePub based on HTML). And of course plenty of JSON and XML wrapping too.
And of course you can you Chinese characters in JavaScript/JSON, but development teams are increasingly international and English is the de-facto lingua franca.
That huge asset has become a liability. We always needed to become encoding-aware, but UTF-8's ASCII compatibility has let us delay it for decades, and caused exactly the confusion causing us to debate right now. So many engineers have been foiled by putting off learning about encodings. Joel Spolsky wrote an article, Atwood wrote an article, Python made a backwards incompatible change, etc. etc. etc.
To be honest, I'm just guessing about what text is stored in--I'll cop to it being very hard to prove. But my guess is the vast majority of text is in old binary formats, executables, log files, firmware, or in databases without markup. That's pretty much all your webpages right there.
n.b. JSON doesn't really fit the markup argument. The whole idea is that HTML is super noisy and the noise is 1 byte in UTF-8, and 2 bytes in UTF-16. JSON isn't noisy so the overhead is very low.
You can't rewrite all existing legacy software to support encodings. You just can't. A backwards-compatible format was a huge catalyst for widely supporting Unicode in the first place. What exactly are we delaying for decades? Engineers everywhere use Unicode today for new software. The battle has been won, moving forwards.
And the vast majority of text isn't in computer code or even books. It's in the seemingly endless stream of content produced by journalists and social media each and every day, dwarfing executables, firmware, etc. And if it supports any kind of formatting (bold/italics etc.) -- which most does -- then it's virtually always stored in HTML or similar (XML). I mean, what are even the alternatives? Neither RTF nor Markdown come even close in terms of adoption.
> You can't rewrite all existing legacy software to support encodings. You just can't. A backwards-compatible format was a huge catalyst for widely supporting Unicode in the first place.
Totally agree.
> What exactly are we delaying for decades?
Learning how encodings work and using that knowledge to write encoding-aware software.
> Engineers everywhere use Unicode today for new software. The battle has been won, moving forwards.
They do, but they're frequently foiled by on-disk encodings, filenames, internal string formats, network data, etc. etc. etc. All this stuff is outlined in TFA.
> And the vast majority of text isn't in computer code or even books. It's in the seemingly endless stream of content produced by journalists and social media each and every day
I concede I'm not likely to convince you here, but like, do you think Twitter is storing markup in their persistence layer? I doubt it. And even if there is some formatting, we're talking about <b> here, not huge amounts of angle brackets.
But think about any car display. That's probably not markup. Think about ATMs. Log files. Bank records. Court records. Label makers. Airport signage. Road signage. University presses.
The reasons most programmers use English in their source code has nothing to do with file size (for that their are JS minimizes) or supported encodings. It has to do with that two things, English is the most used language in the industry so if you want to cooperate with programmers from other parts of the world English is a good idea and because it frankly looks ugly to mix languages in the same file so when the standard library is in English your source code will be too.
So since most source code is in English (and for JS is minimized) UTF-8 works perfectly there too.
Of course there is, the fact that if you're dealing only with ASCII characters then it's backwards-compatible. Which is a nice convenience in a great number of situations programmers encounter.
The minor details of efficiency of an encoding these days isn't particularly relevant -- sure UTF-16 is better for Chinese, but the average webpage usually does have way more markup, CSS and JavaScript than text, and gzip-ing it on delivery will result in a similar payload totally independent of the encoding you choose.