It's not an anti-feature, it's a benefit that is a huge asset in the real world. For example, you can be on a legacy ASCII system, inspect a modern UTF-8 file, and if it's in a Latin language then it will still be readable as opposed to gibberish. Yes all modern tools should be (and these days generally are) encoding-aware, but in the real world we're stuck with a lot of legacy tools too.
And of course the vast majority of transmitted digital text is in HTML and similar! What do you think it's in instead?
By sheer quantity of digital words consumed by the average person, it's news and social media delivered in browsers (HTML), followed by apps (still using HTML markup to a huge degree) and ebooks (ePub based on HTML). And of course plenty of JSON and XML wrapping too.
And of course you can you Chinese characters in JavaScript/JSON, but development teams are increasingly international and English is the de-facto lingua franca.
That huge asset has become a liability. We always needed to become encoding-aware, but UTF-8's ASCII compatibility has let us delay it for decades, and caused exactly the confusion causing us to debate right now. So many engineers have been foiled by putting off learning about encodings. Joel Spolsky wrote an article, Atwood wrote an article, Python made a backwards incompatible change, etc. etc. etc.
To be honest, I'm just guessing about what text is stored in--I'll cop to it being very hard to prove. But my guess is the vast majority of text is in old binary formats, executables, log files, firmware, or in databases without markup. That's pretty much all your webpages right there.
n.b. JSON doesn't really fit the markup argument. The whole idea is that HTML is super noisy and the noise is 1 byte in UTF-8, and 2 bytes in UTF-16. JSON isn't noisy so the overhead is very low.
You can't rewrite all existing legacy software to support encodings. You just can't. A backwards-compatible format was a huge catalyst for widely supporting Unicode in the first place. What exactly are we delaying for decades? Engineers everywhere use Unicode today for new software. The battle has been won, moving forwards.
And the vast majority of text isn't in computer code or even books. It's in the seemingly endless stream of content produced by journalists and social media each and every day, dwarfing executables, firmware, etc. And if it supports any kind of formatting (bold/italics etc.) -- which most does -- then it's virtually always stored in HTML or similar (XML). I mean, what are even the alternatives? Neither RTF nor Markdown come even close in terms of adoption.
> You can't rewrite all existing legacy software to support encodings. You just can't. A backwards-compatible format was a huge catalyst for widely supporting Unicode in the first place.
Totally agree.
> What exactly are we delaying for decades?
Learning how encodings work and using that knowledge to write encoding-aware software.
> Engineers everywhere use Unicode today for new software. The battle has been won, moving forwards.
They do, but they're frequently foiled by on-disk encodings, filenames, internal string formats, network data, etc. etc. etc. All this stuff is outlined in TFA.
> And the vast majority of text isn't in computer code or even books. It's in the seemingly endless stream of content produced by journalists and social media each and every day
I concede I'm not likely to convince you here, but like, do you think Twitter is storing markup in their persistence layer? I doubt it. And even if there is some formatting, we're talking about <b> here, not huge amounts of angle brackets.
But think about any car display. That's probably not markup. Think about ATMs. Log files. Bank records. Court records. Label makers. Airport signage. Road signage. University presses.
And of course the vast majority of transmitted digital text is in HTML and similar! What do you think it's in instead?
By sheer quantity of digital words consumed by the average person, it's news and social media delivered in browsers (HTML), followed by apps (still using HTML markup to a huge degree) and ebooks (ePub based on HTML). And of course plenty of JSON and XML wrapping too.
And of course you can you Chinese characters in JavaScript/JSON, but development teams are increasingly international and English is the de-facto lingua franca.