It's not an anti-feature, it's a benefit that is a huge asset in the real world....

camgunz · on April 14, 2020

That huge asset has become a liability. We always needed to become encoding-aware, but UTF-8's ASCII compatibility has let us delay it for decades, and caused exactly the confusion causing us to debate right now. So many engineers have been foiled by putting off learning about encodings. Joel Spolsky wrote an article, Atwood wrote an article, Python made a backwards incompatible change, etc. etc. etc.

To be honest, I'm just guessing about what text is stored in--I'll cop to it being very hard to prove. But my guess is the vast majority of text is in old binary formats, executables, log files, firmware, or in databases without markup. That's pretty much all your webpages right there.

n.b. JSON doesn't really fit the markup argument. The whole idea is that HTML is super noisy and the noise is 1 byte in UTF-8, and 2 bytes in UTF-16. JSON isn't noisy so the overhead is very low.

crazygringo · on April 14, 2020

I just don't know what you're talking about.

You can't rewrite all existing legacy software to support encodings. You just can't. A backwards-compatible format was a huge catalyst for widely supporting Unicode in the first place. What exactly are we delaying for decades? Engineers everywhere use Unicode today for new software. The battle has been won, moving forwards.

And the vast majority of text isn't in computer code or even books. It's in the seemingly endless stream of content produced by journalists and social media each and every day, dwarfing executables, firmware, etc. And if it supports any kind of formatting (bold/italics etc.) -- which most does -- then it's virtually always stored in HTML or similar (XML). I mean, what are even the alternatives? Neither RTF nor Markdown come even close in terms of adoption.

camgunz · on April 14, 2020

> You can't rewrite all existing legacy software to support encodings. You just can't. A backwards-compatible format was a huge catalyst for widely supporting Unicode in the first place.

Totally agree.

> What exactly are we delaying for decades?

Learning how encodings work and using that knowledge to write encoding-aware software.

> Engineers everywhere use Unicode today for new software. The battle has been won, moving forwards.

They do, but they're frequently foiled by on-disk encodings, filenames, internal string formats, network data, etc. etc. etc. All this stuff is outlined in TFA.

> And the vast majority of text isn't in computer code or even books. It's in the seemingly endless stream of content produced by journalists and social media each and every day

I concede I'm not likely to convince you here, but like, do you think Twitter is storing markup in their persistence layer? I doubt it. And even if there is some formatting, we're talking about <b> here, not huge amounts of angle brackets.

But think about any car display. That's probably not markup. Think about ATMs. Log files. Bank records. Court records. Label makers. Airport signage. Road signage. University presses.