> I'll choose whatever encoding I like, thanks. If everyone chooses whatever enc...

camgunz · on April 14, 2020

> If everyone chooses whatever encoding they like, then the charset being used has to be encoded somewhere.

This is gonna be the case for the foreseeable future, as you point out. Settling on one encoding only fixes this like, 100 years from now. I'd prefer to build encoding-aware software that solves this problem now.

> given its compatibility with ASCII, UTF-8 is the most reasonable one to pick

This only makes sense of your system is ASCII in the first place, and if you can't build encoding-aware software. I think we can both agree that's essentially legacy ASCII software, so you don't get to choose anything anyway. And any system that interacts with it should be encoding-aware and still validate the encoding anyway, as though it might be BIG5 or whatever. Assuming ASCII/UTF-8 is a bad idea, always and forever.

> If you want to insist on using KOI-8, or ISO-2022-JP, or ISO-8859-1, you're implicitly saying "fuck you" to 2/3 of the world's population since you can't support tasks as basic as "let me write my name" for them.

I'm not obligated to write software for every possible user at every point in time. It's perfectly acceptable for me to say, "I'm writing this program for my 1 friend who speaks Spanish" and have that be my requirements. But if I were to write software that had a hope of being broadly useful, UTF-8 everywhere doesn't get me there. I'd have to build it to be encoding-aware, and let my users configure the encoding(s) it uses.

jcranmer · on April 14, 2020

> But if I were to write software that had a hope of being broadly useful, UTF-8 everywhere doesn't get me there.

Actually, it does.

Right now, in 2020, if you're writing a new programming language, you can insist that the input files must be valid UTF-8 or it's a compiler error. If you're writing a localization tool, you can insist that the localization files be valid UTF-8 or it's an error. Even if you're writing a compiler for an existing language (e.g., C), it would not be unreasonable to say that the source file must be valid UTF-8 or it's an error--and let those not using UTF-8 right now handle it by converting their source code to use UTF-8. And this has been the case for a decade or so.

That's the point of UTF-8 everywhere: if you don't have legacy concerns [someone actively using a non-ASCII, non-UTF-8 charset that you have to support], force UTF-8 and be done with it. And if you do have legacy concerns, try to push people to using UTF-8 anyways (e.g., default to UTF-8).

camgunz · on April 14, 2020

I can't insist that other systems send your program UTF-8, or that the users' OS use UTF-8 for filenames and file contents, or that data in databases uses UTF-8, or that the UTF-8 you might get is always valid. The end result of all these things you're raising is "you can't assume, you have to check always, UTF-8 everywhere buys you nothing". Even if we did somehow get there, you'd still have to validate it.