Trying to figure out how to express this without making people mad at me. I think the conflation of Unicode with "plain text" might be a mistake. Don't get me wrong, Unicode serves an important purpose. But bumping the version from plain text 1.0 (ASCII) to plain text 2.0 (Unicode) introduced a ton of complexity, and there are cases where the abstractions start leaking (iterating characters etc).
With things like data archival, if I have a hard drive with the library of congress stored in ASCII, I need half a sheet of paper to understand how to decode it.
Whereas apparently UTF8 requires 7k words just to explain why it's important. And that's not even looking at the spec.
Just to be crystal clear, I'm not advocating to not use Unicode, or even use it less. I'm just saying I think it maybe shouldn't count as plain text, since it looks a lot like a relatively complicated binary format to me.
Unicode is complicated because the languages it needs to handle are, alas, complicated. UTF-8 is super simple. It's a variable-length encoding for 21-bit unsigned integers. Wikipedia gives a handy table showing how it works:
When I wrote a very primitive UTF-8 library, I really began to appreciate UTF-8's design. For example; the first byte says how many bytes the character requires. At first it was daunting, but when I put 2 and 2 together, it really opened up.
I am sure there are many aspects I am missing about UTF-8, but it is all reasonable in its design and implementation.
For reference, I was converting between code points and actual bytes, and also implemented strlen and strcmp (which for the latter the standard library apparently handles fine).
The self-synchronizing property is also very clever. If you start at an arbitrary byte, you can find the start of the next character by scanning forward a maximum of 3 bytes.
Yeah, this. I have a pat "Unicode Rant" that boils down to this essentially.
Having a catalog of standard numbers-to-glyphs (or symbols or whatever, little pictures humans use to communicate with) is awesome and useful (and all ASCII ever was) but trying to digitalize all of human language is much much more challenging.
But human language doesn't stop being "much much more challenging" if you decide not to engage.
Sometimes (and this can even be an admirable choice) in some specialist applications it's acceptable to decide you won't embrace the complexity of human language. But in a lot of places where that's fine we already did this with the decimal digits such as in telephone numbers, or UPC/EAN product codes, so we don't need ASCII.
In most other places insisting upon ASCII is just an annoying limitation, it's annoying not being able to write your sister's name in the name of the JPEG file, regardless of whether her name is 林鳳嬌 or Jenny Smith, and it jumps out at you if the product you're using is OK with Jenny Smith but not 林鳳嬌.
You might think well, OK, but there weren't problems in ASCII. The complexity is Unicode's fault. Think about Sarah O'Connor? That apostrophe will often break people's software without any help from Unicode.
Your sister's name doesn't render in my browser (stable Firefox on Linux 5.6). I'm sure I'm missing a fontpack or something. Again, I'm not saying ASCII is the solution, I'm saying Unicode is much more difficult to get right, and maybe we should call it something other than "plain text", since we already had a generally accepted meaning for that for many years. I'm usually in favor of making a new name for a thing rather than overloading an old name.
Firefox does full font fallback. So this means your system just isn't capable of rendering her name (which yes you might be able to fix if you wanted to by installing font packages). If you don't understand Han characters that's an acceptable situation, the dotted boxes (which I assume rendered instead) alert you that there is something here you can't display properly but if you know you can't understand it even if it's displayed there's no need to bother.
It really is just plain text. Human writing systems were always this hard, and "for many years" what you had were separate independent understandings of what "plain text" means in different environments, which makes interoperability impossible. Unicode is mostly about having only one "plain text" rather than dozens.
It is not mandatory that your 80x25 terminal learn how to display Linear B, you can't read Linear B and you probably have no desire to learn how and no interest in any text written in it. But Unicode means your computer agrees with everybody else's computer that it's Linear B, and not a bunch of symbols for drawing Space Invaders, or the manufacturer's logo, if you fix a typo in a document I wrote that has some Linear B in it, your computer doesn't replace the Linear B with question marks, or erase the document, since it knows what that is even if you can't read it and it doesn't know how to display it.
But I'm not saying we shouldn't engage, I'm just pointing out that the catalog of lil pictures is the easy part of the task.
One way I put it is, imagine if one of the first-class outputs of the Unicode Consortium was standard libraries for different human languages for different computer languages.
As a person who comes from a country with non-ASCII alphabet, I strongly disagree. Since UTF-8 became de-facto standard everywhere, so many headaches went away.
That complexity comes from the fact that you are using non ASCII characters. UTF8 is a superset of standard ASCII. If you are using only standard ASCII characters, they're exactly the same thing.
And you're naïve if you think ASCII suffices for English. I wouldn't give you ½¢ for an OS incapable of handling Unicode and UTF-8 even if you told me every language other than English were mysteriously destroyed. Going back to ASCII is 180° from what would enrich English-language text.
> You only need one sentence to explain why ASCII isn't sufficient
Nitpick: ASCII is sufficient when you consider that Base64, despite its 33% overhead from representing 6 bits with 8 bits, makes life easier for certain classes of software.
What I was alluding to is, I often convert any binary data, including text, to Base64 to avoid dealing with cross platform, cross language, cross format, cross storage, cross network data-handling. Only the layer that needs to deal with the blob's actual string representation needs to worry about encoding schemes that are outside the purview of the humble ASCII table.
Base64 encodes sextets. The mapping from octets to sextets is mostly settled for set of three octets at a time, but the situation for lengths not divisible by 6 is a mess.
ASCII is English and limiting access to knowledge for the rest of humanity for a simpler encoding is just not an acceptable option. Someone needs to interpret those 7k words and write a (complicated?) program once so that billions can read in their own language? Sounds like an easy win to me.
Sure spoken, but both Arabic and CJK ideograms are written in far more countries in the world, with far more people, and for far longer in history than the ASCII set. The oldest surviving great works of Mathematics were written in Arabic and some of the oldest surviving great works of Poetry where written in Chinese, as just two easy and obvious examples of things worth preserving in "plain text".
Playing the devil's advocate here. I am not a native English speaker, I'm a French speaker, but I'm happy that English is kind of the default international language. It's a relatively simple language. I actually make less grammar mistakes in English than I do in my native language. I suppose it's probably not a politically correct thing to say, the English are the colonists, the invaders, the oppressors, but eh, maybe it's also kind of a nice thing for world peace, if there is one relatively simple language that's accessible to everyone?
Go ahead and make nice libraries that support Unicode effectively, but I think it's fair game, for a small software development shop (or a one-person programming project), to support ASCII only for some basic software projects. Things are of course different when you're talking about governments providing essential services, etc.
I know almost no one who actually types the accented e, let alone the c with the cedilla. I scarcely ever see the degree symbol typed. Rather, I see facade, cafe, and "degrees".
That aside, the big problem with unicode is not those characters; they're a simple two-byte extension. They obey the simple bijective mapping of binary character <-> character on screen. Unicode doesn't. You have to deal with multiple code points representing one on-screen grapheme, which in turn may or may not translate into a single on-screen glyph. Also bi-directional text, or even vertical text (see the recent post about Mongolian script). Unicode is still probably one of the better solutions possible, but there's a reason you don't see it everywhere: it means not just updating to wide chars but having to deal with a text shaper, re-do your interfaces, and tons of other messy stuff. It's very easy for most people to look at that and ask why they'd bother if only a tiny percentage of users use, say, vertical text.
The first point is just because of the keys on a keyboard.
I see many uses of "pounds" or "GBP" on HN. Anyone with the symbol on the keyboard (British and Irish obviously, plus several other European countries) types £. When people use a phone keyboard, and a long-press or symbol view shows $, £ and €, they can choose £.
Danish people use ½ and § (and £). These keys are labelled on the standard Danish Windows keyboard.
There's plenty of scope for implementing enough Unicode to support most Latin-like languages without going as far as supporting vertical or RTL text.
For some reason people seem to think that the only options are UTF-8 and ASCII. That choice never existed. There are thousands upon thousands of character encodings in use. Before Unicode every single writing system had its own character encoding that is incompatible with everything else.
You didn't say spoken by every person. Merely spoken in every country. Even the existence of tourists in a country would pass this incredibly low bar...
Of course ASCII is simpler than Unicode, it handles only 127 characters. If you restrict yourself to those characters ASCII is binary equivalent to UTF-8.
So yeah, maybe you shouldn't use characters 128+ for data archival, I doubt that's a good idea, but that's irrelevant to whether UTF-8 is plain text or not.
I think that sometimes it makes sense to enforce strict limitations early on (eg: overly strict input validation). You can then remove such limitations in later versions of your software, after careful consideration and after inserting the necessary tests. The reverse usually doesn't work. If you didn't have those limitations early on, and your database is full of strings with characters that should never have been allowed in there, you will have a hard time cleaning up the mess.
This seems especially true to me in the design of programming languages. If you have useless, badly thought out features in your programming language, people will begin to rely on them, and you will never be able to get rid of them... So start with a small language, and make it strict. Grow it gradually.
There are tens of thousands of characters in all the human scripts. If you're a librarian, scholar, researcher -- why would you not want to be able to use them seamlessly??
If there was a complicated tool that claimed it could do the job of every tool in history, or a simple tool that was focused to cover 99% of the work you do-- and we lived on planet earth-- which would you choose?
With things like data archival, if I have a hard drive with the library of congress stored in ASCII, I need half a sheet of paper to understand how to decode it.
Whereas apparently UTF8 requires 7k words just to explain why it's important. And that's not even looking at the spec.
Just to be crystal clear, I'm not advocating to not use Unicode, or even use it less. I'm just saying I think it maybe shouldn't count as plain text, since it looks a lot like a relatively complicated binary format to me.