Trying to figure out how to express this without making people mad at me. I thin...

pjscott · on April 14, 2020

Unicode is complicated because the languages it needs to handle are, alas, complicated. UTF-8 is super simple. It's a variable-length encoding for 21-bit unsigned integers. Wikipedia gives a handy table showing how it works:

https://en.wikipedia.org/wiki/UTF-8#Description

ftvy · on April 14, 2020

When I wrote a very primitive UTF-8 library, I really began to appreciate UTF-8's design. For example; the first byte says how many bytes the character requires. At first it was daunting, but when I put 2 and 2 together, it really opened up.

I am sure there are many aspects I am missing about UTF-8, but it is all reasonable in its design and implementation.

For reference, I was converting between code points and actual bytes, and also implemented strlen and strcmp (which for the latter the standard library apparently handles fine).

TheCoelacanth · on April 14, 2020

The self-synchronizing property is also very clever. If you start at an arbitrary byte, you can find the start of the next character by scanning forward a maximum of 3 bytes.

account42 · on April 15, 2020

And scanning backwards works too.

carapace · on April 14, 2020

Yeah, this. I have a pat "Unicode Rant" that boils down to this essentially.

Having a catalog of standard numbers-to-glyphs (or symbols or whatever, little pictures humans use to communicate with) is awesome and useful (and all ASCII ever was) but trying to digitalize all of human language is much much more challenging.

tialaramex · on April 14, 2020

But human language doesn't stop being "much much more challenging" if you decide not to engage.

Sometimes (and this can even be an admirable choice) in some specialist applications it's acceptable to decide you won't embrace the complexity of human language. But in a lot of places where that's fine we already did this with the decimal digits such as in telephone numbers, or UPC/EAN product codes, so we don't need ASCII.

In most other places insisting upon ASCII is just an annoying limitation, it's annoying not being able to write your sister's name in the name of the JPEG file, regardless of whether her name is 林鳳嬌 or Jenny Smith, and it jumps out at you if the product you're using is OK with Jenny Smith but not 林鳳嬌.

You might think well, OK, but there weren't problems in ASCII. The complexity is Unicode's fault. Think about Sarah O'Connor? That apostrophe will often break people's software without any help from Unicode.

apitman · on April 15, 2020

Your sister's name doesn't render in my browser (stable Firefox on Linux 5.6). I'm sure I'm missing a fontpack or something. Again, I'm not saying ASCII is the solution, I'm saying Unicode is much more difficult to get right, and maybe we should call it something other than "plain text", since we already had a generally accepted meaning for that for many years. I'm usually in favor of making a new name for a thing rather than overloading an old name.

tialaramex · on April 15, 2020

Firefox does full font fallback. So this means your system just isn't capable of rendering her name (which yes you might be able to fix if you wanted to by installing font packages). If you don't understand Han characters that's an acceptable situation, the dotted boxes (which I assume rendered instead) alert you that there is something here you can't display properly but if you know you can't understand it even if it's displayed there's no need to bother.

It really is just plain text. Human writing systems were always this hard, and "for many years" what you had were separate independent understandings of what "plain text" means in different environments, which makes interoperability impossible. Unicode is mostly about having only one "plain text" rather than dozens.

It is not mandatory that your 80x25 terminal learn how to display Linear B, you can't read Linear B and you probably have no desire to learn how and no interest in any text written in it. But Unicode means your computer agrees with everybody else's computer that it's Linear B, and not a bunch of symbols for drawing Space Invaders, or the manufacturer's logo, if you fix a typo in a document I wrote that has some Linear B in it, your computer doesn't replace the Linear B with question marks, or erase the document, since it knows what that is even if you can't read it and it doesn't know how to display it.

carapace · on April 15, 2020

But I'm not saying we shouldn't engage, I'm just pointing out that the catalog of lil pictures is the easy part of the task.

One way I put it is, imagine if one of the first-class outputs of the Unicode Consortium was standard libraries for different human languages for different computer languages.

perilunar · on April 15, 2020

> and all ASCII ever was

Except that's not true. The ASCII control codes were never glyphs, but were used to control the hardware.

carapace · on April 15, 2020

Sorry, what I mean was ASCII wasn't an encoding of the English language, just an encoding of the English alphabet and some other symbols.

You're quite right that, uh, meta-linguistic symbols are also in there and that does kind of complicate my argument.

nlitened · on April 14, 2020

As a person who comes from a country with non-ASCII alphabet, I strongly disagree. Since UTF-8 became de-facto standard everywhere, so many headaches went away.

hechang1997 · on April 14, 2020

That complexity comes from the fact that you are using non ASCII characters. UTF8 is a superset of standard ASCII. If you are using only standard ASCII characters, they're exactly the same thing.

TheCoelacanth · on April 14, 2020

You only need one sentence to explain why ASCII isn't sufficient: There are languages other than English.

msla · on April 14, 2020

And you're naïve if you think ASCII suffices for English. I wouldn't give you ½¢ for an OS incapable of handling Unicode and UTF-8 even if you told me every language other than English were mysteriously destroyed. Going back to ASCII is 180° from what would enrich English-language text.

ignoramous · on April 14, 2020

> You only need one sentence to explain why ASCII isn't sufficient

Nitpick: ASCII is sufficient when you consider that Base64, despite its 33% overhead from representing 6 bits with 8 bits, makes life easier for certain classes of software.

TheCoelacanth · on April 14, 2020

Base64 is an encoding for representing bytes[0] in ASCII.

That doesn't help you represent text unless you already have an encoding for representing text in bytes (e.g. UTF8).

[0] Octets if you want to be pedantic

ignoramous · on April 14, 2020

What I was alluding to is, I often convert any binary data, including text, to Base64 to avoid dealing with cross platform, cross language, cross format, cross storage, cross network data-handling. Only the layer that needs to deal with the blob's actual string representation needs to worry about encoding schemes that are outside the purview of the humble ASCII table.

naniwaduni · on April 15, 2020

Base64 encodes sextets. The mapping from octets to sextets is mostly settled for set of three octets at a time, but the situation for lengths not divisible by 6 is a mess.

dtech · on April 14, 2020

You still need an encoding to represent non-ASCII characters like ë or 木. Base64 is no help at all there

jaseemabid · on April 14, 2020

ASCII is English and limiting access to knowledge for the rest of humanity for a simpler encoding is just not an acceptable option. Someone needs to interpret those 7k words and write a (complicated?) program once so that billions can read in their own language? Sounds like an easy win to me.

droopyEyelids · on April 14, 2020

counterpoint:

A complicated program is never an easy win, and English is already spoken in every country in the world.

WorldMaker · on April 14, 2020

Sure spoken, but both Arabic and CJK ideograms are written in far more countries in the world, with far more people, and for far longer in history than the ASCII set. The oldest surviving great works of Mathematics were written in Arabic and some of the oldest surviving great works of Poetry where written in Chinese, as just two easy and obvious examples of things worth preserving in "plain text".

crazygringo · on April 14, 2020

So your argument is... it's easier to teach billions of people fluent English... than for software to support UTF-8?

You are aware that a majority of the world's population speaks no English whatsoever?

tachyonbeam · on April 14, 2020

Playing the devil's advocate here. I am not a native English speaker, I'm a French speaker, but I'm happy that English is kind of the default international language. It's a relatively simple language. I actually make less grammar mistakes in English than I do in my native language. I suppose it's probably not a politically correct thing to say, the English are the colonists, the invaders, the oppressors, but eh, maybe it's also kind of a nice thing for world peace, if there is one relatively simple language that's accessible to everyone?

Go ahead and make nice libraries that support Unicode effectively, but I think it's fair game, for a small software development shop (or a one-person programming project), to support ASCII only for some basic software projects. Things are of course different when you're talking about governments providing essential services, etc.

Symbiote · on April 14, 2020

English isn't even ASCII anyway.

Some loanwords like façade or café retain their accents.

It doesn't take much to need one of these cases in a project.

_vbdg · on April 15, 2020

I know almost no one who actually types the accented e, let alone the c with the cedilla. I scarcely ever see the degree symbol typed. Rather, I see facade, cafe, and "degrees".

That aside, the big problem with unicode is not those characters; they're a simple two-byte extension. They obey the simple bijective mapping of binary character <-> character on screen. Unicode doesn't. You have to deal with multiple code points representing one on-screen grapheme, which in turn may or may not translate into a single on-screen glyph. Also bi-directional text, or even vertical text (see the recent post about Mongolian script). Unicode is still probably one of the better solutions possible, but there's a reason you don't see it everywhere: it means not just updating to wide chars but having to deal with a text shaper, re-do your interfaces, and tons of other messy stuff. It's very easy for most people to look at that and ask why they'd bother if only a tiny percentage of users use, say, vertical text.

Symbiote · on April 15, 2020

The first point is just because of the keys on a keyboard.

I see many uses of "pounds" or "GBP" on HN. Anyone with the symbol on the keyboard (British and Irish obviously, plus several other European countries) types £. When people use a phone keyboard, and a long-press or symbol view shows $, £ and €, they can choose £.

Danish people use ½ and § (and £). These keys are labelled on the standard Danish Windows keyboard.

There's plenty of scope for implementing enough Unicode to support most Latin-like languages without going as far as supporting vertical or RTL text.

imtringued · on April 15, 2020

For some reason people seem to think that the only options are UTF-8 and ASCII. That choice never existed. There are thousands upon thousands of character encodings in use. Before Unicode every single writing system had its own character encoding that is incompatible with everything else.

imtringued · on April 15, 2020

You didn't say spoken by every person. Merely spoken in every country. Even the existence of tourists in a country would pass this incredibly low bar...

dtech · on April 14, 2020

Of course ASCII is simpler than Unicode, it handles only 127 characters. If you restrict yourself to those characters ASCII is binary equivalent to UTF-8.

So yeah, maybe you shouldn't use characters 128+ for data archival, I doubt that's a good idea, but that's irrelevant to whether UTF-8 is plain text or not.

tachyonbeam · on April 14, 2020

I think that sometimes it makes sense to enforce strict limitations early on (eg: overly strict input validation). You can then remove such limitations in later versions of your software, after careful consideration and after inserting the necessary tests. The reverse usually doesn't work. If you didn't have those limitations early on, and your database is full of strings with characters that should never have been allowed in there, you will have a hard time cleaning up the mess.

This seems especially true to me in the design of programming languages. If you have useless, badly thought out features in your programming language, people will begin to rely on them, and you will never be able to get rid of them... So start with a small language, and make it strict. Grow it gradually.

cryptonector · on April 14, 2020

There are tens of thousands of characters in all the human scripts. If you're a librarian, scholar, researcher -- why would you not want to be able to use them seamlessly??

droopyEyelids · on April 14, 2020

If there was a complicated tool that claimed it could do the job of every tool in history, or a simple tool that was focused to cover 99% of the work you do-- and we lived on planet earth-- which would you choose?

crazygringo · on April 14, 2020

Umm... but ASCII doesn't work for 99% of people's work.

A majority of the world's population have writing systems that ASCII doesn't encode.

So not really sure what you're suggesting here.

tingletech · on April 14, 2020

LET'S GO BACK TO 6-BITS