Even more specifically, emoji paved the way for proper support of Unicode charac...

jesuscyborg · on Nov 17, 2020

The historic planes beyond the basic multilingual plane are usually referred to as the "astral planes" which includes things like gothic, runes, alchemy, egyptian, and emoji https://justine.storage.googleapis.com/astralplanes.txt

derefr · on Nov 17, 2020

And the etymology of this being that Dungeons and Dragons has a "Prime Material Plane" and an "Astral Plane", where the Astral Plane connects the PMP to various "Outer Planes" made of ridiculous not-oft-encountered stuff.

But whoever came up with this cute analogy got the analogy wrong — the higher Unicode planes are analogous to the "outer planes" themselves; while the "astral plane" would be some sort of glue allowing you to access these outer planes from within the BMP. Like... surrogate-pair characters! One could nickname the reserved surrogate-pair range in the BMP, the "astral projection" range ;)

kens · on Nov 17, 2020

"Astral plane" predates Dungeons and Dragons by centuries. Looking at old discussions, I couldn't find any evidence that Unicode's usage is connected with D&D.

Early discussion of "astral character" or "astral plane" for the Unicode supplementary planes at: https://unicode.org/mail-arch/unicode-ml/Archives-Old/UML024... Even earlier 1998 use: https://www.unicode.org/L2/L1998/98354.pdf

Sniffnoy · on Nov 17, 2020

The term "astral plane" is older than D&D, and I would assume they took it from the more general usage, not the specific usage in D&D. https://en.wikipedia.org/wiki/Astral_plane

GauntletWizard · on Nov 17, 2020

I’ve met several of the Unicode standard committee - They’re nerds. The kind of nerds for whom “Astral Plane” is a multilayered joke. It’s not not about the general usage, but nor is it not about the D&D term.

chrisseaton · on Nov 17, 2020

> you can encode with only two bytes (e.g. UTF-16)

UTF-16 is variable width, not two bytes, and it can encode any Unicode character.

toxik · on Nov 17, 2020

OP probably meant UCS-2.

lifthrasiir · on Nov 18, 2020

> Characters encoded beyond the BMP in plane 1 and 2 included things like esoteric CJKV additions (East Asian ideographs) not usually in daily use, but part of historic documents.

Unfortunately, this has been false for a long time. BMP turned out to be not even barely enough even by Unicode 3.0 [1] where the initial set of Unicode emoji (722 characters) would barely fit in the undesignated area of BMP. Many important characters, starting from a larger set like CJKV and eventually to almost everything by Unicode 6.0 [2], got allocated in SMP and SIP instead as a result. HKSCS additions in the CJK Unified Ideograph Extension B block (U+20000..2A6FF) are notable examples.

[1] https://www.unicode.org/roadmaps/bmp/bmp-3-0.html

[2] https://www.unicode.org/roadmaps/bmp/bmp-6-0-0.html

Freak_NL · on Nov 18, 2020

CJK Unified Ideograph Extension B/C/D are all pretty exotic though. In normal daily use you won't encounter them, because input methods rarely offer them and people simply don't need them. These are important characters, but only a handful of them will ever be used by the average writer of Chinese or Japanese.

I was using some of these (from B and probably C) for very specific purposes at that time, and general support was a long way off in 2009 (although already good on GNU/Linux distributions).

bawolff · on Nov 17, 2020

You're mixing up ucs-2 and utf-16.

Robin_Message · on Nov 17, 2020

To expand on this comment, UCS-2 defines a fixed-length, 2-byte encoding of Unicode. It can therefore only represent the first 65536 characters in the Basic Multilingual Plane (BMP).

UTF-16 allows representing characters outside of the BMP by using a reserved area to split a single codepoint into two surrogates that form a pair.

This makes UTF-16 complicated and in some ways worse than UTF-8: the encoding is longer for many typical texts, but is still not fixed-width. The bug you typically see is that codepoints outside of the BMP are munged when clipping the text to a certain length (or reversing it, but that doesn't happen in real systems generally.)

seba_dos1 · on Nov 17, 2020

The reason why some older mobile phones struggle with SMS containing emojis instead of just displaying tofus in place of unsupported characters is that there's no way to send emojis in accordance to SMS standard - it defines the encoding to be UCS-2. In order to put emojis in SMS, newer phones send the messages as UTF-16 instead, technically violating the standard, which can break some parsers that only expect UCS-2 to be there.

lokedhs · on Nov 18, 2020

UTF-16 is the worst of both worlds when compared to UTF-8 and UTF-32. The only reason it exists (and, unfortunately prevalent) is because a number of popular technologies (Java, Javascript, Windows) thought they were being smart when building their Unicode support on UCS-2, and now here we are.

Now, the issue of clipping or reversing strings is a problem not just because of encoding. It simply doesn't work even with UTF-32. You're going to end up cutting off combining characters for example. Manipulating strings is very difficult, and software should never really try to do it unless they know what they are doing, and even then you need to use a library to help you do it.

innocenat · on Nov 18, 2020

> thought they were being smart when building their Unicode support on UCS-2, and now here we are.

Not sure what you mean by 'being smart' when all of those were released before Unicode 2.0.

lokedhs · on Nov 18, 2020

I said they thought they were smart. I'm not going to judge whether it actually was smart based on the situation then.

That said, UTF-8 was already 4 years old by the time Java came out. Surrogate pairs was added to Unicode in 1996, one year prior to the release of Java.

I joined Sun Microsystems around that time, and Unicode really wasn't a thing in the Solaris world for a few more years, so the fact that people wasn't aggressively pushing good Unicode support at the time is understandable. People just didn't have much experience with it.

ucarion · on Nov 17, 2020

To that point, what are systems supposed to make of UTF-8 strings encoding codepoints in the surrogate pair range? Is that well-defined?

In other words, to what extent are surrogate pairs a UTF-16 thing, rather than a Unicode thing that exists to accommodate for UCS-2 -> UTF-16?

SloopJon · on Nov 18, 2020

Surrogates are technically a UTF-16 only thing. Realizing that sometimes they nevertheless escape out into the wild, WTF-8 defines a superset of UTF-8 that encodes them:

https://simonsapin.github.io/wtf-8/

To be clear, this is not an official Unicode spec. It's a hack (albeit a pretty natural and obvious one) to deal with systems that don't do Unicode quite right.

I recently came across some old code that narrows wchar_t to UCS-2 by zeroing out the high-order bytes. Even though my test was careful not to generate any surrogates in the input, they showed up in the output when a randomly generated code point like U+1DF7C was mangled into U+DF7C.

A corrupted value like that is not necessarily a great example of something you want to preserve, but it's the sort of thing that late 90s code assumed about Unicode.

account42 · on Nov 19, 2020

Specifically, filenames on Windows are not UTF-16 (or UCS-12) but rather WTF-16 - like UTF-16 but with possibly unmatched surragate pairs. WTF-8 provides an 8-bit encoding for such filenames that matches UTF-8 wherever the original was valid UTF-16 while converting the rest in the most straightforward way possible, menaing you need less code to go from WTF-16 to WTF-8 than going from UTF-16 to UTF-8 while rejecting invalid characters.

ChrisSD · on Nov 18, 2020

It's invalid according to the spec. They are permanently reserved code points for use in UTF-16.

> The definition of UTF-8 prohibits encoding character numbers between U+D800 and U+DFFF, which are reserved for use with the UTF-16 encoding form (as surrogate pairs) and do not directly represent characters.

They could be replaced by the replacement character to produce a valid string.

a1369209993 · on Nov 17, 2020

Nitpick: UCS-2 actually isn't fixed-length either, eg "ẍ̊" (small x+umlaut+ring above) is two code units (1E8D 030A) or possibly three (0078 0308 030A).

ygra · on Nov 17, 2020

UCS-2 uses a fixed number of (16-bit) code units to represent a Unicode scalar value (code point). Of course, to represent a grapheme cluster, more than one code point may be needed, but that's true of Unicode in general.

a1369209993 · on Nov 17, 2020

> that's true of Unicode in general.

Yes, that was rather my point: if you're using a Unicode-based character encoding, you're going to have variable-width characters regardless, so you might as well use UTF-8.

> UCS-2 uses a fixed number of (16-bit) code units to represent a Unicode scalar value (code point).

Sure, but that's a implementaion detail of the mapping from characters (at the application level) to bytes (at the physical(-ish) representation level).

3131s · on Nov 18, 2020

> This first block of 65,536 characters

This is also a little confusing, since "block" already has a specific meaning in relation to Unicode.

hinkley · on Nov 17, 2020

UTF-8 is simple enough to implement and yet I've seen it done improperly more than once.

The problem with UTF-8 is that the density is really good for North America and Western Europe but drops off quite a bit for other languages, and you have to trade CPU for bandwidth (eg, gzip) to do much about it.

Japan has several encodings (though shiftJIS is the only one that I can recall) that use escape characters to switch code pages. As long as you don't switch too rapidly between kanji and borrow words, it's more compact, but more complex to implement (I would say less so than implementing gzip but if you aren't using zlib, one of the most portable libraries in existence, you have much bigger issues than character encoding).

UTF-8 takes 3 bytes for all of the first block. Only the first 2048 characters fit into 2 bytes, which is mostly European languages.

Freak_NL · on Nov 17, 2020

Outside of embedded software this really isn't that much of a problem any more.

Taking a random Wikipedia page as sample I get 46kB (UTF-8) versus 35kB (Shift-JIS). A random Japanese text from Project Gutenberg is roughly ⅔ of the size of the UTF-8 text in Shift-JIS.

Those are impressive enough numbers, but add just a single photograph to the Wikipedia page and it doesn't matter at all. Text is just pretty efficient, even if you use an encoding that supports every language in the world.

crazygringo · on Nov 17, 2020

First, that's because European languages have small alphabets. It's not like Chinese or Japanese with their many thousands of characters could have fit in those 2,048 spots anyways. So it makes sense to allocate the small common alphabets there.

Second, text is so comparatively tiny relative to photos, video, code, etc. that it really doesn't matter at all anyways.

Third, text is often zipped as well. It's often zipped over HTTP. It's zipped when it sits inside of an EPUB. It's zipped when it sits inside a Word document. You can even configure MySQL to zip text fields in a database. Basically, whenever space is an issue, you can fix it.

So it's hard to see how this is any problem in practice at all, when phones and computers mostly ship with 32 GB of SSD minimum.

Dylan16807 · on Nov 18, 2020

The density drops off but it's still good density. It's not a problem. And the amount of CPU you need to do the bit shifts is negligible.

Shift JIS requires you to track extra context when you're actively using the text. That's going to take extra space. I bet that in most of the situations where shift JIS meaningfully wins out, you could get more benefit by using a combination of UTF-8 and Zstd.