Hacker News new | past | comments | ask | show | jobs | submit login

UTF-16 is garbage. Windows is stuck with it because it was too early an adopter of Unicode. Oh the irony. This may set Windows on a path to deprecating UTF-16 -- godspeed!



Another exciting development in moving beyond UTF-16: Microsoft is experimenting with adding a native UTF-8 string in .NET next to the existing UTF-16 string:

https://github.com/dotnet/coreclr/commits/feature/utf8string


I've just spent some time reading through the proposals, it made for a fascinating read! It's really interesting to see the work and discussions that go into a seemingly simple feature like this.


Could you elaborate? I've been under the guise for most of my career that doubling a digit leads to huge benefits that I'm too comp-sci ignorant to understand.


Can I assume that's just subtle humor on your part?

If not:

UTF-16 is born of UCS-2 being a very poor codeset, as it was limited to the Unicode BMP, which means 2^16 codepoints, but Unicode has many more codepoints, so users couldn't have the pile-of-poo emoticon. Something had to be done, and that something was to create a variable-length (in terms of 16-bit code units) encoding using a few then-unassigned codepoints in the BMP. The result yields only a sad, pathetic, measly 2^21 codepoints, and that's just not that much. Moreover, while many codesets play well with ASCII, UTF-16 doesn't. Also, decomposed forms of Unicode glyphs necessarily involve multiple codepoints, thus multiple code units... Many programmers hate variable length text encoding because they can't do simple array indexing operations to find the nth character in a string, but with UTF-8, UTF-16, and just plain decomposition, that's a fact of life anyways. If you're going to have a variable-length codeset encoding, you might as well use UTF-8 and get all its plays-nice-with-ASCII benefits. For Latin-mostly text UTF-8 also is more efficient than UTF-16, so there is a slight benefit there.

Much of the rest of the non-Windows, non-ECMAScript world has settled on UTF-8, and that's a very very good thing.


No humor whatsoever. Thank you for the explanation! I'm an ops person who knows python/golang to a dangerous extent and have never gone out of my way to understand the UTF reasonings. Your post intrigued me and made me want to ask why you felt that way. This will make me sound horrendously ignorant to someone of the likes of someone such as yourself but I'm here to learn.


UTF-16 has a bit of a funky design (using four byte/two code unit surrogate pairs to encode code points outside the basic multilingual plane) that ultimately restricts Unicode (if compatibility is to be maintained with UTF-16, at least) to 17 planes, or 2^20 code points (about 1 million).

UTF-8 uses a variable length encoding that allows for more characters-- if restricted to four bytes, it allows for 2^21 total code points; it's designed to eventually allow for 2^31 code points, which works out to about 2 billion code points that can be expressed.

(Granted, this is all hypothetical-- Unicode isn't even close to filling all of the space that UTF-16 allows; there aren't enough known writing systems yet to be encoded to fill all of the remaining Unicode planes (3-13 of 17 are all still unassigned). But UTF-16's still nonstandard (most of the world's standardized on UTF-8) and kind of ugly, so the sooner it goes away, the better.)


Thank you, this was an incredibly understandable explanation.


That is a bit misleading to the point of error, on several points:

* Your timeline is backwards. UTF-8 was designed for a 31-bit code space. Far from that being its future, that is its past. In the 21st century it was explicitly reduced from 31-bit capable to 21 bits.

* UTF-16 is just as standard as UTF-8 is, it being standardized by the same people in the same places.

* 17 planes is 21 bits; it is 16 planes that is 20 bits.


Joel Spolsky's essay "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" is an excellent read:

https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...


Wikipedia says: "UTF-8 requires 8, 16, 24 or 32 bits (one to four octets) to encode a Unicode character, UTF-16 requires either 16 or 32 bits to encode a character"

https://en.wikipedia.org/wiki/Comparison_of_Unicode_encoding...


> I've been under the guise for most of my career that doubling a digit leads to huge benefits that I'm too comp-sci ignorant to understand.

I was confused about this for years, too. But it turns out it's just a problem of bad naming. Happens more in this industry than we'd like to admit.

As other explained, it boils down to UTF-16 being 16-bit, and UTF-8 being anything from 8- to 32-bit. It should have been named UTF-V (from "variable") or something, but here we are.


UTF-16 is a variable-length encoding using up to two code units which each are 16-bits wide.

UTF-8 is a variable-length encoding using up to 4 code units (though it used to be up to 6, and could again be up to 6) each of which are 8-bits wide.

Both, UTF-16 and UTF-8 are variable-length encodings!

UTF-32 is not variable-length, but even so, the way Unicode works a character like ´ (á) can be written in two different ways, one of which requires one codepoint and one of which requires two (regardless of encoding), while ṻ (LATIN SMALL LETTER U WITH MACRON AND DIAERESIS) can be written in up to five different ways requiring from one to three different codepoints (regardless of encoding).

Not every character has a one-codepoint representation in Unicode, or at least not every character has a canonically-pre-composed one-codepoint representation in Unicode.

Therefore, many characters in Unicode can be expected to be written in multiple codepoints regardless of encoding. Therefore all programmers dealing with text need to be prepared for being unable to do an O(1) array index operation to get at the nth character of a string.

(In UTF-32 you can do an O(1) array index operation to get to the nth codepoint, not character, but one is usually only ever interested in getting the nth character.)


Thanks. It turns out I was even more confused than I thought I was.


For completeness sake: the confusing part is that there is (used to be) a constant-length encoding that uses the exact same codepoints as UTF-16, but doesn't allow the variable-length extensions. That encoding is called UCS-2 and although deprecated, is the reason why many people think UTF-16 is constant-length.


Not the same codepoints as UTF-16, but a subset. But yes, UCS-2 is the source of all this evil.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: