[pedantic] "UTF-8 space"... UTF-8 code units space is quite limited by single by...

sctb · on Oct 6, 2016

We detached this subthread from https://news.ycombinator.com/item?id=12654930 and marked it off-topic.

dragonwriter · on Oct 6, 2016

> UTF-8 code units space is quite limited by single byte - 0...255

Yes, but no one said "UTF-8 code units space", and its pretty clearly that the "UTF-8 space" intended was the full space representable in UTF-8, not the space representable in a single UTF-8 code unit, so this is not only pedantic, but also a non-sequitur.

GauntletWizard · on Oct 6, 2016

Not true at all. Per the wiki article[1], "UTF-8 is a character encoding capable of encoding all possible characters, or code points, defined by Unicode"

It's a variable-length encoding, and just so happens to correspond to ASCII for the first 127 characters. But if the leading bit of the byte is 1, it indicates that it's part of a multi-byte glyph. With the encoding, you can represent the entirety of the unicode space - The latter bytes start with 10 to indicate they're middle parts of the glyph, while the first byte uses 11 and continues to indicate how many bytes long the glpyh is.

[1]https://en.wikipedia.org/wiki/UTF-8

c-smile · on Oct 6, 2016

Depends on what "space" means...

As of utf-8 encoding... It is a variable encoding that is capable to encode any 32-bit number: from 0 to 0xFFFFFFFF. Not just current set of 21-bit unicode code points.

As of "... indicates that it's part of a multi-byte glyph." You are mixing completely different entities here. Unicode has nothing with glyphs.

Glyph is an atomic component (image) of internal font structure. Single character (unicode code point here) can be composed on screen from multiple glyphs.

wolf550e · on Oct 6, 2016

> As of utf-8 encoding... It is a variable encoding that is capable to encode any 32-bit number: from 0 to 0xFFFFFFFF. Not just current set of 21-bit unicode code points.

You're thinking of an older version of UTF-8 which allowed sequences of up to 6 bytes to be used to encode a single code point. UTF-8 is now defined to not allow code point values above 0x10FFFF and not allow code point values between 0xD800 to 0xDFFF (inclusive) to allow only the same values as possible in UTF-16.

https://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

https://tools.ietf.org/html/rfc3629

Dylan16807 · on Oct 8, 2016

Actually the UTF-8 format maxes out at 31 bits.

consto · on Oct 6, 2016

You are aware that UTF-8 is a variable width encoding, and not limited to ASCII, right?

wolf550e · on Oct 6, 2016

"code unit" is not "code point". "code unit" for utf-8 is 1 octet. "code unit" for utf-16 is a 16bit value. Both utf-8 and utf-16 can represent the entire space of unicode code points by using multiple code units to represent a single code point.