What Every Programmer Absolutely, Positively Needs To Know About Encodings (2011)

oshiar53-0 · on Feb 19, 2022

Fun fact: GB 18030 is a Unicode Transformation Format.

Example: \N{THINKING FACE}\N{FACE WITH TEARS OF JOY}\N{FACE SCREAMING IN FEAR}\N{SMILING FACE WITH SMILING EYES AND THREE HEARTS}\N{PERSON DOING CARTWHEEL}\N{FACE WITH NO GOOD GESTURE}\N{ZERO WIDTH JOINER}\N{FEMALE SIGN}\N{VARIATION SELECTOR-16}\N{EYES}\N{ON WITH EXCLAMATION MARK WITH LEFT RIGHT ARROW ABOVE}\N{SQUARED COOL}\N{VARIATION SELECTOR-16}

In UTF-8:

  00000000: f09f a494 f09f 9882 f09f 98b1 f09f a5b0  ................
  00000010: f09f a4b8 f09f 9985 e280 8de2 9980 efb8  ................
  00000020: 8ff0 9f91 80f0 9f94 9bf0 9f86 92ef b88f  ................

In GB 18030:

  00000000: 9530 cd34 9439 fc38 9530 8335 9530 d636  .0.4.9.8.0.5.0.6
  00000010: 9530 d130 9530 8535 8136 a439 a1e2 8431  .0.0.0.5.6.9...1
  00000020: 8235 9439 cf38 9439 e537 9439 8b32 8431  .5.9.8.9.7.9.2.1
  00000030: 8235                                     .5

lifthrasiir · on Feb 19, 2022

Which is carefully designed to work around existing codes that only expect at-most-two-byte-long encoding, e.g. Windows's IsDBCSLeadByte(Ex). Normally a bad design for a new-ish encoding, but a reasonable one given that it's meant to be a superset of GBK---an already bad but widespread encoding.

bombcar · on Feb 19, 2022

What is a Unicode transformation format?

e-dt · on Feb 20, 2022

It's an encoding that encodes all of Unicode. The "UTF" in UTF-8, etc. stands for Unicode Transformation Format.

torstenvl · on Feb 19, 2022

The world of text encodings is pretty insane, especially when you start getting into the realm of what seems like endless variations on multi-level encodings, like the bajillion different character set encodings for quwei/kuten CJK encodings.

I'm only scratching the surface right now, and just wrote a CPG 932 → Unicode lookup utility. https://pastebin.com/4PYmEjQZ

(For internal testing, forgive any sloppiness, but feel free to use the kuten table if you happen to have a niche project with one-way mapping. The tables themselves are facts and not creative expression, so should not be copyrightable, but I'm dedicating the project to the public domain anyway.)

dhosek · on Feb 19, 2022

Back in the 90s I had a 4-inch binder stuffed with every encoding registered with ECMA. I'll take UTF-8 with all of its warts over the mess that was the pre-Unicode encoding hell.

kmeisthax · on Feb 19, 2022

UTF-8 actually isn't the warty bit of Unicode. It's actually fairly reasonable[0], with the only downside being that you can't efficiently index strings by integer. If you exclusively use UTF-8 everywhere in your application, you will remain wart free.

UTF-16 is the warty bit. Why? Because it was designed to be a fixed-width encoding, under the assumption that 65,536 code points would cover all languages we needed. Well, spoiler alert: this wasn't enough, even with the mess that was UniHan. So they carved out a chunk of those code points from the 16-bit part of Unicode, called them "surrogate code points", and used them to extend UTF-16 into a variable-width encoding. Thus, they could have 16 planes of 65,536 characters each. They weren't valid characters beforehand, so obviously we could just repurpose them as flags for "astral" characters outside the Basic Multilingual Plane. Simple, right?

Except that plenty of code to work with UTF-16 strings already existed, and assumed UTF-16 was a fixed-width encoding. If you chucked an astral character at it, it'd think it's two rather than one. It'll also happily split or rearrange astral characters into invalid sequences of codepoints. Furthermore, if you ask these systems for UTF-8, it'll happily encode the surrogates, leading to two nested layers of variable encodings.

Oh, and the two biggest users of UTF-16 were Windows NT and JavaScript. So we're absolutely stuck with fixed-length UTF-16 and the mangled WTF-8[1] encodings they will output until the end of time. Furthermore, if we ever do fill up the astral planes, we'll have to define superastral characters, which means having supersurrogate code points that will break modern UTF-16 again and make WTF-8 handling even more complicated.

[0] Unless you use MySQL, where someone decided to break UTF-8 early on and now there's "utf8mb4" to un-break it.

[1] UTF-8 which contains encoded UTF-16 surrogates, as defined in https://simonsapin.github.io/wtf-8/

bruce511 · on Feb 19, 2022

You are completely right.

Utf-16 is either 2 or 4 bytes long per code point (not per character) but a lot of older programs (and older programmers) see it as a "2 byte per code point" string.

There is a 2 byte per code point encoding, its called UCS2 so its important to make sure when something says utf-16 they don't mean UCS2.

I am not familiar enough with UCS2 to know if it also supports multiple code points per char - but I expect it does - so the mental model of 2 bytes per char breaks down even there.

kmeisthax · on Feb 19, 2022

It depends on which UCS-2 you're talking about.

Originally, UCS-2 was variable-width, and then it was fixed-width, and then UTF-16 was proposed to make it variable-width again. The reason for this is actually because there used to be a different standard called UCCS, which was in competition with Unicode at one point before being withdrawn and replaced by it. But the actual historical drama is very relevant for why UTF-16 is such a warty mess.

ISO proposed UCCS (then ISO 10646), where you had 31-bit code points. Each codepoint was organized by it's bytes into "groups" (high byte), "planes", "rows", and "cells" (low byte). Furthermore, you couldn't have a group, plane, row, or cell that was an ASCII or ISO control code (00-1F/80-9F), so everything was in Group 20 and Plane 20. The standard then defined 1, 2, and 4-byte encodings of these 31-bit codepoints. These were UCS-4, UCS-2, and UTF-1[0].

UCS-2 as defined in UCCS very much supports multiple code points per character; because UCS-4 was supposed to be the fixed-width encoding. The extension mechanism is very different from UTF-16 surrogates and is actually the same as defined in ISO 2022. That is, when you wanted to use a character from a different group or plane, you would actually encode special code-switching words that said you wanted to change all further characters in the string to that new group or plane. This is also why control codes were forbidden in UCCS code points, for the same reason why UTF-16 surrogates are forbidden in Unicode control points.

The underlying problem with UCS-2 (and, for that matter, UTF-1) was that it was not self-synchronizing. In other words, if I take a UCS-2 string and delete one byte from it, all the characters onwards flip their byte order. If I delete a word that's part of a code-switch, then all characters onwards will be misinterpreted in the wrong group or plane. This is a terrible property for text encodings where erroneous deletions or mutations can occur.

A couple of things happened that killed UCCS:

- Software companies vehemently opposed 31-bit codepoints as wasteful and insisted on doing everything in 16-bits. They promoted a competing standard called Unicode which defined one 16-bit, fixed-width (and, thus, self-synchronizing) encoding just called "Unicode".

- When designing Plan 9[1], the designers of that OS realized they could make UTF-1 self-synchronizing, and proposed UTF-8 as an alternative encoding for UCCS. This also had the advantage of not requiring integer division to decode, which made it faster on contemporary processors.

ISO ultimately caved and withdrew the original version of ISO 10646. Unicode 1.1 was altered slightly to please ISO, but ISO 10646 was altered drastically to match Unicode. The resulting standard became ISO 10646-1:1993; and future versions of ISO 10646 are more or less identical to specific Unicode versions.

At this point our nomenclature changes, because in Unicode 2.0 they define or reference several encodings[2][3]:

- UCS-2: Fixed-width 16-bit words. At this point, UCS-2's code-switching mechanism had been withdrawn from ISO 10646, making it identical to just an array of Unicode 1.1 codepoints.

- UCS-4: Still specified by ISO 10646 to support 31-bit codepoints, but with the control code restriction withdrawn. Since they were now using group 0 and plane 0 exclusively, the Unicode 2.0 spec says to treat this as codepoints with extra zero padding. At some point in the future this would become UTF-32 with a restriction to 16-bit + astral codepoints.

- UCS Transformation Format, 7-bit form; or UTF-7: Backwards compatibility for MIME, because e-mail is evil.

- UCS Transformation Format, 8-bit form; or UTF-8: Backwards compatibility for US-ASCII and UNIX filesystems. This is also referred to as "File System Safe UTF", "FSS-UTF", or "UTF-2", but caps characters at 4 bytes, prohibiting the use of 31-bit codepoints.

- UTF-16: The historical mess I mentioned in the previous comment.

Interestingly enough, several parts of the 2.0 spec differ as to whether or not "Unicode" means UCS-2 or UTF-16. In the UTF-8 appendix they specify 4-byte UTF-8 characters as having a "Unicode value" consisting of two UTF-16 surrogates. But in other parts of the spec, such as Table C-1, they list "UCS-2" and "Unicode" as equivalent. I have a feeling that surrogates were a last-minute compromise between ISO and Unicode, given that the UTF-16 spec warns against using existing private-use superastrals[4] already allocated in ISO 10646.

[0] Wikipedia calls this UTF-1 but I don't have access to the old versions of ISO 10646 to check if that standard actually called it that or UCS-1. The Unicode spec calls UTF-8 "UCS Transformation Format-8" so it's possible that the "UTF" nomenclature actually came from ISO first.

[1] Plan 9 is a really fascinating OS, because, while it wasn't very commercially successful, it wound up being highly influential. Things like /proc in Linux were ripped wholesale from Plan 9. It even had C extensions that the designers would later copy in Go (which they also made).

[2] https://www.unicode.org/versions/Unicode2.0.0/appA.pdf

[3] https://www.unicode.org/versions/Unicode2.0.0/appC.pdf

[4] Term I coined in the previous comment for characters beyond U+10FFFF. Encoding them in UTF-16 would require defining a second level of astral surrogate pairs, themselves being encoded as pairs of surrogates.

ben-schaaf · on Feb 19, 2022

I recently implemented GB18030, which is a 1,2 or 4 byte encoding with a giant code page covering the entirety of unicode.

bruce511 · on Feb 19, 2022

A detail skimmed over in the article, but one which has significant importance is that;

"one code point in unicode does not necessarily map to one character on the screen."

A "character" can, and often does, get constructed from multiple code points.

This doesn't help an already complicated sorting issue (who knew "sort these names alphabetically" could be an ambiguous statement).

nanis · on Feb 19, 2022

> "one code point in unicode does not necessarily map to one character on the screen."

Also, importantly, some characters are not represented at all. For example, there is no codepoint or combination of characters that distinguishes a capital Turkish dotless I from its identical looking but conceptually different sibling the Latin capital I. Similarly for capital Turkish dotted İ.

I find it extremely weird that two codepoints couldn't have been spared, yet we have gradations of skin tone in emoji. One can use composition to deal with the round-trip from i -> İ -> i (within a closed ecosystem), but even that fails when it comes to ı -> I -> ı.

A case in point are the product pages for my friend's book on Amazon. Compare "Sınır Ötesi" to "Sinir Ötesi". The former means "beyond borders" whereas the latter means "exceedingly irritating".

Yet, because there is no unambiguous representation of the Turkish I's, the rendering makes an assumption on the basis of the domain. Note that all other non-US ASCII letters involved are rendered correctly.

\c[PERSON FROWNING, ZERO WIDTH JOINER, PROGRAMMER]

[tr]: https://www.amazon.com.tr/Gezging%C3%B6z-S%C4%B1n%C4%B1r-T%C...

[us]: https://www.amazon.com/Gezging%C3%B6z-Sinir-T%C3%BCrkiye-Mir...

samatman · on Feb 19, 2022

This is an outcome of the historic process which gave us modern Unicode.

The Turkish alphabet ISO 8859-9 was published in 1988, and it doesn't distinguish between I and I, since why would it? It's a byte encoding designed specifically to accommodate Turkish.

Unicode has a principle of adopting whatever makes it easy to convert from the preferred legacy encoding to Unicode, and for Turkish that meant using the same code point for I in both Turkish and practically everywhere else, since I is in ASCII and the bulk of Latin-based character encodings are based on ASCII.

I point this out because it could indeed have been the case that they ended up with separate encodings, which would have solved your problem.

But it wouldn't have solved the actual problem, which is the collation and casing are locale-specific according to Unicode, and indeed they must be. Example: in Dutch the letters 'ij' are one letter for collation purposes, as well as for casing, see https://en.wikipedia.org/wiki/IJsselstein for an example.

This is of course impossible to get correct if the application doesn't know it's dealing with Dutch, which it often can't know. The very same problem you have with Turkish.

schiffern · on Feb 19, 2022

>The very same problem you have with Turkish.

Not really the same. The brokenness of more complex text transformations (collation, alphabetization, upper-casing, lower-casing) seems a bit less catastrophic than breaking the fundamental ability to unambiguously draw the character.

If you don't think that's an important goal.... then what are "character sets" for anyway?

samatman · on Feb 19, 2022

There is no such breakage at all, the assumption that casing and collation can happen without locale-specific information is invalid, Unicode isn't designed to prevent this and we'd hate it if it was.

This may not be a solved problem in a given programming language's standard library, and/or one might see a bug such as the one GP was reporting, but neither of these conditions is inevitable, here's how you do it in Swift:

    "i".uppercased(with: Locale(identifier: "tr_TR")) // returns "İ"
    "i".uppercased(with: Locale(identifier: "en_US")) // returns "I"

Same deal with "ij".uppercased and "nl_NL" locale.

It's not a bug that Unicode just threw its hands up and said that complex string operations are locale-specific. The more you learn about the various scripts it represents, the more you will appreciate that there is no other choice.

schiffern · on Feb 26, 2022

I don't disagree, actually. IMO "the brokenness of more complex text transformations" is fundamental, not a 'bug.'

Hopefully this makes what I meant a bit more clear.

nanis · on Feb 20, 2022

That is baloney ... The fact is Unicode could have provided extra codepoints so that at the least the possibility of correctly processing a mainly French document that includes passages in Turkish at least existed.

The current state of affairs categorically rules that out it contrary to the situation with any other language.

bombcar · on Feb 19, 2022

That brings up the philosophical problem - is the English word taco the same as the Spanish word taco and if they aren’t the same, are the characters the same, or should there be an a for every language?

samatman · on Feb 19, 2022

Which semantic HTML at least attempts to solve with the `span lang` construct, although "taco" is maybe not the ideal example, but saying e.g. <span lang=fr>Jean</span> could in principle tell a screen reader to pronounce this name correctly, rather than the equally-correct English pronunciation which is a homophone of "gene".

This is another question which doesn't belong in the Unicode character specification, but should be resolved when necessary at a higher level of abstraction.

bombcar · on Feb 19, 2022

Exactly. Unicode’s “job” is to provide a code point for every letter - which necessary involves some trade offs such as whether a character is duplicated or not depending on the history of the region (which depends on all sorts of historical accidents such as whether keyboards were even available).

People shouldn’t expect to know what language a Unicode string is in without external help, even if heuristics work often.

nanis · on Feb 20, 2022

> Exactly. Unicode’s “job” is to provide a code point for every letter

And in this case Unicode fails catastrophically by not providing codepoints for the uppercase of "ı" and the lowercase of "İ" while potentially allowing for ethnically sensitive renderings of shit.

nanis · on Feb 20, 2022

> This is an outcome of the historic process which gave us modern Unicode.

That historic process does not preclude introducing two extra codepoints to at least allow for the possibility of correctly encoding the upper case of "ı" and the lower case of "İ".

samatman · on Feb 20, 2022

You're talking about a historic process using the present tense, forgivable in an ESL speaker, except clearly the intention of the sentence requires the present tense!

The past isn't something you can argue with, no more than the wind.

Someone · on Feb 19, 2022

> who knew "sort these names alphabetically" could be an ambiguous statement

That already is ambiguous without considering text encodings or even computers.

Sort order depends on locale (example, “in Lithuanian, "y" is sorted between "i" and "k"”) and within some locales even on the goal of the sorting (example: “in German dictionaries, "öf" would come before "of". In phone books the situation is the exact opposite.”

(Both examples taken from https://unicode-org.github.io/icu/userguide/collation/, which has many more examples)

Also, on that “one code point in unicode does not necessarily map to one character on the screen”, I think it’s important to use consistent terminology. The screen doesn’t show characters but glyphs.

bruce511 · on Feb 19, 2022

This is exactly my point. Growing up I thought alphabetical order was pretty sorted, then I grew up, and discovered that not only are there more characters, but people don't even agree on the _ordering_ of characters.

Thus alphabetical order becomes ambiguous depending on where you are.

_then_ you discover some symbol sets / languages don't have an order at all. Oh the world is complicated. I wish I was 4 again.

Ekaros · on Feb 20, 2022

And then you come across dictionary or like and discover that yeah, sometimes things aren't sorted as strings, but weird things are done.

Like "the" word, or person with some titles... While ofc. the title itself is sorted in that position...

Mlller · on Feb 19, 2022

> in German dictionaries, "öf" would come before "of".

I am only aware of “of” before “öf”, e.g. in Grimmʼs dictionary, the greatest dictionary of German itself, “offen” ‘open (adj.), openly’ comes before “öffen”, a variant of “öffnen” ‘to open’.

When I have to sort strings in German, I make a sortstring by throwing out diacritics (via Unicode Normal Form “K”anonical Decomposition) and changing “ß” to “s”. Then I order by a tuple of this sortstring and the original string.

Twisell · on Feb 19, 2022

It's not skimmed the article have a whole pararagrah entitled "Unicode To The Confusion" about that.

What you infer about sorting is better described as collation and is a separate set of problems. https://www.unicode.org/collation/

There can be multiple collations for the same encoding and even for the same language inside the same encoding.

PS: As a matter of fact in some database like PostgreSQL you can define different collations for each columns if needed.

dhosek · on Feb 19, 2022

That said, the Unicode standard does a great job of tackling that question and providing a framework that can cover all the various language standards (and even with a fully normalized Unicode string you still need to do things like handle ch as a single character for sorting in Czech or Spanish).

oldsecondhand · on Feb 19, 2022

In Hungarian "dzs", "zs" and "sz" are so called compund letters, however most software treats them as separate characters, because it would be a pain in the ass do sorting the correct way. (It would either require changing the input method, or software would need to be aware of etymology.) It's only sorted the right way in paper dictionaries.

dhosek · on Feb 20, 2022

Actually, it just needs to be context sensitive. If I have an English document and I have an index which has an entry for Zsigmond, it would be placed between Zebra and Zygote. But, if I have a Hungarian document, then the correct ordering would be to have Zsigmond after Zebra and Zygote. Any correctly-implemented program that deals with sorting should not be assuming it can just sort on code point¹ and Hungarian is one of the standard collations that is part of the Unicode reference implementations so it shouldn't be a problem. The fact that a lot of software is badly-implemented is not an excuse for software to continue to be so.

I did a quick check and saw that I can specify Hungarian in MySQL as a collation for a table and I also know that it's available in Java and even JavaScript.

⸻

1. This sorting is wrong not just for Hungarian but for every language including English which would expect, e.g., naïve to be sorted between nag and nanny and not after nay.

pezezin · on Feb 19, 2022

In Spanish, "ch" and "ll" stopped being considered a single character for sorting purposes in 1994. In have some dictionaries printed before that date, and I find them quite confusing...

dhosek · on Feb 20, 2022

As someone who learned his Spanish before 1994 (and whose dictionary also predates that time), I was unaware of this until now.

Toutouxc · on Feb 19, 2022

As a Czech I would love "ch" to go fuck itself for that exact reason.

Gigachad · on Feb 19, 2022

It’s only complex if your language does not support Unicode properly. In something like rust or python you would just ask for the first character rather than the first byte. And the language will calculate what that is for you.

torstenvl · on Feb 19, 2022

A 'character' meaning what? The first code point? The first non-combining code point? The first non-combining code point along with all associated combining code points? The first non-combining code point along with all associated combining code points modified to look like it would look in conjunction with surrounding non-combining code points along with their associated combining code points? The first displayable component of a code point? What are the following?

the second character of sœur (o or the ligature?)

the second character of حبيبي (the canonical form ب or the contextual form ﺒ ?)

the third character of есть (Cyrillic t with or without soft sign, which is always a separate code point and always displayed to the right but changes the sound?)

the first character of 실례합니다 (Korean phoneme or syllabic grapheme?)

the first character of ﷺ or ﷽ ?

The main issue isn't programming language support, it's ambiguity in the concept of "character" and conventions about how languages treat them. Imagine how the letter i would be treated if Unicode were invented by a Turk. The fundamental issue here is that human communication is deeply nuanced in a way that does not lend itself well to systematic encoding or fast/naive algorithms on that encoding. Even in the plain ASCII range it's impossible to know how to render a word like "ANSCHLUSS" in lower case (or how many 'characters' such a word would have) without knowledge of the language, country of origin, and time period in which the word was written.

bmn__ · on Feb 19, 2022

> A 'character' meaning what? […] it's ambiguity in the concept of "character"

All of this already standardised in detail in Unicode. There is no ambiguity. Read it some time to mend your ignorance, it answers all your questions with justification, I am willing to spend time on the answers only.

> the second character of sœur (o or the ligature?)

the letter œ, both character (C) and grapheme cluster (GC)

the letter o does not exist in this text (again, according to the standard)

> the second character of حبيبي (the canonical form ب or the contextual form ﺒ ?)

the letter ب, both C and GC

the letter ﺒ is a presentation form and does not exist in this text

> the third character of есть (Cyrillic t with or without soft sign

the letter т, both C and GC

the letter ь, both C and GC, is the 4th in the text

> the first character of 실례합니다 (Korean phoneme or syllabic grapheme?)

the letter 실, both C and GC

the text does not contain the jamo ᄉ

Unicode is not concerned with phonemes, only with writing

> the first character of ﷺ

is entirely itself, the letter ﷺ, both C and GC

> or ﷽ ?

is entirely itself, the symbol ﷽, both C and GC

torstenvl · on Feb 19, 2022

[flagged]

bmn__ · on Feb 20, 2022

> you are oversimplifying to the point of inaccuracy

No, what I wrote is an accurate reflection of Unicode.

It does not matter what Jean Dupont and Hong Gildong think, but what we agreed on as a standard.

> They do not exist as standalone code points in the binary form of the encoded text

This terminology is confused/conceptually wrong.

The binary form of encoded text consists of bytes. Decoded text consists of characters/grapheme clusters. A code point is the number corresponding to a character in the repertoire tables. By definition a code point cannot exist in encoded text.

The correct way to determine whether sœur contains œ or o is to use a standard compliant implementation. Perl has always been nothing short of excellent, so:

    $ perl -Mutf8 -E'say "sœur" =~ "œ" ? "yes" : "no"'
    yes
    $ perl -Mutf8 -E'say "sœur" =~ "o" ? "yes" : "no"'
    no

> don't let that anger control you

I'm not angry. Projecting much?

torstenvl · on Feb 20, 2022

Software written in the way you suggest will fail in the real world, where, to beat a dead horse, "sœur" indisputably does contain the letter "o."

You seem to be arguing that the way Unicode treats language is the Right Answer, but that simply isn't the case. Real people speaking real human languages in the real world do not care whether the œ ligature is implemented in a character set or in the font or even what those terms mean. What they do care about is that they searched for "Lettre à ma sœur.docx" and your software failed to show them "Lettre à ma soeur.docx"

I choose not to write software that way. The right thing to do when dealing with human language depends on context, which is ambiguous. Unicode is a just a tool - indeed, one that might be helpful in the case of some ligatures by using NFKD - and nothing more. It certainly isn't an arbiter of what constitutes a character in a human language.

Have a great night.

bmn__ · on Feb 20, 2022

    #!/usr/bin/env perl
    use utf8;
    use Unicode::Collate;

    my $uc = Unicode::Collate->new(
        normalization => undef, level => 1
    );
    printf "pos %d len %d\n",
        $uc->index('Lettre à ma sœur.docx', $_)
        for 'sœur', 'soeur', 'lettre   a ma-soeur _DOCX_';
    __END__
    pos 12 len 4
    pos 12 len 4
    pos 0 len 21

You have been wrong three times in a row now, just because you have an insufficient understanding of the standard.

Time to do what I said in the beginning and read it, or do you want to continue arguing against straw-men and pretending to be knowledgeable enough about the subject matter to talk about it authoritatively?

Gigachad · on Feb 19, 2022

Ultimately I think it doesn’t really matter. Not all languages and character sets have orderings. You just order the Latin or closely related characters and numbers and you are ok for almost all situations. If you are a company or product which caters to more languages, you can work out what you do for those. But there is no need to work out how to order every single codepoint in Unicode because it isn’t very useful.

tialaramex · on Feb 19, 2022

You've gone from arguing that it isn't complex if people have the right language, to redefining the problem as whatever it was easy for you to do in your preferred language.

Just say "I don't have the faintest idea how to do this properly". It would be much less of a problem if people would admit they don't know how and need somebody else's help than it is when they insist it "doesn't really matter" as you have and produce garbage that doesn't work but insist that's just because everybody is holding it wrong.

GolDDranks · on Feb 19, 2022

That's not true for Rust; Rust generally supports iterating by bytes and by codepoints, but the grandparent is talking about grapheme clusters. You need an external libary for that. The generally endorsed solution is to use the unicode-segmentation crate: https://crates.io/crates/unicode-segmentation

chmod600 · on Feb 19, 2022

I'm not an expert, but your statement seems confusing and possibly misses the point of the parent.

guenthert · on Feb 19, 2022

It's about time that someone fixes Turkish.

seles · on Feb 19, 2022

According to the article, PHP can handle other encodings by just treating sequences of strings as byte sequences and not caring what the encoding is. There example:

$string = "漢字";

But if you are using say UTF-8 and one of those Chinese characters has one of its bytes have a value of 34 (the ascii value of "), then wouldn't the string terminate prematurely?

Edit: to answer my own question, quote from wikipedia: ASCII bytes do not occur when encoding non-ASCII code points into UTF-8

chmod600 · on Feb 19, 2022

Also, the compiler might be treating the input file as UTF-8, while the semantics of the language may treat string literals as the sequence of bytes when encoded as UTF-8.

agumonkey · on Feb 19, 2022

A nice complement to nedbat's https://nedbatchelder.com/text/unipain/unipain.html#1

kgm · on Feb 19, 2022

I would also recommend https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

cxcorp · on Feb 19, 2022

This is linked to in the second paragraph of the post

badrabbit · on Feb 19, 2022

Should mention enduanness as well and ebcdic. There are big endian versions of UTF-*

samatman · on Feb 19, 2022

Not -8 there isn't. One of the many reasons to deprecate all other Unicode encodings whenever it's in one's power to do so.

ncmncm · on Feb 19, 2022

PHP-centric, not mentioned in the title. Most of it is relevant to everybody, but it is jarring to run into stuff about PHP. Isn't that dead yet?

Of greater moment is that the article keeps talking about "characters", which is an undefined term in Unicode. Unicode offers you code points, code units, graphemes, grapheme clusters, and ... other things, none of which maps to the grouping of dots you see on your screen (and probably cannot imagine how to type in).

"Character" has outlived its sell-by date. Let it be retired and buried with dignity, but with a good thick slab of concrete on top.

It also fails to mention "expanded form" and "canonical form", and other ways that two completely different sequences of bits mean, at some level, the same text. Different forms are convenient for different things; there is a shortest possible representation nice for sending and storing, and a maximally decomposed representation that might be best for editing if you like adding and removing diereses ("umlauts") and accents piecemeal.

And it fails to mention WTF-8, a way to package up byte sequences that are not valid UTF-8, but may have valid UTF-8 characters that you want to display in case they offer the poor human a clue as to what was intended. WTF-8 sequences often arise in file systems and databases that don't enforce any particular encoding, but just store whatever bytes the benighted programs users run provide as, e.g., names for files. You wish you could display them in sorted order. There had better be a way to point at it, because there is no way to type it. But you have to store it, because that is the only way to tell the OS which file you wanted to rename or delete. Deletion is tempting, but we can't, always, can we?.

867-5309 · on Feb 19, 2022

78% of the websites you visit are written in PHP

https://w3techs.com/technologies/overview/programming_langua...

guenthert · on Feb 19, 2022

98.3% of all statistics are made up on the fly.

It is not possible to determine with certainty how a given set of HMTL/CSS/Javascript was generated. They admit to it in their FAQ: https://w3techs.com/faq

They also state which web sites they survey, which doesn't include the web site many of us use most of their workday: the one of your employer which very likely is either made using ASP.Net or Java and isn't available on the Internet.

None of which should be taken as argument that PHP isn't still a major, if not even dominant, language to create dynamic web pages on the Internet. shudder

ncmncm · on Feb 19, 2022

I can hope.

jraph · on Feb 19, 2022

Come on, the PHP part is small and at the end, and most things said in this part apply in any language that works the same way, notably C and C++ which also don't care about what you put into their strings.

ahartmetz · on Feb 19, 2022

PHP is very much not dead. WordPress still seems to be popular. Facebook replaced its optimizing PHP VM with another one a few years ago. And most importantly, PHP is a low-friction, decent performance way to create simple dynamic websites because it was made for that.