Joel spolsky spoke against this exact statistics-based approach when he wrote ab...

kelnos · on April 30, 2024

I don't think he was speaking against the statistics-based approach itself, just against Postel's Law in general.

Ideally people would see gibberish (or an error message) immediately if they don't provide an encoding; then they'll know something is wrong, figure it out, fix it, and never have the issue again.

But if we're in a situation where we already have lots and lots of text documents that don't have an encoding specified, and we believe it's not feasible to require everyone to fix that, then it's actually pretty amazing that we can often correctly guess the encoding.

vbezhenar · on April 30, 2024

There's enca library (and cli tool) which does that. I used it often before UTF-8 became overwhelming. The situation was especially dire with Russian encodings. There were three 1-byte encodings which were quite wide-spread: KOI8-R mostly found in unixes, CP866 used in DOS and CP1251 used in Windows. What's worse, with Windows you sometimes had to deal with both CP866 and CP1251 because it includes DOS subsystem with separate codepage.

bastawhiz · on April 30, 2024

Exactly. I used this technique at Mozilla in 2010 when processing Firefox add-ons, and it misidentified scripts as having the wrong encoding pretty frequently. There's far less weird encoding out there than there are false positives from statistics-based approaches.

mark-r · on April 30, 2024

20 years old, but still true.