Joel spolsky spoke against this exact statistics-based approach when he wrote about unicode[1]:
> What do web browsers do if they don’t find any Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of various languages, what language and encoding was used. Because the various old 8 bit code pages tended to put their national letters in different ranges between 128 and 255, and because every human language has a different characteristic histogram of letter usage, this actually has a chance of working. It’s truly weird, but it does seem to work often enough that naïve web-page writers who never knew they needed a Content-Type header look at their page in a web browser and it looks ok, until one day, they write something that doesn’t exactly conform to the letter-frequency-distribution of their native language, and Internet Explorer decides it’s Korean and displays it thusly, proving, I think, the point that Postel’s Law about being “conservative in what you emit and liberal in what you accept” is quite frankly not a good engineering principle.
I don't think he was speaking against the statistics-based approach itself, just against Postel's Law in general.
Ideally people would see gibberish (or an error message) immediately if they don't provide an encoding; then they'll know something is wrong, figure it out, fix it, and never have the issue again.
But if we're in a situation where we already have lots and lots of text documents that don't have an encoding specified, and we believe it's not feasible to require everyone to fix that, then it's actually pretty amazing that we can often correctly guess the encoding.
There's enca library (and cli tool) which does that. I used it often before UTF-8 became overwhelming. The situation was especially dire with Russian encodings. There were three 1-byte encodings which were quite wide-spread: KOI8-R mostly found in unixes, CP866 used in DOS and CP1251 used in Windows. What's worse, with Windows you sometimes had to deal with both CP866 and CP1251 because it includes DOS subsystem with separate codepage.
Exactly. I used this technique at Mozilla in 2010 when processing Firefox add-ons, and it misidentified scripts as having the wrong encoding pretty frequently. There's far less weird encoding out there than there are false positives from statistics-based approaches.
> What do web browsers do if they don’t find any Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to guess, based on the frequency in which various bytes appear in typical text in typical encodings of various languages, what language and encoding was used. Because the various old 8 bit code pages tended to put their national letters in different ranges between 128 and 255, and because every human language has a different characteristic histogram of letter usage, this actually has a chance of working. It’s truly weird, but it does seem to work often enough that naïve web-page writers who never knew they needed a Content-Type header look at their page in a web browser and it looks ok, until one day, they write something that doesn’t exactly conform to the letter-frequency-distribution of their native language, and Internet Explorer decides it’s Korean and displays it thusly, proving, I think, the point that Postel’s Law about being “conservative in what you emit and liberal in what you accept” is quite frankly not a good engineering principle.
1: https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...