If UTF-8 decoding fails, then it's time to pull out the fancy statistical tools to (unreliably) guess an encoding. But that should be a fallback, not the first thing you try.
> If UTF-8 decoding fails, then it's time to pull out the fancy statistical tools to (unreliably) guess an encoding.
Don't really even need to do that. There's only a handful other encodings still in common use, just try each of them as fallbacks and see which one works without errors, and you'll manage the vast majority of what's not UTF-8.
(We recently did just that for a system that handles unreliable input, I think I remember our fallback only has 3 additional encodings before it gives up and it's been working fine)
The person you're replying to sort of addresses this, though not completely.
Since UTF-8 is a variable-length encoding, it somewhat naturally has some error detection built in. Fixed-length encodings don't really have that, and for some of them, any byte value, 0 to 255, in any position, is valid. (Some have a few byte values that are invalid or reserved, but the point still stands.)
So you could very easily pick a "next most common encoding" after UTF-8 fails, try it, find that it works (that is, no bytes are invalid in that encoding), but it turns out that's still not actually the correct encoding. The statistics-based approach will nearly always yield better results. Even a statistics-based approach that restricts you to a few possible encodings that you know are most likely will do better.
These are actually interpreted as the corresponding C1 control codes by Windows, so arguably not invalid in practice, just formally reserved to be reassigned to other characters in the future.
> Many charsets decode fine as UTF-8 as long as the message happens to fit in ASCII.
At which point the message is effectively ASCII. UTF-8 is a superset of ASCII, so "decoding" ASCII as UTF-8 is fine.
(Yes, I know there are some Japanese text encodings where 0x5c is decoded as "¥" instead of "\". But they're sometimes treated as backslashes even though they look like ¥ symbols so handling them "correctly" is complicated.)
"Fun" fact: some video subtitle formats (ASS specifically) use "text{\b1}bold" to format things — but since they were primarily used to subtitle Japanese anime, this frequently became "text{¥b1}bold". Which is all good and well, except when those subtitles moved to UTF-8 they kept the ¥. So now you have to support ¥ (0xC2 0xA5) as a markup/control character in those subtitles.
I'm guessing they're thinking of Extended ASCII (the 8-bit one that's actually multiple different encodings, but the lower half is shared with 7-bit ASCII and so that part does fit in UTF-8 while the upper half likely won't if the message actually uses it).
The definition GP is using most likely refers to non-ASCII sequences that validly decode as UTF-8, because virtually every major charset in practical use has ASCII as a subset.
ASCII (or ISO646) has local variants [0] that replace certain characters, like $ with £, or \ with ¥. The latter is still in use in Japan. That’s why "ASCII" is sometimes clarified as "US-ASCII".
I'm skeptical. Any charset that uses bytes 128-255 as characters is unlikely to successfully decode to UTF-8. Are there really many others that only use 0-127, or most text ends up only using 0-127?
I think there are a bunch of encodings that just repurposed a few ASCII characters as different characters - someone on this page was giving the example of some Swedish encoding where {}| were replaced with three accented Swedish letters. There are probably a bunch of others. In those cases, the text will decode fine as UTF-8, but it will display the wrong thing.
I meant, some messages will decode fine as UTF-8, but in some other messages there may be letters which don't fit in 7 bits. So some simple testing, especially with English words, will show it to work fine. But as soon as a non-7-bit characters creeps, it will stop working fine.
It's garbage anyway, which you can (non-reliably) guess by there being a Korean character in the middle of C/J kanji. (Kanji are not completely gone from Korean, but mostly.)