I'm just gonna assume UTF-8

duskwuff · on April 29, 2024

I'm disappointed that the article doesn't discuss this in more detail. Most byte sequences are not valid UTF-8. If you can decode a message as UTF-8 with no errors, that is almost certainly the correct encoding to use; it's extremely unlikely that some text in another encoding just happened to be perfectly valid as UTF-8. (The converse is not true; most 8-bit text encodings will happily decode UTF-8 sequences to nonsense strings like ðŸš©.)

If UTF-8 decoding fails, then it's time to pull out the fancy statistical tools to (unreliably) guess an encoding. But that should be a fallback, not the first thing you try.

Izkata · on April 30, 2024

> If UTF-8 decoding fails, then it's time to pull out the fancy statistical tools to (unreliably) guess an encoding.

Don't really even need to do that. There's only a handful other encodings still in common use, just try each of them as fallbacks and see which one works without errors, and you'll manage the vast majority of what's not UTF-8.

(We recently did just that for a system that handles unreliable input, I think I remember our fallback only has 3 additional encodings before it gives up and it's been working fine)

kelnos · on April 30, 2024

The person you're replying to sort of addresses this, though not completely.

Since UTF-8 is a variable-length encoding, it somewhat naturally has some error detection built in. Fixed-length encodings don't really have that, and for some of them, any byte value, 0 to 255, in any position, is valid. (Some have a few byte values that are invalid or reserved, but the point still stands.)

So you could very easily pick a "next most common encoding" after UTF-8 fails, try it, find that it works (that is, no bytes are invalid in that encoding), but it turns out that's still not actually the correct encoding. The statistics-based approach will nearly always yield better results. Even a statistics-based approach that restricts you to a few possible encodings that you know are most likely will do better.

mark-r · on April 30, 2024

Unfortunately Windows code page 1252 has no invalid bytes, so it will always succeed. You'd better try that one last.

Dwedit · on April 30, 2024

81, 8D, 8F, 90, 9D are invalid.

layer8 · on April 30, 2024

These are actually interpreted as the corresponding C1 control codes by Windows, so arguably not invalid in practice, just formally reserved to be reassigned to other characters in the future.

actionfromafar · on April 29, 2024

Not extremely unlikely. Many charsets decode fine as UTF-8 as long as the message happens to fit in ASCII.

duskwuff · on April 29, 2024

> Many charsets decode fine as UTF-8 as long as the message happens to fit in ASCII.

At which point the message is effectively ASCII. UTF-8 is a superset of ASCII, so "decoding" ASCII as UTF-8 is fine.

(Yes, I know there are some Japanese text encodings where 0x5c is decoded as "¥" instead of "\". But they're sometimes treated as backslashes even though they look like ¥ symbols so handling them "correctly" is complicated.)

eqvinox · on April 30, 2024

"Fun" fact: some video subtitle formats (ASS specifically) use "text{\b1}bold" to format things — but since they were primarily used to subtitle Japanese anime, this frequently became "text{¥b1}bold". Which is all good and well, except when those subtitles moved to UTF-8 they kept the ¥. So now you have to support ¥ (0xC2 0xA5) as a markup/control character in those subtitles.

Izkata · on April 30, 2024

I'm guessing they're thinking of Extended ASCII (the 8-bit one that's actually multiple different encodings, but the lower half is shared with 7-bit ASCII and so that part does fit in UTF-8 while the upper half likely won't if the message actually uses it).

actionfromafar · on May 1, 2024

Yes. And some test messages will decode fine and then suddenly no.

jwilk · on April 30, 2024

ISO-2022-JP (sometimes?) disguises perfectly as ASCII:

  $ echo は | iconv -t ISO-2022-JP | hd
  00000000  1b 24 42 24 4f 1b 28 42  0a                       |.$B$O.(B.|
  00000009

jcranmer · on April 29, 2024

The definition GP is using most likely refers to non-ASCII sequences that validly decode as UTF-8, because virtually every major charset in practical use has ASCII as a subset.

layer8 · on April 30, 2024

ASCII (or ISO646) has local variants [0] that replace certain characters, like $ with £, or \ with ¥. The latter is still in use in Japan. That’s why "ASCII" is sometimes clarified as "US-ASCII".

[0] https://en.wikipedia.org/wiki/ASCII#7-bit_codes

kelnos · on April 30, 2024

I'm skeptical. Any charset that uses bytes 128-255 as characters is unlikely to successfully decode to UTF-8. Are there really many others that only use 0-127, or most text ends up only using 0-127?

PaulDavisThe1st · on April 29, 2024

such encodings are also UTF-8 then, are they not?

tsimionescu · on April 30, 2024

I think there are a bunch of encodings that just repurposed a few ASCII characters as different characters - someone on this page was giving the example of some Swedish encoding where {}| were replaced with three accented Swedish letters. There are probably a bunch of others. In those cases, the text will decode fine as UTF-8, but it will display the wrong thing.

srj · on April 29, 2024

A distinction without a difference?

actionfromafar · on May 2, 2024

I meant, some messages will decode fine as UTF-8, but in some other messages there may be letters which don't fit in 7 bits. So some simple testing, especially with English words, will show it to work fine. But as soon as a non-7-bit characters creeps, it will stop working fine.

calpaterson · on April 29, 2024

榥\ue0af侬펭懃䒥亷

bhaney · on April 29, 2024

And a good day to you too, my friend whose input I'm going to discard

eqvinox · on April 30, 2024

It's garbage anyway, which you can (non-reliably) guess by there being a Korean character in the middle of C/J kanji. (Kanji are not completely gone from Korean, but mostly.)

calpaterson · on April 30, 2024

It's mojibake for "probably a bad idea" [in Chinese]

klysm · on April 30, 2024

You can’t just do that! /s