Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm just gonna assume UTF-8


I'm disappointed that the article doesn't discuss this in more detail. Most byte sequences are not valid UTF-8. If you can decode a message as UTF-8 with no errors, that is almost certainly the correct encoding to use; it's extremely unlikely that some text in another encoding just happened to be perfectly valid as UTF-8. (The converse is not true; most 8-bit text encodings will happily decode UTF-8 sequences to nonsense strings like 🚩.)

If UTF-8 decoding fails, then it's time to pull out the fancy statistical tools to (unreliably) guess an encoding. But that should be a fallback, not the first thing you try.


> If UTF-8 decoding fails, then it's time to pull out the fancy statistical tools to (unreliably) guess an encoding.

Don't really even need to do that. There's only a handful other encodings still in common use, just try each of them as fallbacks and see which one works without errors, and you'll manage the vast majority of what's not UTF-8.

(We recently did just that for a system that handles unreliable input, I think I remember our fallback only has 3 additional encodings before it gives up and it's been working fine)


The person you're replying to sort of addresses this, though not completely.

Since UTF-8 is a variable-length encoding, it somewhat naturally has some error detection built in. Fixed-length encodings don't really have that, and for some of them, any byte value, 0 to 255, in any position, is valid. (Some have a few byte values that are invalid or reserved, but the point still stands.)

So you could very easily pick a "next most common encoding" after UTF-8 fails, try it, find that it works (that is, no bytes are invalid in that encoding), but it turns out that's still not actually the correct encoding. The statistics-based approach will nearly always yield better results. Even a statistics-based approach that restricts you to a few possible encodings that you know are most likely will do better.


Unfortunately Windows code page 1252 has no invalid bytes, so it will always succeed. You'd better try that one last.


81, 8D, 8F, 90, 9D are invalid.


These are actually interpreted as the corresponding C1 control codes by Windows, so arguably not invalid in practice, just formally reserved to be reassigned to other characters in the future.


Not extremely unlikely. Many charsets decode fine as UTF-8 as long as the message happens to fit in ASCII.


> Many charsets decode fine as UTF-8 as long as the message happens to fit in ASCII.

At which point the message is effectively ASCII. UTF-8 is a superset of ASCII, so "decoding" ASCII as UTF-8 is fine.

(Yes, I know there are some Japanese text encodings where 0x5c is decoded as "¥" instead of "\". But they're sometimes treated as backslashes even though they look like ¥ symbols so handling them "correctly" is complicated.)


"Fun" fact: some video subtitle formats (ASS specifically) use "text{\b1}bold" to format things — but since they were primarily used to subtitle Japanese anime, this frequently became "text{¥b1}bold". Which is all good and well, except when those subtitles moved to UTF-8 they kept the ¥. So now you have to support ¥ (0xC2 0xA5) as a markup/control character in those subtitles.


I'm guessing they're thinking of Extended ASCII (the 8-bit one that's actually multiple different encodings, but the lower half is shared with 7-bit ASCII and so that part does fit in UTF-8 while the upper half likely won't if the message actually uses it).


Yes. And some test messages will decode fine and then suddenly no.


ISO-2022-JP (sometimes?) disguises perfectly as ASCII:

  $ echo は | iconv -t ISO-2022-JP | hd
  00000000  1b 24 42 24 4f 1b 28 42  0a                       |.$B$O.(B.|
  00000009


The definition GP is using most likely refers to non-ASCII sequences that validly decode as UTF-8, because virtually every major charset in practical use has ASCII as a subset.


ASCII (or ISO646) has local variants [0] that replace certain characters, like $ with £, or \ with ¥. The latter is still in use in Japan. That’s why "ASCII" is sometimes clarified as "US-ASCII".

[0] https://en.wikipedia.org/wiki/ASCII#7-bit_codes


I'm skeptical. Any charset that uses bytes 128-255 as characters is unlikely to successfully decode to UTF-8. Are there really many others that only use 0-127, or most text ends up only using 0-127?


such encodings are also UTF-8 then, are they not?


I think there are a bunch of encodings that just repurposed a few ASCII characters as different characters - someone on this page was giving the example of some Swedish encoding where {}| were replaced with three accented Swedish letters. There are probably a bunch of others. In those cases, the text will decode fine as UTF-8, but it will display the wrong thing.


A distinction without a difference?


I meant, some messages will decode fine as UTF-8, but in some other messages there may be letters which don't fit in 7 bits. So some simple testing, especially with English words, will show it to work fine. But as soon as a non-7-bit characters creeps, it will stop working fine.


榥\ue0af侬펭懃䒥亷


And a good day to you too, my friend whose input I'm going to discard


It's garbage anyway, which you can (non-reliably) guess by there being a Korean character in the middle of C/J kanji. (Kanji are not completely gone from Korean, but mostly.)


It's mojibake for "probably a bad idea" [in Chinese]


You can’t just do that! /s




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: