Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Shift-JIS specifically cannot be reliably detected, and can pass for UTF-8 but actually make no sense if you looked at a conversion of it.


I don't think that's true. Looking at how it's encoded[0], it seems similar to many other country/language-specific encodings: bytes 0-127 are the control chars and latin alphabet and symbols, and is more-or-less ASCII, then 128-255 represent characters specific to the language at hand.

The only way you'd successfully decode Shift-JIS as UTF-8 is if it essentially is just latin-alphabet text (though the yen symbol would incorrectly display as a '\'). If it includes any non-trival amount of Japanese in it, it'll fail to decode as UTF-8.

As for whether or not you can then (after it fails to decode as UTF-8) use statistical analysis to reliably figure out that it's in fact Shift-JIS, and not something else, I can't speak to that.

[0] https://en.wikipedia.org/wiki/Shift_JIS#Shift_JIS_byte_map


Do you have an example in mind? Looking at the Shift-JIS encoding tables, that seems unlikely to happen in a text of any nontrivial length; there's a small number of Shift-JIS sequences which would be valid as UTF-8, and any meaningfully long text is likely to stray outside that set.


I don't think it's fair to require "meaningfully long text" since when you're dealing with strings in programming they can often be of any arbitrary length.


Encoding detection is usually applied to a larger document, at the point it's ingested into an application. If you're applying it to short strings, something's not right -- where did those strings come from?


Taking an ID3 tag example, if you are mass-converting/sanitizing/etc. tag titles and other similar metadata, the strings are often very short, sometimes only even a single codepoint or character, and proper assumptions of encoding can not be relied on because so many people violate specs and put whatever they want in there, which is the whole point of wanting to sanitize the info in the first place.


Even then, I think it's likely that a your average short byte sequence that's valid Shift-JIS would still not be valid UTF-8.


I disagree... half of the UTF-8 Latin-1 supplement range overlaps with the entire half-width katakana alphabet.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: