But, any ideas where the corpus for telling apart encodings should actually come from?
If you want a lot of fairly natural Japanese, try Wikipedia. If you want something a bit more comprehensive and less restricted, the "Balanced Corpus of Contemporary Written Japanese" is a thing that exists, but you might have to jump through a few hoops.
Thanks! I saw your retweet about it. Let me warn you that I haven't solved the CJK encoding problem, so Japanese developers are only going to benefit if their mojibake involves UTF-8.
The corpus I'm looking for needs to be messy and informal, unlike Wikipedia, as the ambiguous cases tend to be crazy emoticons. I guess I can't hope for more than Twitter with artificially increased mojibake.
If you want a lot of fairly natural Japanese, try Wikipedia. If you want something a bit more comprehensive and less restricted, the "Balanced Corpus of Contemporary Written Japanese" is a thing that exists, but you might have to jump through a few hoops.
That project is awesome, by the way.