*But, any ideas where the corpus for telling apart encodings should actually com...

rspeer · on Dec 8, 2015

Thanks! I saw your retweet about it. Let me warn you that I haven't solved the CJK encoding problem, so Japanese developers are only going to benefit if their mojibake involves UTF-8.

The corpus I'm looking for needs to be messy and informal, unlike Wikipedia, as the ambiguous cases tend to be crazy emoticons. I guess I can't hope for more than Twitter with artificially increased mojibake.