Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

But, any ideas where the corpus for telling apart encodings should actually come from?

If you want a lot of fairly natural Japanese, try Wikipedia. If you want something a bit more comprehensive and less restricted, the "Balanced Corpus of Contemporary Written Japanese" is a thing that exists, but you might have to jump through a few hoops.

That project is awesome, by the way.



Thanks! I saw your retweet about it. Let me warn you that I haven't solved the CJK encoding problem, so Japanese developers are only going to benefit if their mojibake involves UTF-8.

The corpus I'm looking for needs to be messy and informal, unlike Wikipedia, as the ambiguous cases tend to be crazy emoticons. I guess I can't hope for more than Twitter with artificially increased mojibake.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: