Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> This problem is worth thinking about (which is why a customer asked a team of CS researchers and linguists to think about it, including one barely competent programmer who nonetheless had reasonably good intuitions for what character distributions looked like), but it turns out to be much, much less hard than many people originally expected.

There are a couple related ideas that might make this more obvious in hindsight:

- You can determine the language of a substitution-ciphered text just from its frequency distribution.

- Imagine getting a page each of text in several different languages, say english, french, portuguese, polish, and turkish. "Normalize" everything to ascii characters, so laïcité turns into laicite. It will be trivially easy, by looking at any page in isolation, to determine which language is represented there.



Trivially easy for a human, maybe. In most cases it's pretty straightforward, but there are a few notoriously tricky language pairs that any automated solution is going to have trouble with. Most notably Norwegian and Danish (and Swedish to a lesser extent), and Czech and Slovakian. There are also some tricky cases in the Iberian area where some dialects of Spanish are quite similar to Portuguese, and in the Balkans Croat, Serbian and Bosnian are basically identical as well, although in some cases they can be distinguished based on the writing system used.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: