Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Probably just a big hashtable mapping word -> the number of times it's been seen, and another hashset of all the words it hasn't seen. When a post comes in you hash all the words in it and look them up in the hashtable, increment it, and if the old value was 0 remove it from the hash set.

250k words at a generous 100 bytes per word is only 25MB of memory...



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: