I'm surprised at how normal some of the unseen words are. I expected them to all be archaic or niche, but many are pretty reasonable: 'congregant', 'definer', 'stereoscope'.
It's likely that the commenter has read less than 5 million posts worth of text though. So perhaps this still points to a lack of diversity in content.
You got me wondering. Supposing the average post is 10 words, and a typical page of text is 250 words, that would only be ~50 pages of text a day over the last 10 years. Which I don't think I manage, but over 20 years I am probably in that window.
I noticed one of the cited bluesky posts was all in French, so one might argue that technically it didn't find the English word "mouch", but rather a different French word that happens to be spelled the same. But trying to sort that out seems unrealistically challenging. "Mouch" is only in the dictionary as an alternative spelling to mooch, so probably a pretty rare word to see in English.
Bluesky lets you select the language your post is written in before posting it and it is attached as metadata to the skeet. I guess the backend for this only searches posts in English, but it's possible the dataset is not 100% accurate due to some users forgetting to switch language before posting.
I'm very curious as to how this works in the backend. I realize it uses Bluesky's firehose to get the posts, but I'm more curious on how it's checking whether a post contains any of the available words. Any guesses?
Hey! this is my site - it's not all that complex, i'm just using a sqlite db with two tables - one for stats, the other for all the words that's just word | count | first use | last use | post.
using this, a combo of "covered enough" for the bit and easy to use
also, since i'm tracking every word (technically a better name for this project would be The Bluesky Corpus) all inflected forms are different words, which aligns with my thinking
You can probably fit all words under 10-15MB of memory, but memory optimisations are not even needed for 250k words...
Trie data structures are memory-efficient for storing such dictionaries (2-4x better than hashmaps). Although not as fast as hashmaps for retrieving items. You can hash the top 1k of the most common words and check the rest using a trie.
The most CPU-intensive task here is text tokenizing, but there are a ton of optimized options developed by orgs that work on LLMs.
I very much hope that the backend uses one of the bluesky jetstream endpoints.
When you only subscribe to new posts, it provides a stream of around 20mbit/s last time I checked, while the firehose was ~200mbit/s.
Probably just a big hashtable mapping word -> the number of times it's been seen, and another hashset of all the words it hasn't seen. When a post comes in you hash all the words in it and look them up in the hashtable, increment it, and if the old value was 0 remove it from the hash set.
250k words at a generous 100 bytes per word is only 25MB of memory...
Maybe I'm being naive, but with only ~275k words to check against, this doesn't seem like a particularly hard problem. Ingest post, split by words, check each word via some db, hashmap, etc... and update metadata.
> We just visited wheal Martyn museum in Cornwall, nice scones and a waterwheel, they also have a lot of gutters, sluices and pipes and a bit of a fixation about China Clay. More importantly they appear to be unattached at the moment
Both "wheal" (kind of cheating, that should be Wheal and is a place name) and "sluices" were new to the dictionary.
fascinating! I think it's really cool that this is possible, and at the same time kine of sad that the norm is slowly moving towards more locked-down APIs.
I checked out the author's other projects and this is common issue. For example, he has a "lean checker" for bluesky that claims it is right-leaning simply because of all the people saying "That's right," "He was right," etc. None of the supposed right-leaning posts were actually conservative in nature. They just used to word right to mean correct.
one, thank you for checking my website. two, that is the joke, 100% - at the time people kept talking about how "left leaning" bsky was and that idea came to mind
Not an answer to your question, but I suspect most people don't -- my bot (a pi searcher bot, of course) just runs on Jetstream, which is pretty lightweight and heavily compressed.