Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The number of unique values in the bloom filter will go up ~exponentially with n. So to control the false positive rate the bloom filter has to grow.


At large enough ngram size there would be very few collisions. You can take for example this text and try in Google with quotes, it won't find anything matching exactly.

I tested this 6-gram "it won't find anything matching exactly", no match. Almost anything we write has never been said exactly like that before.


> it won't find anything matching exactly

This approach is probably inadequate. In my line of (NLP) research I find many things have been said exactly many, many times over.

You can try this out yourself by grouping and counting strings using the many publically available Bigquery corpora for various substring lengths and offsets, e.g. [0-16]; [0-32]; [0-64] substring lengths at different offsets.


Yes and the fact that the number of unique phrases grows so quickly with n is why the bloom filter needs to grow so that hashed n-grams don't collide.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: