The number of *unique* values in the bloom filter will go up ~exponentially with...

visarga · on Dec 28, 2023

At large enough ngram size there would be very few collisions. You can take for example this text and try in Google with quotes, it won't find anything matching exactly.

I tested this 6-gram "it won't find anything matching exactly", no match. Almost anything we write has never been said exactly like that before.

dleeftink · on Dec 28, 2023

> it won't find anything matching exactly

This approach is probably inadequate. In my line of (NLP) research I find many things have been said exactly many, many times over.

You can try this out yourself by grouping and counting strings using the many publically available Bigquery corpora for various substring lengths and offsets, e.g. [0-16]; [0-32]; [0-64] substring lengths at different offsets.

groceryheist · on Dec 28, 2023

Yes and the fact that the number of unique phrases grows so quickly with n is why the bloom filter needs to grow so that hashed n-grams don't collide.