This is the best for data with very few updates/modifications and a lot of queries. One example would be points of interest on a map, like restaurants on google maps. Not like those restaurants gets updated/modified every single day but you do have clients that while using your restaurant finder app they query a lot of them on a given radius around their GPS coordinates.
I had such a project in the past. My client acquired this data, around 3 billion points of interests residing in a PostgreSQL DB that was around 900 GB. While the server was beefed with 128GB RAM, was not nearly enough to have it all in memory and it had a responsiveness of several seconds, way too much for young impatient users and my client was seeing a decline in users. I reorganized data in PGSQL to fit a trie tree and now a query was solved in milliseconds. Accounting all other stuff on top, the app was now able to generate a response under a second. Trie ftw.
Yes, they are. But those are general approach. While a good approach my implementation was custom made for the client's need, allowing to achieve a better performance.
A related technique is generating an index using a machine learned model (well explained here: https://blog.acolyer.org/2018/01/08/the-case-for-learned-ind...). In interpolation search, we just use a simple linear model. Kraska and co. use much more complex models to beat BTree indexes. However, the Recursive Model Indexes (RMIs) that they published (https://github.com/learnedsystems/RMI) use Rust to a C++ model that is exported as a library, so it definitely doesn't seem ready for production yet!
One good thing about binary search is memory access pattern.
The element in the middle of the array is tested every single time. Similarly, the elements at 25% and 75% of the array are tested very often, in 50% searches/each.
Modern computers have very slow memory compared to computation speed, and multi-level caches to compensate. Binary search RAM access pattern, when done many times, is friendly towards these multi-level caches. The elements at positions like 25%, 50%, 75% will stay in L1D cache.
It’s true that those places will get cached, but binary search tends to dance around the address space a bit too much with large arrays. That’s one of the advantages of interpolation search—provided the items are uniformly distributed.
In college I played with a variation of this where you fix the size of the collection to a power of 2 (sprinkling in zeroes/repeated values if needed): you make your initial guess for where to look is just by taking the top bits of the searched-for value, then estimate the number of slots off you were from the top bits of the _difference_ between the value you found and the one you want (clamping your next guess to the first/last item if you'd fall off the end otherwise). In the concrete example I was playing with, it worked to just stop there then do linear search, but you could be cleverer about how long to iterate and when/how to bail out to something with a better worst case.
As Lemire says the problem (a good "problem"!) is that hashtables and B-trees can be quite fast, robust against different distributions, and you have them right at hand. It'd almost have to be some weird situation like you're handed data in a format that kinda fits the requirements and you're looking for how to work with it in-place.
Neat trick. Possibly a bit brittle, since you have to maintain your knowledge of the distribution somewhere -- i.e. it's "state you have to update or things might break," which is always worrisome.
I wonder what the worst case performance is -- if your guess is completely wrong, will it still converge in no worse than O(log(N))? Do you guess only once, at the beginning, or do you keep guessing where to go based on the distribution?
If you only guess once, at the beginning, then there's probably no risk. But if you keep trying to aim for the wrong target every iteration, I wouldn't be surprised if you get pathological behavior.
It would be kind of fun to try to calculate or simulate how bad it can get if you keep making incorrect guesses.
> if your guess is completely wrong, will it still converge in no worse than O(log(N))?
No, but there's a simple but powerful technique you can use to ensure this from the outside. See if you can figure it out. (If you don't, look up introsort.)
This sounds like Brent’s method for root finding! Also very similar is his method for Unitarians minimization. I ported this into Clojure recently from Python / FORTRAN... not quite literate programming but well documented if you want to give it a peek. https://github.com/littleredcomputer/sicmutils/blob/master/s...
Yup that was the answer :-) interleaving the two lets you get the best of both worlds at the cost of a constant-factor penalty. This is a pretty clever & general-purpose meta-algorithm that works for pretty much any problem where you have competing algorithms with different strengths.
Quitting and starting over as GGP mentioned is another option that also works for this problem, though it might not work as well for other problems where you can't put a nice bound on the number of steps.
FWIW there's also a very dumb (or smart I guess, depending on your point of view) solution, which is to just have 2 threads run simultaneously, and report the result of the first one. It might make sense for some problems. And in any case, this is basically equivalent to what you're doing on 1 CPU with the previous technique.
A strategy in use since the 1970s is to quit interpolation search when it starts picking values too close to the edge of the range, switching to binary search afterwards. That's a well known technique for finding the roots of polynomials.
>This being said, I am not aware of interpolation search being actually used productively in software today. If you have a reference to such an artefact, please share!
This is also my response to that note in the article, interpolation search has been well-known in the numerical methods world since ancient times. ;)
You don't want to pivot on the average, you want to pivot on the median. You will gain the most information with every step if 50% of the probability mass is above your guess and 50% is below.
It's often possible to do "better" by tuning something toward the data set. Finding something that is always better is hard.
One thing I have done is a double binary search.
Store prefixes and suffixes seperately. A binary search of the prefixes identifies the suffix clump to search with a binary search. This involves a slight increase in storage --- each prefix needs a suffix pointer.
But maybe I will now try an interpolation search on the suffixes.
https://en.wikipedia.org/wiki/Trie
This is the best for data with very few updates/modifications and a lot of queries. One example would be points of interest on a map, like restaurants on google maps. Not like those restaurants gets updated/modified every single day but you do have clients that while using your restaurant finder app they query a lot of them on a given radius around their GPS coordinates.
I had such a project in the past. My client acquired this data, around 3 billion points of interests residing in a PostgreSQL DB that was around 900 GB. While the server was beefed with 128GB RAM, was not nearly enough to have it all in memory and it had a responsiveness of several seconds, way too much for young impatient users and my client was seeing a decline in users. I reorganized data in PGSQL to fit a trie tree and now a query was solved in milliseconds. Accounting all other stuff on top, the app was now able to generate a response under a second. Trie ftw.