How fast does interpolation search converge?

unnouinceput · on Nov 26, 2020

As a side note, if you have the possibility to sort data before searching values in it then implement trie:

https://en.wikipedia.org/wiki/Trie

This is the best for data with very few updates/modifications and a lot of queries. One example would be points of interest on a map, like restaurants on google maps. Not like those restaurants gets updated/modified every single day but you do have clients that while using your restaurant finder app they query a lot of them on a given radius around their GPS coordinates.

I had such a project in the past. My client acquired this data, around 3 billion points of interests residing in a PostgreSQL DB that was around 900 GB. While the server was beefed with 128GB RAM, was not nearly enough to have it all in memory and it had a responsiveness of several seconds, way too much for young impatient users and my client was seeing a decline in users. I reorganized data in PGSQL to fit a trie tree and now a query was solved in milliseconds. Accounting all other stuff on top, the app was now able to generate a response under a second. Trie ftw.

karterk · on Nov 26, 2020

Tries are insanely good for implementing fuzzy search as well (say based on a Damerau–Levenshtein distance).

Using this approach for a typo-tolerant instant search engine that I am working on: https://github.com/typesense/typesense

tpolzer · on Nov 26, 2020

If you're already using Postgres, isn't that precisely what GiST indexes are meant for?

unnouinceput · on Nov 26, 2020

Yes, they are. But those are general approach. While a good approach my implementation was custom made for the client's need, allowing to achieve a better performance.

ntonozzi · on Nov 26, 2020

A related technique is generating an index using a machine learned model (well explained here: https://blog.acolyer.org/2018/01/08/the-case-for-learned-ind...). In interpolation search, we just use a simple linear model. Kraska and co. use much more complex models to beat BTree indexes. However, the Recursive Model Indexes (RMIs) that they published (https://github.com/learnedsystems/RMI) use Rust to a C++ model that is exported as a library, so it definitely doesn't seem ready for production yet!

rcthompson · on Nov 26, 2020

The mention of log(log(N)) reminded me of something entertaining I once heard from an algorithms professor: "log(log(N)) is less than 10".

carlob · on Nov 26, 2020

heh, log(log(number of atoms in the universe)) < 6

nn3 · on Nov 26, 2020

I think Microsoft uses with btrees:

https://github.com/microsoft/ALEX

Of course they call it "ML", but it's just a linear interpolation for faster tree search.

ntonozzi · on Nov 26, 2020

Very cool, seems much higher quality than the RMI library. I’m hoping more of these learned data structures make their way into libraries.

Const-me · on Nov 26, 2020

One good thing about binary search is memory access pattern.

The element in the middle of the array is tested every single time. Similarly, the elements at 25% and 75% of the array are tested very often, in 50% searches/each.

Modern computers have very slow memory compared to computation speed, and multi-level caches to compensate. Binary search RAM access pattern, when done many times, is friendly towards these multi-level caches. The elements at positions like 25%, 50%, 75% will stay in L1D cache.

jonstewart · on Nov 26, 2020

It’s true that those places will get cached, but binary search tends to dance around the address space a bit too much with large arrays. That’s one of the advantages of interpolation search—provided the items are uniformly distributed.

twotwotwo · on Nov 26, 2020

In college I played with a variation of this where you fix the size of the collection to a power of 2 (sprinkling in zeroes/repeated values if needed): you make your initial guess for where to look is just by taking the top bits of the searched-for value, then estimate the number of slots off you were from the top bits of the _difference_ between the value you found and the one you want (clamping your next guess to the first/last item if you'd fall off the end otherwise). In the concrete example I was playing with, it worked to just stop there then do linear search, but you could be cleverer about how long to iterate and when/how to bail out to something with a better worst case.

As Lemire says the problem (a good "problem"!) is that hashtables and B-trees can be quite fast, robust against different distributions, and you have them right at hand. It'd almost have to be some weird situation like you're handed data in a format that kinda fits the requirements and you're looking for how to work with it in-place.

sillysaurusx · on Nov 26, 2020

Neat trick. Possibly a bit brittle, since you have to maintain your knowledge of the distribution somewhere -- i.e. it's "state you have to update or things might break," which is always worrisome.

I wonder what the worst case performance is -- if your guess is completely wrong, will it still converge in no worse than O(log(N))? Do you guess only once, at the beginning, or do you keep guessing where to go based on the distribution?

If you only guess once, at the beginning, then there's probably no risk. But if you keep trying to aim for the wrong target every iteration, I wouldn't be surprised if you get pathological behavior.

It would be kind of fun to try to calculate or simulate how bad it can get if you keep making incorrect guesses.

dataflow · on Nov 26, 2020

> if your guess is completely wrong, will it still converge in no worse than O(log(N))?

No, but there's a simple but powerful technique you can use to ensure this from the outside. See if you can figure it out. (If you don't, look up introsort.)

usmannk · on Nov 26, 2020

Is the trick quit and start doing a binary search?

smallnamespace · on Nov 26, 2020

Ooh, or what about alternating interpolation and binary search steps?

Interpolation steps make quick progress if your distribution is correct.

sritchie · on Nov 26, 2020

This sounds like Brent’s method for root finding! Also very similar is his method for Unitarians minimization. I ported this into Clojure recently from Python / FORTRAN... not quite literate programming but well documented if you want to give it a peek. https://github.com/littleredcomputer/sicmutils/blob/master/s...

sillysaurusx · on Nov 26, 2020

That's actually quite clever. I like it.

I can't figure out the puzzle's answer. Hopefully they'll post the solution someday.

dataflow · on Nov 26, 2020

Yup that was the answer :-) interleaving the two lets you get the best of both worlds at the cost of a constant-factor penalty. This is a pretty clever & general-purpose meta-algorithm that works for pretty much any problem where you have competing algorithms with different strengths.

Quitting and starting over as GGP mentioned is another option that also works for this problem, though it might not work as well for other problems where you can't put a nice bound on the number of steps.

FWIW there's also a very dumb (or smart I guess, depending on your point of view) solution, which is to just have 2 threads run simultaneously, and report the result of the first one. It might make sense for some problems. And in any case, this is basically equivalent to what you're doing on 1 CPU with the previous technique.

whatshisface · on Nov 26, 2020

A strategy in use since the 1970s is to quit interpolation search when it starts picking values too close to the edge of the range, switching to binary search afterwards. That's a well known technique for finding the roots of polynomials.

>This being said, I am not aware of interpolation search being actually used productively in software today. If you have a reference to such an artefact, please share!

This is also my response to that note in the article, interpolation search has been well-known in the numerical methods world since ancient times. ;)

brian_herman · on Nov 26, 2020

I converted his code to python. https://gist.github.com/brianherman/7e58b48ddbb7060663139416...

pfdietz · on Nov 26, 2020

The van Emde Boas data structure lets you do searches in sets of integers in O(loglogn) time, if the integers have O(log n) bits.

whatshisface · on Nov 26, 2020

You don't want to pivot on the average, you want to pivot on the median. You will gain the most information with every step if 50% of the probability mass is above your guess and 50% is below.

jqpabc123 · on Nov 26, 2020

The problem I see is the assumption of a uniform distribution.

The problem spaces I normally deal with involve clumps of data, not really uniform except within each clump.

kzrdude · on Nov 26, 2020

is there something that's better than interpolation search and binary search, for that kind of data?

jqpabc123 · on Nov 26, 2020

It's often possible to do "better" by tuning something toward the data set. Finding something that is always better is hard.

One thing I have done is a double binary search.

Store prefixes and suffixes seperately. A binary search of the prefixes identifies the suffix clump to search with a binary search. This involves a slight increase in storage --- each prefix needs a suffix pointer.

But maybe I will now try an interpolation search on the suffixes.