For small enough tables doing a full scan will be faster than an index. This is ...

pkulak · on May 25, 2023

Yes, that's correct. If you do an explain for a tiny table, as long as your stats are up to date, the index will be ignored. In that case, it's there for insurance if the table grows in the future.

strken · on May 26, 2023

Benchmarking this (for non-database applications) is really interesting because you get to find the crossover point, and it's usually in the hundreds of items. An algorithm that searches maximum 50 items in a hash table but does it across many hash tables could be made much faster using a linear search.

Some of his examples aren't great, though. The cost of adding an index to a database you're already using and which has a reasonably small amount of data is low, and the benefit of getting a few hundred ms less latency for your users is underappreciated. A hundred million records is definitely above the limit at which e.g. prefix search on a mobile device becomes painful.

ignoramous · on May 25, 2023

> With a binary search you basically get worst possible case for branch predictor, because if you're searching random values whether you go right/left on each recursion level is basically random.

I think folks writing databases know a thing or two about writing faster search and sort algorithms.

anyfoo · on May 25, 2023

I'm not sure what you are trying to say, given the context of what you quoted was. Binary search has optimal worst case runtime complexity, but for small tables it is, on real computers, overall still slower than a linear scan, which effectively has the worst worst case runtime complexity (unless you start to get pathological). This is because the constant in front of the runtime complexity, the one that O-notation explicitly ignores, starts to matter: Branch prediction and cache locality come into play.

What exactly do you mean with "writing faster search and sort algorithms"?

ignoramous · on May 25, 2023

> What exactly do you mean with "writing faster search and sort algorithms"?

- Branchless binary search: https://news.ycombinator.com/item?id=35737862 / https://news.ycombinator.com/item?id=14598098

- Static B-Trees for faster binary search: https://news.ycombinator.com/item?id=30376140

Presumably there's even more literature on the topic.

anyfoo · on May 26, 2023

Those are neat optimizations of the constant, but even with that, a linear scan can still be much faster under the right (and not unrealistic) circumstances.

But I see now you were referring to the branch predictor part specifically, so all good.

pkulak · on May 26, 2023

Which is why they use b-trees and not binary search.

birdyrooster · on May 25, 2023

Why do you think that? Seems like a hard problem.

mcqueenjordan · on May 25, 2023

All of this looks accurate, but it's worth contextualizing: this is an optimization that bears the most fruit in "local frames of reference" -- the timescales where linear scans beat index lookups are likely to be strictly dominated by the network latency to transceive from the database. The conclusion is then that the optimization ~only optimizes for cases that effectively don't matter.

josephg · on May 25, 2023

What do you think the threshold number is where scanning a list will outperform a hash table? 10 items? 100? 1000? For what it’s worth, for small data sets I agree with you. But I think it’s very hard to be calibrated correctly on what small means here.

Re: binary search, the biggest cost to scanning data in a database is loading that data from persistent storage. A few mispredicted branches aren’t going to matter much if it means you can do fewer reads.

Shorel · on May 25, 2023

> But I think it’s very hard to be calibrated correctly on what small means here.

It depends.

If you measure or determine somehow the sizes of the L1-L2-L3 caches in the CPU, the relative speed and size of the main memory, and the relative speed and size of the local storage (in contrast with for example network-remote data), then it is a matter of performing some arithmetic, and just to be on the safe side, some benchmarks.

Any data set that fits into the CPU cache is definitely small, for a particular CPU.

If you want to decide a number that applies to every and all hardware, then it goes from hard to basically impossible.

chlorion · on May 26, 2023

The Rust BTreeMap implementation is an example of this as well. It does a linear scan through the nodes instead of a binary search because it's actually faster in this case!

https://doc.rust-lang.org/std/collections/struct.BTreeMap.ht...

pkulak · on May 26, 2023

Did we both just watch the latest Crust of Rust, or did you just know that? If the latter, I’m impressed!

chlorion · on May 27, 2023

>Did we both just watch the latest Crust of Rust

Yes that's exactly how I knew about this!

fulafel · on May 26, 2023

Interesting. But the doc doesn't go as far as to say it's faster, just fast - any measurements for eg string keys?

chlorion · on May 28, 2023

I can't seem to find any benchmarks for this but Gankra did a long write up about Rust's btree implementation!

I did not read this entire thing so it's very possible it contains a benchmark somewhere but I couldn't find any when doing a quick skim.

https://faultlore.com/blah/rust-btree-case/

ars · on May 25, 2023

Yes but when the table is so small even if the index is slower it hardly matters because the table is so small.

Or in other words add the index every single time, unless you have so many writes and such a large table that the actual writes slow down the index.

Leaving out an index on a small table never helps anything.

quickthrower2 · on May 26, 2023

I am glad the query execution planner (mostly) takes the strain. When it doesn’t it is a 99% chance you need to promote an ORM generated query to a, human<<<<< well probably GPT generated explicit one.

cwmma · on May 26, 2023

it's more then that, PostgreSQL will choose to ignore the index on a per query basis, if the query will return more then some % of the table it's faster to just do the scan.