How does a B-tree make queries fast?

dicroce · on Dec 24, 2023

I always think discussions like this should start with the following: A database with no indexes is slow. Finding a particular row will require a linear search. Adding an index on a column means that your optimizing finding rows by the values of that column. Hence, an index is really a mapping of a particular column's value to the position in the db OF that row (very likely an 8 byte sized integer that is the offset into the file of the row in question).

This all means we can implement indexes as b-trees where the keys are the values of a particular column and the value is the file offset of the row with that value. You could envision a simple db format where indexes and the main row file are stored in separate files. In such a database you could drop an index simply by deleting the indexes file (or add one by creating it). The main row file actually has all of the data and so indexes can be recreated if necessary (at expense of course).

code_biologist · on Dec 24, 2023

A database with no indexes is slow. Finding a particular row will require a linear search.

The crux is understanding what data access patterns you will have and what indexes / data structures accelerate that access pattern. "Index = fast" is a painfully pernicious untrue meme. It's absolutely true for application tables with queries only touching a few rows. On the other hand, analytics queries touching a high proportion of rows with joins on equality conditions (ie. hash joinable) isn't going to go any faster with an index.

I've seen devs shotgun indexes at tables to fix performance (done it myself too) but the real test of index understanding is when that doesn't work.

valenterry · on Dec 24, 2023

> On the other hand, analytics queries touching a high proportion of rows with joins on equality conditions (ie. hash joinable) isn't going to go any faster with an index.

That's when you bring a BRIN to the table. :-)

nojvek · on Dec 24, 2023

Or change the layout entirely to clustered columnstore from row store.

All databases are datastructures on disk and memory optimized for specific access and write patterns.

scotty79 · on Dec 24, 2023

> "Index = fast" is a painfully pernicious untrue meme.

I believe that lack of internalization of that meme (regardless of how true it is) can be a cause of real trouble.

I was working in a team where Java devs simply didn't bother to put indexes on tables because they were small (like 100 rows or so). When I (JS dev) pestered them long enough to finally do it suddenly the whole app got super snappy and they were very thankful as it happened just as degrading performance was causing a lot of gloomy mood.

valenterry · on Dec 24, 2023

If a table has 100 rows and those rows aren't huuuuge, there is no reason to put in index. The rows probably fit into one page or so anyways and are all read together anyways.

Why are you mentioning that it were Java devs ad that you are a JS (javascript?) dev? Does that give you any kind of... expertise in the matter?

scotty79 · on Dec 24, 2023

In theory, with pages, caching and all, yes. In practice it made collosal difference.

With correct indexes the queries were able to be found only with index lookup without touching data at all. It was many times faster than sequential scan of even this few rows.

I'm mentioning our respective roles to show that people with nominally no expertise make such decisions often in practice and such "memes", even tough they are not strictly true in all cases, may bring a lot of value in practical setting.

"No inedex = slow" is a good heuristic for nearly all devs.

valenterry · on Dec 25, 2023

> With correct indexes the queries were able to be found only with index lookup without touching data at all

That doesn't make things faster, unless the rows contain so much data, that they force many page reads. Otherwise, reading the whole index or reading all rows takes the same time.

I think there is some information that you are not providing that might make a difference - or it is actually a misunderstanding on your side and if the queries really got faster, it's because of something else, such as a join.

scotty79 · on Dec 25, 2023

If it didn't matter noone would ever use binary search for the amount of data that fits single page. Linear search would always be preferred as it is simpler.

The only thing I'm not providing are the exact sizes of the tables involved other than that they were considered small. Some having 100 rows but I can't rule out that some were 1000. I didn't say each of those tables fit a single page which I agree could be significant. I am completely sure that the difference was caused by adding indexes to small tables, nothing else. The improvement was achieved in a single day by a sigle person adding indexes where there were none. He observed improvement as he went from table to table. I was specifically thanked by him for pestering him about it. He was informal tech lead for this project. The application was fairly large at that time and performance issues were everywhere. What was also used everywhere were those small tables containing stuff like list of countries, languages, currencies, types of operations and such.

sgarland · on Dec 24, 2023

> isn’t going to go any faster with an index

Depends on the data, and the index. A covering index would almost certainly be faster. But you said analytics, which implies large results from huge datasets, so it’s unlikely to fit into memory in the first place.

darkclouds · on Dec 24, 2023

> The crux is understanding what data access patterns you will have and what indexes / data structures accelerate that access pattern

We have a winner. But when looking at SQL tables/Views/Stored Procedures, the data is also stored in order in memory, in effect have a master database and files on disk, with sorted databases and files in memory for faster access.

branko_d · on Dec 24, 2023

> A database with no indexes is slow.

No it’s not… if all you do is write to it. In fact, it’s the fastest possible database for such case.

Indexes are pure redundancy - they contain the data already in the base table which must be maintained during writes.

But they can make reads so much faster, if the data access pattern can utilize them. The key is to identify access patterns which justify the price of the index.

isbvhodnvemrwvn · on Dec 24, 2023

In some cases they can make things worse, it's worth remembering that query optimizer looks not only on indexes, but also on statistics and estimated operation costs. If your statistics are out of date and your criteria are not specific enough (e.g. they match 80% rows), then an index is going to slow the query down. It needs to traverse the index to get the row IDs, fetch all the blocks containing them, read those, filter out irrelevant rows. It's probably going to be faster with a pure full table scan (due to linear reads).

scotty79 · on Dec 24, 2023

If you only need to write and never read you don't need database.

> /dev/null will suffice.

And if you ever need to read anything even once database without indexes is slow

blowski · on Dec 24, 2023

I can write to one database, replicate it (for example, by log shipping), and add an index only on the replica. This is not just being pedantic, this is a real-world pattern for some analytics solutions. You have a very high number of writes, and then build a reporting database at the end of every day, with all reads going to that database.

scotty79 · on Dec 24, 2023

That is a very valid scenario. But when you do those things you already know costs and benefits of the indexes so you are not going to be harmed by "no indexes = database slow" heuristc.

"database = fast" is way worse heuristic to believe in for the people that need heuristics to move on with what they are doing.

branko_d · on Dec 24, 2023

> And if you ever need to read anything even once database without indexes is slow

This may be true in most cases, but not all. It just depends on your access pattern. If all you do are full table scans (possibly feeding to hash-joins), you won't benefit from indexes at all!

The whole point is that indexes do not auto-magically improve performance with no downsides. If they did, we could just index all columns and call it a day!

scotty79 · on Dec 24, 2023

> If all you do are full table scans (possibly feeding to hash-joins), you won't benefit from indexes at all!

In most cases it means that you shouldn't be doing what youa re trying to do. Or best case, that maybe database is not the right tool for your job.

SpicyLemonZest · on Dec 24, 2023

True, but it's still relevant in the early stages of iterating on a project. I'm familiar with one team that started investing in database-level optimization months before even beginning to deprecate their `loadAllRowsFromTheLargestTable` RPC.

thargor90 · on Dec 24, 2023

To be pedantic: writes may also make use of indices, if you have constraints (like foreign keys) the db needs to check for every write.

branko_d · on Dec 24, 2023

True, but FK (in child table) must reference a key (in parent), and most databases won't let you create a key without the underlying index.

The other direction, however, is not a given: most DBs will let you create a FK on fields not covered by an index, so deleting or modifying a parent can benefit if you create such index explicitly, because it can check for the existence of children much faster (and avoid potentially locking the entire table). Again, the access pattern governs what indexes are needed: if you never delete/modify parent, you may not need an index on FK (unless you also have some queries which can use it, of course).

kg · on Dec 24, 2023

I've implemented a high performance btree this way in the past, where each table and each index were separate files (with append-only writes for concurrency). It worked pretty well and wasn't hard to get right, but it had some downsides (in particular, the kernel seemed to struggle with all the paging.)

louthy · on Dec 24, 2023

> a high performance btree

then …

> the kernel seemed to struggle …

What was the struggle? If it’s performance doesn’t that contradict your earlier statement?

Genuinely interested in what the issue was, not trying to be a pedant

tomnipotent · on Dec 24, 2023

> What was the struggle?

Performance is always great until you have to hit disk. Not uncommon to rely on mmap at which point your disk access is sub-optimal vs. a hand-tailored buffer manager with strategies to improve disk reads.

sschnei8 · on Dec 24, 2023

This requires the obligatory https://db.cs.cmu.edu/mmap-cidr2022/

mgaunard · on Dec 24, 2023

The linux kernel lets you trigger asynchronous writes of the pages as well as synchronous barriers to ensure they've been written.

You don't need to use direct I/O to have fine control.

fsckboy · on Dec 24, 2023

the purpose of a btree is to optimize when you are hitting the disk, you can't call that the struggle, that's when the btree sings (tho you could consider extensible hashing)

kg · on Dec 24, 2023

My throughput was significantly higher than sqlite (4x or so, if memory serves), but the kernel spent so much time swapping pages that the mouse cursor stuttered.

A custom page manager would have probably done the trick, but I don't have the technical chops to write one.

wruza · on Dec 24, 2023

You could envision a simple db format where indexes and the main row file are stored in separate files

They already envisioned it in DBF and CDX.

RaftPeople · on Dec 24, 2023

This is how systems handled it before relational was widely adopted, for example the IBM System 36.

marginalia_nu · on Dec 23, 2023

Another neat part is that, for intersecting different B-trees, you can use a technique like this to get very efficient algorithm: https://nlp.stanford.edu/IR-book/html/htmledition/faster-pos...

(Here discussed in terms of skip lists, but they are similar enough that the distinction doesn't matter)

o11c · on Dec 23, 2023

Note that the real primitive is "find nearest [with hint]", which has to be the #1 thing I miss in Python.

For B-trees it's only going to be a significant win if the chance of intersection is small, less than about `1/(nodes_per_block)`. For binary trees it's a much bigger win since that becomes 1/2 and binary trees are horrible on cache.

Hmm, can you efficiently intersect an arbitrary number of B-trees using the same idea as a heap-merge, but with a max heap instead of a min heap? You'd still have to iterate over all on a match, but as long as 2 input trees don't have an intersection, you don't have to look at the others at all ... or does that become equivalent to just doing them in series?

marginalia_nu · on Dec 23, 2023

A nuance that is important here is that not all accesses are equal within the context of disk reads. B-trees are designed to minimize block reads, not memory operations.

I guess there are worst case scenarios with evenly spaced intersections spread out exactly one per block, but in terms of block reads it fundamentally doesn't matter how you intersect two such trees, you'll still have to read all blocks, and that is orders of magnitude slower than comparing the values within.

I think the tree structure can be considered cached in real world scenarios; not really relevant to the performance you'll get.

daveevad · on Dec 23, 2023

It's not that deep.

mike_hock · on Dec 23, 2023

[flagged]

ukuina · on Dec 24, 2023

I enjoyed the breadth of this discussion.

porridgeraisin · on Dec 24, 2023

Alright now, let's restore some order to this thread.

carlmr · on Dec 24, 2023

It would definitely balance this discussion.

mirekrusin · on Dec 24, 2023

Let's drop it.

Nevermark · on Dec 25, 2023

Pondering this discussion, it appears to be an artifact of an open access coroutine. There is no dropping, blocking or locking.

hinkley · on Dec 24, 2023

It's important to achieve a balance.

avinassh · on Dec 24, 2023

> To sum up, the key takeaway is to prefer sequential access wherever we can.

For spinning disks, this is obvious. But why is sequential access faster on SSDs?

wizerno · on Dec 24, 2023

TLDR; Garbage Collection.

In an SSD, a write operation can only be done when the page is already erased. However, the unit of read/write operations are a page, while the unit of erase operation is a block. That means for a disk write, a naive implementation needs to read the whole block, erase the block, then write updated data back to the block, which is unacceptable. Furthermore, blocks should wear out uniformly, otherwise, the SSD would lose capacity.

To tackle these problems, SSD introduces Flash Translation Layer (FTL) which helps to build an illusion of random access device. To achieve this, FTL employs an approach very similar to LSM trees. Writes are always written to new, already erased pages, while in the background, garbage collects (GC) outdated data. FTL needs to keep a map from the user’s logical address to physical address on SSD, both in-memory and persistently.

So to answer your question, why are sequential writes are faster than random writes on SSDs? Because the address map table is smaller since new data is consecutive in larger chunks. Garbage Collection is simpler and only metadata needs to be updated. Erasing a block is required anyway.

avinassh · on Dec 24, 2023

what about sequential vs random reads?

jasonwatkinspdx · on Dec 24, 2023

Predictable prefetching.

marginalia_nu · on Dec 24, 2023

In practice the difference is much smaller between random and sequential reads, with the caveat that all SSD reads on some level are block operations, so locally sequential access is what make the big difference.

Though with mmap the OS may do a speculative async readahead, depending on memadvise.

infogulch · on Dec 24, 2023

Like sibling said, physical pages are typically much bigger than logical pages in SSDs. Also drives do prediction and sequential access is easy to predict.

gniv · on Dec 24, 2023

Not just SSD, but in RAM also it's faster, for the same reasons (page-based access to caches). Basically you should always use a B-tree whenever you need the functionality of an STL map.

avinassh · on Dec 24, 2023

I understand it is faster on RAM, but I am asking why

IOW what makes it faster to access sequentially on SSD

gniv · on Dec 24, 2023

It is faster in RAM, on SSD, on HDD, for the same reason: the reads are done in blocks. Even if you want a single byte, the system (the controller) will read an entire block, typically on the order of 100k bytes. So if your B-tree node is all stored in one block, all reads from that node will be fast.

zerr · on Dec 24, 2023

I'd assume at some level of IO, blocks/pages/whole buffers are being read, as opposed to reading bytes one by one. So the sequental access takes advantage of this I suppose.

gfody · on Dec 24, 2023

> It was invented over 40 years ago, yet it is still employed by the majority of modern databases.

I wonder how true this is for the top commercial engines (Oracle, MS, IBM, etc.) whose internals are closed source and proprietary. Even a decade ago my experience performance testing Exadata implied some exotic magic at work, ie lookups are way faster than the expected O(log n). More recently while testing SQL Server's ability to join hundreds of tables together the performance was _way_ in excess of what I expected. I can't imagine these things have internals all that similar to say the B+Tree inside MySQL for example.

kg · on Dec 24, 2023

A lot of this comes down to query planners being really good at finding clever ways of doing scans and intersections of indexes, the tables themselves having indexes with a bunch of specialized representations, and the query execution doing very intelligent data traversal with partitioning or even multi-threading.

If you sit down and think carefully about your data you can often make even a simple bare-bones B-tree perform fantastically for a query, well in excess of what you'd get out of mysql or sqlite (which are already pretty fast).

bob1029 · on Dec 24, 2023

I think the most important takeaway is that the old school RDBMS products are probably more than enough for whatever you are trying to accomplish. Query planners in these are indistinguishable from magic, as should anything that has been forged in the fires of a million production environments for a few decades.

I've been playing around with an idea that involves putting sql server at the heart of a game engine, and it is turning into one of the biggest rabbit holes I've ever explored. I thought latency/jitter would be more of a problem but it simply isn't.

jiggawatts · on Dec 24, 2023

On disk, SQL Server uses only b-trees, unless using the new ColumnStore format.

In memory during a query it can use temporary indexes of other types, primarily hash tables and bitmaps.

Its performance on ad-hoc complex queries is about as good as it gets, few if any other RDBMS can beat its performance, but under the hood it’s still mostly just doing joins on b-trees!

gfody · on Dec 24, 2023

> ..under the hood it’s still mostly just doing joins on b-trees!

I could see the on-disk format needing to be simple and stable, but once the datas buffered who knows what structures and algorithms these proprietary engines are using? You would need to have done some reverse engineering or had hands-on details from the inside which presumably comes w/legal consequences for leaking them.

jiggawatts · on Dec 24, 2023

It tells you what it uses when you inspect the “query plans”. There’s some fairly technical explanation in the docs of what each of the operators do.

Generally the secret sauce in these things is the query optimiser heuristics.

The actual data structures and algorithm are often relatively simple.

Having said that, I’ve read their whitepaper on how they implement hash tables, and… it’s way more complex than I had assumed.

They cater for scenarios like many duplicated keys, parallel construction, unbalanced load across CPU cores, etc…

gfody · on Dec 24, 2023

the query plan says "key lookup" or "index seek" can we really assume anything about the implementation?

cgopalan · on Dec 24, 2023

Amidst all these great discussions, I would like to point out that this article really helped me get my head around a B tree and why its a great optimization on top of the Binary Search Tree. Thanks to the author!

charlieyu1 · on Dec 24, 2023

Is there any other data structures that benefits from hardware?

prydt · on Dec 24, 2023

LSM trees are a good example of a data structure optimized for memory hardware (both hdds and ssds).

omginternets · on Dec 24, 2023

Arrays come to mind.

oggyboye · on Dec 24, 2023

How does it work when indexing uuid columns?

LAC-Tech · on Dec 24, 2023

Very poorly is my understanding. There's various sequential UUID-like schemes that are more sortable by prefixing with bits of physical time. Off the top of my head, ULIDs and also UUID v7.

ahoka · on Dec 24, 2023

But those sequential IDs are often just not suitable due to security reasons.

sgarland · on Dec 24, 2023

Then don’t expose them. Adapt your API, don’t break the DB.

But in any case, both of the varieties mentioned still have random components; they aren’t purely sequential. Monotonic for the time component, yes.

ahoka · on Dec 25, 2023

I cannot not expose them, they exist to be exposed.

sgarland · on Dec 25, 2023

You could map them to a v4 UUID that isn’t a PK (or even indexed), and expose that instead.

sroussey · on Dec 24, 2023

I think this really only matters for clustered indexes.

Tostino · on Dec 24, 2023

No, you will have much larger index bloat as you add rows as page splits happen to accommodate the new rows.

sgarland · on Dec 24, 2023

Nope. Performance and WAL bloat will both occur.

https://www.cybertec-postgresql.com/en/unexpected-downsides-...

sroussey · on Dec 24, 2023

This is a great article, thank you!

sroussey · on Dec 24, 2023

I wonder how this works out for MySQL. Any time I’ve used uuid there I’ve used my own version (more like v7 that is available now).

This makes a big difference for the primary key (clustered index for both MySQL and postgresql).

Does anyone have an analysis for MySQL secondary indexes like the above referenced one for PostgreSQL secondary indexes?

sgarland · on Dec 24, 2023

MySQL stores the PK with every secondary index, and uses it to retrieve the requested rows (unless the query is covered by the index). I’d think for most queries, this would result in a similar slowdown.

masklinn · on Dec 24, 2023

A clustered index means the entire table data is included. It’s the same issue, just to an even higher degree.

kg · on Dec 24, 2023

Probably poorly by default, but you could use a hash of the uuid as a key (to try and more evenly spread the entropy) or key it off a suffix instead of a prefix since iirc that's where most of the entropy lives.

In practice if you want good performance and scalability it's important to select keys well.

hobs · on Dec 24, 2023

For this use case people generally choose sequential UUIDs or they want random ones to prevent hot pages for their inserts.

masklinn · on Dec 24, 2023

Normally, a uuid is orderable.

However uuid4 create a lot of work during updates, and tend to result in relatively low fill trees so querying is less efficient than it could be.

They also don’t really benefit from btrees as range or prefix queries are extremely rare.