I always think discussions like this should start with the following: A database with no indexes is slow. Finding a particular row will require a linear search. Adding an index on a column means that your optimizing finding rows by the values of that column. Hence, an index is really a mapping of a particular column's value to the position in the db OF that row (very likely an 8 byte sized integer that is the offset into the file of the row in question).
This all means we can implement indexes as b-trees where the keys are the values of a particular column and the value is the file offset of the row with that value. You could envision a simple db format where indexes and the main row file are stored in separate files. In such a database you could drop an index simply by deleting the indexes file (or add one by creating it). The main row file actually has all of the data and so indexes can be recreated if necessary (at expense of course).
A database with no indexes is slow. Finding a particular row will require a linear search.
The crux is understanding what data access patterns you will have and what indexes / data structures accelerate that access pattern. "Index = fast" is a painfully pernicious untrue meme. It's absolutely true for application tables with queries only touching a few rows. On the other hand, analytics queries touching a high proportion of rows with joins on equality conditions (ie. hash joinable) isn't going to go any faster with an index.
I've seen devs shotgun indexes at tables to fix performance (done it myself too) but the real test of index understanding is when that doesn't work.
> On the other hand, analytics queries touching a high proportion of rows with joins on equality conditions (ie. hash joinable) isn't going to go any faster with an index.
> "Index = fast" is a painfully pernicious untrue meme.
I believe that lack of internalization of that meme (regardless of how true it is) can be a cause of real trouble.
I was working in a team where Java devs simply didn't bother to put indexes on tables because they were small (like 100 rows or so). When I (JS dev) pestered them long enough to finally do it suddenly the whole app got super snappy and they were very thankful as it happened just as degrading performance was causing a lot of gloomy mood.
If a table has 100 rows and those rows aren't huuuuge, there is no reason to put in index. The rows probably fit into one page or so anyways and are all read together anyways.
Why are you mentioning that it were Java devs ad that you are a JS (javascript?) dev? Does that give you any kind of... expertise in the matter?
In theory, with pages, caching and all, yes. In practice it made collosal difference.
With correct indexes the queries were able to be found only with index lookup without touching data at all. It was many times faster than sequential scan of even this few rows.
I'm mentioning our respective roles to show that people with nominally no expertise make such decisions often in practice and such "memes", even tough they are not strictly true in all cases, may bring a lot of value in practical setting.
"No inedex = slow" is a good heuristic for nearly all devs.
> With correct indexes the queries were able to be found only with index lookup without touching data at all
That doesn't make things faster, unless the rows contain so much data, that they force many page reads. Otherwise, reading the whole index or reading all rows takes the same time.
I think there is some information that you are not providing that might make a difference - or it is actually a misunderstanding on your side and if the queries really got faster, it's because of something else, such as a join.
If it didn't matter noone would ever use binary search for the amount of data that fits single page. Linear search would always be preferred as it is simpler.
The only thing I'm not providing are the exact sizes of the tables involved other than that they were considered small. Some having 100 rows but I can't rule out that some were 1000. I didn't say each of those tables fit a single page which I agree could be significant. I am completely sure that the difference was caused by adding indexes to small tables, nothing else. The improvement was achieved in a single day by a sigle person adding indexes where there were none. He observed improvement as he went from table to table. I was specifically thanked by him for pestering him about it. He was informal tech lead for this project. The application was fairly large at that time and performance issues were everywhere. What was also used everywhere were those small tables containing stuff like list of countries, languages, currencies, types of operations and such.
Depends on the data, and the index. A covering index would almost certainly be faster. But you said analytics, which implies large results from huge datasets, so it’s unlikely to fit into memory in the first place.
> The crux is understanding what data access patterns you will have and what indexes / data structures accelerate that access pattern
We have a winner. But when looking at SQL tables/Views/Stored Procedures, the data is also stored in order in memory, in effect have a master database and files on disk, with sorted databases and files in memory for faster access.
No it’s not… if all you do is write to it. In fact, it’s the fastest possible database for such case.
Indexes are pure redundancy - they contain the data already in the base table which must be maintained during writes.
But they can make reads so much faster, if the data access pattern can utilize them. The key is to identify access patterns which justify the price of the index.
In some cases they can make things worse, it's worth remembering that query optimizer looks not only on indexes, but also on statistics and estimated operation costs. If your statistics are out of date and your criteria are not specific enough (e.g. they match 80% rows), then an index is going to slow the query down. It needs to traverse the index to get the row IDs, fetch all the blocks containing them, read those, filter out irrelevant rows. It's probably going to be faster with a pure full table scan (due to linear reads).
I can write to one database, replicate it (for example, by log shipping), and add an index only on the replica. This is not just being pedantic, this is a real-world pattern for some analytics solutions. You have a very high number of writes, and then build a reporting database at the end of every day, with all reads going to that database.
That is a very valid scenario. But when you do those things you already know costs and benefits of the indexes so you are not going to be harmed by "no indexes = database slow" heuristc.
"database = fast" is way worse heuristic to believe in for the people that need heuristics to move on with what they are doing.
> And if you ever need to read anything even once database without indexes is slow
This may be true in most cases, but not all. It just depends on your access pattern. If all you do are full table scans (possibly feeding to hash-joins), you won't benefit from indexes at all!
The whole point is that indexes do not auto-magically improve performance with no downsides. If they did, we could just index all columns and call it a day!
True, but it's still relevant in the early stages of iterating on a project. I'm familiar with one team that started investing in database-level optimization months before even beginning to deprecate their `loadAllRowsFromTheLargestTable` RPC.
True, but FK (in child table) must reference a key (in parent), and most databases won't let you create a key without the underlying index.
The other direction, however, is not a given: most DBs will let you create a FK on fields not covered by an index, so deleting or modifying a parent can benefit if you create such index explicitly, because it can check for the existence of children much faster (and avoid potentially locking the entire table). Again, the access pattern governs what indexes are needed: if you never delete/modify parent, you may not need an index on FK (unless you also have some queries which can use it, of course).
I've implemented a high performance btree this way in the past, where each table and each index were separate files (with append-only writes for concurrency). It worked pretty well and wasn't hard to get right, but it had some downsides (in particular, the kernel seemed to struggle with all the paging.)
Performance is always great until you have to hit disk. Not uncommon to rely on mmap at which point your disk access is sub-optimal vs. a hand-tailored buffer manager with strategies to improve disk reads.
the purpose of a btree is to optimize when you are hitting the disk, you can't call that the struggle, that's when the btree sings (tho you could consider extensible hashing)
My throughput was significantly higher than sqlite (4x or so, if memory serves), but the kernel spent so much time swapping pages that the mouse cursor stuttered.
A custom page manager would have probably done the trick, but I don't have the technical chops to write one.
Note that the real primitive is "find nearest [with hint]", which has to be the #1 thing I miss in Python.
For B-trees it's only going to be a significant win if the chance of intersection is small, less than about `1/(nodes_per_block)`. For binary trees it's a much bigger win since that becomes 1/2 and binary trees are horrible on cache.
Hmm, can you efficiently intersect an arbitrary number of B-trees using the same idea as a heap-merge, but with a max heap instead of a min heap? You'd still have to iterate over all on a match, but as long as 2 input trees don't have an intersection, you don't have to look at the others at all ... or does that become equivalent to just doing them in series?
A nuance that is important here is that not all accesses are equal within the context of disk reads. B-trees are designed to minimize block reads, not memory operations.
I guess there are worst case scenarios with evenly spaced intersections spread out exactly one per block, but in terms of block reads it fundamentally doesn't matter how you intersect two such trees, you'll still have to read all blocks, and that is orders of magnitude slower than comparing the values within.
I think the tree structure can be considered cached in real world scenarios; not really relevant to the performance you'll get.
In an SSD, a write operation can only be done when the page is already erased. However, the unit of read/write operations are a page, while the unit of erase operation is a block. That means for a disk write, a naive implementation needs to read the whole block, erase the block, then write updated data back to the block, which is unacceptable. Furthermore, blocks should wear out uniformly, otherwise, the SSD would lose capacity.
To tackle these problems, SSD introduces Flash Translation Layer (FTL) which helps to build an illusion of random access device. To achieve this, FTL employs an approach very similar to LSM trees. Writes are always written to new, already erased pages, while in the background, garbage collects (GC) outdated data. FTL needs to keep a map from the user’s logical address to physical address on SSD, both in-memory and persistently.
So to answer your question, why are sequential writes are faster than random writes on SSDs? Because the address map table is smaller since new data is consecutive in larger chunks. Garbage Collection is simpler and only metadata needs to be updated. Erasing a block is required anyway.
In practice the difference is much smaller between random and sequential reads, with the caveat that all SSD reads on some level are block operations, so locally sequential access is what make the big difference.
Though with mmap the OS may do a speculative async readahead, depending on memadvise.
Like sibling said, physical pages are typically much bigger than logical pages in SSDs. Also drives do prediction and sequential access is easy to predict.
Not just SSD, but in RAM also it's faster, for the same reasons (page-based access to caches). Basically you should always use a B-tree whenever you need the functionality of an STL map.
It is faster in RAM, on SSD, on HDD, for the same reason: the reads are done in blocks. Even if you want a single byte, the system (the controller) will read an entire block, typically on the order of 100k bytes. So if your B-tree node is all stored in one block, all reads from that node will be fast.
I'd assume at some level of IO, blocks/pages/whole buffers are being read, as opposed to reading bytes one by one. So the sequental access takes advantage of this I suppose.
> It was invented over 40 years ago, yet it is still employed by the majority of modern databases.
I wonder how true this is for the top commercial engines (Oracle, MS, IBM, etc.) whose internals are closed source and proprietary. Even a decade ago my experience performance testing Exadata implied some exotic magic at work, ie lookups are way faster than the expected O(log n). More recently while testing SQL Server's ability to join hundreds of tables together the performance was _way_ in excess of what I expected. I can't imagine these things have internals all that similar to say the B+Tree inside MySQL for example.
A lot of this comes down to query planners being really good at finding clever ways of doing scans and intersections of indexes, the tables themselves having indexes with a bunch of specialized representations, and the query execution doing very intelligent data traversal with partitioning or even multi-threading.
If you sit down and think carefully about your data you can often make even a simple bare-bones B-tree perform fantastically for a query, well in excess of what you'd get out of mysql or sqlite (which are already pretty fast).
I think the most important takeaway is that the old school RDBMS products are probably more than enough for whatever you are trying to accomplish. Query planners in these are indistinguishable from magic, as should anything that has been forged in the fires of a million production environments for a few decades.
I've been playing around with an idea that involves putting sql server at the heart of a game engine, and it is turning into one of the biggest rabbit holes I've ever explored. I thought latency/jitter would be more of a problem but it simply isn't.
On disk, SQL Server uses only b-trees, unless using the new ColumnStore format.
In memory during a query it can use temporary indexes of other types, primarily hash tables and bitmaps.
Its performance on ad-hoc complex queries is about as good as it gets, few if any other RDBMS can beat its performance, but under the hood it’s still mostly just doing joins on b-trees!
> ..under the hood it’s still mostly just doing joins on b-trees!
I could see the on-disk format needing to be simple and stable, but once the datas buffered who knows what structures and algorithms these proprietary engines are using? You would need to have done some reverse engineering or had hands-on details from the inside which presumably comes w/legal consequences for leaking them.
Amidst all these great discussions, I would like to point out that this article really helped me get my head around a B tree and why its a great optimization on top of the Binary Search Tree. Thanks to the author!
Very poorly is my understanding. There's various sequential UUID-like schemes that are more sortable by prefixing with bits of physical time. Off the top of my head, ULIDs and also UUID v7.
MySQL stores the PK with every secondary index, and uses it to retrieve the requested rows (unless the query is covered by the index). I’d think for most queries, this would result in a similar slowdown.
Probably poorly by default, but you could use a hash of the uuid as a key (to try and more evenly spread the entropy) or key it off a suffix instead of a prefix since iirc that's where most of the entropy lives.
In practice if you want good performance and scalability it's important to select keys well.
This all means we can implement indexes as b-trees where the keys are the values of a particular column and the value is the file offset of the row with that value. You could envision a simple db format where indexes and the main row file are stored in separate files. In such a database you could drop an index simply by deleting the indexes file (or add one by creating it). The main row file actually has all of the data and so indexes can be recreated if necessary (at expense of course).