> I also can't think of a reason this is the case given the underlying data structures.
Ditto.
I suspect like me you are already familiar the underlying data structures. I'm now wondering why they wouldn't do use the B*Tree to search for thing_color. It's downright odd.
Granted in your example, assuming there are only a few colours, it would not be a big win. That's because all the colours for one date would likely fit in one B*Tree Node, which would be a leaf. When you search for a key in a single B*Tree node, you are forced to sequentially scan the sorted list of keys in the node because of the compression they use. Maybe SQLite thinks your example is the typical one that happens in most data sets, and so the extra effort of skipping through the tree isn't worth it.
But in Scour that wasn't true, and it seems to me in indexes with lots of columns it would often not be true.
It's a flattened tree, and you've used the index to reach a point where you have multiple child nodes that meet the precondition thing_date < 17. Some, but not all, of these have thing_color = blue, but the important part is that they're non-consecutive (the index can only give you one point-of-entry, so the data has to be consecutive or you need to do some amount of scan).
I asked an llm to give me an ascii representation so it'll be easier to see what I mean; consider the case where you want thing_date < 17 and thing_color = blue, the results that match are marked with an asterisk, but note that there's no way to get to them (ALL of them) directly:
Root (internal)
+---------------------+
| keys: 14 17 |
+----+--------+--------+
| | |
v v v
+----------------+ +----------------+ +----------------+
| Leaf P0 | | Leaf P1 | | Leaf P2 |
| (dates 10-13) | | (dates 14-16) | | (dates 17-20) |
|----------------| |----------------| |----------------|
| 10 | red |101 | | 14 | blue |141| | 17 | green |171|
| 11 | blue |111 |*| 14 | red |142| | 18 | blue |181|
| 12 | green |121 | | 15 | yellow|151| | 19 | red |191|
| 13 | red |131 | | 16 | blue |161|*| 20 | green |201|
+----------------+ +----------------+ +----------------+
Go back to my resume's example: you have the resumes sorted by application timestamp (chronologically) and position (alphabetically). You want to find all resumes received last Tuesday where position was "Senior Architect". You literally can't skip to the point where the next n consecutive results will be the results you seek, because you can't temporarily re-order them and within the subset of resumes where date > monday and date < wednesday, there may be multiple for "Senior Architect" but they didn't arrive one-after-the-other so they're interspersed with results you don't want. The correct to do here would be (if this is always the shape of the query you expect to get) to sort the resumes by (position, timestamp) instead, at which point you absolutely CAN jump to the position that guarantees the next results are (all the results) of the query ("Senior Architect", last Tuesday).
One important thing to keep in mind is that a SCAN is not the end of the world. You know your data and you know when it's ok (you expect most of the results to match and a few to be filtered out, in which case you're really not saving much time anyway) and when it's not (the scan will cover a huge range that contains only a few, widely interspersed results and you have no way of terminating early).
EDIT: In response to your exact question however, note that with a covering index you might not end up with a SCAN (i.e. you don't hit the table) even though it uses only the partial index (thing_date) and not (thing_date, thing_color)! It's still possible for it to avoid hitting the backing table and might return the results directly from the index - something it wouldn't have been able to do if the index was only (thing_date).
The key thing is "an index is a flattened tree", and for all us devs who haven't thought about trees in many years or younger folks who might not yet know, that means it's conceptually a bunch of nested maps/objects/dictionaries where the keys at each "level" are the columns in that same "level" of the index. To use a little bit of python, here's a list of the raw maps in your ascii art DB:
Notice that since this index is built from nested maps, if we want to use this index to find things, we HAVE to first do it by checking the keys in the "outermost" map, the 'date' column. It's a compound index but its order matters. This 'order matters' property is true in our map-based index and it's also true in our SQLite based index. It's also true that because this 'date < 17' is a range criteria, it has to check individual keys in the outermost map, which constitutes a SCAN of the index (smaller than a scan of the DB, but a scan nonetheless). To then find everything matching 'color = blue', it has to individually check all the color values in the result of the prior date SCAN in order to get down to only those with a 'color = blue'. As you can see, this index of date -> color isn't super helpful for the query 'WHERE color = blue AND date < 17' query. A query this index would be good for is a query like 'WHERE color = red AND date = 14'. That would not require any scans; if we were writing application code such a query with this index would be like calling `index[14][red]` which as we all know is super fast.
A better index for this, which is a different index, comes from swapping the order of the columns when forming the index. Instead of (date, color), this better index would be (color, date). That index looks like this:
Now with this new index to fulfill the exact same query of 'WHERE color = blue AND date < 17', we can do a `index[blue]` and then do a single SCAN of that intermediary to find only those `date < 17`. This still means our query does a SCAN, but it's scanning a smaller set of values (only 4 instead of at least 8 with the previous index) and we only do one scan instead of two scans.
Indexes in general are not flattened trees; they are just trees. Using a Python map as a mental model is fraught with peril since those are implemented as hash tables, which don't have ordering (which most database indexes need to support). So it's the wrong model.
For multidimensional indexes, you don't need to do anything fancy about nesting; you just need to have keys that have more than one component. To a first order, string concatenation is a good enough model. So in your case, here's what the index looks like:
Yes, this is true. I chose a nested-maps example because while it is inaccurate for actual DB applications, it's very helpful for explaining the limits of index ordering and the limits that range queries have when interacting with indexes. It's helpful to use maps as a first explanation because they're one of the most used datastructures that nearly all languages have built in (yes, I know, C, I'm talking about the other 9 of top 10 languages), and many work-a-day developers and dabblers will use nearly _only_ lists and maps (and objects/classes/structs) in their language of choice. I could've used Python's OrderedDict but I figured if I was going to stray away from "the most straightforward possible case using the datastructures everyone is already using daily", I'd have been better off jumping straight to a custom tree, like you've done.
Given a covering index on (thing_date, thing_color)
I would think a scan would not be needed for the query:
select thing_date, thing_color where thing_date < date and thing_color = 'blue'
I also can't think of a reason this is the case given the underlying data structures.