For everyone interested in this, I would like to recommend the MIT Advanced Data Structures course, by prof. Erik Demaine (mentioned in the article). The part of the course about cache oblivious data structures starts on this lecture:
Great suggestion -- Demaine's lecture videos were so illuminating for me when I studied cache-oblivious algos. Here's a corresponding paper describing the cache-oblivious Funnelsort algorithm: http://erikdemaine.org/papers/BRICS2002/paper.pdf
There are two things that make the lectures difficult to follow, IMO:
The first thing is that you need to be comfortable with the names that are given to things, understand well the simpler data structures like different kinds of trees, linked lists, graphs and so on, with their corresponding operation running times. Go watch Algorithms and you'll be fine :)
The second is that the subject is really hard. During the course, most of the time we're learning about (close to) state of the art data structures, so don't expect to understand everything in one go.
If you want to understand it in depth, you'll probably have to more time studying each lecture. Try thinking about how you'd solve the problem, try to understand the proposed solution and go through the video very slowly making sure you understand the details, try to implement at least some of them.
I used these lectures as a way to get comfortable with the field and get a broad overview of the kind of solutions that usually work, so I still have to go through most of Step 2.
Basically, there were first cache-aware algorithms that assumed certain cache sizes and other properties. "Cache-oblivious" algorithms were a refinement that worked well for many cache sizes.
All in all it's silly that the "cache-oblivious" term was the one that survived, because now cache-unaware and cache-oblivious algorithms mean the opposite things - contradicting the dictionary definition of oblivious.
Nope. Still sounds like "having no awareness of cache sizes and lacking a strategy for dealing with them."
No matter how you prefix it, 'oblivious' means you are missing something you are expected to be responsible for. It is a word expressing negative sentiment.
Another fun one is radix sort[1]. We used to use it for sorting alpha'd quads(think particle systems/ui/etc) from front-to-back. Easily annihilates anything else in terms of raw performance for datasets < 10,000 with int/f32 keys and it's stable(!) for identical values.
This talk[2] from Herb Sutter(starting at 22:30) is one of my favorite talks and does a great job of explaining why this is so important(and why dense arrays are still king in many areas).
The article looks great but a doubling of speed actually seems kind of disappointing (What I'd like to see is would something like "big O" improvement).
I have been investigating cache-oblivious data structures for a while. They're impressive, complex and a bit difficult to fathom (for a cache-oblivious tree, you have to consider both relational structure created by pointers and the layout of all the pointer in memory - the article seems like it gives a nice overview compared to academic articles I've looked at).
The thing is, in most of the large applications I've tried to optimize, you could squeeze several speed doublings out of fairly easy inner-loop rewrites as soon as you started profiling the app. After that, you got more marginal results from optimizing this and that. Consider; if 25% of the time is spent retrieving data, a halving of time to do this results in a 12.5% increase in performance. Is that 12.5% worth the complexity of implementing a cache oblivious structure? Even if you application is a full database, I'm not sure if the trade-offs are worth it.
The advantage of squeezing more performance out of data structures is that you can package them up in standard libraries and provide that performance for potentially thousands of applications out there. Inner-loop rewrites are great but they are tied to a specific application.
Cache friendly algorithms tend to be on the order of 10-50x speed improvement so they're definitely worth it, the problem is that once you've already built your app you've already locked yourself into an architecture that can't be changed enough to take advantage of them.
Dig up some of posts around "Data oriented design" from the gamedev space if you want to see it done proper.
It actually may be an order as you suggest -- the difference gets bigger as the tree gets bigger. I just stopped benchmarking at about 300MB of data where the perf was almost 2x. If you run it on a 1GB tree it might be 4x
Sure, maybe. The problem is that doesn't seem enough to necessarily compensate for the added complexity.
The thing is a self-balancing binary tree is so much faster than the alternatives for large chunks of data that you really have to use it. Here, I'm not so sure.
Anyway, thinking about this lead me to this discussion about fractal trees and LSM trees.
Another question here is how will this play out if the high-performance DB is going to be moving to a GPU where cache levels and scans would be quite different.
> Sure, maybe. The problem is that doesn't seem enough to necessarily compensate for the added complexity.
A 50% improvement could be massively significant if processing a huge data-set on thousands of nodes and it is still taking days/weeks/more of wall-clock time. Given the choice between waiting longer, buying more computer power, or using a more complex algorithm, the best of those choices will depend a lot on the circumstances of the project. If it is a one-off then waiting or paying for power will likely win out. Of course in some circumstances paying for more power won't help: you can only get so much out of each node and for some processes factors like interconnect throughput & latency might eat into the ROI significantly as the number of nodes grow.
In some cases a 10% or less boost might be worth the extra complexity. And if the algorithm can be generalised and packages as a library, you aren't actually having to deal with that complexity each time the method is useful.
Dumb question: is it appropriate to describe fractal trees as an "extreme" version of a log-structured merge tree? At a glance (i.e. if you squint really hard), the buffer flushing process in a fractal tree seems kind of similar to the tablet merging process in LSM trees in that they both batch operations to amortize secondary memory access costs, except fractal trees generalize it to all operations.
Yes, additionally fractal trees have a predictable and compact memory structure while LSMTs are more sparse.
Average case for queries is also better for fractal structure and implementation is actually easier.
I'm not sure if I really understood how the cache-oblivious solution works though in this case, I can't get an intuition for how the code linked in the github relates to the description in the article.
One thing I've always wondered with patent grants in open source: it seems to say only that it's okay to use this library, nor necessarily to implement the patented algorithm yourself.
Probably open sourcing and including a patent grant signal that the owner is benevolent and isn't likely to sue you for re-implementing their algorithm, but I don't think that's spelled out anywhere.
In the linked document they go out of the way to specify that the patent grant only applies to "THIS IMPLEMENTATION" of the algorithm. I'm not one hundred percent sure what that means in regards to reimplementing the algorithm.
https://www.youtube.com/watch?v=V3omVLzI0WE