Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think you might be missing the real performance characteristics of the hardware. The amount of comparisons might be the same, but comparisons are only a proxy for the expensive parts of lookups, the latency to tell RAM to provide data.

If the whole tree is in CPU cache then likely the difference is noise, but if the tree is large enough to have some data in RAM then the cost to copy that into cache will be larger based not on the size, but on the amount of queries from the CPU to the RAM (unless the size it way out of proportion). RAM has latency that is generally an order of magnitude slower than CPU Cache and generally an order of magnitude slower than fast contemporary storage.

There are many places you can find "Latency numbers every programmer needs to know" https://fullstackengineer.pro/blog/latency-numbers-programme...

But this trend has held for about 30 years and seem fundamental to the economics of building hardware. CPUs will be fastest, and cache slower, and RAM slower still , and fast storage still slower etc.

From your example, if it takes 100ns to get the node containing a~f then the whole of the B-Tree search for f might take 150ns (1 lookup from RAM + a few CPU intructions), but with the strictly binary tree it will need 6 lookups to RAM, and we can see that we can ignore the CPU time because it is so small. This could take 600ns (or 1300ns if each branch is a node with only pointers and the data needs to be dereferenced too). A smart cache prefetcher might guess that these were pointers and precache a and b once n was referenced, and might even cache things pointed to by loaded pointers but even in that highly optimistic scenario it could still take 300ns to load stuff.

Consider filesystems on storage and how they have had this problem but worse for a long time. Tree structures with many pointers are faster than binary trees in proportion to latency between the thing processing the tree and the place the tree it stored.

EDIT - And don't get me started on clever use of SIMD instruction. Imagine if node in a tree had binary layout matching an AVX register, then comparisons looking for the searched item could check a whole node in a single instruction! I am curious if there are any implementations that do this?

EDIT 2 - It does exist and I am late to the show, people have been doing this for like 15 years: https://stackoverflow.com/questions/20616605/using-simd-avx-...



Like the article, I'm only concerned with comparison counts, trying to explain why B-trees look like balanced BSTs, and why that gets tighter and tighter with increasing numbers of children per node.

It's simply because the nodes themselves contain the equivalent of a balanced binary tree, in the form of a binary-searched array.

In the ultimate case, we set the number of children to be large enough to contain all the data, so then there is only one B-tree node, which is binary searched to find the element. Then we have the same number of comparisons as well balanced binary search tree.


Where M is amount of data per node and N is the amount of all data, each node that you descend skips a proportion M-1/M of all the remaining data as long as N is large enough to require multiple levels of nesting.

So for practical sizes of M, as in a few cache lines, and sufficiently large datasets of size N a binary tree will eliminate 1/2 of data each node checked and a B tree with M set to 8 would skip 7/8ths of the data each node. When not descending nodes they are both in Log2(N) time, but when jumping nodes this hypothetical B-Tree is using Log8(N) time.

So yeah if the N and M are close then they are similar, but those aren't interesting cases. Datasets that small are fast to iterate even if unsorted, and large unwieldy values of M are silly and impractical. Once you have thousands or millions of items in N then the savings of eliminating downtree data from checked at all becomes real. Or put another way binary trees are always averaging Log2(N) amount of checks but B trees have some mix of Log2(N)+LogM(N) amount of checks which should be lower in most cases and so close in edge cases as to not matter.

But again, I state that counting comparisons is silly. On any modern hardware (past 20 years) load time dominates the comparison time and B-Trees will just have fewer loads, memory complexity is the "interesting" part of this problem unless you are a C64 or some zany future company with hyper fast RAM.


Just because it was first discovered 15 years ago, does not mean it wasn't still a good idea.

Keep on thinking. You never know what else you will come up with.


They measured a speedup. SIMD is a super common way to speed stuff up and good compilers do it automatically.

A lot of string comparisons use SIMD now. In C++ if you use std:string on GCCit has done it for years. I bet Rust and Python (in the compiled runtime) ate doing it too.

No sense leaving easy performance gains on the table.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: