Tries (or prefix trees). We use them a lot at Pyroscope for compressing strings ...

NSMutableSet · on July 21, 2022

I wouldn't consider tries to be obscure tbh. They are the recommended solution for many leetcode-style interview problems involving string searching. I think anyone who has done serious interview prep has encountered them.

https://leetcode.com/discuss/general-discussion/931977/begin...

EddySchauHai · on July 22, 2022

I got stung by this for a test engineering role. Apparently implementing them from scratch is a pre-req to building out test infrastructure!

OJFord · on July 22, 2022

The starting example was Bloom filters.

NSMutableSet · on July 22, 2022

I'm not sure what the point of your post is. While bloom filters are heavily used in actual production code throughout the industry, it is very rare for anyone to need to code their own, or make changes to a prior legacy implementation.

Not all educational programs will cover bloom filters, and for those that do, there's no guarantee that the students will retain the information, and be able to recall it.

I don't know of any common interview question patterns where a bloom filter would be the only optimal solution. If it does share optimality with any other data structures, you would be better off choosing something else unless you are extremely confident in your ability to implement one and explain it clearly and in-depth, since the odds of your interviewer not being familiar with them are high in comparison to other solutions.

I only know what a bloom filter is because of HN, and previous submissions like this one that have made it to the front page throughout the years. It's nowhere near as common knowledge as tries, especially if you are sampling younger cohorts.

OJFord · on July 22, 2022

I just didn't think it was necessary to be judgemental/'gate-keep' about what's obscure or interesting 'enough' - better to let anyone share what's interested them and why.

To your points, personally I do fall in a 'younger cohort' I imagine; have come across bloom filters on HN way more frequently, and I dimly recall tries from university but I knew them as prefix trees. Without submissions like this one I wouldn't learn what a trie is/to make that alias in my head.

jefftk · on July 22, 2022

I was with you until the last seven words ;)

Trees were a huge part of CS practice and education historically, but have been replaced by hash-based methods in many cases. For example, in C++ std::map is generally a tree, while in a more recent language the standard Map data structure will be a hashmap. My impression is that the instruction time devoted to Bloom filters (and other hash-based methods) vs trees has shifted towards the former over the last ~20y.

sicp-enjoyer · on July 22, 2022

C++ STL uses trees because the generic requirement for them to work is a < operator. Designing generics around hashing is much more difficult, and most languages struggle with that.

Ironically, due to caches, sorting and then using algorithms that rely on order tend to be superior than most hashing implementations, even though they are theoretically worse due to log(n) factor. So in a way, C++ algorithms are actually more modern.

Sesse__ · on July 22, 2022

> Ironically, due to caches, sorting and then using algorithms that rely on order tend to be superior than most hashing implementations

This doesn't match my experience at all. C++ trees are not cache-friendly; they're pointer-chasing (and there's no arena implementation in the STL). Second, any sort of ordering structure (be it through a tree or through sorting + binary search) is notoriously prone to branch mispredictions. They look great in a microbenchmark if you search for the same element all the time and the trees are small, but in a practical application, it can be incredibly painful. Pretty much by design, every choice is 50–50 and unpredictable.

std::unordered_map from most STLs isn't a fantastic hash table, but in my experience it absolutely destroys std::map (and is also comfortably ahead of something like std::lower_bound on a sorted array). All of this is assuming large-ish N, of course; for small N (e.g. below 30, usually), the best by far is a simple unsorted array and just going through it linearly.

sicp-enjoyer · on July 22, 2022

> C++ trees are not cache-friendly

Agreed, they should use a B-tree to get cache locality and easy generics, but there is legacy code there.

I was referring to the performance of algorithms. For example `std::unique`, `std::lower_bounds`, etc. Many of these use sorted lists, whereas most other languages' standard libraries utilize hashing for these.

> is also comfortably ahead of something like std::lower_bound on a sorted array

I would be interested to learn more about when that's the case. But, it's also not very flexible. You can put an `int` in it, great. Can you put `std::pair<int, int>` in it? Does it work as well?

Sesse__ · on July 22, 2022

B-trees don't work well for CPU caches; 64b lines are typically too small to gain much. (OK, the M1 has 128b lines, but still.) And you still get the branch mispredicts, and a significant increase in code complexity, so hashing is still significantly better. (I've seen C++ projects where the main structure was a B+-tree because the lower levels could be residing on disk, but where they maintained a separate hash table in memory to skip the top first few B-tree levels!)

> I would be interested to learn more about when that's the case. But, it's also not very flexible. You can put an `int` in it, great. Can you put `std::pair<int, int>` in it? Does it work as well?

If by “it” you're talking about std::unordered_map (as key); yes, you can, but you'd have to supply your own hash function, because std::pair<A, B> does not have a std::hash specialization. (It's a bit sad, but the problem is highly specific to std::pair/std::tuple.) Likewise, you can use any arbitrary type you create yourself, as long as it has operator== and you've written a hash function for it.

Shorel · on July 23, 2022

With a modern Intel processor, it was much faster to go through a 10k integers array than sort it first and then do binary search.

The small N is not that small anymore, and the difference speed between stack and heap is bigger now than ever.

umanwizard · on July 22, 2022

To your point, C++ got hash structures in the standard library (unordered_map and unordered_set) in C++11 (so only about a decade ago).

adgjlsfhk1 · on July 22, 2022

a lot of this is that other than heaps, most tree based algorithms involve O(logn) random access pointer lookups which make them relatively slow for in memory data structures

ant6n · on July 22, 2022

Trie != tree

btschaegg · on July 22, 2022

Indeed, but it took a while for me to appreciate Tries.

Originally, i thought "nice idea", but I rarely ever encountered a problem where I really needed a prefix tree.

But while reading into HAMT and Clojure's datastructure implementations, it dawned on me that prefix trees aren't only useful for strings (or arbitrary byte sequences): Hashes in a dictionary, for example, are sequences of bits, too. And the design of a HAMT is exactly such that it doesn't concern itself with bytes but any blocks of N bits (determined by the branching factor; i.e. a factor of 32 implies 6-bit-blocks). And if you know the length of each sequence (hash), you know the maximum Trie depth, which also works for you, not against you.

That was rather eye opening for me at the time.

chrismarlow9 · on July 22, 2022

Even though a trie is pretty standard and expected (to be known) these days, I will vote for this. It was my first deep dive into more exotic data structures after an interview question about implementing T9 that stumped me many years ago.

https://github.com/Azeem112/tries-T9-Prediction

rockwotj · on July 22, 2022

Do you have any documentation about how tries are used in mongo indexes? Or could you point to the source?

petethepig · on July 22, 2022

There's not much about it online. They do mention it in the docs, they call this "prefix compression" [0], so you can search for that. This article describes it [1], although it is somewhat high level.

They only use this for indexes. It works really well for internal mongo ids (they all start with timestamps if I remember correctly). It also works well for compound indexes. Pro tip: for compound indexes always go from low-cardinality attribute to high-cardinality attribute to get the best results from prefix compression (it's also good for speed of queries).

[0] https://www.mongodb.com/docs/manual/reference/glossary/#std-...

[1] https://medium.com/swlh/mongodb-indexes-deep-dive-understand...