Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'd expect a comparison with compact (or even succinct) constructions for arbitrary functions, like MWHC. Section 3.2 of https://vigna.di.unimi.it/ftp/papers/TheoryPracticeMonotone.... has a good overview.

Given a set S of arbitrary hashable values, it's possible to represent a function from S to r bits in |S|r + o(|S|) bits (keys outside S are mapped to random r-bit values). More practical construction hit ~1.23 |S|r, or even |S|(r + 1.23) bits. It should also be faster to evaluate than `r` bloom filter lookups for large datasets.

I think the main advantage of the bloom filter (or compressed bitmap) approach is it can be updated incrementally. MWHC-style representations are better suited to build once / read many workloads.



My understanding is that a perfect hash function maps elements elements to a unique integer (i.e., it's a one-to-one mapping). I think PHF data structures will also always return a value. So if you look up an element not in the constructed PHF, you'll always get a "false positive" value.

In contrast, a B-field lets you map a key to an arbitrary number of (typically non-unique) values. So I could map a million elements to "1", another million to "2", etc.

I'm not especially current (or fluent!) in that literature though, so would love pointers to anything that doesn't have the above constraints.


The MWHC construction represents minimal (monotone!) perfect hash functions as arbitrary functions to the ceil(log(n)) bits needed to store the rank... where the value happens to be the rank, but could be anything.


... meaning it is an "injective" function that maps unique key-value pairs, correct? Genuinely asking, I have glancing familiarity via their use in assembly algorithms but (a) don't have a formal math/CS background; and (b) haven't read any of the papers recently.


No, it doesn't have to be injective. In theory, the range can be any group. It's k bits in practice (with addition mod 2^k or xor as the group operator), but k need not have any relationship with `lg(|S|)`.


I think we're somewhat talking past one another -- in any case, we'll add more in the README on minimal perfect hash functions and the differences. In short, you'd need to also have a data structure (e.g., a Bloom filter) for checking if the key is in your MPHF and then also a mapping of 1..n MPHF values to your actual values.


There's a classic solution to detecting most missing entries: make your value a pair of a signature and the actual value. m signature bits result in a 2^-m false match rate for keys not in the input map.

Again, the MWHC construction does not need to map the hashed keys to ranks, they can map to the values directly.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: