More

boyd · 2025-05-12T13:29:07 1747056547

Throwing cores at the problem with `pytest-xdist` is typically the lowest hanging fruit, but you still hit all the paper cuts the authors mention -- collection, DB fixtures, import time, etc.

And, further optimization is really hard when the CI plumbing starts to dominate. For example, the last Warehouse `test` job I checked has 43s of Github Actions overhead for 51s of pytest execution time (half the test action time and approaching 100% overhead).

Disclosure: Have been tinkering on a side project trying to provide 90% of these pytest optimizations automatically, but also get "time-to-first-test-failure" down to ~10 seconds (via warm runners, container snapshotting, etc.). Email in profile if anyone would like to swap notes.

boyd · on May 23, 2024

> PTHash and other minimum perfect hash functions return an arbitrary value if the query key did not exist when building the MPHF, so they can be a lot smaller. B-field can identify query keys that don't exist in the set (with high probability?).

Yes, exactly.

> What I'm wondering is why the Kraken2 probabilistic hash table doesn't work.

I just skimmed the paper again (has been a while since a close reading), but my refreshed understanding is:

* Like the B-field, there are also false positives.

* When multiple hashed keys (k-mers) collide in the Kraken2 hash table, it has to store a "reduced" value for those key-value pairs. While there's an elegant solution for this issue for the problem of taxonomic classification (lowest common ancestor), it still results in a loss of specificity. There's a similar issue with "indeterminate" results in the B-field, but this rate can be reduced to ~0 with secondary arrays.

* The original Kraken2 paper describes using 17 bits for taxonomic IDs (~131K unique values). I don't know how many tax IDs current Kraken2 DB builds use offhand, but the error rate climbs significantly as you use additional bits for the value vs. key (e.g., to represent >=2^20 values, see Fig S4). I don't have a good sense for the performance and other engineering tradeoffs of just extending the hash code >32 bits. I also don't know what the data structure overhead is beyond those >32 bits/pair.

So, for a metagenomics classifier specifically, some subtle tradeoffs but honestly database quality and the classification algorithm likely matters a lot more than the marginal FP rates with either data structure -- we just happen to have come to this solution.

For other applications, my sense is a B-field is generally going to be much more flexible (e.g., supporting arbitrary keys vs. a specific fixed-length encoding) but of course it depends on the specifics.

boyd · on May 22, 2024

My understanding is that a perfect hash function maps elements elements to a unique integer (i.e., it's a one-to-one mapping). I think PHF data structures will also always return a value. So if you look up an element not in the constructed PHF, you'll always get a "false positive" value.

In contrast, a B-field lets you map a key to an arbitrary number of (typically non-unique) values. So I could map a million elements to "1", another million to "2", etc.

I'm not especially current (or fluent!) in that literature though, so would love pointers to anything that doesn't have the above constraints.

pkhuong · on May 22, 2024

The MWHC construction represents minimal (monotone!) perfect hash functions as arbitrary functions to the ceil(log(n)) bits needed to store the rank... where the value happens to be the rank, but could be anything.

boyd · on May 22, 2024

... meaning it is an "injective" function that maps unique key-value pairs, correct? Genuinely asking, I have glancing familiarity via their use in assembly algorithms but (a) don't have a formal math/CS background; and (b) haven't read any of the papers recently.

pkhuong · on May 22, 2024

No, it doesn't have to be injective. In theory, the range can be any group. It's k bits in practice (with addition mod 2^k or xor as the group operator), but k need not have any relationship with `lg(|S|)`.

boyd · on May 23, 2024

I think we're somewhat talking past one another -- in any case, we'll add more in the README on minimal perfect hash functions and the differences. In short, you'd need to also have a data structure (e.g., a Bloom filter) for checking if the key is in your MPHF and then also a mapping of 1..n MPHF values to your actual values.

pkhuong · on May 23, 2024

There's a classic solution to detecting most missing entries: make your value a pair of a signature and the actual value. m signature bits result in a 2^-m false match rate for keys not in the input map.

Again, the MWHC construction does not need to map the hashed keys to ranks, they can map to the values directly.

boyd · on May 22, 2024

Thank you! The "Space Requirements" section in the README has a few examples, and your comment has made me realize our (micro-)benchmark link in the README is broken.

We'll get that fixed and maybe find the time to do a larger post with some benchmarks on both space/time tradeoffs and overall performance vs. other data structures.

boyd · on May 22, 2024

> So it’s for cases where you have any key but associated with one of only (preferably few) discrete values

We use it for a case with ~million unique values, but it's certainly more space efficient for cases where you have tens, hundreds, or thousands of values. The "Space Requirements" section has a few examples: https://github.com/onecodex/rust-bfield?tab=readme-ov-file#s... (e.g., you can store a key-value pair with 32 distinct values in ~27 bits of space at a 0.1% false positive rate).

> all the docs say “designed for in-memory lookups”

We use mmap for persistence as our use case is largely a build-once, read many times one. As a practical matter, the data structure involves lots of random access, so is better suited to in-memory use from a speed POV.

> fyi, you use temp::temp_file() but never actually use the result, instead using the hard-coded /tmp path

Thank you, have opened an issue and we'll fix it!

ComputerGuru · on May 24, 2024

Sure, but I wouldn’t expect the api to force you to use an mmap when a slice of bytes would accomplish the same when unpersisted (and the user could choose to persist via a different mechanism if you have a .into() method that decays self into a Vec<u8>/Box<[u8]>/etc)

If I were to design this library, I would internally use an enum { Mapped(mmap), Direct(Box<[u8]>) } or better yet, delegate access and serialization/persistence to a trait so the type becomes BField<Impl> where the impl trait provides as_slice() and load()/save().

This way you abstract over the OS internals, provide a pure implementation for testing or no_std, and probably improve your codegen a bit.

boyd · on May 22, 2024

Ah, we need to clarify the language! The B-field will always return the correct value for an inserted key.

False positives are only returned for keys that have not been inserted. This is akin to a Bloom filter falsely returning that a key is in the set).

dpc_01234 · on May 22, 2024

I second that "The B-field will always return the correct or Indeterminate value for an inserted key." before listing classes of errors would clarify it by a lot.

boyd · on May 22, 2024

Yes you can manage the error rate by controlling the overall size of the allocated bit array and several other parameters. There's a (slightly obtuse) section on parameter selection here: https://github.com/onecodex/rust-bfield?tab=readme-ov-file#p...

And a Jupyter notebook example here: https://github.com/onecodex/rust-bfield/blob/main/docs/noteb...

We do need a better "smart parameter selection" method for instantiating a B-field on-the-fly.

boyd · on May 22, 2024

I think those are both good examples of where you can manage the cost of a false positive.

In genomics, we're using this to map a DNA substring (or "k-mer") to a value. We can tolerate a very low error rate for those individual substrings, especially since any erroneous values will be random (vs. having the same or correlated values). So, with some simple threshold-based filtering, our false positive problem goes away.

Again, you'll never get the incorrect value for a key in the B-field, only for a key not in the B-field (which can return a false positive with a low, tunable error rate).

foota · on May 22, 2024

Yes, that makes sense.

Ooc, what do you think about the comparison to posting lists (aka bitmap indexes).

Some searching shows ML folks are also interested in compressing KV caches for models, so if your technique is applicable there you can probably find infinite funding :P

boyd · on May 22, 2024

So a bitmap index requires a bit per unique value IIRC (plus the key and some amount of overhead). So for ~32 unique values you're already at 4 bytes, 40 bytes per key-value pair for 320 values, etc.

In comparison, a B-field will let you store 32 distinct values at ~3.4 bytes (27 bits) per key-value pair at a 0.1% FP rate.

foota · on May 22, 2024

Yes, I'm not claiming that they're going to perform as well, just that they're sort of similar in the space of problems.

There's different techniques used for compressing them, mostly around sorting the keys and e.g., run length encoding the values.

boyd · on May 17, 2024

We’re more focused on data visualization vs. general “chat with your data”, but building in this space: https://minard.ai

Upload or import a CSV, Excel, Google sheet, etc. and also just launched Postgres and Snowflake connectors today!

boyd · on April 19, 2024

It looks like on mobile in portrait that link isn’t showing up in the header. Thanks for flagging!

We’ll get that fixed, but in the meantime the direct login link is https://minard.ai/accounts/login/

Note that the site is not mobile optimized at the moment, so you’ll want to be on desktop or at least a tablet to actually play around with it.

buildj48 · on April 19, 2024

Oh cool! Thank you and noted!