PTHash and other minimum perfect hash functions return an arbitrary value if the query key did not exist when building the MPHF, so they can be a lot smaller. B-field can identify query keys that don't exist in the set (with high probability?).
What I'm wondering is why the Kraken2 probabilistic hash table doesn't work. It uses 32 bits per element in an open addressing hash table. For 1 billion k-mers and 19 bits for the value, 32 - 19 = 13 bits of the key hash can be stored alongside the value, helping disambiguate hash collisions. If the load factor is 1.25x, then that's 4 * 10^9 * 1.25 = 5GB total, better than ~7GB. Also, this is only one cache miss (+ linear probing that can be SIMD accelerated) per lookup.
> PTHash and other minimum perfect hash functions return an arbitrary value if the query key did not exist when building the MPHF, so they can be a lot smaller. B-field can identify query keys that don't exist in the set (with high probability?).
Yes, exactly.
> What I'm wondering is why the Kraken2 probabilistic hash table doesn't work.
I just skimmed the paper again (has been a while since a close reading), but my refreshed understanding is:
* Like the B-field, there are also false positives.
* When multiple hashed keys (k-mers) collide in the Kraken2 hash table, it has to store a "reduced" value for those key-value pairs. While there's an elegant solution for this issue for the problem of taxonomic classification (lowest common ancestor), it still results in a loss of specificity. There's a similar issue with "indeterminate" results in the B-field, but this rate can be reduced to ~0 with secondary arrays.
* The original Kraken2 paper describes using 17 bits for taxonomic IDs (~131K unique values). I don't know how many tax IDs current Kraken2 DB builds use offhand, but the error rate climbs significantly as you use additional bits for the value vs. key (e.g., to represent >=2^20 values, see Fig S4). I don't have a good sense for the performance and other engineering tradeoffs of just extending the hash code >32 bits. I also don't know what the data structure overhead is beyond those >32 bits/pair.
So, for a metagenomics classifier specifically, some subtle tradeoffs but honestly database quality and the classification algorithm likely matters a lot more than the marginal FP rates with either data structure -- we just happen to have come to this solution.
For other applications, my sense is a B-field is generally going to be much more flexible (e.g., supporting arbitrary keys vs. a specific fixed-length encoding) but of course it depends on the specifics.
Adversarial attacks is a super interesting field, but unfortunately I feel that a lot of papers are just incremental attack or defense improvements like a cat-and-mouse game. I originally did some research on 3D point cloud attacks, but later stopped because making super successful attacks (eg., attacks with higher success rates than all the previous techniques for some very specific task) don't really help us understand that much more about neural nets, its just optimizing a metric for publishing papers. This kind of research is quite common, even at top conferences.
Despite this, recently, we made a 1 minute explainer video introducing adversarial attacks on neural nets as a submission for the Veritasium contest: https://youtu.be/hNuhdf-fL_g Give it a watch!
Yes, this will uwuify your text at high speeds. It reached 2.3 GB/s on my 8-core macbook pro, while uwu-ing the first 1 GB of english Wikipedia. This Rust command-line tool takes advantage of SSE4.1 SIMD vectorization and multithreading (exploit all your CPU cores for this!) to be almost as fast as simply copying a file. Installing it is simply cargo install uwuify, assuming you already have Rust installed. It is on crate.io too: https://crates.io/crates/uwuify
In case you don't know what uwu'd text looks like, here's an example:
hey... i think i w-weawwy wuv you. (⑅˘꒳˘) d-do you want a headpat?
Yes, this will uwuify your text at high speeds. It reached 2.3 GB/s on my 8-core macbook pro, while uwu-ing the first 1 GB of english Wikipedia. This Rust command-line tool takes advantage of SSE4.1 SIMD vectorization and multithreading to be almost as fast as simply copying a file. Installing it is simply cargo install uwuify, assuming you already have Rust installed. It is on crate.io: https://crates.io/crates/uwuify
For binary trees, indexing can be done by saving the subtree size of each node and doing a sort of binary search. Not sure if this is fast for B-trees that have more than 2 children nodes, though.
I'm a high school student that is quite new to Rust, and I am liking it so far. I learned a lot from this project and I hope that it can be a good learning resource for others.
In addition to this, I am also working on a SIMD edit distance library in Rust.
High school is out so I am learning SIMD instruction sets, like AVX2 and SSE, and using these to speed up Hamming/Levenshtein distance calculations in Rust. Preliminary testing shows a 20x speedup using vectorized SIMD operations! The end goal is a full Rust library for edit distance routines.
Though I probably won't implement the different weighting schemes, I currently have alignment traceback and searching (allow "free shifts" for the pattern string) features.
I took a look at the code, and read the paper. It seems that they directly calculate the entire 2D DP array, but use SIMD to allow each cell to contain multiple values, one for each query string. My approach uses anti-diagonals instead, but it is fast for one vs one comparisons, instead of handling multiple query strings.
Regardless, my goal was to learn some SIMD and Rust (first time for both), so I did not read many background papers.
One thing to keep in mind is for SIMD memory locality is very important; a diagonal vector with a standard 2D DP grid is gonna lead to a lot of cache misses. Just something else to learn about.
Since I am storing the entire DP matrix as diagonal vectors that are flattened, I don't think there will be many cache misses. Each diagonal only depends on its previous two diagonals, and each diagonal is stored contiguously in memory.
The problem with handling diagonals is that indexing cells and comparing characters on the diagonal becomes complex. Dealing with this without many branches (less branch mispredictions) is the hard part.
I am a high school student, and this is a published paper that I wrote. If you want to read a shorter blog version of my work, please take a look at my blog: https://blog.liudaniel.com/n-grams-BK-trees. I enjoy working on bringing computer science algorithms to other fields like biology. I have also worked on algorithms for machine learning security.
The general problem that this paper addressed is grouping similar DNA/RNA sequences based on something known as a Unique Molecular Identifier, and then collapsing those groups into consensus sequences. This helps estimate the number of unique sequences while efficiently accounting for substitution errors in sequencing or PCR amplification.
I am a high school student, and this is a published paper that I wrote. If you want to read a shorter blog version of my work, please take a look at my blog: https://blog.liudaniel.com/n-grams-BK-trees. I enjoy working on bringing computer science to other fields like biology. I have also worked on algorithms for machine learning security.
The general problem that this paper addressed is grouping similar DNA/RNA sequences based on something known as a Unique Molecular Identifier, and then collapsing those groups into consensus sequences. This helps estimate the number of unique sequences while efficiently accounting for substitution errors in sequencing or PCR amplification.
I am a high school student, and this is a published paper that I wrote. If you want to read a shorter blog version of my work, please take a look at my blog: https://blog.liudaniel.com/n-grams-BK-trees. I enjoy working on bringing computer science to other fields like biology. I have also worked on algorithms for machine learning security.
The general problem that this paper addressed is grouping similar DNA/RNA sequences based on something known as a Unique Molecular Identifier, and then collapsing those groups into consensus sequences. This helps estimate the number of unique sequences while efficiently accounting for substitution errors in sequencing or PCR amplification.
What I'm wondering is why the Kraken2 probabilistic hash table doesn't work. It uses 32 bits per element in an open addressing hash table. For 1 billion k-mers and 19 bits for the value, 32 - 19 = 13 bits of the key hash can be stored alongside the value, helping disambiguate hash collisions. If the load factor is 1.25x, then that's 4 * 10^9 * 1.25 = 5GB total, better than ~7GB. Also, this is only one cache miss (+ linear probing that can be SIMD accelerated) per lookup.