sparsehash is still more memory efficient, but it is quite slow in comparison. Also, in practice it doesn't reach the ultra-low space overheads claimed by the documentation. It allocates memory blocks of many different sizes, so the dominant space overhead becomes internal fragmentation in the allocator. For sparsehash using JEMalloc's default allocation classes (spaced about a factor of 1.2 apart) the memory used relative to useful data varies from about 1.3 for small values to 1.1 for large values.
I recall that that was what was suggested for Java's HashMap too, but they found it to be insufficient. By careful analysis, an attacker could still figure out how to cause hash collisions.
I haven't seen those discussions, but is it possible that they were trying to rehash the results of Object.hashCode rather than pass all of the bytes of the original value through a new hash algorithm?
Afaik, they augmented the hash function with a random seed, which didn't work out. See: http://emboss.github.io/blog/2012/12/14/breaking-murmur-hash... It appears that SipHash would be enough (for now) to prevent hash poisoning. Most open addressing hash implementations doesn't yet use SipHash though.
It would be nice to have a deeper understanding of how/why this works, but in practice it seems fine. Under a continuous insert and erase workload the number of chunks with an overflow fluctuates a bit but stays low enough to keep probe lengths short, so F14 never needs to rehash to clean them up. Batched adds and then removes might let you bias a bit toward having longer probes for the remaining keys, but it never gets too bad. A truly adversarial workload could make things pretty bad, but they could also just insert lots of keys that all have the same hash code.
https://github.com/rust-lang/hashbrown is a port of Google's SwissTable algorithm (abseil flat_hash_map). It also uses SIMD filtering and has the potential to be quite good.
Which has been recently upstreamed to the standard library, replacing the old hash map implementation. (This sped up rustc a noticeable amount, incidentally.)
Is `hashbrown::HashMap` now completely ported/identical with `std::collections::HashMap`? I just compared their performance (on yesterday's beta channel) on a hashmap with over 200k keys and didn't notice any relevant speedup.
Non-SIMD algorithms work better in some cases, for example if you look up the same key over and over (so the control flow is predictable) or when all of the data is in the L1. The SIMD approach tends to be more robust inside a larger program where there's less chance for branch prediction and caching to help you.
I'm one of the authors. Seeing the Google presentation at CPPCon convinced us that the potential wins were worth the effort to productionize, but the similarity in the designs is a case of convergent evolution. Since then Google and Facebook have collaborated on a microbenchmark (https://github.com/google/hashtable-benchmarks/), which shows that the algorithms have slightly different tradeoffs but neither dominates the other.
If there is metadata available at every slot then you can track overflows rather than tombstones. These let you terminate the search earlier (you don't need to find an empty) and can be cleaned up on erase, because they are a property of the keys in the map rather than the keys that are no longer in the map.
Sharding your product so that no cross-shard links are possible is robust and scalable, but it requires you to understand your final product very well at design time. If you start down that path and then bolt on some cross-shard data, the end result is almost always worse then if you had planned for that from the start.