Try Blake3 instead SHA-1 https://github.com/BLAKE3-team/BLAKE3

jiggawatts · 2025-11-07T09:41:39 1762508499

The sorting is the slowest step by far.

Hashing is so fast that you can hand-wave it away as zero cost relative to the time taken to read such a large amount of data. Also, you only have to do it once for the whole input, which means that it's O(n) time where 'n' is the gigabytes of passwords you have.

Sorting is going to need about O(n * log n) time even if it's entirely in memory, but more if it has to spool to disk storage then it'll take much longer than the hashing step.

PS: I just realised that 2 billion passwords is not actually that much data -- only 40 GB of hashes -- that's well within the range of what's "easy" to sort in-memory by simply creating an array of hashes that size and calling a standard library sort function.

zazaulola · 2025-11-07T15:37:25 1762529845

What other algorithms have you used? I'm really interested in big data streams. I would like to hear not only successful solutions, but also failed ones. Have you tried using Bloom filters? Is it possible to merge shards using the Min-Heap algorithm?

jiggawatts · 2025-11-08T02:06:54 1762567614

Algorithm choice depends on what you're optimising for. The discussion a few years ago was dozens of small web servers handling a large volume of password change traffic (10K/sec!) needing a cheap centralised service for verifying against "known bad" passwords. On a cloud hosting platform, the optimal solution is a blob store of sorted binary hashes with a small (~1 MB) precomputed index stored in-memory in the web server that lets you query the main store with 1 or at most 2 reads. This is an optimal "one round trip" request, and the per-server overhead at the front end is negligible.

However, that approach assumes a slowly changing list where you don't mind that there's a delay of maybe a few hours to merge in a new list. Large lists of password leaks are infrequent, so this works fine for this usecase.

A two-layer approach with a small frequently updated hash set on top of the large infrequently built sorted list is more generally applicable to a wider range of problems.

Bloom filters are probabilistic, and aren't much faster to create than a sorted list. They also require 'k' random reads to test, where k is a small constant such as 3 to 5. If the filter is large (gigabytes), then you can't efficiently load it into many front-end web servers. If you query it over the network as a blob, you need 3-5x the I/O operations. You can wrap it in an API service that holds it in-memory, but that's still much more complex than simply storing and querying a blob in S3 or Azure Storage directly.

"Clever" algorithms like min-heaps or whatever are likely not worth the trouble. My decade-old PC can sort 2 billion hashes in about 30 seconds using the Rust "Rayon" crate. There are cloud VMs available for about $2/hr that have about a terabyte of RAM that could sort any reasonable sized list (10s of billions) in a few minutes at most.

The original article mention a week of 80 vCores of SQL Hyperscale, which is about $6,000 at PayG rates!

Sure, developer time is expensive, blah blah blah, but waiting a week ain't cheap either, and these days an AI can bang out the code for a password file hashing and sorting utility in a minute or two.