More

genericlemon24 · on Oct 3, 2023

Non-paywalled versions:

http://web.archive.org/web/20170913234634/https://medium.com...

http://webcache.googleusercontent.com/search?q=cache:https:/...

genericlemon24 · on Dec 16, 2022

https://docs.python.org/3/library/bisect.html

genericlemon24 · on Dec 16, 2022

Glad to hear that :D One of my main goals was to show the entire progression of how I approached the problem, instead of just saying "here's the final result".

I have another article that covers profiling in more detail, you might find it useful: https://death.andgravity.com/fast-conway-cubes#intro-to-prof...

Note that besides the cProfile text output, there are better, graphical tools for profiling; I've personally used pyflame+flamegraph.pl, and tuna.

genericlemon24 · on Dec 16, 2022

You can, see the next paragraph, "also narrow down around the estimated position iteratively..."

I gave up because I was too lazy to think how to do the "precompute the maximum +/- error bounds" jiggawatts mentions in https://news.ycombinator.com/item?id=34021826

danielvf · on Dec 17, 2022

BTW, loved the article, not sure how I missed that you tried this.

genericlemon24 · on Dec 16, 2022

(replying with more details, because I think it's pretty cool, and I'll definitely try this out)

The linked item is to https://blog.mro.name/2022/08/pwned-diy/, where the author converts the passwords list to a CDB (Constant DataBase) file, a simple, fast lookup database format created by D. J. Bernstein.

https://cr.yp.to/cdb.html

https://manpages.ubuntu.com/manpages/bionic/man1/cdb.1.html

genericlemon24 · on Dec 16, 2022

Hey, I didn't know about this command, neat!

On my laptop, look `time`s at ~10 ms (for comparison, the Python "binary search" script `time`s at ~50 ms).

justinsaccount · on Dec 16, 2022

You can make python binary search super fast if you use mmap. here's a version of that I had lying around, it's probably correct.

  import os
  import mmap
  
  def do_mmap(f):
      fd = os.open(f, os.O_RDONLY)
      size = os.lseek(fd, 0, 2)
      os.lseek(fd, 0, 0)
      m = mmap.mmap(fd, size, prot=mmap.PROT_READ)
      return m, size, fd
  
  SEEK_SET = 0
  SEEK_CUR = 1
  
  class Searcher:
      def __init__(self, file):
          self.file = file
          self.map, self.size, self.fd = do_mmap(file)
  
      def close(self):
          self.map.close()
          os.close(self.fd)
  
      def find_newline(self):
          self.map.readline()
          return self.map.tell()
  
      def binary_search(self, q):
          pos = 0
          start = 0
          end = self.size
          found = False
          #this can get stuck with start = xxx and end = xxx+1, probably from the \r\n
          while start < end - 2:
              mid = start + (end-start)//2
              self.map.seek(mid)
              pos = self.find_newline()
              if pos > end:
                  break
              line = self.map.readline()
              if q < line:
                  end = mid
              elif q > line:
                  start = mid
  
          while True:
              line = self.map.readline()
              if not line.startswith(q): break
              yield line
  
  if __name__ == "__main__":
      import sys
      q = sys.argv[1]
      s = Searcher("pwned-passwords-sha1-ordered-by-hash-v6.txt")
      import time
      ss = time.perf_counter()
      res = s.binary_search(q.upper().encode())
      for x in res:
          print(x)
      ee = time.perf_counter()
      print(ee-ss)

genericlemon24 · on Dec 16, 2022

I did try mmap, both with the plaintext binary search, and with the binary file (you can find a note about it in the HTML source :)

I ended up not mentioning it because for some reason, it was ~twice as slow on my mac... I'm now curious to try it on a decent Linux machine.

genericlemon24 · on Dec 16, 2022

You can, Scott Helme wrote some articles about this (I mention them at the end of my article):

* https://scotthelme.co.uk/when-pwned-passwords-bloom/

* https://scotthelme.co.uk/sketchy-pwned-passwords/ – here he uses a count-min sketch to also store the frequency

genericlemon24 · on Dec 16, 2022

I actually did try this, but I got impatient and stopped it after 30 minutes or so, and a ~20G database file.

For some reason, after that, `ls` in that directory (or even shell completion) would freeze the shell entirely, even after reboot. I eventually managed to delete the file with `rm db.sqlite` (no completion allowed), but even that took like 2 minutes.

I might try again with WAL enabled (based on my shell history, I also deleted a -journal file).

0b01 · on Dec 16, 2022

Have you tried DuckDB? It’s the columnar version of SQLite.

genericlemon24 · on Dec 16, 2022

Not yet, I might give it a try.

genericlemon24 · on Dec 16, 2022

(author here) Hey, that's actually a great idea. I remember reading about a fast grep or wc implementation that did parallel reads.

Off the top of my head, I don't really know how to do that (aside from the naive way of using threads), I'm guessing readv[1] might be be useful? Will definitely look into it.

[1]: https://linux.die.net/man/2/readv

genericlemon24 · on May 24, 2022

I mention PCRE in passing at the end of the article, but I didn't know about (?(DEFINE)...); that's very, very cool!