Glad to hear that :D One of my main goals was to show the entire progression of how I approached the problem, instead of just saying "here's the final result".
(replying with more details, because I think it's pretty cool, and I'll definitely try this out)
The linked item is to https://blog.mro.name/2022/08/pwned-diy/, where the author converts the passwords list to a CDB (Constant DataBase) file, a simple, fast lookup database format created by D. J. Bernstein.
You can make python binary search super fast if you use mmap. here's a version of that I had lying around, it's probably correct.
import os
import mmap
def do_mmap(f):
fd = os.open(f, os.O_RDONLY)
size = os.lseek(fd, 0, 2)
os.lseek(fd, 0, 0)
m = mmap.mmap(fd, size, prot=mmap.PROT_READ)
return m, size, fd
SEEK_SET = 0
SEEK_CUR = 1
class Searcher:
def __init__(self, file):
self.file = file
self.map, self.size, self.fd = do_mmap(file)
def close(self):
self.map.close()
os.close(self.fd)
def find_newline(self):
self.map.readline()
return self.map.tell()
def binary_search(self, q):
pos = 0
start = 0
end = self.size
found = False
#this can get stuck with start = xxx and end = xxx+1, probably from the \r\n
while start < end - 2:
mid = start + (end-start)//2
self.map.seek(mid)
pos = self.find_newline()
if pos > end:
break
line = self.map.readline()
if q < line:
end = mid
elif q > line:
start = mid
while True:
line = self.map.readline()
if not line.startswith(q): break
yield line
if __name__ == "__main__":
import sys
q = sys.argv[1]
s = Searcher("pwned-passwords-sha1-ordered-by-hash-v6.txt")
import time
ss = time.perf_counter()
res = s.binary_search(q.upper().encode())
for x in res:
print(x)
ee = time.perf_counter()
print(ee-ss)
I actually did try this, but I got impatient and stopped it after 30 minutes or so, and a ~20G database file.
For some reason, after that, `ls` in that directory (or even shell completion) would freeze the shell entirely, even after reboot. I eventually managed to delete the file with `rm db.sqlite` (no completion allowed), but even that took like 2 minutes.
I might try again with WAL enabled (based on my shell history, I also deleted a -journal file).
(author here) Hey, that's actually a great idea. I remember reading about a fast grep or wc implementation that did parallel reads.
Off the top of my head, I don't really know how to do that (aside from the naive way of using threads), I'm guessing readv[1] might be be useful? Will definitely look into it.
http://web.archive.org/web/20170913234634/https://medium.com...
http://webcache.googleusercontent.com/search?q=cache:https:/...