This seems like a great step forward, but it's still a bunch of ad-hoc rules. Wh...

drostie · on April 10, 2012

That's probably the wrong way to think about it, and might -- as it does in this case -- lead to a ridiculously oversized password-guessing implementation which tries to do too much fancy business.

The most obvious way to do password strength checking would not (I don't think) let you "use this list to begin cracking", but would instead estimate the Kolmogorov complexity of the password, as a proxy for its entropy.

That sounds daunting, but it's actually pretty simple in principle: append the password to a couple concatenated dictionaries plus popular password files, and see how much it compresses with your favorite zipping algorithm. Compare it to how much 'password1' zips, because you know that's the first one that they try and therefore it has complexity 2^0. If the zipping algorithm is good, it will automatically figure out most of these tricks directly from the 'bad password lists'.

I would say a little more: it is probably the case that you can "steal" the dictionary from one zipping and force a zipping algorithm to use that dictionary. If this is the case, the dictionary needed reduces to 64 KiB (I believe) rather than the 600 KB that the above script requires. I don't know how much effort it takes to get zlib-with-a-static-preset running in JS but then again, I don't know how much time it took zxcvbn to reach its final form either.

Using /usr/share/dict/american-english for my dictionary is a bit crap because it does not yet speak l33t, but my dictionary can be used for "correcthorsebatterystaple". XKCD estimates 44 bits = 5.5 bytes; gzip --best estimates 7 bytes, maybe more if we had more sequences ending in '1' to better compress 'password1'. (Some extra bits are to be expected purely due to the diverse number of password-guessing algorithms; 'switch one character to l33t, switch two characters to l33t, end with a number' offer a couple extra bits which XKCD ignores in order to establish a lower bound.)

aidenn0 · on April 11, 2012

I thought of and tried this already.

gzip does a terrible job at this. Long passwords that are in the dictionary come out as much stronger than short passwords not in the dictionary. For example, correcthorsebatterystaple increased the size by much more than ablekindpagequite despite the former being in the dictionary verbatim and the latter not.

lzma does a much better job, but all implementations I could find like to pad things out to 4-byte boundaries giving you poor resolution.

blake8086 · on April 11, 2012

I assume since you mentioned Kolmogorov complexity, that you probably have an understanding of how compression works.

Why would you add the layer of indirection of running a compression algorithm over something, when it has to measure the same thing you're trying to get at?

Given a character '1' at the beginning of a password, how likely is it that '2' is the next character? Compression tools answer this question, but then they go a different direction with the application of their answer.

Also, if your password guesser guesses anything but '123456' as its first guess, it's suboptimal, since that really is the most frequent password.

drostie · on April 11, 2012

The only reason I'd add the layer of indirection is "they've already done it for me, but they didn't expose the API." I would also be interested in training a Bayes classifier or a neural network, but again, those aren't as easy to do as just appending two passwords to the end of a dictionary copies and feeding them to gzip.