Huh. Of all my old blog posts, this is the last one I expected to randomly resurface at the top of Hacker News.
There were a lot of other articles made using this 10M Password dataset at the time it was originally released, which the dataset author aggregated into a subreddit (https://www.reddit.com/r/10millionpasswords/). WPEngine, for example, has a much more comprehensive writeup with ad-hoc looks at specific passwords (http://wpengine.com/unmasked/).
The site is static, hosted on GitHub Pages and generated via Jekyll, backed by Cloudflare for extra HN-proofing.
As of this comment, there are 150-170 concurrent users on the site, with about 120 of them (~80%) from HN. Although I do have the data, I am hesitant to do a write up since I would need to correlate traffic to the rank of a submission on HN, which I do not have in retrospect. (For example, a post at #1 can get 300 concurrent users while this post at #3 only 150. Posts in #20-30 are lucky to get 50 concurrents. For further reference, note that Reddit posts which hit the front page of a default like /r/dataisbeautiful can get 1,000 concurrents.)
EDIT: When this post dropped to #4, traffic immediately dropped to 100-110 concurrents.
Man, you have to do a post about traffic patterns to your website with whatever data you have, it's way too interesting. Leave out the rank correlation part, and share whatever data you have available. Please!
My 2013 post "Which Universities Produce the Most Successful Startup Founders?" is one of my posts which received the most notority at the time: http://minimaxir.com/2013/07/alma-mater-data/
However, there are a number of data fidelity issues which would get me teared apart in the HN comments nowadays.
The distribution of 1-digit numbers is simple: when sites require a digit, everybody appends '1' to their usual password. The exponential declining frequency of subsequent digits is because when passwords 'expire' folks just add 1. The short lifetime of site usage results in that decline. Just thinking out loud.
Technically it describes none of these. Benford's Law only describes collections of leading digits. The charts in the article are just exponential distributions.
The problem with this type of analysis is that it treats the 10million passwords as if they are representative of all passwords. A more descriptive title would be:
"Analyzing the Patterns of Numbers in 10 Million passwords that were not randomly selected from an unknown number of accounts"
One of the first cracking rules in john is append a "1" to dictionary word. "123" is one of the few multidigit strings that john appends in the default ruleset. Furthermore the first 5 million passwords were used to generate a character frequency database for cracking the second 5 million.
How did you "avoid" sample bias? How many of the passwords come from databases that were dumped in cleartext or cracked with 100% success? Meaning every account on that system was included in cleartext or 100% of the passwords from a dump were cracked.
The reason I ask is that the dataset you analyzed does not make this claim:
I'm always struck by the uncredited similarities of stuff there to other sources, like the pin grid, found in this paper published earlier in 2012 than the blag there:
DataGenetics have really cool and interesting articles. I've been following them for a couple of years now...been asking for a "Bin and Order Packin" post.
I'm not sure if this comment is deeply sarcastic and insightful on a number of topics, or just hopelessly naive. I'm learning towards sarcastic and insightful, and it's impressive.
I think brute force password crackers could be made much more efficient by using machine learning or manually written rules to exploit how people choose passwords.
Even if you force users to pick a password of at least 8 characters with upper and lower case letter, numbers and special characters, I suspect the real entropy is much lower than the theoretical.
When it comes to visualisation of numbers distribution, every time I recall the Secret Live of Numbers [0] applet by Golan Levin from 2002. Haven't seen anything comparable ever since. So pleasant to browse through the data I'm tempted to try to make the java applet runtime working again now. (At least we can enjoy some screenshots [1])
I'm surprised 69 is third instead of first. I'm even more surprised the author is surprised it's in the tops.
When I first looked at a password database I actually laughed out loud at how many 69 there were. I don't know, there is something funny about 'Yaris69' or 'Puppy69', although it's probably used ironically these days.
The fact that there are peaks at 6 and 8 digits per password is probably due the fact that dates can be represented as DDMMYY and DDMMYYYY respectively, rather than imply that humans are better at remembering an even number of digits.
An interesting peak in the '7XX' subset is '768' which is an important number for muslims. [1] I also noticed mild peaks at '258' and '852' which are vertical sequences on a numeric keypad - in the 4-digit PIN dataset there was a distinct peak at '2580' as well - as well as another at '951' for the diagonal sequence.
Reminds me about the tidbit about "strong password" rules, like one each of small letter, capital letter, digit or symbol. Like: "Password2016". Really strong. It's even longer than 8 letters.
All charts in this post were made using R/ggplot2. (The code was not open sourced in this case because the code for this post is a mess. I have revised my process since)
There were a lot of other articles made using this 10M Password dataset at the time it was originally released, which the dataset author aggregated into a subreddit (https://www.reddit.com/r/10millionpasswords/). WPEngine, for example, has a much more comprehensive writeup with ad-hoc looks at specific passwords (http://wpengine.com/unmasked/).