Analyzing the Patterns of Numbers in 10M Passwords (2015)

minimaxir · on Oct 13, 2016

Huh. Of all my old blog posts, this is the last one I expected to randomly resurface at the top of Hacker News.

There were a lot of other articles made using this 10M Password dataset at the time it was originally released, which the dataset author aggregated into a subreddit (https://www.reddit.com/r/10millionpasswords/). WPEngine, for example, has a much more comprehensive writeup with ad-hoc looks at specific passwords (http://wpengine.com/unmasked/).

anondon · on Oct 13, 2016

Off topic, but where do you host your website and what is your tech stack?

Would it be possible to do a blog post about traffic patterns from HN? Eg- Hits vs time since post, hits vs day of post.

minimaxir · on Oct 13, 2016

The site is static, hosted on GitHub Pages and generated via Jekyll, backed by Cloudflare for extra HN-proofing.

As of this comment, there are 150-170 concurrent users on the site, with about 120 of them (~80%) from HN. Although I do have the data, I am hesitant to do a write up since I would need to correlate traffic to the rank of a submission on HN, which I do not have in retrospect. (For example, a post at #1 can get 300 concurrent users while this post at #3 only 150. Posts in #20-30 are lucky to get 50 concurrents. For further reference, note that Reddit posts which hit the front page of a default like /r/dataisbeautiful can get 1,000 concurrents.)

EDIT: When this post dropped to #4, traffic immediately dropped to 100-110 concurrents.

anondon · on Oct 14, 2016

Man, you have to do a post about traffic patterns to your website with whatever data you have, it's way too interesting. Leave out the rank correlation part, and share whatever data you have available. Please!

robocaptain · on Oct 14, 2016

Out of curiosity (and because I'd love to read more), which of your old blog posts would you MOST expect to randomly resurface at the top of HN?

minimaxir · on Oct 14, 2016

My 2013 post "Which Universities Produce the Most Successful Startup Founders?" is one of my posts which received the most notority at the time: http://minimaxir.com/2013/07/alma-mater-data/

However, there are a number of data fidelity issues which would get me teared apart in the HN comments nowadays.

JoeAltmaier · on Oct 13, 2016

The distribution of 1-digit numbers is simple: when sites require a digit, everybody appends '1' to their usual password. The exponential declining frequency of subsequent digits is because when passwords 'expire' folks just add 1. The short lifetime of site usage results in that decline. Just thinking out loud.

markild · on Oct 13, 2016

Looks like a few of the patterns in his analysis has a tendency towards Benford's Law[1]

[1]:https://en.wikipedia.org/wiki/Benford%27s_law

nateberkopec · on Oct 13, 2016

Technically it describes none of these. Benford's Law only describes collections of leading digits. The charts in the article are just exponential distributions.

markild · on Oct 13, 2016

Yeah. Reading a bit more into it, I think you're right.

amelius · on Oct 13, 2016

The article says:

> Distributions that would not be expected to obey Benford's Law:

> ...

> Where numbers are influenced by human thought: e.g. prices set by psychological thresholds

dfc · on Oct 13, 2016

The problem with this type of analysis is that it treats the 10million passwords as if they are representative of all passwords. A more descriptive title would be:

"Analyzing the Patterns of Numbers in 10 Million passwords that were not randomly selected from an unknown number of accounts"

One of the first cracking rules in john is append a "1" to dictionary word. "123" is one of the few multidigit strings that john appends in the default ruleset. Furthermore the first 5 million passwords were used to generate a character frequency database for cracking the second 5 million.

aWeighThrown · on Oct 14, 2016

  the first cracking 
  rules in john

And by John, you mean the "John The Ripper" program.

https://en.wikipedia.org/wiki/John_the_Ripper

minimaxir · on Oct 13, 2016

The 10M dump was collected from a wide variety of sources to avoid sampling bias.

dfc · on Oct 13, 2016

How did you "avoid" sample bias? How many of the passwords come from databases that were dumped in cleartext or cracked with 100% success? Meaning every account on that system was included in cleartext or 100% of the passwords from a dump were cracked.

The reason I ask is that the dataset you analyzed does not make this claim:

"Now not all of these passwords are plaintext. Many dumps include passwords in a hashed format that requires you to crack them yourself." https://xato.net/a-glimpse-into-the-world-of-internet-passwo...

kijin · on Oct 13, 2016

DataGenetics did a similar analysis with four-digit numbers in leaked passwords and PINs. The article contains lots of cool visualizations.

http://www.datagenetics.com/blog/september32012/

maxerickson · on Oct 13, 2016

I'm always struck by the uncredited similarities of stuff there to other sources, like the pin grid, found in this paper published earlier in 2012 than the blag there:

https://www.cl.cam.ac.uk/~rja14/Papers/BPA12-FC-banking_pin_...

iask · on Oct 13, 2016

DataGenetics have really cool and interesting articles. I've been following them for a couple of years now...been asking for a "Bin and Order Packin" post.

d--b · on Oct 13, 2016

Notable fact: '69' makes it as '3rd most used combination of 2 numbers in passwords'.

AznHisoka · on Oct 13, 2016

I assume because most users were born in 1969?

wccrawford · on Oct 13, 2016

I'm not sure if this comment is deeply sarcastic and insightful on a number of topics, or just hopelessly naive. I'm learning towards sarcastic and insightful, and it's impressive.

ssully · on Oct 13, 2016

This reminds me of my mother wishing her neighbor happy birthday on April 20th because his Wifi network name had 420 at the end of it.

wnevets · on Oct 13, 2016

assuming you're not a joking, it's probably because of the sex position.

gruez · on Oct 14, 2016

nah, they really liked the apollo 11 moon landing

TorKlingberg · on Oct 13, 2016

I think brute force password crackers could be made much more efficient by using machine learning or manually written rules to exploit how people choose passwords.

Even if you force users to pick a password of at least 8 characters with upper and lower case letter, numbers and special characters, I suspect the real entropy is much lower than the theoretical.

e12e · on Oct 13, 2016

There were a couple of talks about this at password^12:

Like:

http://passwords12.at.ifi.uio.no/Kirsi_Helkala/

http://passwords12.at.ifi.uio.no/Markus_Duermuth_Password_Se...

But it's a whole conference about passwords... so not sure if I found the presentation I had in mind...:

http://passwords12.at.ifi.uio.no/

And btw, registration is now open for password^16 in Germany in December: https://passwordscon.org/

PwdRsch · on Oct 14, 2016

Some of the latest research on this technique:

Fast, Lean, and Accurate: Modeling Password Guessability Using Neural Networks https://www.ece.cmu.edu/~lbauer/papers/2016/usenixsec2016-ne...

andromeduck · on Oct 13, 2016

That had been done for more than a decade now.

myfonj · on Oct 13, 2016

When it comes to visualisation of numbers distribution, every time I recall the Secret Live of Numbers [0] applet by Golan Levin from 2002. Haven't seen anything comparable ever since. So pleasant to browse through the data I'm tempted to try to make the java applet runtime working again now. (At least we can enjoy some screenshots [1])

[0] http://www.flong.com/projects/slon/ [1] https://www.flickr.com/photos/golanlevin/sets/72157594388612...

Coincoin · on Oct 13, 2016

I'm surprised 69 is third instead of first. I'm even more surprised the author is surprised it's in the tops.

When I first looked at a password database I actually laughed out loud at how many 69 there were. I don't know, there is something funny about 'Yaris69' or 'Puppy69', although it's probably used ironically these days.

lwander · on Oct 13, 2016

The fact that there are peaks at 6 and 8 digits per password is probably due the fact that dates can be represented as DDMMYY and DDMMYYYY respectively, rather than imply that humans are better at remembering an even number of digits.

grkvlt · on Oct 13, 2016

An interesting peak in the '7XX' subset is '768' which is an important number for muslims. [1] I also noticed mild peaks at '258' and '852' which are vertical sequences on a numeric keypad - in the 4-digit PIN dataset there was a distinct peak at '2580' as well - as well as another at '951' for the diagonal sequence.

[1] http://islam.stackexchange.com/questions/799/what-does-786-m...

OJFord · on Oct 13, 2016

There's a comment there [0] asking for more graphs including the distribution for password managers that randomly generate passwords... erm...

[0]: http://minimaxir.com/2015/02/password-numbers/#comment-18765...

blakep · on Oct 13, 2016

Looks like this guy is doing some serious campaigning for his password manager, take a look at his previous comments:

https://disqus.com/by/disqus_OIqfE7dCZb/

e12e · on Oct 13, 2016

Reminds me about the tidbit about "strong password" rules, like one each of small letter, capital letter, digit or symbol. Like: "Password2016". Really strong. It's even longer than 8 letters.

HeyLaughingBoy · on Oct 13, 2016

"E12e likes 2016" is probably even stronger and easier to remember

e12e · on Oct 14, 2016

The point is that "Password2016" will often score as "strong" (enough) while it really isn't.

social_quotient · on Oct 13, 2016

slightly off topic, what tool/lib did you use to make the charts?"

minimaxir · on Oct 13, 2016

All charts in this post were made using R/ggplot2. (The code was not open sourced in this case because the code for this post is a mess. I have revised my process since)

dfc · on Oct 13, 2016

I am not the author but I imagine this was not included as a joke: "All charts were made using R and ggplot2."