Show HN: WordPass – password generator giving over 90 bits of entropy

euank · on May 24, 2014

We need to just get away from remembering passwords. Something like Mozilla Persona would have been great, but since it didn't catch on we already have KeePass (and for mac/linux, keepassx2 alpha works quite well).

These cutesy little passwords might be fairly strong and easy to remember, but it hardly matters because having to remember them results in reuse and an upper bound on the possible entropy. The human mind has more important things to do than remembering thousands of bits of entropy (spread among all your passwords, you should have at least that much)... we made computers to remember this sort of stuff for us.

If these passwords aren't meant to be remembered, but put in keepass, then there's no point in not just upping the entropy by randomizing at the character granularity.

KeePass makes it much easier; you have it generate you 300 bits of entropy passwords (or whatever you feel is enough), and then you can focus all your security efforts on that one database file. Having a "single point of failure/password" doesn't have to be bad because it lets you guard more closely against that single point.

dasil003 · on May 24, 2014

I don't understand how 4 random words from a 50k dictionary gives 90 bits of entropy.

    50k ^ 4 => 6.25e+18

    2 ^ 90  => 1.24e+27

Am I missing something?

saurik · on May 24, 2014

The words are separated (and I guess terminated) by a random symbol (of which there are apparently 32 possibilities) or (randomly) a random two-digit number. log(50000^4*(100+32)^4)/log(2) = 90.6

ghshephard · on May 24, 2014

Doesn't that defeat the "Trivial to remember" part of the xkcd concept of "Trivial to remember, hard to guess?"

stcredzero · on May 24, 2014

That means you only have 7 entities to remember, which is supposed to be what humans can handle easily. (Though I understand now that this is stale knowledge.)

aggie · on May 24, 2014

The number refers to the capacity of working memory and has indeed been revised over time. It's a complicated issue, but 4 units of information is now considered a more accurate approximation. But if you're remembering a password, you're dealing with long-term memory, which doesn't have this limitation.

http://en.wikipedia.org/wiki/Working_memory#Capacity

miga · on May 25, 2014

Long-term memory usually accepts most information from short-term memory not direct sensory input, so it is easier to memorize when you break down things in chunks first.

mtct · on May 24, 2014

https://www.schneier.com/blog/archives/2014/03/choosing_secu...

atmosx · on May 24, 2014

Can you offer a link on how do you do these measurements and what exactly do they mean?

Reading the wikipedia article about entropy in (computing) I stumbled across this[1]:

" Around 2011, two of the random devices were dropped and linked into a single source as it could produce hundreds of megabytes per second of high quality random data on an average system. "

What does it mean high quality random data in this context?. I mean 'random' is a precise definition, something 'is random' (e.g. distance of next prime :-P, kill Riemann) or 'not random' (distance of next prime if Riemann's ζ(s) is solved).

[1] https://news.ycombinator.com/user?id=dasil003

eliteraspberrie · on May 24, 2014

The information theory (mathematical) definition of randomness is explained nicely in Chapter V of An Introduction to Information Theory:

http://www.amazon.com/Introduction-Information-Theory-Symbol...

In the particular case of a series of n equally-probably, independent events, the entropy is given as H = - log 1/n, measured in bits. For example, the entropy of a fair die is - log 1/6 = 2.58 bits per throw.

In this case, the random event is words chosen from a word list. Four words are chosen from fifty thousand, with each word having equal probability of being chosen. So the entropy (measure of randomness) is - log 1/50,000 = 15.6 bits per word, or 62.4 bits per four-word combination. (The script also adds random numbers or symbols, to add up to 90 bits.)

CHY872 · on May 24, 2014

Ugh, the other replies aren't quite getting at the crux of the matter. A function f : {0, 1}^k -> {0, 1}^n -> {0, 1}^m is considered to be pseudo-random if (when provided with a key K) there is no way of distinguishing between its output and the output of a random function F : {0, 1}^n -> {0, 1}^m in polynomial time (it's a little more complicated than that, but that's the gist).

In modern computers, the basic principle is that one generates a random seed (through heat noise, keyboard movements, mouse movements, user interactions with the computer etc) and then uses it as the key for a pseudo-random function such that the next output of the function cannot be predicted given all of the previous. One can simply iterate if one so chose (first n bits output = f_K(0), next f_K(1), ...).

In this case, he's simply worked by the fact that if one knows that their target is using this key scheme, one knows that at most there are 50^4 passwords the target could be using. If there are 90 bits of entropy, that means that there are 2^90 passwords that the target could be using. What seems to be the difference here is that this program outputs a special character between each word, or a 2-digit number, meaning that we have 50000^4*(100+32)^4 > 2^90. It's kinda messy, and probably not that easy to remember.

In general, with passwords, one wants about 2^80 bits of entropy - it's high enough to be incredibly difficult to crack even when hashed insecurely. Once we start using proper algorithms like bcrypt it's even harder.

To give a comparison, a password output by this is roughly as hard to crack as a 15 character random password encoded in ASCII, or if unicode is allowed, a 6 character password in that encoding.

ef4 · on May 24, 2014

"Distance of next prime" is not random, it's entirely deterministic. It's a well-defined function. Just because we don't have an analytical solution for it doesn't make it random.

As for "high quality", some random variables are less predictable than others. A weighted coin that comes up heads 51% of the time is still random (not deterministic), but more predictable than a fair coin.

avoid3d · on May 24, 2014

I don't like calling non-uniform distributions 'predictable'.

Think of it like this: A coin which produces heads 99/100 times is not predictable, since you cannot predict when the next tails is coming, you just know that they are rarer.

As for the distance to the next prime thing, you are just plain wrong, in a sense that he is talking about a seeded pnrg, i.e the starting prime is a seed and you don't get to know which it is, out of all integers...

ef4 · on May 24, 2014

> in a sense that he is talking about a seeded pnrg

That little "p" makes all the difference. The post I was replying to was asking about the precise definition of "random", not "pseudorandom".

As you allude to, any real randomness in such a scheme needs to come in the form of a seed.

tinco · on May 24, 2014

That's random in maths, we're not talking maths here. In the real world, random is a value that's hard to predict. For example a dice throw has a high quality of random in normal circumstances.

A low quality random source would depend on some variables that make it easier to predict the next random number (perhaps just statistically). The worst quality random source would be pseudo-random numbers, where the value of each next random number directly follows from the previous value and perhaps some secret second value.

It might be distributed like a true random number, but eventually the secret could be guessed and the numbers would become predictable.

darkmighty · on May 24, 2014

A simple explanation is that "randomness" here is related to (logarithm of) the minimum average time [1] you would take to guess the password correctly (H=sum(pi * log(1/pi)). Since the random number generator supposedly gives an uniform distribution over a set of words, this simplifies to H=n(1/n log(n))=log(n).

[1] Your average time is at least t* =1/2 * 2^H, so log( t* )>H-1.

Ruud-v-A · on May 24, 2014

I wrote a post about this some time ago: http://ruudvanasseldonk.com/2012/07/07/passphrase-entropy

csirac2 · on May 25, 2014

My favourite tool is simply gpw(1). It tries to generate semi-pronounceable words that aren't necessarily dictionary words:

  $ gpw
    rpreence
    rethersi
    atencend
    rostrass
    rtschers
    pocrevoy
    umblowfl
    disalsit
    rmsturnu
    tinfethe

I then have a perl script which can give me slightly more interesting capitalizations and numbers:

  nessinea 258 rusness
  redoodr rInGLe 893
  orinGsT 87 iNdaPeRv
  3423 RisEv screar bullys

I've become quite good at remembering these. I can usually remember them even if I go a month or two between usages. It's hard to explain, but it seems to require a different memorization effort that somehow sticks better into my long-term memory than with random dictionary words.

(1) http://manpages.ubuntu.com/manpages/hardy/man1/gpw.1.html

miga · on May 25, 2014

Yes, it is critical constraint - if you find it easy to remember.

What's the size of random password space there?

csirac2 · on May 25, 2014

You just prompted me to read the source code :-) gpw.c has this comment at the top:

  /* GPW - Generate pronounceable passwords
   This program uses statistics on the frequency of three-letter sequences
   in your dictionary to generate passwords.  The statistics are in trigram.h,
   generated there by the program loadtris.  Use different dictionaries
   and you'll get different statistics.

   This program can generate every word in the dictionary, and a lot of
   non-words.  It won't generate a bunch of other non-words, call them the
   unpronounceable ones, containing letter combinations found nowhere in
   the dictionary.  My rough estimate is that if there are 10^6 words,
   then there are about 10^9 pronounceables, out of a total population
   of 10^11 8-character strings.  I base this on running the program a lot
   and looking for real words in its output.. they are very rare, on the
   order of one in a thousand.

   ...

I'm not really in a position to properly figure this out... for a start, it's commonly claimed that there are 1M words used in English however my local /usr/share/dict/words has only 100K entries (and a even lot of those seem redundant).

Assuming my gpw binary was "trained" on a 1M wordlist, and that my algorithm averages three 8-char strings per passphrase... and we take the comments at face-value, i.e. 10^9 possibilities per 8-char string generated.

I'm going to pretend I can competently use Mlog2(N) to estimate that three such strings might make up 3log2(10^9)=89 bits of entropy. Throw in a 0-9999 number, let's pretend that's worth 13-ish bits, let's call it 100 bits?

I haven't factored in the capitalizations yet, and I'm not sure how much the random placement of the numbers and the random lengths of each "word" are worth.

IANACG (I am not a crypto guy) but that seems like a huge number of bits and so this is likely completely wrong (I'd start with going back to trying to properly analyze how many combinations my gpw binary can actually produces in an n-char string - that 10^9 combinations from a 10^6 wordlist needs testing).

arh68 · on May 24, 2014

This inspired me to write a rudimentary version in bash/C:

    ----pwgen.sh
    #!/usr/bin/env bash
    DICT=/usr/share/dict/words
    NL=$(wc -l $DICT | sed 's_[ \t]*\([0-9]*\).*_\1_')
    PW=""
    for i in {1..4}; do
    RAND=$(./arc4random)
    LINE=$[$RAND % $NL]
    WORD=$(sed -n $LINE"p" < $DICT)
    PW=$WORD" "$PW
    done
    echo $PW
 
    ----arc4random.c
    #include <stdio.h>
    #include <stdlib.h>
    int main() {
        printf("%u\n", arc4random());
        return 0;
    }

They're not always easy to remember, though, at least for me:

    > ./pwgen.sh 
    interdepartmental anesthyl intrabranchial physiolatrous

cynwoody · on May 24, 2014

Your little script performs amazingly well on my Mac, coming in at about 165 ms, quite a bit faster than I expected. If your script is too fast for you, you could consider reading the dictionary into a bash array instead of using sed. My /usr/share/dict/words contains 234,936 words.

As a point of comparison, here is a Python version that clocks in around 107 ms:

    #!/usr/bin/env python
    from random import SystemRandom as r
    
    with open('/usr/share/dict/words', 'rb') as f:
        lines = f.readlines()
    
    print ' '.join(r().choice(lines).rstrip() for x in xrange(4))

However, I don't know how well SystemRandom compares to arc4random in terms of crypto.

jzwinck · on May 25, 2014

If you're set on using arc4random it's no big deal to use ctypes to invoke it from Python. Bash is quick to bang out but it's funny how your parent needs to compile a C program then invoke it N times per bash script run. Using Python for this is an obvious choice.

arh68 · on May 26, 2014

Yes thanks for pointing that out, I wrote a slightly faster version below. I took your advice and wanted just a single invoke on the C program, so I modified it to take an argument. I'm satisfied!

    > cat arc4random-v2.c
    #include <stdio.h>
    #include <stdlib.h>
    int main(int argc, char** argv) {
      if ( ! (argc == 1 || argc == 2) ) {
        printf("usage: %s [n_output]\n", argv[0]);
      } else {
        int n_iter = (argc == 2) ? atoi(argv[1]) : 1;
        for (int i = 0; i < n_iter; i++) {
          printf("%u\n", arc4random());
        }
      }
      return 0;
    }

    > ./arc4random 3
    4282823399
    333796506
    762478899
    
    > cat pwgen-v2.sh
    #!/usr/bin/env bash
    dict=/usr/share/dict/words
    n=4 # of words to include in passphrase
    nl=$(wc -l $dict | sed 's_[ \t]*\([0-9]*\).*_\1_')
    fmt_flag() { echo $[$1 % $nl] | sed -n -e 's_^_-e _' -e 's_$_p_p' ; }
    flags=$(./arc4random $n | while read -r i; do fmt_flag "$i"; done)
    words=$(sed -n $flags < $dict)
    echo $words
    
    > time ./pwgen.sh
    liparite evade retiringly spacious
    
    real	0m0.268s
    user	0m0.251s
    sys	        0m0.015s
    
    > time ./pwgen-v2.sh
    latherin prideling thumbscrew unraveled
    
    real	0m0.082s
    user	0m0.072s
    sys	        0m0.013s

EDIT: note that the single call to sed results in alphabetical/dictionary-dependent results... (!) oops

ghshephard · on May 24, 2014

  import random, re

  lines=[re.sub('[-!.,;:]',' ',x).split() for x in open('in').readlines()]

  w=set()
  for l in lines:
      w=w | set(l)

  random.sample(w,4)

Hrm, for romeo and juliet I get some interesting combinations...

  ['unseen', 'iron', 'Catling', 'baked']
  ['grove', 'press', 'Aurora', 'garish']
  ['agate', 'She', 'drybeat', 'rather']
  ['flowed', 'sails', 'wed', 'masks']
  ['spilt', 'cage', 'Remembering', 'stiff']
  ['heartsick', 'shame', 'enjoin', 'weeping']

wyager · on May 24, 2014

> import random, re

You'll want to replace random with a cryptographically secure RNG.

peteretep · on May 24, 2014

Could you describe a practical attack against this password generator based on how it actually works?

Specifically what information you would need to be in possession of in order to exploit it, and then a description of how you would - again, in practice - exploit that, inside a practical timeframe. Thanks.

tptacek · on May 24, 2014

No, there probably isn't a realistic attack against Python's default MT random based on the normal use case of a random word selector run "offline".

wyager · on May 24, 2014

> there probably isn't a realistic attack against Python's default MT random based on the normal use case of a random word selector run "offline".

What does Python use as an entropy source for its PRNG? Even though MTs work pretty well if you can't look at any of their outputs (i.e. you're "offline"), if the initial state space is small enough you can just crawl the entire thing.

ghshephard · on May 25, 2014

It's offline, so, I have my own dictionary which pretty much destroys any chance of guessing what words are sampled.

tptacek · on May 25, 2014

urandom.

borplk · on May 24, 2014

I'm also interested in knowing a practical attack against this.

How could a weakness in the randomness be a problem at all in this case? (generating words to use as password)

Ok suppose I've used a not-at-all-random PRNG and got the following numbers: 1, 16, 29, 30, 18

So I grab the 1st, 16th, 29th, 30th and the 18th words from my list and those words are my generated password.

How is this vulnerable to any attacks?

euank · on May 24, 2014

To expound on the other response.

A PRNG is only as random as its seed / starting state. Let's say it gets this state from the current time, in milliseconds.

Now, let's assume the attacker knows, based on your "Time joined" statistic for some account, a roughly 2 minute period of when you generated your password (it's very common for a website to show time joined in the granularity of a second through an api; passwords are rarely generated more than a minute beforehand if you use a generator + keepass). 2 minutes is 3,600,000 milliseconds... or in other words, they can seed the RNG with all 3.6 million "time-in-milliseconds" and run the program and see the output. They now only have to guess 3.6 million possible passwords, not, say, 1e20 (assuming a dictionary of 10k words). That's a massive speedup on the order of 1e14.

I hope that all made sense... the basic idea is just if the attacker can make any sort of estimate about the state of the PRNG, he can just test passwords generated from states like that, not all possible passwords.

You'll notice, of course, that this vulnerability relies on the other party being able to discern some information about the state. This doesn't have to be from guessing the seed directly (as was the case in the example above). It can also happen by observing multiple outputs and guessing at what seed could produce them both, or noticing a bias.

The last one, the chance for a bias, is quite possibly what you're asking about. "How could a weakness in the randomness be a problem at all in this case?". A weakness in randomness means that some combinations of words will show up more often than others, by definition. It makes the password weaker because the attacker can discover this bias and then guess more probable passwords first.

borplk · on May 25, 2014

Thanks, but all of that is based o the assumption that the attacker has the list of words that I used to generate the password right?

euank · on May 25, 2014

That he has it or can get it, yes. Even if he doesn't have it, there are a few dictionaries (/usr/share/dict/words on various popular distros would be a good start) he could try and only guess the probable options in both of them.

Furthermore, it's possible he could figure out which dictionary you were using by another means. If a site that stored your password in plaintext was compromised, that would give him 4,5,whatever words in your dictionary, and using a similar attack on this flawed prng as discussed above, he could narrow down the possible positions of those words, thereby figuring what dictionary is most probable.

The assumption that the attacker can get your word list in this sort of password scheme is actually implicit in how the entropy is calculated; it's calculated assuming the attacker is stringing together words from the same known source in the same general format.

wyager · on May 24, 2014

>How is this vulnerable to any attacks?

Because if the PRNG isn't sufficiently random, an attacker can determine that you got the numbers 1, 16, 29, 30, 18.

borplk · on May 25, 2014

But what if they don't have the list? (which is most likely)

wyager · on May 24, 2014

I'm not familiar with python's non-secure PRNG implementation. Usually it involves the fact that A)these PRNGs have small entropy pools, so you can check the entire state space quickly and B)they have non-uniform outputs, so you can do probabilistic attacks.

leo_santagada · on May 24, 2014

python and ruby use a mersene twister prng: http://en.wikipedia.org/wiki/Mersenne_twister it is a good prng... but it is not crypto secure. as an easy patch just use urandom: http://cryptography.readthedocs.org/en/latest/random-numbers...

abecedarius · on May 25, 2014

Here is what I happen to have sitting around. (I'm not recommending it; there are much more secure ones. But it's short, and uses a crypto-random generator.)

    import math
    from random import SystemRandom

    cryptogen = SystemRandom()

    alphabet = open('/path/to/a/custom/wordlist').read().splitlines()
    assert len(set(alphabet)) == len(alphabet)

    def gen_token(length):
        return ','.join(cryptogen.sample(alphabet, length))

    nbits = 80
    length = int(math.ceil(nbits * math.log(2, len(alphabet))))
    print gen_token(length)

Jach · on May 24, 2014

    from random import SystemRandom
    random = SystemRandom()

tinco · on May 24, 2014

I made one in Ruby a few weeks ago. Please don't do what ghshephard does below and blindly import random. It's actually quite important that you use a secure random generator, or your entropy will be drastically reduced.

My solution is here, don't use it unless smart people have confirmed that it works:

https://github.com/d-snp/wachtwoord/blob/master/wachtwoord.r...

Also smart people: please let me know if I've done anything stupid here, I actually use passwords generated by this tool.

For people who don't enjoy reading Ruby: My tool reads in a dictionary file, drops any too small words because they don't have enough absolute entropy and too long words because they're harder to remember (for Dutch anyway, your mileage might vary in your own language).

You can vary the amount of words you want to have in your passphrase, and the tool will print how much entropy the password has if the attacker is aware that you're using this scheme and this dictionary file, but not aware of the random values spitted out by the random generator to pick the words.

It will also spit out a cheesy ballpark estimate as to how long it might take an attacker to crack this password with reasonable effort, you can tweak what reasonable effort means for your context. What currently is a reasonable attack depends heavily on the hashing scheme you employ.

robgering · on May 24, 2014

Quick and dirty Bash one-liner:

  shuf -n4 /usr/share/dict/words | sed "s/'s//g" | tr -d "\\n"; echo;

keithpeter · on May 24, 2014

https://www.schneier.com/essay-246.html

Oulipo rules. Latin squares. Nursery texts. Maps.

Anyone working on a generator based on this kind of approach?

miga · on May 25, 2014

Could you elaborate and give pointers to what "Oulipo rules" and others are?

keithpeter · on May 25, 2014

Oulipo was a silly 'organisation' invented by some writers in Paris after the second world war. The members included mathematics lecturers and writers interested in producing texts according to routines or with constraints. George Perec walked the streets of Paris systematically for some months according to a routine derived from a double latin square, and then wrote about what was happening at each location at a specified time.

I was alluding to the idea easily remembered rules for selecting characters from a text known to the person.

pjaspers · on May 24, 2014

I made something similar [0] but based on Diceware [1] and a book you pass on to the app. It generates passphrases by using words from the book.

[0] https://github.com/pjaspers/frasier

[1] http://world.std.com/~reinhold/diceware.html

tomberek · on May 24, 2014

A similar scheme used to save, retrieve, and publish elm code: http://tomberek.insomnia247.nl:5055/xkcd/flowerbed/too/socio...

Using 'Publish' will produce http://tomberek.insomnia247.nl:5055/xkcd/2797191408452820889 which is just the output page without access to an editor.

Edit: go to http://tomberek.insomnia247.nl:5055/try to make your own

kolev · on May 24, 2014

How do you install stuff like this? I've never touched Haskell, I tried "cabal install wordpass", but it seems it doesn't exist.

samstokes · on May 24, 2014

Assuming you mean wordpass doesn't exist: try "cabal update" first.

If you mean cabal doesn't exist, you'll need to install the Haskell Platform first. There may be a package for your distro (or Homebrew).

kolev · on May 24, 2014

Thanks, "cabal update" did it for me. I guess I have no more excuses to dive into Haskell!