We need to just get away from remembering passwords. Something like Mozilla Persona would have been great, but since it didn't catch on we already have KeePass (and for mac/linux, keepassx2 alpha works quite well).
These cutesy little passwords might be fairly strong and easy to remember, but it hardly matters because having to remember them results in reuse and an upper bound on the possible entropy. The human mind has more important things to do than remembering thousands of bits of entropy (spread among all your passwords, you should have at least that much)... we made computers to remember this sort of stuff for us.
If these passwords aren't meant to be remembered, but put in keepass, then there's no point in not just upping the entropy by randomizing at the character granularity.
KeePass makes it much easier; you have it generate you 300 bits of entropy passwords (or whatever you feel is enough), and then you can focus all your security efforts on that one database file. Having a "single point of failure/password" doesn't have to be bad because it lets you guard more closely against that single point.
The words are separated (and I guess terminated) by a random symbol (of which there are apparently 32 possibilities) or (randomly) a random two-digit number. log(50000^4*(100+32)^4)/log(2) = 90.6
That means you only have 7 entities to remember, which is supposed to be what humans can handle easily. (Though I understand now that this is stale knowledge.)
The number refers to the capacity of working memory and has indeed been revised over time. It's a complicated issue, but 4 units of information is now considered a more accurate approximation. But if you're remembering a password, you're dealing with long-term memory, which doesn't have this limitation.
Long-term memory usually accepts most information from short-term memory not direct sensory input, so it is easier to memorize when you break down things in chunks first.
Can you offer a link on how do you do these measurements and what exactly do they mean?
Reading the wikipedia article about entropy in (computing) I stumbled across this[1]:
" Around 2011, two of the random devices were dropped and linked into a single source as it could produce hundreds of megabytes per second of high quality random data on an average system. "
What does it mean high quality random data in this context?. I mean 'random' is a precise definition, something 'is random' (e.g. distance of next prime :-P, kill Riemann) or 'not random' (distance of next prime if Riemann's ζ(s) is solved).
In the particular case of a series of n equally-probably, independent events, the entropy is given as H = - log 1/n, measured in bits. For example, the entropy of a fair die is - log 1/6 = 2.58 bits per throw.
In this case, the random event is words chosen from a word list. Four words are chosen from fifty thousand, with each word having equal probability of being chosen. So the entropy (measure of randomness) is - log 1/50,000 = 15.6 bits per word, or 62.4 bits per four-word combination. (The script also adds random numbers or symbols, to add up to 90 bits.)
Ugh, the other replies aren't quite getting at the crux of the matter. A function f : {0, 1}^k -> {0, 1}^n -> {0, 1}^m is considered to be pseudo-random if (when provided with a key K) there is no way of distinguishing between its output and the output of a random function F : {0, 1}^n -> {0, 1}^m in polynomial time (it's a little more complicated than that, but that's the gist).
In modern computers, the basic principle is that one generates a random seed (through heat noise, keyboard movements, mouse movements, user interactions with the computer etc) and then uses it as the key for a pseudo-random function such that the next output of the function cannot be predicted given all of the previous. One can simply iterate if one so chose (first n bits output = f_K(0), next f_K(1), ...).
In this case, he's simply worked by the fact that if one knows that their target is using this key scheme, one knows that at most there are 50^4 passwords the target could be using. If there are 90 bits of entropy, that means that there are 2^90 passwords that the target could be using. What seems to be the difference here is that this program outputs a special character between each word, or a 2-digit number, meaning that we have 50000^4*(100+32)^4 > 2^90. It's kinda messy, and probably not that easy to remember.
In general, with passwords, one wants about 2^80 bits of entropy - it's high enough to be incredibly difficult to crack even when hashed insecurely. Once we start using proper algorithms like bcrypt it's even harder.
To give a comparison, a password output by this is roughly as hard to crack as a 15 character random password encoded in ASCII, or if unicode is allowed, a 6 character password in that encoding.
"Distance of next prime" is not random, it's entirely deterministic. It's a well-defined function. Just because we don't have an analytical solution for it doesn't make it random.
As for "high quality", some random variables are less predictable than others. A weighted coin that comes up heads 51% of the time is still random (not deterministic), but more predictable than a fair coin.
I don't like calling non-uniform distributions 'predictable'.
Think of it like this: A coin which produces heads 99/100 times is not predictable, since you cannot predict when the next tails is coming, you just know that they are rarer.
As for the distance to the next prime thing, you are just plain wrong, in a sense that he is talking about a seeded pnrg, i.e the starting prime is a seed and you don't get to know which it is, out of all integers...
That's random in maths, we're not talking maths here. In the real world, random is a value that's hard to predict. For example a dice throw has a high quality of random in normal circumstances.
A low quality random source would depend on some variables that make it easier to predict the next random number (perhaps just statistically). The worst quality random source would be pseudo-random numbers, where the value of each next random number directly follows from the previous value and perhaps some secret second value.
It might be distributed like a true random number, but eventually the secret could be guessed and the numbers would become predictable.
A simple explanation is that "randomness" here is related to (logarithm of) the minimum average time [1] you would take to guess the password correctly (H=sum(pi * log(1/pi)). Since the random number generator supposedly gives an uniform distribution over a set of words, this simplifies to H=n(1/n log(n))=log(n).
[1] Your average time is at least t* =1/2 * 2^H, so log( t* )>H-1.
I've become quite good at remembering these. I can usually remember them even if I go a month or two between usages. It's hard to explain, but it seems to require a different memorization effort that somehow sticks better into my long-term memory than with random dictionary words.
You just prompted me to read the source code :-) gpw.c has this comment at the top:
/* GPW - Generate pronounceable passwords
This program uses statistics on the frequency of three-letter sequences
in your dictionary to generate passwords. The statistics are in trigram.h,
generated there by the program loadtris. Use different dictionaries
and you'll get different statistics.
This program can generate every word in the dictionary, and a lot of
non-words. It won't generate a bunch of other non-words, call them the
unpronounceable ones, containing letter combinations found nowhere in
the dictionary. My rough estimate is that if there are 10^6 words,
then there are about 10^9 pronounceables, out of a total population
of 10^11 8-character strings. I base this on running the program a lot
and looking for real words in its output.. they are very rare, on the
order of one in a thousand.
...
I'm not really in a position to properly figure this out... for a start, it's commonly claimed that there are 1M words used in English however my local /usr/share/dict/words has only 100K entries (and a even lot of those seem redundant).
Assuming my gpw binary was "trained" on a 1M wordlist, and that my algorithm averages three 8-char strings per passphrase... and we take the comments at face-value, i.e. 10^9 possibilities per 8-char string generated.
I'm going to pretend I can competently use Mlog2(N) to estimate that three such strings might make up 3log2(10^9)=89 bits of entropy. Throw in a 0-9999 number, let's pretend that's worth 13-ish bits, let's call it 100 bits?
I haven't factored in the capitalizations yet, and I'm not sure how much the random placement of the numbers and the random lengths of each "word" are worth.
IANACG (I am not a crypto guy) but that seems like a huge number of bits and so this is likely completely wrong (I'd start with going back to trying to properly analyze how many combinations my gpw binary can actually produces in an n-char string - that 10^9 combinations from a 10^6 wordlist needs testing).
Your little script performs amazingly well on my Mac, coming in at about 165 ms, quite a bit faster than I expected. If your script is too fast for you, you could consider reading the dictionary into a bash array instead of using sed. My /usr/share/dict/words contains 234,936 words.
As a point of comparison, here is a Python version that clocks in around 107 ms:
#!/usr/bin/env python
from random import SystemRandom as r
with open('/usr/share/dict/words', 'rb') as f:
lines = f.readlines()
print ' '.join(r().choice(lines).rstrip() for x in xrange(4))
However, I don't know how well SystemRandom compares to arc4random in terms of crypto.
If you're set on using arc4random it's no big deal to use ctypes to invoke it from Python. Bash is quick to bang out but it's funny how your parent needs to compile a C program then invoke it N times per bash script run. Using Python for this is an obvious choice.
Yes thanks for pointing that out, I wrote a slightly faster version below. I took your advice and wanted just a single invoke on the C program, so I modified it to take an argument. I'm satisfied!
> cat arc4random-v2.c
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char** argv) {
if ( ! (argc == 1 || argc == 2) ) {
printf("usage: %s [n_output]\n", argv[0]);
} else {
int n_iter = (argc == 2) ? atoi(argv[1]) : 1;
for (int i = 0; i < n_iter; i++) {
printf("%u\n", arc4random());
}
}
return 0;
}
> ./arc4random 3
4282823399
333796506
762478899
> cat pwgen-v2.sh
#!/usr/bin/env bash
dict=/usr/share/dict/words
n=4 # of words to include in passphrase
nl=$(wc -l $dict | sed 's_[ \t]*\([0-9]*\).*_\1_')
fmt_flag() { echo $[$1 % $nl] | sed -n -e 's_^_-e _' -e 's_$_p_p' ; }
flags=$(./arc4random $n | while read -r i; do fmt_flag "$i"; done)
words=$(sed -n $flags < $dict)
echo $words
> time ./pwgen.sh
liparite evade retiringly spacious
real 0m0.268s
user 0m0.251s
sys 0m0.015s
> time ./pwgen-v2.sh
latherin prideling thumbscrew unraveled
real 0m0.082s
user 0m0.072s
sys 0m0.013s
EDIT: note that the single call to sed results in alphabetical/dictionary-dependent results... (!) oops
Could you describe a practical attack against this password generator based on how it actually works?
Specifically what information you would need to be in possession of in order to exploit it, and then a description of how you would - again, in practice - exploit that, inside a practical timeframe. Thanks.
> there probably isn't a realistic attack against Python's default MT random based on the normal use case of a random word selector run "offline".
What does Python use as an entropy source for its PRNG? Even though MTs work pretty well if you can't look at any of their outputs (i.e. you're "offline"), if the initial state space is small enough you can just crawl the entire thing.
A PRNG is only as random as its seed / starting state. Let's say it gets this state from the current time, in milliseconds.
Now, let's assume the attacker knows, based on your "Time joined" statistic for some account, a roughly 2 minute period of when you generated your password (it's very common for a website to show time joined in the granularity of a second through an api; passwords are rarely generated more than a minute beforehand if you use a generator + keepass). 2 minutes is 3,600,000 milliseconds... or in other words, they can seed the RNG with all 3.6 million "time-in-milliseconds" and run the program and see the output. They now only have to guess 3.6 million possible passwords, not, say, 1e20 (assuming a dictionary of 10k words). That's a massive speedup on the order of 1e14.
I hope that all made sense... the basic idea is just if the attacker can make any sort of estimate about the state of the PRNG, he can just test passwords generated from states like that, not all possible passwords.
You'll notice, of course, that this vulnerability relies on the other party being able to discern some information about the state. This doesn't have to be from guessing the seed directly (as was the case in the example above). It can also happen by observing multiple outputs and guessing at what seed could produce them both, or noticing a bias.
The last one, the chance for a bias, is quite possibly what you're asking about. "How could a weakness in the randomness be a problem at all in this case?". A weakness in randomness means that some combinations of words will show up more often than others, by definition. It makes the password weaker because the attacker can discover this bias and then guess more probable passwords first.
That he has it or can get it, yes. Even if he doesn't have it, there are a few dictionaries (/usr/share/dict/words on various popular distros would be a good start) he could try and only guess the probable options in both of them.
Furthermore, it's possible he could figure out which dictionary you were using by another means. If a site that stored your password in plaintext was compromised, that would give him 4,5,whatever words in your dictionary, and using a similar attack on this flawed prng as discussed above, he could narrow down the possible positions of those words, thereby figuring what dictionary is most probable.
The assumption that the attacker can get your word list in this sort of password scheme is actually implicit in how the entropy is calculated; it's calculated assuming the attacker is stringing together words from the same known source in the same general format.
I'm not familiar with python's non-secure PRNG implementation. Usually it involves the fact that A)these PRNGs have small entropy pools, so you can check the entire state space quickly and B)they have non-uniform outputs, so you can do probabilistic attacks.
Here is what I happen to have sitting around. (I'm not recommending it; there are much more secure ones. But it's short, and uses a crypto-random generator.)
I made one in Ruby a few weeks ago. Please don't do what ghshephard does below and blindly import random. It's actually quite important that you use a secure random generator, or your entropy will be drastically reduced.
My solution is here, don't use it unless smart people have confirmed that it works:
Also smart people: please let me know if I've done anything stupid here, I actually use passwords generated by this tool.
For people who don't enjoy reading Ruby: My tool reads in a dictionary file, drops any too small words because they don't have enough absolute entropy and too long words because they're harder to remember (for Dutch anyway, your mileage might vary in your own language).
You can vary the amount of words you want to have in your passphrase, and the tool will print how much entropy the password has if the attacker is aware that you're using this scheme and this dictionary file, but not aware of the random values spitted out by the random generator to pick the words.
It will also spit out a cheesy ballpark estimate as to how long it might take an attacker to crack this password with reasonable effort, you can tweak what reasonable effort means for your context. What currently is a reasonable attack depends heavily on the hashing scheme you employ.
Oulipo was a silly 'organisation' invented by some writers in Paris after the second world war. The members included mathematics lecturers and writers interested in producing texts according to routines or with constraints. George Perec walked the streets of Paris systematically for some months according to a routine derived from a double latin square, and then wrote about what was happening at each location at a specified time.
I was alluding to the idea easily remembered rules for selecting characters from a text known to the person.
These cutesy little passwords might be fairly strong and easy to remember, but it hardly matters because having to remember them results in reuse and an upper bound on the possible entropy. The human mind has more important things to do than remembering thousands of bits of entropy (spread among all your passwords, you should have at least that much)... we made computers to remember this sort of stuff for us.
If these passwords aren't meant to be remembered, but put in keepass, then there's no point in not just upping the entropy by randomizing at the character granularity.
KeePass makes it much easier; you have it generate you 300 bits of entropy passwords (or whatever you feel is enough), and then you can focus all your security efforts on that one database file. Having a "single point of failure/password" doesn't have to be bad because it lets you guard more closely against that single point.