If you are aware of it, you make sure your fake data follows Benford's law. Otherwise, your fake data will be 'random', because people assume that is how it is supposed to look.
Predict and then create a histogram of the leading digits of the file sizes of the non-zero-length files on your computer.
[SPOILER: when I did this once I found sharp peaks around digits that weren't 1. You are likely to see this if you have a large number of files around a particular size, examples: a) whatever your digital camera typically produces; b) whatever size your software encodes a typical song into. These files violate the assumption that you are sampling sizes over a wide range of sizes. After excluding these files I observed Benford's law quite closely on the remainder.]
That spike for 4 is due to the default directory size of 4096 (my experiment included directories as well as files). The information was pulled from 503,444 files and directories.
I had to use a different stat command on my Linux system. This worked for me:
find . -type f -exec stat -c %s {} \; | cut -c 1 | sort | uniq -c
Note that I exclude directories to avoid the size 4096 bias.
I ran it in my "project" directory and found that 38% of my file sizes begin with "1". That directory includes Perl source code files, input data files, and automatically generated output files.
After the digit "1" the distributions ranged from 3% to 9% with no obvious bias I could see.
Limiting to just files is a good idea, and if we employ our good friend awk it cuts the time down significantly. This one should work for both OSX and Linux.
Now I'm piping that into Perl to convert the counts to percentages. If I figure out a one-liner for that I'll let you know.
Next I'll be tempted to write a module for generating "realistic" (Benford-compliant) random numbers using this concise specification from HN contributor "shrughes":
"Data whose logarithm is uniformly distributed does [follow Benford's Law]."
Interesting, but can't this be explained by the distribution of file sizes. There will generally be many small files but fewer and fewer larger files. So there will be more 1k files than 2k files, more 2ks than 3ks, more files between 10 to 19k than 20 to 29k, more files between 100 to 199k than 200 to 299k etc.
Nice. The "law" got the rough shape of the distribution, but (unless you have a very small filesystem) these numbers are statistically significantly different.
Your files must have been made up! ...or you have a nice demonstration of how people shouldn't do too careful calculations with Benford's law. "30.1%" — 3 significant figures, really?
"Explaining Benford's Law" may have been a more accurate title, but "Why 30.1% of numbers start with 1" brought this neat article to a larger audience. Taken too far, though, that kind of thing can ruin social news sites. Headlines have always been a gray area.
Unfortunately, "Explaining Benford's Law", which is the actual title of the article, is a complete misnomer, as the article doesn't actually explain the first thing about Benford's law. It just restates Benford's law and gives examples. There is no explanation or derivation at all.
The supposed explanation here is not very good. It amounts to this: "To get those first digits, you took the actual numbers and scaled them all by powers of 10 to get values between 1 and 10. That's kinda logarithmic, and Benford's law is kinda logarithmic, so it's no surprise that the results end up obeying Benford's law. All you really need, kinda, is for the probability distribution to span several powers of 10." This is hand-wavy and, not to put too fine a point on it, wrong. For instance, suppose we generate random numbers uniformly distributed between 1 and 1000000; they will not obey Benford's law.
The author also claims that looking at the Fourier transform of the probability distribution is key to understanding what's going on. But the full extent of his Fourier-based analysis is this: Consider the probability distribution function for log_10(data). Then Benford's law holds if this is constant (editorial note: it cannot in fact be constant) and holds roughly if it's roughly constant. That happens, kinda, when the probability distribution is very broad (editorial note: no, not really; see the example above). What, you didn't see anything about Fourier transforms there? Well, that's because the Fourier stuff is really almost all window-dressing.
For a brief account of Benford's law and related matters written by someone with a better grasp of what's going on, you could turn to http://terrytao.wordpress.com/2009/07/03/benfords-law-zipfs-... whose author is one of the best mathematicians currently living and also a very good expositor.
Benford's law is not universal in the same sense that the Pareto and Gaussian distributions are not universal: blatantly, while at the same time people believe and treat them to be universal -- often several different universal distributions by the same people!
Maybe they're not universal, but each is a strong attractor of models given a relatively small number of commonly satisfied assumptions.
In particular: dimensioned measurements must represent the same relationships between data under many different scales thus leading to logarithmic sampling.
Without further information to give expectations to other trends, these laws great starts.
If by must you mean often tend to, and by relationships you mean ratios, then I agree; it's a very useful model in the absence of other information. Wonderful, in fact.
Still, sociologically, working scientists tend to believe in these models as if they were hard rules, to the extent of constructing bridges, rockets, nuclear reactors, nationwide health recommendations and global financial systems without fundamentally understanding why each distribution might arise, and why it might fail to explain real phenomena.
Without taking the time to read the rest of the book, I am suspicious of the author's claims that he has really solved the problem. Consider:
- Wolfram Mathworld references this topic and claims Benford's Law was put on a rigorous footing in 1998 (http://mathworld.wolfram.com/BenfordsLaw.html), but this is not even mentioned in the book.
- the author makes "straw-man" type claims that (unnamed) prominent mathematicians view Benford's law as "paranormal". Also see the last two paragraphs on the first page of the original article, where the author dismisses the idea of a "universal distribution", which is used at Mathworld to give a heuristic derivation (suppose some rule governs this distribution -> apply scale invariance -> derive needed properties) - it seems like he misunderstood this.
- the author claims that he is the first to have solved this mystery, but doesn't reference any literature since 1976
- he claims on a blog (http://www.dsprelated.com/showarticle/55.php) that he tried to publish in journals, but was rejected because mathematicians weren't interested. He then published in a textbook, not even in some sort of paper. His "proof" is really long, uses very elementary mathematics and unnecessary computer programs, and refers back to other parts of his book (so that I don't want to actually try to parse the whole thing and see if I believe it).
After reading the chapter and glossing over Hill's paper, I think I can deny your suspicions.
Ted Hill's papers on Bedford's law are from 1995, and this chapter is from 1997. Wolfram's date is incorrect, although the secondary phenomona that choosing from a distribution which itself was chosen at random is log normal was proved at that later time.
The 'straw-man' was misunderstood by you. The author points out in the conclusion of that paragraph that all of these pseudo-scientific or grandiose explanations where nonsense.
The 'proof' is really an explanation that the phenomena presents itself to an easier analysis when viewed in terms of FTs. The computer program is there to show the reader the 'repeatedly divide by ten' action which is implicitly going on when we map from the unbounded domain to a small bounded domain.
Its a nice explanation really, but as the problem was solved years earlier and this analysis isn't given with another application, I can see why it wasn't accepted for publication in a pure math journal.
Ok, so lets say you have a data set that consists of items that tend to cap at 2000. Already, half of all possible numbers begin with 1.
I think 1 is special, in any base, because it's the first digit used when a new digit gets added. If you're talking about quantities that vary easily by, say, thousands, once it crosses the 10,000 threshold the first digit changes much more slowly.
There's more to think about here for sure, though.
Suppose you vary the unit of measure. If the unit of measure is picked in the Benford's law distribution, then the expected leading digit of any value you give me will follow Benford's law.
It works in any base, you can extend Benfords law to other bases easily.
Try it on a set of numbers in base 10, then convert them to base 16 and check the percentages you get. They still follow the same pattern, but of course there are more slots and the individual percentages are lower because of that.
For example, the histogram in Fig. 34-2b was generated by taking a large number of samples from a computer random number generator. These particular numbers follow a normal distribution with a mean of five and a standard deviation of three
The author refers to the distribution of values in the text. That graph is the distribution of leading digits of those values. It doesn't follow that the first digits would also be a normal distribution.
I have a feeling that would actually show a Gaussian distribution, perhaps with a peak in the lower digits (1-10 or so). So, you would see more number start with a 4 or 5 than a 1, or it would at least lack the logarithmic distribution that Benford's law has.
Simple trick: come up with "random" numbers, from your own head. Make a list of them. They're not random, but part of the charm here is noticing that even you're affected by Benford's Law.
Now, pair them off, in order, and write down the products: you'll notice a large amount start with 1!
(of course you'll actively avoid this if you're expecting it)
Though a fun article, this post serves better as an introduction to the book from which it is an extract (already posted today: http://news.ycombinator.com/item?id=1076122)
There's a radiolab show (called "Numbers") where they talk about this. Quite cool, though they don't go in to the reason behind it, which is a little unsatisfying.
Linear data is measured using logarithmic scale (..., 0.1, 1, 10, 100, ...).
If you saw logarithmic scale ( http://www.ieer.org/log.gif ), you know that distance between 1 and 2 on logarithmic scale is much wider than between 2 and 3, and so on. So, no surprise, if you seed linear data through logarithmic (non-linear) filter, then numbers will follow pattern of logarithmic scale.
http://falkenblog.blogspot.com/2008/12/benfords-law-catches-...