Why 30.1% of numbers start with 1

jacquesm · on Jan 25, 2010

Benfords law is instrumental in fraud detection, the Enron guys were wise to it, but Bernie Madoff was not:

http://falkenblog.blogspot.com/2008/12/benfords-law-catches-...

wooby · on Jan 26, 2010

As a friend of mine pointed out, how is this possible? Shouldn't fraudulent numbers also follow Benford's law?

count · on Jan 26, 2010

If you are aware of it, you make sure your fake data follows Benford's law. Otherwise, your fake data will be 'random', because people assume that is how it is supposed to look.

sown · on Jan 26, 2010

http://news.ycombinator.com/item?id=1076534

_d4od · on Jan 26, 2010

Random data doesn't follow the law. Fraudulent tax returns are essentially random numbers, hence, detectable.

shrughes · on Jan 26, 2010

Uniformly distributed random data doesn't follow the law. Data whose logarithm is uniformly distributed does.

imurray · on Jan 25, 2010

Suggested exercise:

Predict and then create a histogram of the leading digits of the file sizes of the non-zero-length files on your computer.

[SPOILER: when I did this once I found sharp peaks around digits that weren't 1. You are likely to see this if you have a large number of files around a particular size, examples: a) whatever your digital camera typically produces; b) whatever size your software encodes a typical song into. These files violate the assumption that you are sampling sizes over a wide range of sizes. After excluding these files I observed Benford's law quite closely on the remainder.]

spc476 · on Jan 26, 2010

Curious. I did so:

    1       ****************************
    2       ***************
    3       *********
    4       *****************
    5       *******
    6       ******
    7       *****
    8       ****
    9       ***

That spike for 4 is due to the default directory size of 4096 (my experiment included directories as well as files). The information was pulled from 503,444 files and directories.

revicon · on Jan 26, 2010

Very cool idea. For those who want to try at home, try this (mac and unix users only)...

cd /

find . -exec stat -f "%z" {} \; | cut -c -1 > /tmp/tally.txt

sort /tmp/tally.txt | uniq -c

Mine came out with...

Very interesting...

fexl · on Jan 26, 2010

I had to use a different stat command on my Linux system. This worked for me:

find . -type f -exec stat -c %s {} \; | cut -c 1 | sort | uniq -c

Note that I exclude directories to avoid the size 4096 bias.

I ran it in my "project" directory and found that 38% of my file sizes begin with "1". That directory includes Perl source code files, input data files, and automatically generated output files.

After the digit "1" the distributions ranged from 3% to 9% with no obvious bias I could see.

revicon · on Jan 26, 2010

Limiting to just files is a good idea, and if we employ our good friend awk it cuts the time down significantly. This one should work for both OSX and Linux.

find . -type f -ls | awk '{print $7}' | cut -c -1 | sort | uniq -c

fexl · on Jan 27, 2010

MUCH faster, thanks.

Now I'm piping that into Perl to convert the counts to percentages. If I figure out a one-liner for that I'll let you know.

Next I'll be tempted to write a module for generating "realistic" (Benford-compliant) random numbers using this concise specification from HN contributor "shrughes":

"Data whose logarithm is uniformly distributed does [follow Benford's Law]."

I could use that to produce demo or test data.

Gupie · on Jan 26, 2010

Interesting, but can't this be explained by the distribution of file sizes. There will generally be many small files but fewer and fewer larger files. So there will be more 1k files than 2k files, more 2ks than 3ks, more files between 10 to 19k than 20 to 29k, more files between 100 to 199k than 200 to 299k etc.

hasanove · on Jan 26, 2010

That's actually the reason behind this law in general :)

portman · on Jan 26, 2010

Too interesting to pass up:

1 29.12 %

2 20.60 %

3 13.99 %

4 12.54 %

5 5.74 %

6 4.78 %

7 4.84 %

8 4.93 %

9 3.47 %

imurray · on Jan 26, 2010

Nice. The "law" got the rough shape of the distribution, but (unless you have a very small filesystem) these numbers are statistically significantly different.

Your files must have been made up! ...or you have a nice demonstration of how people shouldn't do too careful calculations with Benford's law. "30.1%" — 3 significant figures, really?

grigory · on Jan 26, 2010

I did this and graphed the results:

http://imgur.com/y0aH3.png

2192 0 - 0%

389003 1 - 38%

151943 2 - 15%

116663 3 - 11%

96393 4 - 9%

76590 5 - 8%

53572 6 - 5%

45381 7 - 4%

47138 8 - 5%

36983 9 - 4%

cruise02 · on Jan 26, 2010

Not "30.1% of numbers" start with 1, but 30.1% of numbers from distributions that cover several orders of magnitude might.

http://www.billthelizard.com/2009/04/benfords-law.html

</pedantry>

palish · on Jan 26, 2010

"Explaining Benford's Law" may have been a more accurate title, but "Why 30.1% of numbers start with 1" brought this neat article to a larger audience. Taken too far, though, that kind of thing can ruin social news sites. Headlines have always been a gray area.

Confusion · on Jan 26, 2010

Unfortunately, "Explaining Benford's Law", which is the actual title of the article, is a complete misnomer, as the article doesn't actually explain the first thing about Benford's law. It just restates Benford's law and gives examples. There is no explanation or derivation at all.

FalcorTheDog · on Jan 26, 2010

Click the "Next Section" link at the bottom to get way more explanation/derivation than you could ever want.

Confusion · on Jan 26, 2010

Ah thanks, I mistakingly thought that linked to a next chapter(which actually isn't there; this is the final chapter in the book).

lutorm · on Jan 26, 2010

That's what I thought too, learning about that law. For certain, 30% of prices start with 9... ;-)

stanleydrew · on Jan 26, 2010

Do you mean 30.1% of prices end with 9?

gjm11 · on Jan 26, 2010

The supposed explanation here is not very good. It amounts to this: "To get those first digits, you took the actual numbers and scaled them all by powers of 10 to get values between 1 and 10. That's kinda logarithmic, and Benford's law is kinda logarithmic, so it's no surprise that the results end up obeying Benford's law. All you really need, kinda, is for the probability distribution to span several powers of 10." This is hand-wavy and, not to put too fine a point on it, wrong. For instance, suppose we generate random numbers uniformly distributed between 1 and 1000000; they will not obey Benford's law.

The author also claims that looking at the Fourier transform of the probability distribution is key to understanding what's going on. But the full extent of his Fourier-based analysis is this: Consider the probability distribution function for log_10(data). Then Benford's law holds if this is constant (editorial note: it cannot in fact be constant) and holds roughly if it's roughly constant. That happens, kinda, when the probability distribution is very broad (editorial note: no, not really; see the example above). What, you didn't see anything about Fourier transforms there? Well, that's because the Fourier stuff is really almost all window-dressing.

For a brief account of Benford's law and related matters written by someone with a better grasp of what's going on, you could turn to http://terrytao.wordpress.com/2009/07/03/benfords-law-zipfs-... whose author is one of the best mathematicians currently living and also a very good expositor.

adamc · on Jan 26, 2010

Thanks much for the link.

jgrahamc · on Jan 26, 2010

Benford's Law and the Iranian election: http://www.jgc.org/blog/2009/06/benfords-law-and-iranian-ele...

And the British MPs' expenses: http://www.jgc.org/blog/2009/06/its-probably-worth-testing-m...

And BBC executives' expenses: http://www.jgc.org/blog/2009/06/running-numbers-on-bbc-execu...

joss82 · on Jan 26, 2010

But as the article says, if you add a certain percentage of votes to one candidate, you don't actually break Benford's law.

I wonder why this law is used for detecting fraud...

jgrahamc · on Jan 26, 2010

It's not clear that it should be used for elections. See for example, http://www.jgc.org/blog/2009/06/does-benfords-law-apply-to-e...

DaniFong · on Jan 25, 2010

Benford's law is not universal in the same sense that the Pareto and Gaussian distributions are not universal: blatantly, while at the same time people believe and treat them to be universal -- often several different universal distributions by the same people!

tel · on Jan 25, 2010

Maybe they're not universal, but each is a strong attractor of models given a relatively small number of commonly satisfied assumptions.

In particular: dimensioned measurements must represent the same relationships between data under many different scales thus leading to logarithmic sampling.

Without further information to give expectations to other trends, these laws great starts.

DaniFong · on Jan 26, 2010

If by must you mean often tend to, and by relationships you mean ratios, then I agree; it's a very useful model in the absence of other information. Wonderful, in fact.

Still, sociologically, working scientists tend to believe in these models as if they were hard rules, to the extent of constructing bridges, rockets, nuclear reactors, nationwide health recommendations and global financial systems without fundamentally understanding why each distribution might arise, and why it might fail to explain real phenomena.

jparise · on Jan 26, 2010

There's a really good segment on Benford's Law (and how it's been applied to business fraud detection) in a recent Radio Lab episode called "Numbers":

http://blogs.wnyc.org/radiolab/2009/11/30/numbers/

Fixnum · on Jan 26, 2010

Without taking the time to read the rest of the book, I am suspicious of the author's claims that he has really solved the problem. Consider:

- Wolfram Mathworld references this topic and claims Benford's Law was put on a rigorous footing in 1998 (http://mathworld.wolfram.com/BenfordsLaw.html), but this is not even mentioned in the book.

- the author makes "straw-man" type claims that (unnamed) prominent mathematicians view Benford's law as "paranormal". Also see the last two paragraphs on the first page of the original article, where the author dismisses the idea of a "universal distribution", which is used at Mathworld to give a heuristic derivation (suppose some rule governs this distribution -> apply scale invariance -> derive needed properties) - it seems like he misunderstood this.

- the author claims that he is the first to have solved this mystery, but doesn't reference any literature since 1976

- he claims on a blog (http://www.dsprelated.com/showarticle/55.php) that he tried to publish in journals, but was rejected because mathematicians weren't interested. He then published in a textbook, not even in some sort of paper. His "proof" is really long, uses very elementary mathematics and unnecessary computer programs, and refers back to other parts of his book (so that I don't want to actually try to parse the whole thing and see if I believe it).

Can anyone confirm or deny my suspicions?

That said, Benford's Law is really cool.

brg · on Jan 26, 2010

After reading the chapter and glossing over Hill's paper, I think I can deny your suspicions.

Ted Hill's papers on Bedford's law are from 1995, and this chapter is from 1997. Wolfram's date is incorrect, although the secondary phenomona that choosing from a distribution which itself was chosen at random is log normal was proved at that later time.

The 'straw-man' was misunderstood by you. The author points out in the conclusion of that paragraph that all of these pseudo-scientific or grandiose explanations where nonsense.

The 'proof' is really an explanation that the phenomena presents itself to an easier analysis when viewed in terms of FTs. The computer program is there to show the reader the 'repeatedly divide by ten' action which is implicitly going on when we map from the unbounded domain to a small bounded domain.

Its a nice explanation really, but as the problem was solved years earlier and this analysis isn't given with another application, I can see why it wasn't accepted for publication in a pure math journal.

Fixnum · on Jan 26, 2010

Thanks!

orblivion · on Jan 25, 2010

Ok, so lets say you have a data set that consists of items that tend to cap at 2000. Already, half of all possible numbers begin with 1.

I think 1 is special, in any base, because it's the first digit used when a new digit gets added. If you're talking about quantities that vary easily by, say, thousands, once it crosses the 10,000 threshold the first digit changes much more slowly.

There's more to think about here for sure, though.

shrughes · on Jan 26, 2010

Suppose you vary the unit of measure. If the unit of measure is picked in the Benford's law distribution, then the expected leading digit of any value you give me will follow Benford's law.

jacquesm · on Jan 25, 2010

Think of binary: Any number will start with a '1', except for '0'!

eagleal · on Jan 26, 2010

Only if you're using a decimal notation for binary systems (think of 1 and 0 of computers).

[Not regarding you comment or this response]

As I have understood, the Law applies only to a logarithmic scale [1 to 2, 2 to 3, 3 to 4, ... to ...]. Look at this pattern graph:

http://en.wikipedia.org/wiki/File:BenfordDensities.png

jacquesm · on Jan 26, 2010

It works in any base, you can extend Benfords law to other bases easily.

Try it on a set of numbers in base 10, then convert them to base 16 and check the percentages you get. They still follow the same pattern, but of course there are more slots and the individual percentages are lower because of that.

aaronblohowiak · on Jan 26, 2010

That doesn't explain base invariance

orblivion · on Jan 26, 2010

Definitely does. Whether I'm switching from 9 to 10 or 7 to 10, there's a border. from 0 to 17 in base 8, half of the numbers still begin with 1.

roundsquare · on Jan 26, 2010

For example, the histogram in Fig. 34-2b was generated by taking a large number of samples from a computer random number generator. These particular numbers follow a normal distribution with a mean of five and a standard deviation of three

Eh? Shouldn't this be a uniform distribution?

jgrahamc · on Jan 26, 2010

Yes. It's amusing how this is a prime example of a person 'seeing' a normal distribution where there is none.

wjy · on Jan 26, 2010

The author refers to the distribution of values in the text. That graph is the distribution of leading digits of those values. It doesn't follow that the first digits would also be a normal distribution.

rbanffy · on Jan 26, 2010

It depends on the RNG you are using. But I agree - it's highly unlikely he picked up one that generates a normal distribution by accident.

csallen · on Jan 26, 2010

34% of the numbers in this article start with 1.

crocowhile · on Jan 26, 2010

Some time ago someone posted this link on HN http://www.arandomnumber.com/

I wonder whether number choice will fallo Benford's Law.

ahlatimer · on Jan 26, 2010

I have a feeling that would actually show a Gaussian distribution, perhaps with a peak in the lower digits (1-10 or so). So, you would see more number start with a 4 or 5 than a 1, or it would at least lack the logarithmic distribution that Benford's law has.

jkincaid · on Jan 25, 2010

This is awesome. Also, I can't help but wonder if this could be used in some sort of bar trick to impress members of the opposite sex.

fleaflicker · on Jan 26, 2010

Probably not but the birthday problem is always a crowd pleaser

http://en.wikipedia.org/wiki/Birthday_problem

scythe · on Jan 26, 2010

Simple trick: come up with "random" numbers, from your own head. Make a list of them. They're not random, but part of the charm here is noticing that even you're affected by Benford's Law.

Now, pair them off, in order, and write down the products: you'll notice a large amount start with 1!

(of course you'll actively avoid this if you're expecting it)

anigbrowl · on Jan 26, 2010

Though a fun article, this post serves better as an introduction to the book from which it is an extract (already posted today: http://news.ycombinator.com/item?id=1076122)

leif · on Jan 26, 2010

There's a radiolab show (called "Numbers") where they talk about this. Quite cool, though they don't go in to the reason behind it, which is a little unsatisfying.

papaf · on Jan 26, 2010

Apparently, Benford rediscovered the law that Simon Newcomb wrote about in 1881:

http://www.jstor.org/pss/2369148

lmkg · on Jan 26, 2010

It happens.

http://en.wikipedia.org/wiki/Stiglers_law_of_eponymy

gjm11 · on Jan 26, 2010

You're missing an apostrophe. Will HN linkify this correctly? http://en.wikipedia.org/wiki/Stiglers_law_of_eponymy [EDIT: no, it won't, and I suspect the parent included the apostrophe and HN removed it. So let's try a different way: http://en.wikipedia.org/wiki/Stigler%27s_law_of_eponymy ... yup, that works.]

It's pleasing to note that the law applies to itself; Stigler was not its discoverer, and of course he was well aware of this when naming it.

vlisivka · on Jan 25, 2010

Linear data is measured using logarithmic scale (..., 0.1, 1, 10, 100, ...).

If you saw logarithmic scale ( http://www.ieer.org/log.gif ), you know that distance between 1 and 2 on logarithmic scale is much wider than between 2 and 3, and so on. So, no surprise, if you seed linear data through logarithmic (non-linear) filter, then numbers will follow pattern of logarithmic scale.

TotlolRon · on Jan 25, 2010

"If the tool you have is a hammer, make the problem look like a nail."

zephjc · on Jan 25, 2010

silt settles at the bottom of a pond.