Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Statistics Done Wrong – The woefully complete guide (refsmmat.com)
331 points by bowyakka on Nov 5, 2013 | hide | past | favorite | 70 comments


Hey everyone, I'm the author of this guide. It's come full circle -- I posted it a week ago in a "what are you working on?" Ask HN post, someone posted it to Metafilter and reddit, and it made its way to Boing Boing and Daily Kos before coming back here.

I'm currently working on expanding the guide to book length, and considering options for publication (self-publishing, commercial publishers, etc.). It seems like a broad spectrum of people find it useful. I'd appreciate any suggestions from the HN crowd.

(A few folks have already emailed me with tips and suggestions. Thanks!)

(Also, I'm sure glad I added that email signup a couple weeks ago)


As a scientist I think you are addressing a very important problem with this book. I've taken two statistics classes, one graduate level, and even I am plagued with doubt as to wether the statistics I've used have all been applied and interpreted "correctly". That said, I think the recent spate of "a majority of science publications are wrong" stories is incredible hyperbole. Is it the raw data that is wrong (fabricated)? The main conclusions? One or two minor side points? What if the broad strokes are right but the statistics are sloppy?

People also need to realize what while the Discussion and Conclusion section of publications may often read like statements of truth, they're usually just a huge lump of spinoff hypotheses in prose form. Despite my frequent frustrations with the ways science could be better, the overall arrow of progress points in the right direction. Science isn't a process where the goal is to ensure that 100% of what gets published is correct, but whereby previous assertions can be refuted and corrected.

Edit:

To be more specific, I think the statement in your Introduction is overly critical: "The problem isn’t fraud but poor statistical education – poor enough that some scientists conclude that most published research findings are probably false". I would change it to say: "conclude that most published research findings contain (significant) errors", or something along those lines.


I based that statement off of John Ioannidis's famous paper, "Why Most Published Research Findings are False." It's open-access:

http://www.plosmedicine.org/article/info:doi/10.1371/journal...

He's drawn some criticism for the paper, and perhaps things aren't as bad as he makes it seem, but it is true that someone has suggested most findings are false.

I may tone down the Introduction slightly.


If anything, Ioannidis's paper would hugely understate the problem, because he was only looking at the percentage of papers that couldn't be replicated. But just because a result can be replicated doesn't mean that the study is actually correct. In fact, the vast majority of wrong papers are likely very replicable, since most wrong papers are the result of bad methodology (or other process-related issues) rather than fudged data.


There is also a big divide between statistical rigor in "science" research and "medical" research. In their defense, I think it's just extremely difficult for most medical research studies to get the kinds of random or N needed for reliable statistics.

Also regarding John Ioannidis's essay (not paper):

First, he uses the blanket term "research" in his meta-analysis (or at least examples) but his work seems focused primarily on medical research studies. Second, I'm not sure he clearly defines what it means to be "False", or for "most" published research to be "false".

Let's say there is clearly a right and a wrong answer to a question, and up until yesterday, publications A, B and C had concluded the wrong answer. But someone releases a newer, more rigorous finding D that refutes A, B and C conclusively and choses the correct answer. I wouldn't consider this particular field to be 75% wrong after the publication of D. (Though it accurately could have been described as close to 0% conclusive before D). For any particular line of inquiry, the quality of research in this area seems like it should be shifted strongly toward the maximum exemplar of this body of work, and not it's average.


"a majority of science publications are wrong"

As a scientist, I think this is probably correct. In my experience, a great majority of publications draw improper statistical conclusions, and I believe many of these are wrong in substance.

"What if the broad strokes are right but the statistics are sloppy?"

Publishing statements as statements of truth when they are improperly or falsely backed up would be better described as politics than science.


I take the view that "a majority of science publications are wrong" is a purposefully misleading and sensationalistic take, even though it may be technically true. IMO, only the leading-edge of known science should factor into such studies, and I think that is probably not "mostly wrong". After all, even if there has only been 1 rock-solid publication in favor of a round Earth that is preceded by 99 publications in support of a flat-Earth, I would't call that field 99% wrong.


If, say, you find that your results are not statistically sigificant at a reasonable p-level, you don't get to claim your conclusion is right in 'broad strokes', maybe has some 'significant errors', a bit 'sloppy statistics', but it's broadly true, right?

Nope, instead it's not statistically significant, so there's no reason to think it wasn't just random chance that made it 'close' to statistically significant.

But if your results are statistically validated, but only because you used statistics incorrectly, you've crossed the line to broadly-true-but-with-sloppiness?

Nope.

There is room for qualitative research. I like qualitative research. But qualitative research has, righty, a different sort of impact.

If you are claiming your research is quantitative, then you need to live and die by good statistics. That's the implicit promise of providing the statistics in the first place, otherwise why do statistical calculations at all?


What if someone claims their p=0.05 but their control group wasn't quite as representative as they assumed and their p is really something more like 0.1?

Or what if one of the experiments in a paper is well-designed and well-executed and supports a hypothesis with very high certainty but some of the other experiments were sloppy or botched? Should the conclusions of the entire paper be labeled as "wrong"?

I find it pretty funny when critiques on statistical rigor in science arrive at language with words such as "mostly" and "wrong".


That said, I think the recent spate of "a majority of science publications are wrong" stories is incredible hyperbole. Is it the raw data that is wrong (fabricated)?

This is only a good working assumption of some (open access) journals and of papers (co-)authored exclusively by nationals of some countries. That's a lot of papers.

The main conclusions? One or two minor side points? What if the broad strokes are right but the statistics are sloppy?

If the main conclusions are right but the statistics are sloppy the paper is true, not false.

My confidence in what Ioannidis published went up significantly on learning that epidemiology is mostly bullshit[0] and "Bayer halts nearly two-thirds of its target-validation projects because in-house experimental findings fail to match up with published literature claims, finds a first-of-a-kind analysis on data irreproducibility."[1]

I hope the author of the textbook does not listen to you.

[0]http://lesswrong.com/lw/72f/why_epidemiology_will_not_correc...

[1]http://blogs.nature.com/news/2011/09/reliability_of_new_drug...


To be accurate the paper should then come to the conclusion that "a majority of published work is inconclusive and fails in the proper application of statistics used to support their claims" instead of "a majority of science publications are wrong". The second is just pure sensationalist troll.

I'm not arguing against the work itself, or against more rigorous application of statistics. I'm just arguing against sensationalistic and inflammatory language. Anyone who practices science in a particular field for any length of time will have a pretty good idea of what work is "good" and "bad". Certainly they are smart enough to ignore previous work that has been refuted and/or retracted, and it's not really fair for this previous work to contribute to assessments of what % of the field is "wrong".


I like how you present various ways that you can make mistakes with statistics.

One thing that is missing is like a Summary/Checklist chapter that tells you what you SHOULD do, in a few common scenarios to avoid all the mistakes presented in the previous chapter. I know its not that simple, and it depends a lot on how you are testing and what you're actually trying to achieve, but a few examples wouldn't hurt.

For example: I have two sets of measurements, and I want to know: * is there a statistically significant difference between them * if yes, how much is the difference

A somewhat simplistic way I'd do that is to do a two-sample t-test for question #1, and to compute statistics for the difference between the samples (mean, median, confidence intervals), but doing just that I might've already committed some of the mistakes that your site warns about, for example I completely disregarded the power of the test.

FWIW I like parts of this book on statistics which focuses on statistics in the domain of computer systems / network, although it is rather too long: http://perfeval.epfl.ch/


My draft (twice as long as what's online) has a feature like this at the end of each chapter. It needs more work to be useful, but it goes a long way to making the advice in the book actionable.


Is there any way to print the entire book as a document? Do you accept donations?


http://sphinx-doc.org/latest/builders.html

Sphinx can output Latex for printing to PDF. We'd just need the source.

BTW, excellent refresher for statistical screw-ups, I had forgotten a quarter of these (and never learned the rest.)


Not yet. I intend to get it published so you can buy a printed copy, but that hasn't happened.

I don't accept donations, but sign up with your email and I'll let you know when you can pay me for a copy.


You might want to take a look at Leanpub (https://leanpub.com) if you haven't already. Seems like it may suit you.


Unfortunately, Leanpub exclusively uses Markdown, which doesn't support things like using a BibTeX bibliography for references. I also don't think there's a way to customize the design of the print book, apart from cover images and such.

I'm writing using reStructuredText and Sphinx because they give me a bibliography, an index, full-text search, cross-referencing, custom environments (e.g. boxes for examples, tips, etc.), and all sorts of other cool features. For example, if I needed matplotlib plots, I can embed the code in my documents and Sphinx will generate them.

I'd immediately put the book on Leanpub if I could get some of these features. A Leanpub using a customized Sphinx would be incredibly useful for people writing textbooks.


this is the exact reason i'm moving away from markdown as my principle note taking tool towards restructuredText.


I've been reading this on/off for the last day. One random UX suggestion: Don't use black for your text, use something close (e.g. #333).

There are a million other UX tips that you can probably get from a real expert, but the black one I noticed.


And another suggestion: http://en.wikipedia.org/wiki/Simpson's_paradox

... might be nice to add to your list.


Now if you could just send a copy of the book to every newspaper in the US...

I would donate to that goal.


awesome, thanks for writing this, and for whoever posted it here.


If I was a billionaire, I would set up some sort of screening lab for scientific/academic/research papers. There would be a statistics division for evaluating the application of statistical methods being used; a replication division for checking that experiments do actually replicate; and a corruption division for investigating suspicious influences on the research. It would be tempting to then generate some sort of credibility rating for each institution based on the papers they're publishing, but that would probably invite too much trouble, so best just to publish the results and leave it at that.

Arguably this would be a greater benefit to humanity than all the millions poured charitably into cancer research etc.


Something like that idea has actually already been the inspiration for at least one startup: MetaMed (http://en.wikipedia.org/wiki/MetaMed, http://nymag.com/health/bestdoctors/2013/metamed-personalize...) does meta-level analysis of the medical literature to determine which treatments seem effective for rare conditions, taking into account the sample size, statistical methodology, funding sources, etc. of each study.

Of course, medicine might be unique as a domain in which individuals are willing to pay vast sums of money to obtain slightly more trustworthy research conclusions, and the profit motive has obvious conflicts with "benefit to humanity" (if someone pays you to research a treatment for their disease, do you post the findings when done? Or hold them privately for the next person with the same problem?). But maybe there are other domains in which the market could support a (non-billionaire's) project for better-validated research.


MetaMed, as far as I know, does basically just customized literature reviews; it isn't doing anything I'd recognize as 'meta-level analysis' like the work done by the Cochrane Collaboration or using meta-analytic techniques to directly estimate the reliability of existing medical treatments or beliefs.


Something along those lines: Elizabeth Iorn's Reproducibility Initiative. [1] And an opinion piece which gives some context. [2]

[1] https://www.scienceexchange.com/reproducibility

[2] http://www.newscientist.com/article/mg21528826.000-is-medica...


Doesn't the Cochrane Institute already do something similar?

They perform meta analysis of studies and talk about the validity of their statistical methods. http://summaries.cochrane.org/

Read about it in the book Bad Science by Ben Goldacre.


SIGMOD (a database research conference) has set up a reproducibility committee [1]. Their goal is to ensure that the results can be reproduced by someone from the outside. If they succeed, you get an additional label for your graphs saying "Approved by the SIGMOD reproducibility committee."

Notably, this is easier in computer science as you don't need to wait for hundreds of patients to turn up having a certain condition.

[1] http://www.sigmod.org/2012/reproducibility.shtml


I think you'd get very depressed just by the statistics, let alone the reproduction. Especially if you included journals of econometrics.


You know what's more depressing when I read the Cochrane summaries about health studies...most of the medicines and treatments are no better than placebos!


Fuck yes, I would love to help do something like that. I'm not a statistician though, so I'm probably not very qualified.


As a graduate student in the life sciences, I was required to take a course on ethical conduct of science. This gave me the tools to find ethical solutions to complex issues like advisor relations, plagiarism, authorship, etc. We were also taught to keep good notes and use ethical data management practices - don't throw out data, use the proper tests, etc. Unfortunately, we weren't really taught how to do statistics "the right way." It seems like this is equally important to ethical conduct of science. Ignorance is no excuse for using bad statistical practices - it's still unethical. By the way, this is at (what is considered to be) one of the best academic institutions in the world.


> Unfortunately, we weren't really taught how to do statistics "the right way."

Learning the right way takes a lot of work, there's a lot of ways to analyse things, each one wrong/right in different situations. (Even teaching something as "simple" as the correct interpretation of a p-value is hard.)


I'm sure it does, but they don't have a problem assigning other required classes, such as a one-hour-a-week communication of science class. One hour a week for a year is enough to cover a lot of material.


One of the many challenges in science is that there is no publication outlet for experiments that just didn't pan out. If you do an experiment and don't find statistical significance, there aren't many journals that want to publish your work. That alone helps contribute to a bias toward publishing results that might have been found by chance. If 20 independent researchers test the same hypothesis, and there is no real effect, 1 might find statistical significance. That 1 researcher will get published. The 19 just move on.


They are working on it. There was an effort at Oxford to track perinatal trials -- started in the 1980's. It looks like it hasn't happened yet, but that various major players (PLOS, Center for Evidence-based Medicine) want to expand the brief to cover all clinical trials:

http://www.alltrials.net/about/

http://www.cochrane.org/about-us/history


For many types of clinical trials, pre-registration and publication of results through ClinicalTrials.gov is required by the FDA. I think it's been five or ten years now. Unfortunately, compliance isn't great -- something like 80% of studies registered on ClinicalTrials.gov never have results published there. 20% of registered studies never have results published anywhere.


There's Figshare <http://figshare.com/> to publish negative data.


This is called the "file drawer effect".


Norvig's "Warning Signs in Experimental Design and Interpretation" is also worth reading and covers the higher level problem of bad research and results. Including mentioning bad statistics.

http://norvig.com/experiment-design.html


Quite a few years ago i devised an ambitious method to achieve significance while sitting through another braindead thesis presentation (psychology):

If you are interested in the difference of a metric scaled quantity between two groups do the following:

1.) Add 4-5 plausible control variables that you do not document in advance (questionaire, sex, age...).

2.) Write a r-script that helps you do the following: Whenever you have tested a person increment your dataset with the persons result and run a:

t-test

u-test

ordinal logistic regression over some possible bucket combinations.

3.) Do this over all permutations of the control variables. Have the script ring a loud bell when significance is achieved so data collection is stopped immediately. An added bonus is that you will likely get a significant result with a small n which enables you to do a reversed power analysis.

Now you can report that your theoretical research implied a strong effect size so you choose an appropriate small n which, as expected, yielded a significant result ;)


XKCD did it first:

http://xkcd.com/882/


One thing that constantly saddens me about statistics is that a large amount of energy is expended using is almost correctly to "prove" something that was already the gut feel. Even unbiased practitioners can be lead astray [1] but standards on how not to intentionally lie with statistics are very useful.

[1] http://euri.ca/2012/youre-probably-polluting-your-statistics...


There's no way to tell whether or not that "gut feel" is accurate without proof. Often it's right, but occasionally it's very, very wrong (cancer risk and Bayes theory provides a good illustration: http://betterexplained.com/articles/an-intuitive-and-short-e...). Consequently it's still worthwhile proving things even when they're seemly obvious.


I think his point was that people seem to "prove" common sense statistically all of the time - but when doing so make a lot of thoughtless assumptions about representativeness, significance, definitions, etc. stemming from the unspoken assumption of a particular outcome being inevitable.

Or maybe I'm projecting?


He said he's saddened by people using statistics "almost" correctly to prove what was already "gut feel". I can't quite tell whether his real concern is what you thought he was saying (i.e., wasted effort), or whether it's the "almost" (but not quite correctly) part, i.e., that researchers use statistics to wrongly prove the gut feel. If it's the latter, then I think a big part of the problem is that people don't understand statistics well enough, not that they intentionally misuse it.


"Consequently it's still worthwhile proving things even when they're seemly obvious."

Not true. It's all about risk vs. payoff. Some things are low risk enough we can go by gut, others we need more evidence. It's all about tuning for false positives and negatives.

EDIT: Added quote of what I was responding to.


One of the functions of the prior in Bayesian analysis is to incorporate this "gut feel" into your calculations. Given that you have a strong prior belief and weak data (ie not much data) your belief will strongly influence the posterior. As you collect more data your belief will be increasingly overridden by the reality.


I see the author of this interesting site is active in this thread. You may already know about this, but for onlookers I will mention that Uri Simonsohn and his colleagues

http://opim.wharton.upenn.edu/~uws/

have published a lot of interesting papers advising psychology researchers how to avoid statistical errors (and also how to detect statistical errors, up to and including fraud, by using statistical techniques on published data).


Thanks. I had seen some of his work, but browsing his list of publications I found a few more interesting papers. I've already worked one into my draft.


One way to do statistics less wrong is to move from statistical testing to statistical modelling. This is what we are trying to support with BayesHive at https://bayeshive.com

Other ways of doing this include JAGS (http://mcmc-jags.sourceforge.net/) and Stan (http://mc-stan.org/)

The advantage of statistical modelling is that it makes your assumptions very explicit, and there is more of an emphasis on effect size estimation and less on reaching arbitrary significance thresholds.


BayesHive is very interesting! I couldn't find any details on pricing, though?


We're thinking about it. Everything is free for the moment and we will keep a free tier for most data analysis needs.


I like that he references Huff's "How to lie with statistics" in the first sentence of the intro. That was the book that came to mind when I saw the subject. Also reminds me of the Twain quote, "There are three types of lies: Lies, Damned Lies, and Statistics."

But despite this, statistics done well are very powerful.


With respect to that Twain/Disraeli quote, my friend who is a professor of statistics tells me that he cannot go to a party and say what he does for a living without someone repeating it smirkingly.


Isn't that why the name "Data Science" was invented?


Data science sure sounds sexier...ish.

The irony of people who use the "damned lies and statistics" quote snidely is that the "statistics" part is not referring to the field Statistics but the plural version of the noun statistics, which of course are easily abused. The field of Statistics is all about NOT abusing statistics.


...ish.

Yeah, it always makes me wonder when they have to put the "science" in the name. "Computation" seems so much more timeless and elegant than "computer science", for instance. It's almost like "Democratic Republic" for nations.


You know, lotsfields are named that way. Its just that the some of them are less obvious because both the part of the name identifying the domain and the part saying its the science (or "study", but with the same intended meaning) are in Latin, instead of plain English.


It gets worse with Political Science, or perhaps the broader Social Sciences.


Exactly. If you're doing statistics for yourself, it's good to know the tricks so that you don't fool yourself by mistake. Many times people use statistics to support their positions, rather than to make up their minds. If done right, data science or statistics is about the latter.


What is puzzling to me is that many of the statistical errors showing up in all the science literature are well understood. The problem is not all the junk science that is being generated but that the current tools and culture are not readily naming and shamming these awful studies. Just as we have basic standards in other fields such as GAAP in finance why can' we have an agreed upon standard for data collection and analysis of scientific data?


If you want to see truly egregious uses of statistics, take a look at any paper on diet or nutrition. Be prepared to be angry.

At this point, if someone published a study stating that we needed to eat not to die, I'd be skeptical of it.


You might enjoy this:

Schoenfeld, J. D., & Ioannidis, J. P. A. (2013). Is everything we eat associated with cancer? A systematic cookbook review. American Journal of Clinical Nutrition, 97(1), 127–134. doi:10.3945/ajcn.112.047142

They did a review of cookbook ingredients and found that most of them had studies showing they increased your risk of cancer, while also having studies showing they decrease your risk of cancer.

I think bacon was a notable exception -- everyone agreed that it increases your cancer risk.


The greatest problem of statistical analysis is throwing out observations which do not fit the bill. All analysis should be thoroughly documented with postmortems.


whenever there is discussion about statistics role in science (sometimes even going as far as crossing into how science is statistics) i always remember this:

http://en.wikipedia.org/wiki/Oil_drop_experiment#Fraud_alleg...


More revealing than the fraud allegations is the next section discussing the way results had pressure from his and other experiments' results for years, delaying our arrival at a more precise measurement. It reads as though it wasn't so much malice as it was self-doubt that lead to the scientists' actions.


That was an excellent read. Thank you. I'll admit I'm often reluctant to read to much in to data I deal with daily (web analytics), as I'm unsure of how to measure its significance accurately. I'm going to dive in and learn more about this.


"Statistical significance does not mean your result has any practical significance."




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: