Hacker News new | past | comments | ask | show | jobs | submit login

I have been in discussions about this with one of my friends working in academic materials research. Its amazing the amount of work today done by scientist at universities writing code without very basic software development tools.

I'm talking opening their code in notepad, 'versioning' files by sending around zip files with numbers manually added to the end of the file name, etc.

This doesn't even begin to scratch the surface of the 'reproducible results' problem. Often times, the software I've seen is 'rough' to be kind. Most times its not even possible to get the software running (missing some really specific library or some changes to a dependency which haven't been distributed) or its built for a super specific environment and makes huge assumptions on what can 'be assumed about the system.' This same software produces results which end up being published in journals.

If any of these places had money to spend, I think there could be a valuable business in teaching science types how to better manage their software. Its really unfortunate that outside of a few core libraries (numpy, etc.) the default method is for each researcher to rebuild the components they need.

I'm surprised about only 11% of results being reproducible. It seems lower then I'd expect. I agree we don't want to optimize for reproducibility, but obviously there is some problem here that needs to be addressed.




> Its amazing the amount of work today done by scientist at universities writing code without very basic software development tools.

I agree 100%. I recently quit my PhD so I still know a lot of people on the frontlines of science. One of these friends recently asked me to help them with a coding issue so they gave me an ssh login to group's server. I login and start reading the source.

It was all Fortran, with comments throughout like "C A major bug was present in all versions of this program dated prior to 1993." What bug, and of what significance for past results? Unknowable. As far as I can tell from the comments, the software has been hacked on intermittently by various people of various skill since at least 1985 without ever using source control or even starting a basic CHANGELOG describing the program's evolution. The README is a copy/paste of some old emails about the project. There are no tests.

So even though computer modeling projects should, in theory, be highly reproducible... it often seems like researchers are not taking the necessary steps to know what state their codebase was in at the time certain results were obtained.


This is an entirely different issue than code; code mostly does the same thing when you run it twice. There's no such guarantee in biology. A cancer cell line growing in one lab may behave differently than descendants of those cells in a different lab. This may be due to slight differences in the timings between feeding the cells and the experiments, stochastic responses built into the biology, slight variations between batches of input materials for the cells, mutations in the genomes as the cell line grows, or even mistaking one cell line for another.

Reproducibility of software is a truly trivial problem in comparison.


Also, sometimes, doing the experiment is extremely hard. I know a guy who only slightly jokingly claims he got his Ph.D. on one brain cell. He spent a couple of years building a setup to measure electrical activity of neurons, and 'had' one cell for half an hour or so (you stick an electrode in a cell, hope it doesn't die in the process, and then hope your subject animal remains perfectly subdued, and that there will not be external vibrations that make your electrode move, thus losing contact with the cell or killing it)

Reproducible? Many people could do it, if they made the effort, but how long it would take is anybody's guess.

Experiments like that require a lot of fingerspitzengefühl from those performing them. Worse, that doesn't readily translate between labs. For example, an experimental setup in a small lab might force an experimenter in a body posture that makes his hand vibrate less when doing the experiment. If he isn't aware of that advantage, he will not be able to repeat his experiment in a better lab (I also know guys who jokingly stated they got best results with a slight hangover; there might have been some truth in that)


Oh, I agree. Biological experiment reproducibility is an incredibly hard problem. You are probably right that it is 'trivial' by comparison in the same way that landing on mars is trivial to landing on Alpha Centauri.



Have you seen: http://matt.might.net/articles/crapl/

"Generally, academic software is stapled together on a tight deadline; an expert user has to coerce it into running; and it's not pretty code. Academic code is about "proof of concept." These rough edges make academics reluctant to release their software. But, that doesn't mean they shouldn't.

Most open source licenses (1) require source and modifications to be shared with binaries, and (2) absolve authors of legal liability.

An open source license for academics has additional needs: (1) it should require that source and modifications used to validate scientific claims be released with those claims; and (2) more importantly, it should absolve authors of shame, embarrassment and ridicule for ugly code."


I think that's what the folks at Software Carpentry [0] are trying to do. I went on one of their courses, and you're taught the basics of writing good software, version control and databases (SQLite). I've frequently recommended it to fellow scientists.

[0] http://software-carpentry.org/


This is great! Thanks for sharing.


Recent article on git and reproducability in science: http://www.scfbm.org/content/8/1/7

It is badly needed.


That article says "Data are ideal for managing with Git."

I one time tried using git to manage my data. The problem is, I frequently have thousands of files and gigabytes of data. And git just does not handle that well.[1]

One time, I even tried building a git repo that just had the history of pdb snapshots. The PDB frequently has updates, and I have run into many cases where an analysis of a structure was done in a paper 3 years ago, but the structure has been updated and changed since then, making the paper make no sense until I thought to look at the history of changes to the structure. Unfortunately, git could not handle this at all when I tried it, taking days to construct the repo and then that repo was unbearably slow when I tried to use it.

Git would probably work well for storing the data used by most bench scientists, but for a computational chemist puking up gigabytes of data weekly on a single project, it is sadly horrible for handling the history of your data.

[1] http://osdir.com/ml/git/2009-05/msg00051.html


You might find git-annex useful:

http://git-annex.branchable.com/


As someone who, fresh out of high school, coded for a quite published astrophysicist at a major government research institution, I can confirm that I had no idea what I was doing.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: