Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I love these topics.

~ 2007, working in a large bioinformatics group with our own very powerful cluster, mainly used for protein folding. Example job: fold every protein from a predicted coding region in a given genome. I was mostly doing graph analysis on metabolic and genetic networks though, and writing everything in Perl.

I had a research deadline coming up in a month, but I was also about to go on a hunting trip and be incommunicado for two weeks. I had to kick off a large job (about 75,000 total tasks) but I figured spread over our 8,000 node cluster it would be okay (GPFS storage, set up for us by IBM). I kicked off the jobs as I walked out the door for the woods.

Except I had been doing all my testing of those jobs locally, and my Perl environment was configured slightly differently on the cluster, so while I was running through billions of iterations on each node I was writing the same warning to STDOUT, over and over. It filled up the disks everywhere and caused an epic I/O traffic jam that crashed every single long-running protein folding job. The disk space issues caused some interesting edge cases and it was basically a few days before the cluster would function properly and not lose data or crash jobs. The best part was that I was totally unreachable and thus no one could vent their ire, causing me to return happy and well-rested to an overworked office brimming with fermented ill-will. And I didn't get my own calculations done either, causing me to miss a deadline.

Lessons learned:

1) PRODUCTION != DEVELOPMENT ever ever ever ever 2) Big jobs should be proceeded by small but qualitatively identical test jobs 3) Don't launch any multi-day builds on a Friday 4) Know what your resource consumption will mean for your colleagues in the best and worst cases 5) Make sure any bad code you've written has been aired out before you go on vacation 6) Don't use Perl when what you really needed was Hadoop



Nice. I once needed to do reciprocal blast for the complete genomes of about 300 bacterial species. That's on the order of half a billion queries, but the work was embarrassingly parallel, and each discrete job only took about 90 seconds. I wrote a little shell script to kick them off on the cluster, and went home.

I woke up the next morning to several inbox screens' worth of messages from angry people I didn't know, demanding explanations for what I did to their jobs and their cluster. I don't think I have ever biked to the lab faster.

After multiple rounds of palm-drenching emails with the cluster sysadmins and the computational mathematics group PI (and my own boss agonizingly cc'ed), we determined the cause. The cluster sysadmins, lacking imagination for the destructive naivete of their users, had not foreseen that anyone would want to submit more than 10^4 jobs at once. That broke the scheduler, preventing other people from running jobs and me from canceling them. Meanwhile the blast jobs blew past the disk quota, leading to a Hellerian impasse where I somehow lacked the space to delete files so I could create space. I still don't fully understand it.

I believe it took a full day to get the cluster back online.


My team did something similar once. We pushed a version to UAT with all the dev debug logging still turned on. It filled up a solaris disk so badly the SA had to get a bus to the data centre and fix it in person.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: