I love these topics. ~ 2007, working in a large bioinformatics group with our ow...

lambdaphage · on Jan 30, 2014

Nice. I once needed to do reciprocal blast for the complete genomes of about 300 bacterial species. That's on the order of half a billion queries, but the work was embarrassingly parallel, and each discrete job only took about 90 seconds. I wrote a little shell script to kick them off on the cluster, and went home.

I woke up the next morning to several inbox screens' worth of messages from angry people I didn't know, demanding explanations for what I did to their jobs and their cluster. I don't think I have ever biked to the lab faster.

After multiple rounds of palm-drenching emails with the cluster sysadmins and the computational mathematics group PI (and my own boss agonizingly cc'ed), we determined the cause. The cluster sysadmins, lacking imagination for the destructive naivete of their users, had not foreseen that anyone would want to submit more than 10^4 jobs at once. That broke the scheduler, preventing other people from running jobs and me from canceling them. Meanwhile the blast jobs blew past the disk quota, leading to a Hellerian impasse where I somehow lacked the space to delete files so I could create space. I still don't fully understand it.

I believe it took a full day to get the cluster back online.

gadders · on Jan 31, 2014

My team did something similar once. We pushed a version to UAT with all the dev debug logging still turned on. It filled up a solaris disk so badly the SA had to get a bus to the data centre and fix it in person.