Everywhere I've ever worked I've encountered some variation of this. One recent customer had some MySQL problems which took down the sole MediaWiki server holding all their IT documentation. Not only did they not have a standby, not a single backup or dump was ever done.
These days what I find even more astonishing is how even experienced professionals make the mistake of thinking they don't need separate backups because their system has some integral disk redundancy (e.g. http://www.joyent.com/joyeurblog/2008/01/22/bingodisk-and-st... ).
Another true story: A while ago I was DBA on a very important database (as in, the company would likely go out of business if it was lost). The way it was backed up was to quiesce the standby, take a snapshot of its filesystems on the SAN, then send those to tape while the standby was reenabled. Sounds good. But what had happened was that the sysadmin, to speed it up, was running several parallel backups, each one of which had a hard-coded list of files.
So as soon as one new datafile was added to the database (I estimate, a week after that regime was put in place) all subsequent backups were invalid. We were in that situation for over a year and everyone believed that the databases was being properly backed up. We would even occasionally restore a file (picked from the backupset, hmm) and block-verify it. I only discovered it when I was upgrading the system and I fixed it real quick (dynamically generating n equally-sized lists of files is trivial!). Hard coding the list of files was stupid, sure, and betrayed a fundamental lack of understanding about how the system worked - but it was also stupid that no-one ever checked, and that betrayed a lack of understanding of how organizations work.
The lesson I took away from that was, you have to have people who have visibility of things end-to-end.
This is why we blacklist databases/tables that are not to be backed up and backup everything else. This makes sure any new database is added to the backup.
About 12 years ago, when I was still fairly new to linux, I screwed up enough to require a reformat of the HD. I discovered afterward that my backups couldn't be read, the last 3 MONTHS worth of backups. I still don't know what was wrong, I was just making tarballs of my home directory. Always check occasionally to make sure your backups are actually readable.
A meta-lesson of many of these lessons was "if you can avoid it, don't work with difficult people at crummy jobs". Lesson #8 in particular smelled of an environment where people were actively seeking to be offended and unhappy (the smell is oh-so-depressingly familiar).
Agreed. A couple of companies ago I knew someone who complained to HR that her manager was bullying her. What happened? Well, he knew she was afraid of heights, but still relocated her desk (along with everyone else in her team) from the 1st to the 5th floor...
Accountability, or confessing when you make a mistake, is the mark of a professional. I didn't learn that lesson right away, but I'm glad I finally did. Your manager or customers might not be happy that you made the mistake, but they'll always respect and appreciate you for being accountable.
Getting caught trying to cover up a big mistake is usually a resume changing event. As trite as it may sound, the cover up is usually worse than the crime.
The backup thing is critical... I once had a client who's tape backup drive died. He was out of town, I was busy with other clients and it was a Fri. I said let's deal with it on Monday. Monday comes around and the secretary calls me and says the server is locked up. I walk her through a hard reboot... then hard disk DOA, as in "no bootable drive found", no partition, no files. My heart sank. Even worse the tape drive that died had not written a usable backup in weeks even though the logs reported no errors. The best tape we had was 2 months old, not much use but better than nothing. Fortunately, the data recovery firm was able to get the data off the old hard disk but it cost the company about $3K.
Another good rule... if you inherit a new customer you need to "trust but verify" the previous tech's work. I warn my new clients about the added costs of checking and learning a new setup. I learned this one the hard way when a new client had a decent tape backup program in place but I never did a test restore. One day they needed some deleted files restored and was shocked to learn that the previous tech had inadvertantly selected the option to backup the directory structure only! No files in piles of tapes, just folders. Yikes!
I have shot myself in the foot on more than one occasion by failing to document a procedure or configuration I was sure I would remember.
Boy does that ever strike a nerve. Mainly in my own code.
I'll write a function with just a little bit of complexity, getting working perfectly, and build all kinds of stuff on top of it. Three months later I need to add something to it, but I don't understand what I wrote.
I'm the biggest stickler on variable naming and self-documenting code and yet, I still keep doing it to myself. Oh well, always room for improvement.
The worst one I did : With this product, the VP of product distributes product keys. I had recently had a HDD crash and reinstalled and had customer facing deadlines. He was not around and I needed to make some progress, so I fiddled the keycheck code and bypassed it. I finished my work and committed the whole workspace... including the hack... fortunately this was caught in QA.
Now, that is not the worst impact on my life... in my earlier years I made it too much a point to get my real opinion across.
CNET used to do something sneaky whereby they used domain.com.com on many of their domains to inflate their unique visitor counts. That way, when somebody typed http://mistake.com.com, they'd be counted as a unique visitor to the com.com TLD which hosted all CNET sites. This way, they could tell advertisers that they reached 99% of the Internet via their com.com domain. It was really annoying to type in http://www.cnet.com, and be redirected to http://cnet.com.com...
Apparently, they've stopped doing this. Now, typing in, say, http://mistake.com.com ends up doing a search for "mistake" on CNET's search engine, search.com.
Everywhere I've ever worked I've encountered some variation of this. One recent customer had some MySQL problems which took down the sole MediaWiki server holding all their IT documentation. Not only did they not have a standby, not a single backup or dump was ever done.
These days what I find even more astonishing is how even experienced professionals make the mistake of thinking they don't need separate backups because their system has some integral disk redundancy (e.g. http://www.joyent.com/joyeurblog/2008/01/22/bingodisk-and-st... ).