I did that when I first came to Silicon Valley. I went to work for a company which operated a large, for its day, mainframe data center. (Not IBM, UNIVAC). Each time the operating system crashed, which it did several times a day, a "panic dump" was produced, a stack of paper about an inch thick, with a summary and stack backtrace at the top, and a full listing of the contents of memory.
There were two stacks of these, six feet high, waiting for me.
It took me most of a year to work through the pile, finding out why the crash had occurred, by tracking pointers through memory with pencil and colored marker and comparing this with paper listings of the operating system. Then I'd code a fix for that problem, test it (usually around 2 AM when I could take over a mainframe), and nervously put it into production on one mainframe. Slowly, the piles of crash dumps got shorter and the mean time to failure went from hours to weeks.
There were very few meetings, and nobody interfered. They were just happy to see the crash dump pile shrink and the uptime increase.
After a few years of this, by which time the systems would stay up for months, I got a job in R&D at another company and got out of maintenance programming and into theory.
Neat. For some reason I find it satisfying to read an example like that where there's actual real tangible work being done with immediate and obvious benefits.
The weird thing about tackling these seemingly Goliath type problems is that at the start it requires a lot of patience to get going and can be demotivating to see no progress or trickle-slow progress, but once you visibly see that the gears are starting to crank, it is very rewarding.
There were two stacks of these, six feet high, waiting for me.
It took me most of a year to work through the pile, finding out why the crash had occurred, by tracking pointers through memory with pencil and colored marker and comparing this with paper listings of the operating system. Then I'd code a fix for that problem, test it (usually around 2 AM when I could take over a mainframe), and nervously put it into production on one mainframe. Slowly, the piles of crash dumps got shorter and the mean time to failure went from hours to weeks.
There were very few meetings, and nobody interfered. They were just happy to see the crash dump pile shrink and the uptime increase.
After a few years of this, by which time the systems would stay up for months, I got a job in R&D at another company and got out of maintenance programming and into theory.