Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Crash-only software: More than meets the eye (2006) (lwn.net)
59 points by zdw on May 4, 2022 | hide | past | favorite | 16 comments


As is tradition when talking about "let it crash" or "crash-only software", Joe Armstrong's 2003 thesis "Making reliable distributed systems in the presence of software errors" [0] is required reading. Sections 4.3-4.4 cover "let it crash".

[0]: https://erlang.org/download/armstrong_thesis_2003.pdf


The first thing I thought about was Erlang and the BEAM VM as well.


RIP


I remember we had an "underdesk" VAX running some proprietary software for scanning and cataloging drill samples in a metallurgy office in an open pit mine in Montana back in the 90s. The thing just worked. Power regularly went off due to shovels being moved (power was cut at the substation to move shovels). Ground was constantly shaking due to blasting in the pit. Extremely dusty environment. The only time that I ever had to work on it was when the vendor sent out a new harddrive (detected some anomalies over a dial up connection). Just beautifully functioning hardware and software.


How would you knew if it was not working correctly?


I feel like people don't go deep enough into how to write 'crash only software' in these discussions. Like what are the options?

1. write ahead log before you do side effects/idempotent side effects

2. double writes to disk to prevent torn writes

3. checksums to make sure we don't make bad decisions based on bad data

4. redundancy/anti-entropy/other distributed system patterns which attempt to obviate the need to be overly concerned with a single process crashing

5. self-healing patterns when bad data is found

anyone have any other ideas?


> There, exception-handling code is scattered pervasively throughout the system, and much of it cannot be executed in any practical test process.

Systematically testing that software is crash-proof isn’t exactly easier though.

Regarding exceptions, one technique, if you can hook the throw sites, is to increment a counter anytime a throwing condition could occur, and in the tests indiscriminately throw an exception when the counter reaches N, and repeat that for N=1…MAX.


In C++, anyway, the same code -- destructors -- gets run for throws as for normal exits, so almost everything is tested all the time in normal operation.


I read this back then, and still remember it - an interesting piece, making you "think different".

In case someone wants to search for more recent stuff, the author changed name later: https://en.wikipedia.org/wiki/Valerie_Aurora


A painful data loss on my journaled HFS formatted flash drive reminded me how you shouldn't believe the crash safe promise.

Especially when it crashes during a recovery or when there is encryption involved.


The point is not that crashes are always safe, every time. The point is that if you intentionally crash things regularly, you have to make crashes safer than if they only happen inadvertently.

So more concretely, if system A crashes 1 % of the time, half of crashes might cause corruption. If system B crashes 100 % of the time, the developers of B will have to face the corruption issues and might reduce corruptions to affect only one crash in every thousand.

In the end, system A has a corruption rate of 0.5 %, whereas system B has a corruption rate of only 0.1 % – despite crashing more often.

So the point is changing the incentives around dealing with crashy situations in a way that ultimately results in higher stability. It's saying "It will crash anyway, so we have to deal with it. How can we force people to deal with it? We can crash it always."


This is why, when it's urgent, I'm willing to turn of my laptop with Alt-SysRq-s-u-o (that's sync-unmount-poweroff).


Sorry for the lay question, but does this mean you’re unmounting your primary drive and powering off your system, as opposed to the shutdown (graceful) and poweroff (less graceful) shell commands?


It's a way of sending commands to safely power down a linux (maybe *nix?) system even when it is in an otherwise unresponsive state -- I'm pretty sure not a kernel panic, but if it's otherwise frozen, these [0] may work.

The current usual advice for a safe shutdown is REISUB, with the mnemonic 'Reboot Even If System Utterly Broken', but the old mnemonic a lot of folks will still know is 'Raising Skinny Elephants Is Utterly Boring'. It seems to have been replaced because it flushes data to disk earlier in the sequence than killing off processes, potentially leaving data from running processes un-flushed.

0 - https://en.m.wikipedia.org/wiki/Magic_SysRq_key


Thanks to the author who took the time to bring up the question that was bothering me for years. Especially when poor program logic messed up time critical processes. And the shutdown process in Linux is very messed up. For example, I've disabled sleep-on-lid close, but, nevertheless, when I initiate the shutdown and close the laptop, I may come in an hour, open up the lid - only to see "Shutting down, bye. Blink.". If the battery haven't died, that is. Turns out the laptop falls "asleep" in the middle of shutdown and when resurrected, continues to run a few remaining shutdown procedures. And a bit less fun, but weird, no less case: if you happen to mount some SMB shares like "mount -t cifs //..." (even by IP), gods save you from pulling the plugs! Because if you do, like me, press the power button, happily unplug the mouse, audio, and etherenet, you're in a bit of a situation: "Unmounting /mnt/myfreakinhomeserver/"... 10 minutes or so. And, until you plug the cable back in and pray it hooks up as it was (because who knows if the auth still works), Linux won't shut down. Na-ah. Not a chance. But, unfortunately, one can't just hit CTRL+S (save) on everything and pull the plug. Things need to be saved, states maintained, registers written, telemetry sent (yeah, right :) hello, ms-canonical). And I fully understand, that the OSI model works against us in my SMB share case: the FS layer probably doesn't know it's a network share, and the network subsystem may not know that we're shutting down and may drop everything immediately, etc. etc. Well, what can we do... was I asking myself until today. Thanks for the food for thought! (fixed grammar)


Author has since changed her name to Valerie Aurora, and (last I heard) runs Ada Lovelace Day festivities.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: