How “let it fail” leads to simpler code

mewse · on July 17, 2022

I was a believer in the "my code should never crash, no matter what" school of thought until I shipped a Dreamcast game with an out-of-date opening cutscene.

It was an in-engine opening cutscene which was very nearly final; the file we shipped was about two or three weeks out of date compared against the version that should have gone on the disc (It had one missing shape key on a character's face at the end of a shot, and a couple other missing elements). My code was wrangling the whole animation; doing all the stuff which our at-the-time-primitive animation system couldn't do itself (animating texture coordinates and etc). And my code was just silently handling all the errors it ran into so that we never even noticed that anything was wrong.

The difference was subtle enough that in the twelve years since the game was released, nobody but the original animator has ever noticed and mentioned it to me (and that, years after release). But that one experience and knowing how much worse it could have been was enough to convince me that "crashes early and crashes loudly with as much detail as possible" is by far the better strategy. At least for entertainment products. And doubly so for entertainment products which can't be patched after release.

(for clarity, this screw-up was 100% my fault. The animators had made the final changes to the cutscene data files in plenty of time for inclusion in the final build, I just somehow didn't import the changed data files into the game when I made the matching changes to the code side, and then my code didn't throw any errors to tell me or anyone else on the project that anything was wrong.)

mike_hock · on July 17, 2022

Or even better, it should be the latter during development and the former in the released version.

You don't want your released game to crash in level 11 if the player happens to look behind the wrong lightpole because a texture is missing, but you do want to notice that in development.

XorNot · on July 17, 2022

I remember back when I messed with D3D (around 8?) I was surprised to learn this is how the Debug build mode of it works. In Debug you can completely screw up your pipeline and scene handling and...everything will work just fine. Or appear to. Switch to a Release build and all those failures are suddenly very obvious - and this is almost exactly what you wouldn't want.

mewse · on July 17, 2022

Back in those days when we couldn’t patch games post-release, our team felt it was much too dangerous to change anything for the release for fear of code layout changing exposing some bug which had previously been harmless and undetected by QA, and so we would typically leave all of our debugging tools and runtime checks enabled in the final release builds of the game. It was just safer that way.

But with that said, we didn’t generally crash the game due to a missing texture, even during development, as that’s a super common problem which would have impacted development too much; instead, we just drew anything which used a missing texture in max-saturation pink and green instead, alternating between the two colors once per second, to make sure it’d be super visible to anyone looking at the game during development.

We did usually change that behaviour to instead render missing tetxures in alternating black/near-black instead of pink/green for our final releases, as that was deemed a relatively safe change.

angarg12 · on July 17, 2022

One of the pieces of software I'm most proud of is a service to manage the dynamic part of our infrastructure. It uses control theory and let it fail to great effect.

The service reads the state of the system, and applies change to converge to a configured policy. If it encounter an error, it doesn't try to handle or fix it, it just fails and logs a fine grained metric, plus a general error metric.

The system fails all the time at this scale, but heals itself pretty quickly. In over 1 year of operation it hasn't caused a single incident, and it has survived all outages.

tsimionescu · on July 17, 2022

This is exactly why I think all the discussions about the importance of error handling paths (and the aversion drive have to exceptions) are usually overblown.

The most successful, and common, error handling strategy is to log and abandon the whole operation, cleaning up everything the operation left around. If you have one process per operation, this is often very well captured by doing exit() at the place of the error. If you don't, then exceptions are the best approximation of this pattern - much better than result types or error codes, which litter your code with irrelevant error handling details.

angarg12 · on July 17, 2022

Funny that you mention cleanup. Our service doesn't clean up in any circumstance.

Originally we had a cleanup operation in case of errors. But then we found those could fail as well. As it turns out, we need a catch-all way to clean up resources if anything else fails.

The solution is simple. The main service never cleans up, and a secondary service (I like to call it The Reaper) just cleans up orphaned resources. This keeps both services simpler and more resilient.

Of course this pattern works in our particular circumstances. In other domains it might lead to resource leaks and such, so apply your best judgement.

tsimionescu · on July 17, 2022

Sure, if you can avoid cleaning up, that's great. The biggest potential problem I know of are client-side TCP connections, which aren't very easy to clean up after the fact (you'd have to know the client port, and craft a RST packet from that port), and tend to have really long timeouts by default.

avipars · on July 17, 2022

Garbage collector?

blub · on July 17, 2022

These discussions happen when people don’t give consideration to the fact that reliability is an architectural concern and error handling is part of that.

There’s certainly a minimum of error handling that has to be done in order for code to be considered generally correct, but a lot also depends on the reliability requirements.

Sometimes it’s just inappropriate to abort and this may deeply change the architecture of a program, including by making hard demands on the toolchain, hardware and OS.

tsimionescu · on July 17, 2022

Completely agree. Even with "fail fast", there are many levels where this makes sense - for example, you may not want to let an entire server application crash just because one request couldn't be successfully served; but you may still design it to crash and restart for other conditions, rather than trying to recover from more serious errors.

makeitdouble · on July 17, 2022

That needs a few conditions to be accepted:

- an isolated process which failure doesn’t cascade other part’s failure

- as parent mentionned, where and what failed needs to be super clear

- people are available to timely react to the error, so a rerun will succeed

Failing any of the above, and you’ll need extensive and probably complex error handling that can at least help the system work in a degraded state until the error source is handled.

In my experience, it’s really the last part that motivates engineers to try to deal with the maximum error cases automatically instead of having to deal with them weekends after weekends.

tsimionescu · on July 17, 2022

I believe one of the assumptions is that the failed process can automatically restart (e.g. using systemd, Kubernetes, hypervisor policies, top-level retries) - so that transient errors recover automatically, and at worst cost some performance or tiny bits of lost work (e.g. the setting an end user just hit apply on doesn't get applied, so they have to click again).

makeitdouble · on July 17, 2022

Yes, that’s really where it gets funny.

In k8s, if I remember well, the default retry policy for failures will involve incremental back offs, which means if your transient error lasts 20 min your next retry might be in a few hours. It’s fine if your system is ok with that, otherwise jobs will need to “succeed” even when they fail, and so handle errors as gracefully as possible.

Same actually for the user input one: you need to tell your user it’s a recoverable error and not just throw a random “oopsy” message, which means at least some handling of the error to come clean at the end of the tunnel.

My take is, errors are complicated. It’s nice when a script can just die at the first error and not care about how or what happens from there, but that’s such a niche case.

dfee · on July 17, 2022

Totally agree. Now, I’ve actually found functional programming - specifically, Either<Error, Value> - to be of great use in helping me focus only on the happy path.

tsimionescu · on July 17, 2022

Yes, with some kind of monadic-like control flow (either actual monads or even Rust's ? macro), those can be also achieve this workflow pretty well.

Edit to add: I still think exceptions are better in practice, as you also get a stack trace when the failure happens, whereas Either and ? don't really help track down the error unless you add code to create a manual "stack trace".

rklaehn · on July 17, 2022

In rust, you can use the excellent anyhow crate https://docs.rs/anyhow/latest/anyhow/ . It has various ways to add context to an error, and will automatically attach a stack trace with the backtrace feature.

Explicit or implicit panic of course also attaches a backtrace. It can also be caught, although that is a can of worms. So panicing is the closest thing rust has to exceptions - somewhat similar to java.lang.Error on the jvm. https://docs.oracle.com/javase/7/docs/api/java/lang/Error.ht...

With anyhow, error handling in rust really is quite pleasant.

bimboler38 · on July 17, 2022

From my experience, much of this comes down to the design of the system: was some state modified prior to an error occurring? If so, can it easily be rolled back? If not, why not?

arunaugustine · on July 17, 2022

Yes, please give us more info about using control theory and how one might think about building such a system please..

Too · on July 17, 2022

This is how Kubernetes works in many ways. Crash a pod and the control loop inside the ReplicaSet will create a new pod for you. Scaling nodes is based on similar principles of desired vs actual values.

wsc981 · on July 17, 2022

My guess would be add assertions everywhere instead of throwing exceptions.

Karellen · on July 17, 2022

Why not throw exceptions, and just never use try/catch? That way, all exceptions are uncaught and should terminate the program, in a way that takes advantage of the programming language's native error reporting facilities.

hsn915 · on July 17, 2022

assertion failures terminate the program immediately.

exceptions usually trigger cleanup code.

If your cleanup code is mostly closing files and clearing memory, then it's useless because the OS will do that for your crashed program anyway.

roeles · on July 17, 2022

I don't know of a way to test this behavior (I mainly code C++ and unit test with Google Test). One could spawn a process and capture the output and return value, but that sounds a bit heavy for just testing if your error handling still works as intended.

angarg12 · on July 17, 2022

These kind of systems are not always appropriate, but when they do, they work wonderfully.

Our use case was to build a service to manage the dynamic part of our infrastructure. These are infra pieces that are created/deleted/modified on the fly according to some policies, instead of defining them statically as code. The implementation is simply a lambda function that runs every minute, loads a policy, compares the current state of the system, and then creates/deletes/modify resources as needed.

I am currently in the process of writing a talk that I will deliver to the rest of my org. This will help me crystalize my thoughts, but here are some pointer on why I think it worked in this case:

* The service is stateless. On each run it just loads a policy, compares it to the current state of the system, and acts accordingly. This avoids handling complicated state or coordinating executions. In theory two policies could contradict each other, but in practice we partition our policies in such a way overlap is not possible.

* Operations are idempotent. This is one of the reasons the system converges to a desired state. This makes the service resilient to both failures and eventual consistency.

* Deviation from policies doesn't affect correctness. We are fortunate that our system is not directly customer facing. Deviation from policies affects only performance. The system can work several minutes (if not hours) outside policy band without consequences. This probably will be a critical blocker for most production systems.

Other than control theory and let it fail, I had the chance to play with other cool concepts while implementing this service. These are some of those:

* Parse, don't validate/anti-corruption layer: The service downloads and parses a policy at the beginning of each run. If parsing fails, it errors out. Otherwise, it passes the policy object to the rest of the execution. This makes the system easy to test, and avoids the anti-pattern of peppering your code with instructions reading input, just to find mid-execution that the policy was invalid.

* Pluggable policies: The main body of the service is a very simple sense/act loop. For the actors, we use a strategy pattern, where policies can choose what strategy to use. This approach has helped us to introduce new behavior with minimal code changes.

* Typescript as configuration language: This service replaces an older, less flexible one. A major pain of the old service is that policies where defined as Jinja templates over plain text files. This became unmaintainable as the number and complexity of policies grew. Our new service defines policies in Typescript. Policies are statically typed, and we use regular programming constructs (functions, loops, variables...) to build them at compile time. The output is still a plain JSON file.

Hope that helps.

t0mek · on July 17, 2022

This approach is very similar to the way how Kubernetes custom resource reconciliation works (and Kubernetes in general, but the custom resources is the way how you can bring your own logic there).

In Kubernetes you can define your own types, Custom Resources (basically JSONs with schema) and deploy "operators" - services that should handle these new types. Every time you create or modify your custom resource, the operator is triggered and it should "reconcile" your resource.

Now this reconciliation process is stateless. It doesn't know what exactly changed in your resource, so it should just go through the list of all the things that it needs to do (create or remove pods, services, configmaps, etc.) and if something is not right (e.g. a missing service), try to make it right or fail. In any case, the output should be written in the custom resource's .status section.

There's no active waiting - if the operator sees that some other resource is not ready yet (a required pod is still starting), it should just mark your resource as not ready and finish. If the pod state changes, the next reconciliation will notice it. It should do as much as it can to bring reality to the expectation, but not more.

If implemented correctly, this is surprisingly resilient. The idempotent nature of the reconciliation loop makes it perfect for errors handling. For instance, your reconciliation may fail because some pod is not running correctly. It's nothing that your operator can fix. But if the pod auto-heals (maybe the network connectivity was restored or an external service is available again), the operator will auto-heal as well, without a manual intervention. The next reconciliation loop will just see the pod is available again and carry on.

fire · on July 17, 2022

this is the type of thing I'd love to see code / a post about implementing

kortilla · on July 17, 2022

This is just kubernetes right? Declarative desired state model. Containers created and destroyed to get there. Crashes happen, metrics are incremented, load balancers route around the crashing pod until they recover (or are replaced), etc.

secondcoming · on July 17, 2022

> fails and logs

What if your logging code was written with the same philosophy?

angarg12 · on July 17, 2022

In another comment I mention that our service has some properties which make this a great solution. It doesn't always work, but when it does, it's awesome.

To your point, Erlang has the concept of supervisor trees and handling of errors. Similarly, our supervisor is the Lambda runtime. If everything goes wrong, the runtime will emit a metric error.

But what if Lambda itself fails? Or Cloudwatch? or any other supporting service? of course this is possible. But probably at that point the system is so fucked up that the fate of our little service is of little relevance. At least we know from past outages that when normal operations resume, the system will correct itself.

ramraj07 · on July 17, 2022

A corollary or generalized interpretation of this approach (and someone please specify if there’s a formal term for this) is: “fail locally, and immediately.”

What I mean is that once something unexpected happens your code should ideally fail in that step itself.

The simplest most common example I’ve seen with python programmers is when they pass around dicts as arguments in complex code bases. Methods expect various keys to be present, and often methods also have fail safe defaults if some keys are absent. The defaults are written for the specification, sure, but often they also tolerate unexpected exceptions that happened upstream.

Now when an unexpected exception happens, your program fails somewhere else and the stacktrace is useless. The only way to figure out what went wrong is to debug it line by line.

With python there’s still no elegant solution. I’m now trying to ensure all my methods are typed and use dataclasses and pydantic classes to type and group these parameters but there’s still opportunities for these “fail later” errors. Solutions and suggestions would be appreciated!

MonkeyMalarky · on July 17, 2022

>Solutions and suggestions would be appreciated!

Ban the usage of default values or default parameters anywhere outside of top-level / public facing functions. Plus assert everything all the time.

I've gotten into arguments with other developers over it but I'll take the inconvenience in developing now over tearing hair out over bugs later, anytime.

bimboler38 · on July 17, 2022

>Plus assert everything all the time.

This is where static checking comes in. Static tests should fail if it's assumed (and not asserted) that a key exists.

jeshin · on July 17, 2022

dicts are just a little too easy to use. You just smear it down, pass it around, and you're in business. If you really want to shoot yourself in the foot, also modify its structure here and there along the way, it's just so convenient. Who needs all that hassle of declaring a data class for each little thing?

It took me a little too long to realize that a data class represents a contract about the structure of your data, meaning that no matter how many calls deep you are passing it around, you will always know its structure without having to trace it back to the origin, and that's a powerful thing.

FridgeSeal · on July 17, 2022

I worked on a team once where a couple of co-workers were doing this and more in what is possibly the worst Python codebase I’ve had the misfortune of seeing.

Highlights included:

* DIY “json” logging function that did some obscene string concat work every time it was called and abused global vars; it also didn’t output valid JSON. Suggestions to just use a normal logging library were aggressively disregarded.

* dozens of functions, all of which indirectly mutated this extremely nested dictionary of data. They all had slightly different names, and none of them took this dictionary as a parameter, they just abused global vars. All of them would do these insane checks to ensure that the specific keys they were looking for existed

* none of it was in git properly; the 2 data engineers writing it passed the code back and forth using a google drive.

* instead of importing functions from the Python files they wrote, they’d invoke the functions by shelling out, calling Python <other file.py>, string interpolating the values and then waiting for completion by *waiting for a file of a specific name to be written into the file system.

Oh yeah and when they decided they wanted parallelism, instead of doing the sane thing and using something like joblib or multiprocessing to make stuff easy, they’d just shell out and invoke more Python processes via xargs…

strogonoff · on July 18, 2022

Coming back from TypeScript to Python, I found that most recent (3.10+) typing annotation shorthands are pretty succinct, and running mypy at all times really helps cut down on runtime errors.

My recipe:

— annotate variables, attributes, arguments and return values;

— run a good type linter (we use mypy) at all times;

— never pass around generic dictionaries: use dataclasses[0], TypedDicts, etc. instead.

That way you define a subclass inheriting from, say, TypedDict and declare that your function only takes that subclass. After that, you’ll get a loud error if you pass any dictionary that doesn’t match the spec (missing keys, wrong values, etc.)—ideally, right in your IDE.

(To reiterate, this would be a pointless exercise if you don’t lint all the time; most IDEs support this.)

[0] You can additionally use them with Pydantic, which can validate data at runtime at a cost of some performance overhead.

bzxcvbn · on July 17, 2022

>Solutions and suggestions would be appreciated!

Use a language with strong typing?

anyfoo · on July 17, 2022

Yeah, that one is indeed obvious. That's what everyone advocating for static typing (which you mean, somebody commented already) has been shouting all the time.

The problems coming from dynamic typing are just unnecessary, and the only thing that made python palpable for me again was mypy, which introduces static typing into python. But it's not the default, so it still only really works for small things.

teaearlgraycold · on July 17, 2022

IMO mypy is severely lacking compared to what TypeScript added to JavaScript. I'd always pick TS over Python given the choice.

ramraj07 · on July 17, 2022

We use pytype as well but nothing works runtime right?

I implemented a “check_type” decorator I use in many places but am starting to think I should decorate all the methods in my code with this decorator while building it or something.

aunderscored · on July 17, 2022

Python is strongly typed. You want statically typed. (Instead of duck typed / dynamically typed)

bzxcvbn · on July 17, 2022

Can you guess what this code does?

    class foo:
        pass
    
    obj = foo()
    obj.bar = "I thought Python was strongly typed?"
    print(obj.bar)

And even better:

    class foo:
        a = 42
    obj = foo()
    print(obj.a)
    del foo.a
    print(obj.a)

Whatever your opinion on what the imprecise sentence "strongly typed language" should mean, these are definitely not features of one.

JoshCole · on July 17, 2022

Yes, I can guess what the code does. But can you guess what this code will do? 1 + "1"

Contrast Python (a strongly typed language):

    >>> 1 + "1"

    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      TypeError: unsupported operand type(s) for +: 'int' and 'str'

    >>> [] + 1
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      TypeError: can only concatenate list (not "int") to list

With Javascript (a weakly typed language):

    1 + "1"
    "11"

    [] + 1
    "1"

masklinn · on July 17, 2022

> With Javascript (a weakly typed language):

I'm always wary of these, because you can define that as strongly typed if it's the operation which is defined to perform the conversion internally, which IIRC is how it works in javascript.

For instance the first example will do the exact same thing in Java, because addition between a string and a non-string is defined as converting the non-string to a string then concatenating.

The second operation is not defined such in Java, but in theory you could have a universal toNumber protocol and define the addition of a non-integer and an integer as converting the non-number to a number then adding.

JoshCole · on July 17, 2022

I'm wary too.

Stand upon a language and look down. You get to raw physics as you go down. All the abstractions are a useful reconception, not the reality. Stand upon the language and look up. You see all the unrealized programs that can be built atop it. Focus in on the programs of a particular type: those that implement a language within the language. Spot one in particular - the one that uses say `@property` `isinstance` and `raise` and `TypeError` to always preserve type safety in every situation a person cares about.

So what I am concerned about is the behavior of the finite set of elements provided by the language and their properties. I can make claims about these concrete things - the addition operator in one language rejects by type but in another it doesn't. But I'm quickly overwhelmed by infinities when I try to do more.

Izkata · on July 17, 2022

> which IIRC is how it works in javascript.

Yes, when objects are involved it's internally translated to:

    ([]).toString() + 1

This can be shown by changing the default implementation:

    > Array.prototype.toString = function() { return 'Boo!'; }
    > [] + 1;
    "Boo!1"

Changing the prototype for Number doesn't work so I assume there's something slightly different going on there.

masklinn · on July 17, 2022

> Changing the prototype for Number doesn't work so I assume there's something slightly different going on there.

The answer is that addition first checks if either operand has a "primitive value" which is string-typed, if so it's a string concatenation, otherwise it's a numerical addition, at which point it converts both operands to numbers and adds them.

The primitive value of a `Number` is a `number`, so changing `Number.prototype.toString` has no effect (it's not even called). However if you set `Number.prototype[Symbol.toPrimitive]` then you can influence the rest of the process. Still won't affect an addition of primitive `number` values but:

    > Number.prototype[Symbol.toPrimitive] = function(hint) { return String(this.valueOf()) }
    > new Number(4) + 2
    < "42"
    > 4 + new Number(2)
    < "42"

[numeric binops]: https://262.ecma-international.org/13.0/#sec-applystringornu...

[numeric conversion]: https://262.ecma-international.org/13.0/#sec-tonumeric

bzxcvbn · on July 17, 2022

So your bar for strong typing is that some type conversions are not made implicitly. That's a pretty low bar.

JoshCole · on July 17, 2022

I consulted Wikipedia and other sources and placed the term they used to categorize these languages beside the language name; it wasn't my bar - I use the collection of symbols I use in the positions I use because others chose to do so. I shared the trivial creation of type errors in Python to help you notice why others oppose your definition and wish you to adopt more precise terms. I don't disagree with you that there is imprecision - comparative language is always with respect to a reference and without a reference specification it is meaningless. The deeper point is that you can't complain about imprecision while failing to do contrast. It would be like complaining that weight 100 doesn't tell you whether someone has a weight. Of course it doesn't. Units needs to be included for measures for them to meaningful. That doesn't mean the concept of units is unsound. It means you need to specify weight more thoroughly to avoid ambiguity.

Really think about what you said and it compiles to something akin to:

   type_strength(sample(python_programs, size=1, heuristic=representative_of_claim)) < undefined

In contrast my statement compiles to something like:

   type_strength(sample(python_program, size=2, heuristic=represenative_of_claim_simple_programs_first) > type_strength(sample(javascript_program, size=2, heuristic=represenative_of_claim_simple_programs_first))

There are a bunch of problems with my approach. We can pick it apart endlessly. It is positivism. It is low sample size. The heuristic is biased. That I reported it was subject to bias. `type_strength` isn't well defined. Neither is `>`. It is terrible in so many respects.

dehrmann · on July 17, 2022

> 1 + "1"

Did someone say PHP?

withinboredom · on July 17, 2022

PHP has a strict mode.

tsimionescu · on July 17, 2022

JS's typing is weaker than Python's, but Python still has a relatively weak type system - particularly in old-style classes as GP was showing (class foo(object)-style classes fix those problems, or at least some of them) EDIT: I was wrong about new-style classes fixing this. Other dynamic languages are stronger than both - for example Common Lisp.

filmor · on July 17, 2022

There are no old-style classes in Python 3.

tsimionescu · on July 17, 2022

Oops, you're right, and the problem isn't fixed by new-style classes. I must have misremembered something.

Too · on July 17, 2022

No need to guess, IDE is flashing bright red and mypy screaming main.py:5: error: "foo" has no attribute "bar". One could still say fuckit and run it anyway, but why would you take that risk. This would never get through to production.

bzxcvbn · on July 17, 2022

It's not your IDE's features that determine if the language is strongly typed or not.

bogdanoff_2 · on July 17, 2022

Well, it's not very strongly typed if you use dicts with default values for everything.

jahewson · on July 17, 2022

FWIW “strong” typing is a colloquial term and lacks any precise meaning.

BurningFrog · on July 17, 2022

In general, I think using real classes with a single central definition instead of raw manually created dicts is the solution.

A thorough test suite is also needed, of course.

packetlost · on July 17, 2022

As someone working with an extremely large Python codebase, early on we made the call to never allow dictionaries as arguments to functions (with exceptions for if the dictionary is truly arbitrary and only gets logged/persisted for human reading). We rely heavily on type annotations and dataclasses. Type system weaknesses aside, the system is rather maintainable despite its size, complexity, and domain.

trav4225 · on July 16, 2022

I've seen a lot of new developers shocked by this approach, which surprises me a little. They seem to think that it's up to the application to handle all errors, even those of the programmer(s). This, of course, is unreasonable since it would essentially require knowing all the bugs in advance. :-)

allenu · on July 17, 2022

I'm a big fan of the "crash early" strategy. I write in Swift primarily, and if I suspect a state is impossible to reach, I'll add a fatalError() so that in development, if it turns out I'm wrong, I spot it right away. (Something I learned from another dev I worked with, who was very productive.)

Unfortunately, a lot of other devs hate to see that your code may actually crash and start asking questions about what scenario could cause it and asking if maybe there's a more gentle way to get out of the error. So, I'll often back down and start having softer error-handling, but on the whole it does complicate things further as the errors cascade and now you have to reason about handling combination of errors that have low likelihood of happening. So, to me, just having an early crash is way better.

aaaaaaaaaaab · on July 17, 2022

Yeah. My pet peeve is `guard let … else { return }` instead of force-unwrapping. Like, why do people think that silently swallowing an error is better than crashing loudly and clearly?

jbverschoor · on July 17, 2022

Same here.. esp. in server based code it makes no sense to not fail early, even on the slightest issues. If you have proper logging / notifications, you'll code be more robust.

Had to deal with the same issue as you.. other devs and managers don't like those errors.. but it makes things fragile and more difficult to troubleshoot.

uoaei · on July 17, 2022

> if I suspect a state is impossible to reach, I'll add a fatalError()

Does Swift have assert statements? If so, is there a reason you chose this method instead?

allenu · on July 17, 2022

Yes, Swift has assert statements. I tend to use them a lot as well, but in shipping code, they don't terminate the app. There are still some places where I'd prefer to terminate the app early rather than continue on.

To be clear, I tend to use assert statements more than fatalErrors.

jillesvangurp · on July 17, 2022

It's a common mistake in code written by junior developers to only code the happy path. It leads to a very brittle system. A good example is a web application that needs a websocket open. What happens if you run such an application on a mobile phone and you temporarily lose connectivity and this happens multiple times as people walk around town because real world connectivity just isn't perfect? And also, they put their phone in their pocket and it goes to sleep. These are not user errors but expected, normal behavior.

Basically the happy path is that this simply never happens. You open a websocket and listen for incoming messages and process them. The actual situation is that you open a websocket and some time later it dies and then you simply attempt to reopen it until it succeeds and resume processing messages. The app has several states: connected, connecting, and not connected and should transition from one to the other depending on what happens.

Our frontend people struggled a lot with this exact issue. They only thought of the happy path and simply ignored any form of expected failure. So the first version of the app worked great for a while until it just stopped working. The fix: "just reload the app" was of course not really acceptable. All that was needed was a little defensive coding: assume this call will sometimes fail and simply try again when that happens. Then also handle the case where retrying will also fail because actually the request is wrong (input validation) and the error is the system telling you that it is wrong. If you don't have any code that handles that, you are going to have a very flaky UX.

xdennis · on July 17, 2022

I was on a team for a short while (Java programmers) and their frontend code was really overly "careful". For example, they would always check if a method existed, before calling it.

    var o = new SomeObject();

    if (o.computeSomething != null && o.computeSomething != undefined) {
       o.computeSomething(...);
    }

Their reasoning was that in JavaScript (with the old syntax) you just add functions to the prototype, so you could forget to do it or mistype it.

    SomeObject.prototype.computeSomethinnn = function () ...

I was sort of tripping over myself in objections to what they were doing:

* you shouldn't check for null or undefined, but rather do `o.computeSomething instanceof Function`

* there's no need to do `!= null` and `!= undefined` because `!=` (as opposed to `!==`) actually checks for both

* you shouldn't do the check at all because if you actually mistype the function name all you're doing is hiding the error. Failing sooner is better.

* a missing method should be picked up in the unit tests (but they didn't have any tests at all because "our system is too complex to be tested automatically")

* probably some others...

That team really hated JavaScript and their code showed it.

BTW, the indentation above is not wrong... they did indent by 3 spaces. I read a story about 3 space indents on thedailywtf.com and thought that it was clearly made up... after this team I believe it.

bee_rider · on July 17, 2022

I set tabwidth to 3 in my editor. I like the way it looks. But of course the whole point of tabs is that nobody else has to suffer for my esoteric choice.

It also helped when tutoring new python students -- when they mixed space-indented code they copied from the internet with tab-indented code they copied from the internet, they'd get all sorts of fun errors. Setting the tabwidth to an even number sometimes allows them to hide. 3, though, really makes them stick out.

abraae · on July 17, 2022

Sounds like hilarious passive aggressive behavior from java developers forced to interact with that devil's language JavaScript against their will.

shepherdjerred · on July 17, 2022

It sounds like a pretty Java thing to do considering the prevalence of `null` and null-checks in the language. It's always interesting to see the habits that programmers bring from their main language(s) to ones they're picking up, especially when they're under pressure to deliver so they can't learn to program idiomatically.

bigDinosaur · on July 17, 2022

Picked up by unit tests? How about some kind of system which can tell you if the method exists or not, and even possibly correct your typos, before you run the code!

xdennis · on July 17, 2022

Completely fair, but that's often not built into dynamic languages. My main criticism is with their nonsensical approach to dealing with the limitations of JavaScript.

As far a I know, ESlint can't detect missing methods, and they weren't even using a linter. TypeScript can, but they weren't using that.

userbinator · on July 17, 2022

they did indent by 3 spaces

Probably a compromise between 2 and 4?

Izkata · on July 18, 2022

I suggested exactly that as a joke over a decade ago, then decided to try it out. Ended up I really liked it, and still use it for all my personal code.

I do stick to 4 spaces at work though.

vardump · on July 17, 2022

Well, there's software that can cause some degree of harm. For example through servos controlling something physical. While you still probably can't catch all of the issues, you damn better try as hard as you can within reason.

I'd also wish for similar rigor from people developing whatever filesystens my data is on. :-)

Fail fast is generally a good idea, if you can do it safely.

marcosdumay · on July 17, 2022

If you can't fail safely, you better review your entire architecture.

Software fails, you can make failures rarer, but you can't make they go away. You have to deal with it, it's not an option.

vardump · on July 17, 2022

It's all really about risk management. Things can (and will) go wrong, and it doesn't only apply to software.

This involves a lot of thinking and collecting information about potential risks and evaluating their probability and severity.

Then you just mitigate the worst risks, probability times severity (other factors are also possible). Some residual risk always remains.

sargun · on July 17, 2022

I think the idea is that there are error recovery semantics that:

1. Determine the last sane state of the system, and work forward from there. (Read the servo position and try to go from there)

2. Have a the "recovery" routine to reset the system. (Take all positions to "zero")

3. Just stop. (Yes, I know this can be bad). And ask a human for help.

vardump · on July 17, 2022

If feasible, electromechanical methods are good.

roeles · on July 17, 2022

> I'd also wish for similar rigor from people developing whatever filesystens my data is on. :-)

Stable storage is a key factor in making this philosophy work. [1]

[1] https://qconlondon.com/london-2012/qconlondon.com/dl/qcon-lo...

_moof · on July 17, 2022

"Litter the code with aborts and test the ever-loving hell out of it" is more or less the strategy we use with flight software.

smackeyacky · on July 17, 2022

I don't agree with this approach. Say you have a network service that relies on other network services. It is not difficult to write those such that they know to back off / retry when something disappears.

It's extremely useful in a lot of situations: if you do work on a laptop that gets regularly unplugged, having running test services that know to reconnect makes your life easier. In production, having things automatically reconnect means a lot less restarting of services once whatever root cause problem is corrected. Just shrugging and giving up ends up being a lot more work in the end.

I like to tell junior developers to catch everything they can, and handle it or die as nicely as possible. Of course you can't plan for everything, but you can write around network and disk issues and issue warnings in a way that makes the root cause more obvious. That involves catching errors.

TameAntelope · on July 17, 2022

What you're describing are "known" states; the idea behind "let it fail" is that you shouldn't write code that exhaustively handles every single potential outcome, just the ones that are part of your code's path in general use.

Definitely write code to handle network issues. Don't write code to handle random bitflips, ways to handle garbage coming back from the service you're connecting to, or try to handle OOM errors. Just let those fail.

Do not catch everything you can. That's the whole point of "let it fail". An app crashing is totally fine and expected behavior, in a lot of cases (of course if it's not fine, e.g. someone dies, don't do that but if you're working on that kind of software and taking advice from me, you're super duper screwed).

hansvm · on July 17, 2022

Adding to that, even stuff like OOM errors _can_ be known states. It's not unreasonable for stuff like "one database per machine" to be able to adapt to the available memory. The point of "let it fail" is _just_ to drop the outcomes of your code's path in general use.

IshKebab · on July 17, 2022

> Don't write code to handle random bitflips

It depends what you're doing. There's no fixed threshold for "errors that you should handle" so smackeyacky is right - handle the errors you can (but don't spend an inordinate amount of time handling very unlikely errors).

Bitflips are not very unlikely on huge systems so you need to handle them.

In my experience trying to distinguish between "expected" or "normal" errors and "unexpected" or "exceptional" errors is pointless and impossible. It's better to think about the likelihood of errors.

TameAntelope · on July 17, 2022

Handle errors you should.

Don’t handle errors that aren’t relevant to your specific code.

calvinmorrison · on July 17, 2022

One of the best protocol family is email protocols! It says - call back later, the line is down, give me an hour. Eventual consistency is better than failed calls.

When my email went down last week, I wasn't worried. Eventually all messages would arrive.

I like that system.

verdagon · on July 17, 2022

I'd imagine retrying/reconnecting is compatible with the general "let it fail" approach. If you just sent a message/request to an actor/server and it still hasn't responded after 5 seconds, you can send another.

It wouldn't matter whether that actor/server died from a regular error or a "let it fail" error, the retrying would still work the same.

_gabe_ · on July 17, 2022

> If you just sent a message/request to an actor/server and it still hasn't responded after 5 seconds, you can send another.

This depends on the message. I hope amazon doesn't just send another message if my transaction didn't complete in 5 seconds.

I think like all pieces of wisdom, sometimes it's OK to let it fail, and sometimes its OK to handle the errors. If anyone ever tells me to always do X or never do X, it's typically not sound advice. The one thing we can always count on is generic advice failing sometimes :)

(And even in this article, you shouldn't always try to handle known errors and never try to handle unknown errors, there will be exceptions)

ShroudedNight · on July 17, 2022

> I hope amazon doesn't just send another message if my transaction didn't complete in 5 seconds.

I do, but only while also associating a stable nonce to the transaction.

akdor1154 · on July 17, 2022

While the article focuses on the programming side, the other side is BEAM's Links and Supervisors which is what really allow this.

Letting BEAM handle that stuff like it is designed to could probably do a better job than your junior devs, and of course then free them up to be writing useful stuff instead.

toast0 · on July 17, 2022

Retrying is ideally handled at a single place though.

If the original client is going to retry for failure, including timeout, any intermediate retries are likely to result in signficant multiplication during outages, and that makes for a more difficult recovery.

It's also easy to miss reporting on intermediate retries and your system is running poorly and you didn't know.

Having things automatically reconnect is separate from automatic retries of individual requests.

joshuamorton · on July 17, 2022

I've said this often specifically in the context of golang, but while you're right that retries and similar are a common case, they are fairly similar to the 'expected' error case in the article, and can almost always be handled at precisely the place where you raise the error.

In python this is

    @retry.retry(exceptions=[RpcException], tries=5, backoff=2, jitter=1)
    def my_external_rpc_call(...):
        ....

And RpcException will only be raised beyond this if the backend is unreachable for ~30 seconds.

Similarly, rpc services can abstract over this entirely, grpc (and presumably others) allow you to configure the retry policy per rpc service or method, and have it reflected everywhere that is used, without writing wrappers[0].

Which really all is to say, once you have solid libraries that handle retries of operations that are known to be error prone (file IO, network IO, things that could lock/block, etc.) you pretty quickly get into "any error implies we're totally boned".

[0]: https://github.com/grpc/grpc-go/blob/f601dfac73c9/examples/f...

guitarbill · on July 17, 2022

A warning about this; retries should only be done at boundaries. And it's important to know if e.g. the http or API library already implements retires, not to mention which errors should be retried. I have seen at least one codebase where the retries where completely out of hand.

In short, I've found retrying well is harder than it looks.

sarchertech · on July 17, 2022

This is important. I’ve seen a case where retries were happening in the service mesh, the http client library, and the application code.

at_a_remove · on July 17, 2022

I struggle to find the correct descriptor for a counter-example, wherein You Really Want Success for the process as a whole, but it is acceptable for a sliver of it to fail, in the the context of ETL.

I have an ETL I am told (I switched jobs) that is still working, from 2008. It was built to be a tank, and I also did another forbidden thing: Pokemon Exception Handling. It's a guideline, not a law of physics, and it is fine to resort to a general error catch when you really don't know every possible error (and let's be honest, if you have enough libraries in the mix, some surprises will happen) and you want the other 99.999% of the data to go through. Yes, this one little thing didn't load, and let's log that, let's examine that and figure out how to prevent that going forward, but overall, the rest of the program must continue.

How did it get so tanklike? Every time a little bit failed and it got logged, I figured out what went wrong, fixed it, and then tried to generalize a class of similar errors. After a while, I got into Things I Was Told Would Never Happen in the data we ingested and programmed for when never happened. Reader, never came a little sooner than expected.

Anyway, I largely agree with the idea but there are places where you want the exact opposite, and I think it is important to look for those places lest this heuristic become so stiff it can lose utility.

e9 · on July 17, 2022

Funny thing, I had same experience. I also built robust ETL. It was for ingesting and manipulating financial data from 30+ different banks and I also did Pokemon Exception Handling to make it robust.

In general my philosophy is: I don't want to wake up at 4am unless it's urgent. What can I do to gracefully handle failures to achieve that goal?

at_a_remove · on July 17, 2022

Exactly, thank you. At the time I was transitioning from a jack of all trades person to strictly a programmer. I despise getting called in the middle of the night.

My feeling is that, if you have a foundational business process like this, it should be designed to be maintained if there is a serious problem, and it ought to keep working. I know, haha, "the Internet is a series of tubes" but I really wanted this thing to be like a chunk of very uninteresting ductwork that just moves air from one place to another: it should just do its job with as much fanfare.

Mr_P · on July 17, 2022

For all the hate that Java tends to get, the language natively supports this distinction between:

* Expected errors - Checked Exceptions

* Unexpected errors - Unchecked Exceptions

Idiomatic Java also makes heavy use of asserts, e.g. using the Guava Preconditions library.

goostavos · on July 17, 2022

Alas, the problem with java, which I say as a begrudging long time java developer, is that "supports this distinction" is a theoretical benefit that is seldom used in practice. Checked and unchecked exceptions get so thoroughly abused and twisted into byzantine contraptions that any distinction, if value were to be gained from it, is completely destroyed by the common free form usage throughout the ecosystem.

The precondition thing, while indeed common, drives me sorta insane. I think it's a pattern java folks need to move on from. You've got this lovely type system (used loosely). If you need a precondition because you've got some fundamental invariant in the system, doing it at runtime rather than encoding it into the type system is such a missed opportunity. If I try to do something inherently wrong, I don't want the code to even compile!

yCombLinks · on July 17, 2022

What kind of precondition and what kind of example do you have for the typing? My primary precondition is null checks, which is unavoidable

goostavos · on July 17, 2022

This blog post really captures the core of where null checking should go and how to capture that you've already vetted this field for correctness in the type system so that the rest of your code never has to worry about it -- and further, cannot because the types don't allow!: https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

This is echoed in an amazing book called Domain Design Made Functional, which radically changed how I thought about what a type system is and what it can actually do for us if we lean on it correctly (even a relatively crummy one like Java's!).

josephcsible · on July 17, 2022

The problem is that while Java the language itself does support that distinction, a lot of built-in stuff really messes it up. For example, exceptions from closing a file are unexpected, but are an IOException which is checked anyway. Also, even the support that is in the language isn't first-class; e.g., lack of exception polymorphism.

zmmmmm · on July 17, 2022

I think that's a symptom of the fact that the distinction is really artificial at the language level anyway. Whether something is expected or not is a function of the requirements. Even OutOfMemory can be expected and handled in certain types of applications (esp. since it gets thrown for things like file handles rather than true memory). And then there are all kinds of cases where routine exceptions like file not found are in fact, unexpected errors (as discussed in TFA).

Perhaps some sort of language level solution could have been found (eg: have explicit interfaces to mark exceptions as expected or unexpected and then exceptions are assigned that using generics or something), but that ship has sailed long ago.

rowls66 · on July 17, 2022

This is right, therefore, in most cases, a library should throw a checked exception, and the caller should decide whether it is an expected error and either handle it or rethrow it, or it is unexpected and rethrow a RuntimeException.

shepherdjerred · on July 17, 2022

Unchecked vs Checked is one of the things I like least about Java. Programmers tend to make everything Unchecked because it leads to easier code for API users at the cost of correctness/error handling.

jillesvangurp · on July 17, 2022

Modern Java, should not produce a lot of checked exceptions. Unfortunately, a large part of the standard library is 25 years old and still full of things that throw checked exceptions. If you use something like Spring or Quarkus, you'll not find a lot of those.

Kotlin improved on Java by treating all exceptions as unchecked. Including those from Java code. This was intentional and based on the observation that checked exceptions in Java were simply a mistake. Modern Java frameworks don't tend to use them for this reason. Kotlin fixed several other language design mistakes in Java and it's a reason it is used as a drop in replacement for Java in a lot of places. It also makes what guava and lombok do for Java completely redundant. All part of the language and standard library. Android, Spring, Quarkus, etc. they all become nicer to deal with when you swap out Java for Kotlin. I find dealing Java code to be very awkward these days. I used it for years and it just looks so ugly, clumsy, and verbose to me now.

The most common catch block in Java is e.printStackTrace() because that's what your IDE will insert. That's stupid code. And replacing it with a logger.error(e) is only marginally better. Idiomatic Java is actually re-throwing exceptions as RuntimeExceptions so your framework can handle them for you in a central place and show a nice not found page or bad request page (or the dreaded "we f*ked up" internal server error page). That too is stupid code to write and with Kotlin, re-throwing exceptions is not really a thing. Why would you? Either you handle the exception or it just bubbles up to a place where it is handled or not. If you want people to deal with exceptions, you wrap them with a a Result<T> in Kotlin. Java has a similar thing called an Optional but it is mostly just used to dodge null pointer exceptions; which in Kotlin are rare because it has nullable types. And of course it does not actually contain the original exception.

yawaramin · on July 17, 2022

It's not clear to me that checked exceptions are actually a mistake, rather than just developers getting annoyed at their compiler forcing them to handle errors.

Just fyi, Scala predates Kotlin in not enforcing checked exceptions.

ajanuary · on July 17, 2022

Checked exceptions require the implementation to distinguish between expected and unexpected errors. But as pointed out in the article, whether an error is expected or unexpected is more a function of the use case than the implementation.

jahewson · on July 17, 2022

Expected errors that need to pass through a lambda - unchecked exceptions.

Our good friend UncheckedIOException.

WalterBright · on July 17, 2022

I learned from working on aviation systems is that when a system enters an unknown state, it must be disabled and locked out.

In software, this is known as an assertion failure. When the assert trips, the program is, by definition, in an unknown state. A program cannot reasonably be allowed to continue in an unknown state - it may launch nuclear missiles. The only thing to be done is exit directly, do not pass Go, do not collect $200.

roeles · on July 17, 2022

Thanks for posting this. I have worked on non critical flight software and thought that this philosophy might work well.

I wonder how easy the certification is for such software? For work I might have to write Do178 code in the future.

WalterBright · on July 17, 2022

I use it in the software I write. I should do a presentation sometime about how the aviation industry should be influencing software development.

roeles · on July 18, 2022

I would be very interested in that!

glouwbug · on July 17, 2022

What if the plane is mid flight?

WalterBright · on July 17, 2022

Engage the backup. Everything flight critical is dual.

bicx · on July 17, 2022

It won’t be in mid-flight for much longer.

mmcnl · on July 17, 2022

I like this mindset.

verdagon · on July 17, 2022

I think that the "let it fail" approach is often inevitable, even when we try to use Result<T, E>.

Often, we see an "unknown" variant in the error enum, as a catch-all for a library's unexpected errors. Then, anyone who calls them must also have an "unknown" enum. And anyone who calls them, and so on.

In the end, this "unknown" variant is similar to a panic, in that there's very few reasonable reactions to it: Log it, cancel the request, return error 500, perhaps retry.

For this reason, I often recommend people to just use assertions and panics.

hyperhopper · on July 17, 2022

While everything you said is correct, there are still significant advantages to the 'result' method.

Sometimes you want to return 200 even if most of the backends fail. Sometimes one part may want to retry based on any error.

Even aside from this, disallowing exceptions leads to a very predictable control flow, and makes program state able to be expressed in the type system, which is useful for many reasons on it's own.

While yes, it's often just like an exception or panic, I'll take that over exceptions in my code any day

verdagon · on July 17, 2022

I realize I was ambiguous; I didn't mean to say "just use assertions and panics", I meant "just use assertions and panics for unexpected errors", my apologies.

I wouldn't recommend someone only use Result<T, E> and never panic. If we do, then anything that might indirectly be invalid, such as a map lookup or an array index, will have ? operators on it, often every line of some functions. In the end, our control flow is just as unpredictable as if we just used panics and our signal is lost in the noise.

For this reason, I think a blend of Result<T, E> and panics/assertions is really the way to go.

kgeist · on July 17, 2022

>ignore unexpected exceptions

Isn't it the way it already is in practice, not something specific to Erlang? If an exception is unexpected, usually there won't be an exception handler for it, otherwise a developer pretty much expected it. Developers are generally lazy so in my practice the default is usually to let it fail, and there's usually going to be an exception handler that does something other than logging and quitting only if there's a serious reason to do so.

Maybe a more useful distinction could rather be "business logic errors" vs. everything else ("infrastructure errors", "programming errors" and "input validation errors"). Business logic should clearly define what should be done when an error happens, to avoid inconsistent state. But infrastructure-level errors or programming errors, you can't do much about them, other than log and/or retry.

gabeio · on July 17, 2022

> If an exception is unexpected, usually there won't be an exception handler for it

That honestly depends on how the language and program are written, python is a great and horrible example of where you can handle any exception even ones that were just created by the program:

    try:
        crash_hard_here()
    except: # by default (and unfortunately) will catch *everything*
        pass # and this is one of the worst offenders inside an except, to just outright ignore the exception and continue as if nothing happened and to not even log it.

I can not tell you the amount of production code where I've seen catch all exceptions, and they are the lazy way to know something will not "crash" even though much worse things can happen now.

blub · on July 17, 2022

This catch and ignore seems to be straight out of the (sadly not real) Visual Basic design patterns book :-)

blub · on July 17, 2022

In C++ for example there’s an exception hierarchy and several classes of errors similar to what you enumerated, so the handler might catch a specific exception as a more generic one. There’s also … which catches all exceptions.

This is fine, as one has the choice of exposing error details to outside code or not.

czei002 · on July 17, 2022

Other frameworks like express (nodejs) or actix (Rust) also don't crash if you "throw" in an request handler so this doesn't sounds very exciting to me. The interesting question for me is how retries are handled after an error occurred? For example, if the error happens in an http request handler, does the request still fails with 500 or is it magically retried by Erlang while keeping the request hanging? For internal service calls how are retries working? i.e. how can I configure that a request is retried after a failure? I guess Erlang does this and this is the power behind it?

The example of a missing file seems not very good since its a problem that is probably not solved by waiting. A better example is probably a busy DB that is temporary not reachable?

jon-wood · on July 17, 2022

The Erlang VM is built around message passing, so in the case of a dropped database connection your application code would pass a message to the database API asking for the results of a SQL query. The database API is currently trying to reconnect, so it’s not processing that queue of messages, but once it gets there it’ll pick up the message, run the query, and then send one back to the process that asked. This is all largely transparent to your application code, beyond being able to set some preferences in your return message handler around things like how long you’re willing to wait.

The whole “let it crash” thing comes from Erlang’s process supervision - in practice the DB API isn’t actually retrying. It’s continually failing to connect and if it can’t the process just crashes. The supervisor then notices and starts a new process in its place, this continues until either a process successfully starts or the configured retry count is hit. If the retry count is exceeded then the supervisor crashes, either taking the entire application with it, or more commonly being restarted itself by the next supervisor up the tree.

jswny · on July 17, 2022

The point is that Erlang (the BEAM VM) follows this pattern everywhere, not just for web requests. It’s baked into the language and runtime, and that’s infinitely more powerful and customizable than not crashing in a request handler.

rektide · on July 17, 2022

I spent a good part of this week overhauling a microservice where most fucntions were a giant try/catch & would maybe throw a new error. Just getting rid of the try catches & letting the code fail has been a huge help in seeing what is going wrong as the code executes.

I also am delighted to see the idea of expected errors here. Another thing I've been doing for a long time is tagging erros with an expected = true property when it's something we expect to see, like, oh, we went to get this oauth token but the credentials were wrong. Expectedness shows up in thr logs now & we can see mych more clearly where there are real problems.

robocat · on July 17, 2022

The article doesn’t seem to look at how resources are cleaned up when a BEAM process crashes. https://elixirforum.com/t/understanding-the-advantages-of-le... says “All resources are owned by a process in Erlang, and the VM guarantees clean-up of resources once the process dies”. My Google-fu failed me when I searched for more details about Erlang process cleanup of resources, or how to register cleanup actions (e.g. delete some temporary file on crash).

btkostner · on July 17, 2022

I’d take a look at the terminate callback for Elixir GenServers.

https://hexdocs.pm/elixir/1.12/GenServer.html#c:terminate/2

jeanlucas · on July 17, 2022

I am not sure what you mean by that. Which resources do you mean? Everything happening in your program is running inside the Erlang VM, even file writing... And each process inside the VM does their own garbage collection.

The only way I could see the VM itself crashing is if it can't allocate memory, then it all crashes.

robocat · on July 17, 2022

Let’s assume we are using Linux.

Firstly, an Erlang VM process is not a Linux child process: “Erlang processes are lightweight, operate in (memory) isolation from other processes, and are scheduled by Erlang's Virtual Machine (VM). The creation time of process is very low, the memory footprint of a just spawned process is very small, and a single Erlang VM can have millions of processes running.”.

An Erlang program can use kernel resources. Objects related to file descriptors, filesystem data, the process namespace, and signal handlers. Kernel timers, semaphores, memory allocations (e.g. for FFI), sockets, DMA, etcetera.

I presume the Erlang library code tracks resources (such as file access function tracking file descriptors per Erlang process), so they can be cleaned up if we “let-it-fail”. I also presume there is a canonical way to register to clean up resources - the equivalent to atexit()[1].

[1] https://man7.org/linux/man-pages/man3/atexit.3.html

toast0 · on July 17, 2022

I'm not surprised that wasn't easy to find. All BEAM processes have their own heap, stack, process dictionary, as well as links and monitors and the message queue etc, including a list of owned ports; when a process dies, BEAM goes through all of that to clean things up.

Many BEAM terms [1] are easy to cleanup; you don't have to do anything special to get rid of a number, or an atom, or a tuple or a list. Some binaries are heap binaries, they're stored in the process's own heap, so they're easy; other binaries are RefC (short for reference counted) binaries, a ProcBin is stored in the process's memory that references a binary in the global (per node) binary heap; when those are cleaned up, the global binary's reference count needs to be decremented and if it's now zero, it needs to be cleaned up too.

Ports are how file descriptors are generally interfaced with. I haven't done much with port drivers, but documentation for the port driver stop callback [2] says it will be called when the port is closed explicitly or if the port owner is terminated.

Another way to interface with things outside of BEAM is through NIF resources [3], a Native Implemented Function can call enif_alloc_resource to allocate memory and pass the resource back to BEAM code where it can be used as with any other term. When the last reference to a NIF resource is garbage collected (which could be at process termination, or otherwise), its destructor is called, and external resources can be cleaned up at that point. NIF resources aren't strictly owned by one process, if you send to a process on the local node, the underlying object won't be destructed until it has been garbage collected from all processes. Prior to OTP-20.0, NIF resources sent to another node or otherwise serialized would be indistinguishable from an empty binary <<>>, but since that release, serialized NIF resources can be unserialized into a reference to the same resource, but only if it hasn't been destructed already.

If you wanted to build your own cleanup action in pure BEAM code, you would need to spawn_monitor (or spawn_link, perhaps) a new process, which would trap errors and cleanup if the original process died or you otherwise got a cleanup message. Of course, if the code in that process crashed, you wouldn't get your cleanup. OTOH, if the code in your port driver or NIF crashes, that will bring down the whole BEAM node.

Unfortunately, with a quick look, I can't find an example of handling your very real, and reasonably simple use case of a temporary file, automatically deleted on process exit, but it wouldn't be too difficult to build. Of course, there's the question of what to do if the open fails, or the delete fails --- which comes back to the original topic ;)

[1] https://www.erlang.org/doc/reference_manual/data_types.html [2] https://www.erlang.org/doc/man/driver_entry.html#stop [3] https://www.erlang.org/doc/man/erl_nif.html#resource_objects

sergiotapia · on July 16, 2022

For our liveview project, a lot of the bugs we find are edge cases in the pattern match. We find the bug in appsignal, build another arity match and go on with our day. It's pretty cool.

I've been working in Elixir exclusively since 2016. I do think a lot of the Let It Fail is just marketing from Elixir (and BEAM) but there is a lot of truth in it. In reality you will most definitely not write everything under an explicit supervisor. You will just see errors in function clause matches and add another arity.

rad_gruchalski · on July 17, 2022

The origin of Let it Crash dates to „Making reliable distributed systems in the presence of sodware errors”, Joe Armstrong, 2003: https://erlang.org/download/armstrong_thesis_2003.pdf, section 4.4.

smackeyacky · on July 17, 2022

Ha. From now on, we should refer to badly written services as "sodware". Sounds like something The Register would use a lot.

sergiotapia · on July 17, 2022

I know. I'm saying for my projects, I never had to reach that level of robustness and distribution. I imagine the vast majority of Elixir/Phoenix projects are the same, especially with the language gaining traction and new projects being spun up with the language.

toast0 · on July 17, 2022

> We find the bug in appsignal, build another arity match and go on with our day.

Failing (as crashing is now termed ;) immediately when the data didn't match the pattern is exactly the let it fail approach. If the data doesn't meet the expectations, there's nothing to do but crash. Maybe you've got a nice supervision tree, maybe not, but crashing immediately where things didn't match expectations usually gives you the right place to start looking; maybe it was some reasonable data, so you just handle it. Maybe it is unreasonable, so you need to look at where it came from, but usually (not always, of course) you just got the data and are pattern matching it, so you know where it came from too.

stu2b50 · on July 17, 2022

> In reality you will most definitely not write everything under an explicit supervisor.

That's the point of the supervisor _tree_. Certainly not every process will have its own supervisor, but all processes should be linked to other processes, which are linked to other processors, and at some point you have a process that is quite fundamental to the application and has a supervisor.

As long as there's a supervisor somewhere on that tree, the whole subtree will be restarted, hopefully in a non-erroneous state, and the application will continue on its merry way.

torginus · on July 17, 2022

The article mentions Erlang, a functional language - which gives an interesting contrast - it is a functional language which are all about mapping out all behavior so that no undefined behavior can exist, and basically force you to consider every possibility (of course that doesn't account for stuff like network errors).

Wouldn't the same scheme be better suited for a procedural language, with deliberately dirty code full of gotchas?

I write a ton of hacky scripts, like last time I needed to rewrite 1000s of xml-s I wrote a crude regex replace for it. It worked 99% of the time, and I fixed up the rest manually. This sort of thinking - that a subprocess might fail due to whatever reason, including sloppy code - but the whole process will keep trucking on would be perfect for the this paradigm.

Additionally this would open the path for stuff like trivial hot code replacement - since in this system, a subprocess that crashes every time - like an invalid program would be handled by the system.

winter_blue · on July 17, 2022

In a new language, I'd like to see exceptions being allowed in pure code, but prohibited in non-pure code. (Non-pure here meaning code with side effects.)

In pure code, an exception could essentially be passed up, and transformed into an error return value at the point where it's called by non-pure code.

mike_hock · on July 17, 2022

So then you make a pure function `throw_exception` and now you can throw from impure code.

winter_blue · on July 19, 2022

In non-pure code, we would want to force the programmer to handle the exception.

In this case, wherever `throw_exception` is called, the programmer would have to handle the exception, and either refactor the return value to indicate an error, change the return value to an Optional (and silently fail by returning an empty Optional), or cause/trigger a side effect (like terminating the thread/program termination early, with an exit code).

akdor1154 · on July 17, 2022

I'd be very interested to see non-BEAM approaches to enabling this - i kind of end up in the same pattern thanks to "expected? Return an Error<E>. Unexpected? Throw." However, the supervising part is then difficult.

How do people approach this in Python? NodeJS? Rust? .NET?

andrewingram · on July 17, 2022

If you’re doing web stuff, frameworks will basically have this built in. Typically the framework’s own request handler will have something that catches all uncaught exceptions within the scope of that request and return a 500 response. So within the context of a single request you can usually adopt a “let it crash philosophy”. In fact, it’s something most people seem to do intuitively.

game-of-throws · on July 17, 2022

If you can stomach kubernetes, you get this for free for all languages. Just panic!() or die(), and you'll get a fresh pod in a few seconds.

Outside of k8s I try to use the lang's preferred tool. Python -> supervisord, Node -> pm2, etc.

vvanders · on July 17, 2022

You can take the same patterns and apply them, it just takes a little more rigor since beam/otp had this built in.

My favorite is hardware (or software) watchdog timers. Very simple to implement and surprisingly effective.

jetbalsa · on July 17, 2022

Its common for me in PHP land to call die("Error! $e"); when something really bad has happened in a request and I really don't want anything to happen

theluketowers · on July 17, 2022

That's not without its drawbacks however, especially when you consider tools like swoole and laravel octane that keep your application in a long running process, so if you use die you actually kill one of the server workers instead of just terminating the execution of the request

akdor1154 · on July 17, 2022

Yeah nice, PHP's request scoping is a really good fit for this.

wolffiex · on July 17, 2022

This is the way. Exception handling is often one of the worst aspect of a production codebase, especially since it is typically added late. Though error handling strategies benefit from careful design, they are usually added piecemeal. Making errors louder and more problematic is the best way to get them the attention they deserve.

This is not a new concept and it seems to be one of the core components of the Erlang Weltanschauung. It can be generalized further to systems as the principle of "Crash-Only Software," as advanced in this classic paper: https://dslab.epfl.ch/pubs/crashonly.pdf

smallerfish · on July 17, 2022

I've heard this expressed as "write brittle code", and I'm a strong advocate for it. Looking up a user by id, and get no results? Rather than either passing null up the stack, or wrapping null with an optional.empty, throw an exception! It's the client code's problem if it somehow got hold of an id that doesn't exist. (Yes, ymmv depending on the system, e.g. if you're dealing with eventual consistency then maybe do something different.)

As the article says, this of course doesn't mean that you shouldn't handle user errors, or even known system errors.

_benj · on July 16, 2022

I think the distinction between expected and unexpected errors can easily fall through the cracks and writing code in a way that an unexpected error doesn’t break everything is quite powerful.

Golang makes it easy to ignore errors that can be ignored and defer/recover provide a way to implement a way to “let it fail”

There’s even an implementation of supervisor trees for Go [0] :)

[0] https://github.com/thejerf/suture

rad_gruchalski · on July 16, 2022

This would be my go to for anything _supervisor_ in golang: https://github.com/asynkron/protoactor-go#supervision.

mkl95 · on July 17, 2022

One of the issues I take with this mentality is that companies are always looking for excuses to sell bad software. As an engineer, I really don't enjoy maintaining crappy SaaS, particularly if they are built with weakly typed, dynamic languages.

If something is obviously prone to fail you should empower people to do something about it, because usually "letting it fail" will lead to your users switching to your competitor's offering.

mpweiher · on July 17, 2022

Dataflow offers an interesting variation of not polluting the happy path with error handling: just do nothing.

How so? Well, the happy path passes data to the next filter. In the case of an error, simply do not do that, and the next filter will be none-the-wiser.

Optionally, log the error, possibly with enough information that the "logger" could retry the operation if appropriate. But as others have pointed out, there often isn't much that can be done.

kaczordon · on July 17, 2022

I believe this style of coding is known as “Design by contract”. Basically assigning a range and domain to each function.

jeanlucas · on July 17, 2022

That's why Erlang is cool, to me.

kbyatnal · on July 17, 2022

the best TLDR I've seen of this philosophy is - if you find yourself writing all over your codebase

try { ... } catch (e) { console.log(e) }

then you should probably just "let it fail" since you can't actually handle the error

charcircuit · on July 17, 2022

That construction adds a layer of "let it fail." For example you might want to have your web server be able to mark a connection as failed instead of only being able to mark the entire server as having failed.

anupamchugh · on July 17, 2022

That's impressive!