Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The most crucial thing that I've seen over the years is that most developers are simply afraid of bringing the application down on bugs.

They conflate error handling with writing code for bugs and this leads to proliferation of issues and second/third/etc degree issues where the code fails because it already encountered a BUG but the execution was left to continue.

What do I mean in practice? Practical example:

I program mostly in C and C++ and I often see code like this

   if (some_pointer) { ... }
and the context of the code is such that some_pointer being a NULL pointer is in fact not allowed and is thus a BUG. The right thing to do would be to ABORT the process execution immediately but instead the programmer turned this it into a logical condition. (probably because they were taught to check their pointers).

This has the side effect that:

  - The pre-condition that some_pointer may not be null is now lost. Reading the code it looks like this condition IS allowed. 
  - The code is allowed to continue after it has logically bugged out. Your 1+1 = 2 premise no longer holds. This will lead to second order bugs later on when the BUG let program to continue execution in buggy condition.  False reporting will likely happen. 
The better way to write this code is:

  ASSERT(some_pointer); 
Where ASSERT is a unconditional check that will always (regardless of your build config) abort your process gracefully and produce a) stack trace b) core dump file.

My advice is:

If your environment is such that you can immediately abort your process when you hit a BUG you do so. In the long run this will help with post-mortem diagnosis and fixing of bugs and will result in more robust and better quality code base.



If you're validating parameters that originate from your program (messages, user input, events, etc), ASSERT and ASSERT often. If you're handling parameters that originate from somewhere else (response from server, request from client, loading a file, etc) - you model every possible version of the data and handle all valid and invalid states.

Why? When you or your coworkers are adding code, the stricter you make your code, the fewer permutations you have to test, the fewer bugs you will have. But, you can't enforce an invariant on a data source that you don't control.


Yes of course the key here is to understand the difference between BUGS and logical (error) conditions.

If I write an image processing application failing to process an image .png when:

  - user doesn't permission to the file
  - file is actually not a file
  - file is actually not an image
  - file contains a corrupt image
  etc.
are all logical conditions that the application needs to be able to handle.

The difference is that from the software correctness perspective none of these are errors. In the software they're just logical conditions and they are only errors to the USER.

BUGS are errors in the software.

(People often get confused because the term "error" without more context doesn't adequately distinguish between an error condition experienced by the user when using the software and errors in the program itself.)


> But, you can't enforce an invariant on a data source that you don't control.

This is obvious.


Please, no!

I worked with a 3rd party library that had this mentality. "A bug is a bug so the assert fails and thus the code is now in an unknown state thus The right thing to do would be to ABORT the process execution immediately". Oh my.

Just do "if (pointer)" and when that fails, error out from the smallest context possible that applies to that pointer, and nothing more than that. I.e. the real BEST thing to do is to abort the current connection. To skip the current file with an error. To fail writing that piece of memory. Whatever. But never abort (unless maybe in debug builds).

The end result of this library was that we had a WebRTC server handling 100s of simultaneous video calls, and then when a single new user tripped up during connection and went through a bogus code path, the library would decide "oh something is not as I expected so I'll abort, of course!" and the whole production server was brought down with it.

That kind of behavior does not help achieving high production quality and providing robust and reliable services.

We ended up removing the library's runtime assertions, which meant that connections that would bug the library code would just end up failing with an error somewhere else, that could be used to just discard the attempt and try again. All in all, numbers showed it was a huge positive in stability for the service.


I think the original comment was more directed to scenarios where the precondition fail is let slide, NOTHING happens, and that's not the desired behaviour, so you have a bug. Such code exists too often and it's just poor quality.

If there's anything I've learnt about error handling, it's that it must be approached with very careful consideration of what you want the app to do (conceptually) when the error occurs. Sometimes that's crash the program, sometimes it's throw an error, sometimes it's log and move on. The issue comes when devs don't want to think about this, whereby the simplest solution is absorb any error and forget about it.


Agree with your assumption, but then I also know that developers often read recommendations and convert them into dogma, ignoring the specifics for when those recommendations should be applied.

Early in my career I taught programmers and I was horrified to come back to one client and find they followed my advice, but not where it applied meaning that what they did was actively harmful.

Ever since that, I realized when a developer makes a recommendation they need to be very explicit about when it applies otherwise they are doing more harm than good and hopefully someone will publicly challenge their recommendation to bring their unstated qualifications to light.


So the library was full of bugs.

(IMHO) The right thing to do would be to

  - Fix the actual bugs
  - Provide super visor  and isolation as a protection mechanism and resiliciency. 
What you've now done is to essentially just hammer it quiet while sweeping all the issues under the rug and pretend everything is great when the library is actually in ill defined state. How's that possibly any better? You're probably experiencing hard to detect bugs, occasional runtime corruptions and all the fun stuff now that you'll likely never be able to fix :)

"That kind of behavior does not help achieving high production quality and providing robust and reliable services."

It absolutely does.

When bugs are obvious and caught early they're easier to fix and this leads to higher quality more robust and reliable service.

Pretending that everything is good is never the solution.


I agree with you. But with a product trying to gain a good name and its first loyal customers, a crash that brings down the complete service is a no-go, no matter the excellence in software development that one seeks theoretically.

Bugs won't be fixed either if the early customers fly away.

In practice, ignoring the bugs meant better numbers. Only the user connections affected by those bugs would fail, which then a retry system transparently solved. This is how the library should have worked to begin with.

Bugs are always going to exist, it's the first law of software, that there is no such thing as a software without bugs. So no, I don't believe on aborting a whole application is ever an acceptable behavior for a library. Do the if(pointer) else return error, not the assert(pointer) else abort.


I'm sure you made the best judgement call given your circumstances. Of course from business perspective software quality is irrelevant. It's best achieved by letting the PR/marketing team take care of it.

However, the problem is that when you start with this route that you allow buggy/incorrect code to continue running you cannot reason about your program anymore. You cannot make any smart decisions when the program is allowed to continue after hitting a bug.

If I call function "foobar()" and I assume it's buggy but it continues and leaves my program in a bad state what should I do? How can I determine that the result that foobar() produced is garbage?

So maybe foobar() returns a bool/success/flag value that indicates that it bugged out. But then what? Maybe I want to log this as an error but what if the logging function also has bugs in it? So maybe the logging function didn't work as intended because I called made a bug when I called it when trying to deal with the bug that happened inside foobar(). How do I propagate this error correctly without introducing more bugs to my callers who then must all do something.

The fact is most programmers can barely get the "happy path" right. Even normal logical error conditions ("file not found") cause plenty of software to fail because it cannot do proper error handling and propagation. So if you let incorrect code to keep on executing nothing good will ever come out of it and there's 0 chance anyone will be able to write correct code on top of incorrect code.

The point is once you start writing code to "deal with" incorrect logic its like trying to do math after a division of zero. None of the rules apply, none of the logic applies. Your program state is random garbage.

All these problems disappear when you make the simple rule. You don't write code for bugs. Simply just abort.

Mind you from the library perspective what is your bug might not be a bug from my perspective. In other words if I provide a library for random 3rd parties I can assume they will use it wrong. Therefore their buggy code is what I must expect and return some error value etc. But if I'm calling code that I wrote from my own code I don't write a single line of code for bugs. I simply assert and abort.

And to your last point of bugs always existing. Yes I agree, and the best we can do is to squash them as soon as possible and make them loud and clear and as easy to debug as possible (i.e. direct callstack / core dump). Not doing this does not fix them but smply makes them harder to fix.


Oh my indeed.

> That kind of behavior does not help achieving high production quality and providing robust and reliable services.

Right, so since it's the messenger (assertion) that brought news of defeat in battle, you kill the messenger instead of trying to win the next battle.

Your problem is that the entire business seems to hinge on a single point of failure - if the server binary dies, you lose customers. This has nothing to do with what some library does or does not do, it's just a horribly designed system.

That crash should have made you all stop and think real hard about what you're doing because clearly it's not working:

1. Don't write bugs in the first place. No I'm not kidding. Your tests clearly suck so fix them, make sure you do proper PR reviews, make sure to leverage your type system to the fullest, use all available tooling, etc.

2. Design your system be resilient to the server binary dying because I assure you, there's nothing you can do to prevent it from happening for one reason or another.


Any sane person will throw an exception, which, when unhandled, will crash the program.


I’m a big fan of assertion and rigorous preconditions but there are times when a failure of some invariant in a minor subsystem should not be allowed to crash the entire process, especially if the context makes it easy to return an error.

In our project (the language server for Go) we have gotten tremendous value from telemetry: return an error, but report home the 1-bit fact that the assertion has failed. Often that fact is enough to figure out why; other times it is necessary to refine the assertion into two or more (in a later release) to get another bit or two of information about the nature of the failure.


> The code is allowed to continue after it has logically bugged out.

I'm a big fan of asserting preconditions and making it clear that we are getting into a bad place. I would rather dig through Sentry for an AssertionError than propagate a bad state and having to fix mangled data after the fact. If the AssertionError means that we mishandled valid user input, no problem, we'll go fix it.

A few times in my career I've had to ask, "okay, how long has this bug been quietly mangling user data?" and it's not a fun place to be.

Side note: I've never understood the convention of removing asserts in production builds. It seems like removing the seatbelts from the car before the race just to save a few pounds.


> Side note: I've never understood the convention of removing asserts in production builds. It seems like removing the seatbelts from the car before the race just to save a few pounds.

Once an upon a time computers were slow and every cycle mattered. Assertions were compiled out of the build by necessity. Better a crash once in a while than the program hardly running because it was so slow.


Case in point: When I was stuck inside a "big ball of perl" codebase that heavily used assertions for method input validation, I generated a flame graph of where time was spent in the codebase and it turned out it was assertions all the way down. Since only a small percentage of inputs came from external/unvalidated sources (user input etc) it was fine to remove the vast majority of them outside of the development environment. So we turned them into no-ops in prod and had a significant performance improvement.


Wouldn’t happen to have been a big healthcare company in Boston whose assert-anything function to validate that functions were called with correct signatures was called AssertFields?


As it happens, it was at such a place exactly as you describe. Hello fellow traveler, our paths may have crossed.


Like everything in life, it depends.

If this is some inconsequential part of the codebase it might be better to limp on then to completely stop anyone, user or fellow dev, from running the app at all.

Said another way, graceful degradation is a thing.


I think this is precisely why exceptions model particularly well - well - exceptional situations.

They let you install barriers, and you can safely fail up until that point, disallowing the program from entering cursed states, all the while a user can be returned a readable error message.

In fact, I would be interested in more research into transactions/transactional memory.


How do you gracefully degrade when your program is in a buggy state and you no longer know what data is valid, what is garbage and what conditions hold ?

If I told you to write a function that takes a chunk of customer JSON data but I told you that the data was produced / processed by some code that is buggy and it might have corrupted the data and your job is to write a function that works on that data how would you do it?

Now your answer is likely to be "just check sum it", but what if i told you that the functions that compute the check sums sometimes go off rails in buggy branches and produce incorrect checksums.

Then what?

In a sane world your software is always well defined state. This means buggy conditions cannot be let to execute. If you don't honor this you have no chance of correct program.


Contrary to people's dislike of OOP, I think it pretty well solves the problem.

You have objects, and calling a method on it may fail with an exception. If the method throws an exception, it itself is responsible for leaving behind a sane state, but due to encapsulation it is a feasible task.

(Of course global state may still end up in illegal states, but if the program architecture is carefully designed and implemented it can be largely mitigated)


Why not bring down the entire server if you detect an error condition in your application? You build things in a way where a job or request has isolated resources, and if you detect an error, you abort the job and free those resources, but continue processing other jobs. Operating systems do this through processes with different memory maps. Applications can do it through things like arenas or garbage collection.


It may be okay in a server, but (for example) not in a desktop application. The issue, then, is that most code lives (or should live) in library-like modules that are agnostic of which kind of application context they are running in. In other words, you can’t just abort in library code, because the library might be used in application contexts for which this is not acceptable. And arguably almost all important code should be a library.

Exception mechanisms let the calling context control how to proceed. Deferring to that control and doing some cleanup during stack unwinding virtually never causes serious issues in practice.


What I meant was that if you follow the logic of "computer is in an unknown state. Stop processing everything", then why not continue that to the entire server (operating system, hypervisor, etc.)? Obviously it's not okay in almost any context. Instead, assuming you have something more complicated than a CLI script that's going to immediately exit anyway, you should be handling those sorts of conditions and allowing your event loop/main thread to continue.


I think the issue is that bringing the application down might mean cutting short concurrent ongoing requests, especially requests that will result in data mutation of some sort.

Otherwise, some situations simply don't warrant a full shutdown, and it might be okay to run the application in degraded mode.


"I think the issue is that bringing the application down might mean cutting short concurrent ongoing requests, especially requests that will result in data mutation of some sort."

Yes but what is worse is silently corrupting the data or the state because of running in buggy state.


This is a false choice.


If you don't know why a thing that's supposed to never be null ended up being null, you don't know what the state of your app is.

If you don't know what the state of your app is, how do you prevent data corruption or logical errors in further execution?


> If you don't know what the state of your app is, how do you prevent data corruption or logical errors in further execution?

There are a lot of patterns for this. Its perfectly fine and often desirable to scope the blast radius of an error short of crashing everything.

OSes shouldn't crash because a process had an error. Servers shouldn't crash because a request had an error. Missing textures shouldn't crash your game. Cars shouldn't crash because the infotainment system had an error.


If you can actually isolate state well enough, and code every isolated component in a way that assumes that all state external to it is untrusted, sure.

How often do you see code written this way?


This is basically all code I've worked on. You have a parsing/validation layer that passes data to your logic layer. I could imagine it working less well for something like a game where your state lives longer than 2 ms and an external database is too slow, but for application servers that manipulate database entries or whatever it's completely normal.

In most real-world application programming languages (i.e. not C and C++), you don't really have the ability to access arbitrary memory, so if you know you never gave task B a reference to task A or its resources, then you know task B couldn't possibly interfere with task A. It's not dissimilar to two processes being unable to interfere with each other when they have different logical address spaces. If B does something odd, you just abort it and continue with A. In something like an application server, it is completely normal for requests to have minimal shared state internal to the application (e.g. a connection pool might be the only shared object, and has a relatively small boundary that doesn't allow its clients to directly manipulate its own internals).


You can "drop" that request which fails instead of crashing the whole app (and dropping all other requests too).


Sure. You wouldn't want a webserver to crash if someone sends a malformed request.

I'd have to think long and hard about each individual case of running in degraded mode though. Sometimes that's appropriate: an OS kernel should keep going if someone unplugs a keyboard. Other times it's not: it may be better for a database to fail than to return the wrong set of rows because of a storage error.


That's exactly what the attacker wants you to do after their exploit runs: ignore the warning signs.


You don't ignore it. You track errors. What you don't do is crash the server for all users, giving an attacker an easy way to DoS you.


A DoS might be the better option vs. say, data exfiltration.


Most bugs aren't going to create any risk for data exfiltration. In most real application servers (which are very rarely written in C or C++ these days), requests are almost completely isolated from each other except to the extent that they interact with a database. If you detect a bug in one request, you just abort the one request, and there's likely no way it could affect others.

This is part of why something like Rust is usable at all; in the real world a lot of logic has straightforward, linear lifecycles. To the extent that it doesn't, you can push the long-lived state into something like an external database, and now your application has straightforward lifecycles again where the goal of a task is to produce commands to manipulate the database and then exit.


Sure, but i was talking about an individual process. If you don't know what state it's in, you simply can't trust it to run anymore. That's all.


Except you usually can because the state isn't completely unknown. You might not expect some field in a structure to be null, but you still know for example that there's no way for one request to have a reference to another, so you just abort the one request and continue.


No, if you have been compromised, you cannot make these assumptions.


And what does DOS attacker want you to do? Not crashing the whole service to deny others of the service?


That is a valid tradeoff in many situations, yes.


> If you don't know what the state of your app is, how do you prevent data corruption or logical errors in further execution?

Even worse, you might be in an unknown state because someone is trying to exploit a vulnerability.


If you crash then you've handed them a denial of service vulnerability.


That's an issue handled higher up the stack with process isolation etc. It's still not ok to continue running a process that is in an unknown state.


I don't agree with any of this.

First of all, this results in unintelligible errors. Linux is famous for abysmal error reporting, where no matter what the problem really is, you get something like ENOENT, no context, no explanation. Errors need to propagate upwards and allow the handling code to reinterpret them in the context of the work it was doing. Otherwise, for the user they are either meaningless or dangerous.

Secondly, any particular function that encounters an unexpected condition doesn't have a "moral right" to terminate the entire program (who knows how many layers there are on top of what this particular function does?) Perhaps the fact that a function cannot handle a particular condition is entirely expected, and the level above this function is completely prepared to deal with the problem: insufficient permissions to access the file -- ask user to elevate permission; configuration file is missing in ~/.config? -- perhaps it's in /etc/? cannot navigate to URL? -- perhaps the user needs to connect to Wi-Fi network? And so on.

What I do see in practice, is that programmers are usually incapable of describing errors in a useful way, and are very reluctant to write code that automates error recovery, even if it's entirely within reach. I think, the reason for this is that the acceptance criteria for code usually emphasizes the "good path", and because usually multiple bad things can happen down the "bad path", it becomes cumbersome and tiresome to describe and respond to the bad things, and then it's seldom done.


yup. we have definitely all gotten an ENOENT or EIO before with no context.


Absolutely. An intermediate path in Go is to recover any panics on your goroutines: in this case a nil dereference panic may cause the death of the goroutine but not the whole application.

An example where this can be useful is in HTTP request handling: a single request might fail but the others can keep going -- but there are plenty of other use cases too.

The panic recovery code can log for further investigation, as well as in the HTTP case for example probably returning a 500 to the caller if wanted.

There are of course plenty of valid reasons not to take an approach like this too, but in some circumstances it can be useful.


This is often called offensive programming.


Hot damn, I never heard of this term before but yeah that's exactly what it is.

TIL, thanks.


Paradoxically, it is still a subset of defensive programming.


the best subset


"MIT v. Berkeley - Worse is Better" => https://blog.codinghorror.com/worse-is-better/

"Fail Fast / Let it Crash" => https://erlang.org/pipermail/erlang-questions/2003-March/007...

...you're in good company. :-)


I largely agree. If it came to pass that the precondition fails, there's a bug somewhere and this code just hides it. At the very least, that should go to an error log that someone actually sees.

I'm writing a Rust project right now where I deliberately put almost no error handling in the core of the code apart from the bits accepting user input. In Rust speak, I use .unwrap() all over the place when fetching a mandatory row from the DB or loading config files or opening a network connection to listen on or writing to stdout. If any of those things fail, there's not a thing I can plausibly do to recover from it in this context. I suppose I could write code like

  if let Ok(cfg) = load_config() {
    println!("Loaded the config without failing!");
    Ok(cfg);
  }
  else {
    eprintln!("Oh no! Couldn't load the config file!";
    Err("Couldn't load the config file");
  }
and make the program exit if it returns an error, but that's just adding noise around:

  return load_config().unwrap();
The only advantage is that the error message is more gentle, at the expense of adding a bunch of code and potentially hiding the underlying error message from the user so that they could fix it.

I think Python also gets that right, where it's common to raise exceptions when exceptional things happen, and only ever handle the exceptions you can actually do something about. In 99.999% of projects, what are you actually going to do at the application level to properly deal with an OOM or disk full error? Nothing. It's almost always better to just crash and let the OS / daemon manager / top level event loop log that something bad happened and schedule a retry.


The whole story is 3-fold. We have

  - errors in the software itself, aka BUGS
  - logical conditions that are expected part of the program execution flow and expected state. some of these might be error conditions but only for the *user*. In other words they're not errors in the software itself.
  - unexpected failures where none of the above applies. typically only when some OS resource allocation fails, failed to allocate memory, socket, mutex etc and the reason is not because the programmer called the API wrong.

In the first category we're dealing with BUGS and when I advocate asserting and terminating the process that only really applies to BUG conditions. If you let an application to continue in a buggy state then you cannot logically reason about it anymore.

The logical conditions are the typical cases for example "file not found" or whatever. User tries to use the software but there's a problem. The application needs to deal with these but from the software correctness perspective there's no error. The error is only what the user perceives. When your browser prints "404" or "no internet connection" the software works correctly. The error is only from the user perspective.

Finally the last category are those unexpected situations where something that should not fails. It is quite tricky to get these right. Is straight up exiting the right choice? Maybe the system will have more sources later if you just back off and try again later. Personally in C++ projects my strategy is to employ exceptions and let the callstack unwind to the UI level, inform the user and and then just return to the event loop. Of course the real trick is to keep the program state such that it's in some well defined state and not in a BUGGY state ;-)


When a process is used to serve multiple requests, I don't think you need to let the whole process terminate just because there is a bug dealing with a single request. Just because we can not reason about the current request does not mean the only way to get to the clean state for other requests is to terminate the whole process.


That sounds about right to me. Worry about the things you can fix and don't worry abut the things outside your control.


Makes sense. Better to unwrap via .expect("msg"), though.


That's a good callout, but I do that if and when I can add extra meaningful context.

From a user's POV, "I already know what file not found means. You don't have to explain it to me again in your own words."


The thing I wish more error messages did was tell me exactly which file was not found.


This is something really annoying about simple error codes. Sure they're lightweight but how the hell am I supposed to know the problem with my input when all the error information I get is "The parameter is incorrect"? I've actually had cases where I disassembled Windows system libraries to track down the exact validation that was failing.


Asserts are only available in debug compile mode.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: