Rust doesn't solve the CrowdStrike outage

blissofbeing · 2024-07-23T16:35:54 1721752554

So the argument is that yes Rust would have helped mitigate this specific bug, but bugs exist in all languages so therefore it doesn't matter the language.

I see the logic, but I don't think anyone is saying using Rust would make it bug free, people are just correctly pointing out Rust would have helped this bug, or in other words there is the potential for less bugs with Rust.

So yes, and.

vb-8448 · 2024-07-23T16:43:37 1721753017

I would argue that the whole point of the post is: nothing in the world can save you from bad practices.

There are at least 3 bad practices here:

- releasing of untested SW/config files

- bypassing of final users' rollout policies

- unconditional faith of final users in crowdstrike

Sakos · 2024-07-23T17:23:08 1721755388

> The above are all things that could (and sould!) be done to reduce the chances of a misbehavior happening, but we must accept that the code bug was just the specific trigger this time around and a different trigger could have had similarly nefarious consequences. The root cause behind the outage lies in the process to get the configuration change shipped to the world.

> Now, SRE 101 (or DevOps or whatever you want to call it) says that configuration changes must be staged for slow and controlled deployment, and validated at every step. Those changes should first be validated in a very small scale before being pushed to the world later on, and every push should be incremental.

Unfortunately, the article is sort of burying the lede until half-way through until it makes some decent points.

We should be using safer languages, but also 1) how is it possible that CrowdStrike can push a content update globally to all clients with no option for their customers to delay it for testing and 2) why doesn't CrowdStrike have internal testing before deployment?

SoftTalker · 2024-07-23T17:00:06 1721754006

And "this bug" being one of the most common types of bugs.

smsm42 · 2024-07-23T18:00:38 1721757638

> people are just correctly pointing out Rust would have helped this bug

So would a unit test or a fuzzer. But it's obvious this does not solve the actual problem, only a particular instance of it. It should be as obvious that mentioning Rust doesn't too. Kernel module in Rust may not have buffer overflow bugs (even that is not 100% certain, but let's assume for a minute) but it doesn't mean it's safe - or even significantly less unsafe.

fleventynine · 2024-07-23T16:46:07 1721753167

The way I see it, there are a limited number of ways that a bare-metal component with no heap can "catastrophically fail":

1. "Undefined behavior" due to typical memory bugs: code written in 100% safe Rust cannot trigger this behavior.

2. "Undefined behavior" due to a stack overflow: this can happen in safe Rust, but you can provably guard against it by banning recursion and dynamic dispatch, and doing static stack depth analysis on your program.

3. Panic/abort: naively written bare-metal Rust typically has many potential calls to panic!() due to bounds checks, integer overflows, .unwrap(), etc, and if these get called the system will typically fail catastrophically. I like to solve this by having CI fail if the optimized binary contains ANY call-sites to the panic handler. This forces the developer to handle these edge cases with idiomatic error handling before they can merge their code (but can cause some development pain/brittleness when the optimizer heuristics change).

4. Infinite loops: in most bare metal systems, if a section of code doesn't complete within some reasonable time, a watchdog will fire and the system will crash. Ideally there would be some static analysis that could tell me whether some particular code is capable of exceeding the allotted time bounds, but I'm not aware of such tooling, and need to resort to hacky solutions such as fuzzing and handling resets "gracefully" at runtime.

cdchn · 2024-07-23T17:00:07 1721754007

>having CI fail if the optimized binary contains ANY call-sites to the panic handler.

How do you do that?

fleventynine · 2024-07-23T17:08:16 1721754496

Have your panic handler call an extern "C" function that doesn't exist, and the linker will complain if there's any calls to that function that weren't optimized out. Alternately, you can look for the symbol in the output elf file and fail if it's there, which is a bit more graceful as panicking binaries can still be produced during development.

woodruffw · 2024-07-23T17:02:17 1721754137

You can use something like `no_panic`[1].

You could also do it statically, post-build, by looking for calls to the panic handler.

[1]: https://docs.rs/no-panic/latest/no_panic/

bpicolo · 2024-07-23T17:10:55 1721754655

Does annotating `main` with no_panic do the trick? How much does that tend to affect compile times, do you know?

woodruffw · 2024-07-23T17:18:13 1721755093

> Does annotating `main` with no_panic do the trick?

I've never tried that, so maybe :-)

> How much does that tend to affect compile times, do you know?

IME, not very much -- there might be a small amount of overhead, but the "trick" behind the check is to fail the linker with a symbol error if the panic code is inserted into a function. So the compiler itself should be roughly as fast, and the error path in the linker is really the only thing that should change too much.

noxs · 2024-07-23T16:40:50 1721752850

Meanwhile the author also says this

> Rust would have prevented this incident irks me.

That's exactly everyone's point.

and

> Think of… an innocent buggy unwrap() call if the code actually used Rust.

Sure but would you not just ban usages of unwrap with static lint for a kernel program?

This is more or less like clickbait to me.

uecker · 2024-07-23T17:04:18 1721754258

The point is that merely using Rust would not automatically prevent this problem. Of course you could use Rust and forbid all unwrap(). But then question is why you could not forbid all unchecked dereferences (e.g. ensured with static analysis) in C/C++. The problem with C/C++ is not that this is impossible, it is that this is not done.

nindalf · 2024-07-23T17:27:50 1721755670

Anything is possible in theory. In practice Rust codebases have high standards because it is easy to set up. For instance, most production users use clippy, which has a lint that will tell you if you're using unwrap() outside of test code. Clippy is so easy to set up, that you'd be silly not to. And this in addition to all the checks the compiler does for you for free.

To answer your question - it's much easier to fall into the pit of success when using Rust.

Sure C++ users could set up ASAN, UBSAN etc., but it's extra effort and many people don't bother.

capitainenemo · 2024-07-23T18:02:28 1721757748

And based on what I've read over the years here on HN those compile and run-time tools still are limited by the language and have known failure modes that rust checks would not have.

uecker · 2024-07-23T18:26:23 1721759183

I am not really convinced. I could easily see Rust in the wild be full of unwrap, unsafe, etc.. if written by people that do not bother but are forced to use Rust.

And then, what is even the right behavior for a kernel driver that encounters an unexpected problem related to input? In the interest of security you might want it to crash the machine instead of continuing. I think panic is the default in Rust for many error cases - and I think this is the right approach, but it would have led to exactly the same result in this case.

nindalf · 2024-07-23T21:43:40 1721771020

Yes in theory a person could write poor Rust code. But in practice the quality of popular Rust libraries is pretty high.

Linus says you never crash in the kernel, which is why Rust code merged into Linux won’t crash.

smabie · 2024-07-23T17:10:03 1721754603

Author says Rust wouldn't solve the crowd strike outage but then literally admits that it would have prevented it?

This article makes no sense

steveklabnik · 2024-07-23T17:54:55 1721757295

They're basically making an argument of "talking about the programming language involved is missing the forest for the trees." Bugs are inevitable, regardless of language, and so the real culprit here is a lack of QA/validation before deploy.

Which like, sure, it's a good point in general, but it's also worth discussing if certain technologies would make your QA team's lives easier. Both "would this tool catch this bug" and "are there structural problems that led to a bug slipping through into production" are valid questions.

In my mind, the author is making the same mistake he accuses others of making, just from the opposite direction.

kaycey2022 · 2024-07-24T03:34:59 1721792099

QA aren't some magic wizards that can catch all bugs. They might never hit the memory corruption scenarios that Rust optimizes for. I think you need everything - a good high level language, experienced devs who know that domain, and good practises surrounding each release. There is no good reason for a billion dollar corp to skimp on these.

wolrah · 2024-07-23T19:16:39 1721762199

> Author says Rust wouldn't solve the crowd strike outage but then literally admits that it would have prevented it?

If a person who got claymored by a Takata airbag in a car crash had taken a different route the crash wouldn't have happened and the airbag explosion would have been prevented, but the actual problem still exists and is waiting for some other thing to trigger it.

That's the point the article is making, yes Rust would probably have prevented this specific exact incident but the actual problem that led to this is Crowdstrike's apparent lack of testing and staged release processes for these channel files which could just as easily have triggered some other bug that Rust would not necessarily have had any impact on.

CodeWriter23 · 2024-07-23T17:18:19 1721755099

IMO in 20 years we’ll be discussing new classes of problems arising from the abstraction methodologies inherent in languages like Rust, Swift etc.

gdcbe · 2024-07-23T17:44:14 1721756654

Invalid pointers and related bugs are older then 20 years…

Languages like Rust don’t solve all problems and nobody is hoping that the research and progress towards better tooling stops there. Even many people working on Rust give plenty of ideas on what future languages can do better or different.

You don’t need to wait 20 years, those discussions are already happening…

So uh… I’m confused on what exactly you are trying to communicate here?

CodeWriter23 · 2024-07-23T18:39:01 1721759941

I can tell.

Knowing how your machine works is the only answer. It’s funny you cite the age of pointer issues as some kind of proof. Pointers themselves are abstractions. I’m not saying avoid abstractions or code everything in assembler, but if you want to be a strong practitioner in this field you better be able to disassemble any code and understand what is happening. And alter your practices accordingly.

kkfx · 2024-07-23T20:27:43 1721766463

Declarative systems like NixOS or Guix system solve, since a reboot in an old generation is quick and done, mandatory FLOSS instead of black boxes and pull-based upgrades instead of vendor-push-based ones do solve.

A bug can always happen, being able to decide WHEN and HOW to risk and revert anything with ease it's what's needed. Unfortunately most of the world, many technicians included so far largely ignore any advancement over early '80s tech still in use today, before was zfs "a rampant layer violation" with btrfs and stratis as a good example of reactionary tech choices and monsters resulting from them, than it's about declarative systems vs the "container mania", a re-edition of full-stack-virtualization on x86 previous mania ALL because some big players profit (alone) from such designs...

mamcx · 2024-07-23T17:38:33 1721756313

Better tools do not solve structural human problems.

But

Is a sign of better management if there are better tools in use.

ie: Good devs/managers will pick better tools/methodologies/practices because well, there are better devs, right? And if there are not, is a part of the path to become one to do it.

*p.d: However, sometimes by mistake or maybe design, better tools/environment cause people to do better...*

chris_wot · 2024-07-23T16:59:31 1721753971

I would have thought eBPF is the thing that might have prevented the CrowdStrike issue.

trueismywork · 2024-07-23T17:09:01 1721754541

They triggered a bug in ebpf implementation. Quite impressive actually

jacobr1 · 2024-07-23T17:57:52 1721757472

Wasn't this an issue in Windows? I know there is an ebpf compatibly layer of sorts - but wasn't aware that was being used by crowdstrike.

capitainenemo · 2024-07-23T18:16:04 1721758564

Yeah, I think they are confusing the (far more limited) issue with RedHat that CrowdStrike triggered last month, with the Windows issue. Based on https://news.ycombinator.com/item?id=41033579 Windows does not have ebpf yet and even once they do have it, it will not have as complete of coverage as in Linux for a while, so would not necessarily prevent this.

sapiogram · 2024-07-23T18:16:47 1721758607

What? Source?

craxdevil7 · 2024-07-23T22:38:16 1721774296

there are single points of failure that will continue existing in our digital space. Rust is definitely memory safe, the issue here comes to failed process and bad practices. https://medium.com/@confusedcyberwarrior/our-fragile-digital...

kelnos · 2024-07-23T17:11:30 1721754690

Um, the article, within the first few paragraphs, admits that Rust would have actually prevented this particular outage.

But we shouldn't point that out because there could be other bugs that Rust would not have prevented?

What a completely garbage argument with a premise that starts out with a bald-faced lie. Flagging and moving on.

0xCAP · 2024-07-23T20:22:38 1721766158

The problem was not the bug. The problem was companies having no disaster recovery systems in place.

keybored · 2024-07-23T17:28:06 1721755686

Who is even claiming that?

> Look, I like Rust. I really, really do, and I agree with the premise that memory-unsafe languages like C++ should not be used anymore. But claiming that Rust would have prevented the massive outage that the world went through last Friday is misleading and actively harmful to Rust’s evangelism.

A wild counter-claim appears. Where’s the original claim? Maybe this is just based on the ephemeral hot-take microblogging mill which is not supposed to make sense out of context or even half a week after the original submission. But I would really appreciate a citation for such “people are saying” claims.

Here’s what I saw on HN:

- People complaining about Windows; Linux would have handled it better

- People complaining about how it ran as a kernel module or whatever

I literally saw no one complain about the language it was written in. And I tend to notice that since I’m such a programming language armchairer.

> Having CrowdStrike written in Rust would have minimized the chances of the outage happening, but not resolved the root cause that allowed the outage to happen in the first place. Thus, it irks me to see various folks blanket-claiming that Rust is the answer. It’s not, and pushing this agenda hurts Rust’s adoption more than it helps: C++ experts can understand the root cause and see that this claim is misleading, causing further divide in the systems programming world.

I resent this concern trolling over “why can’t we get along?” That the Rust community is especially tech-evangelic is a myth that simply won’t die, thanks in part to these unsourced claims about the bravado Rustaceans versus the wise C++ experts which result in “further divide” (three unsourced claims thus far). Look at any up-and-coming language. You won’t have to look far in order to find the wide-eyed enthusiasts, trust me.

> And this is where some Rust enthusiasts will zero in and say “Ah-HAH! We got you, fools. If the code had been written in Rust, this bug would not have existed!” And, you know what, that’s literally true: this specific bug would not have happened.

Don’t you love when the enthusiasts do that? Allegedly.

> And, you know, there are many more C++ developers that work in kernel space than Rust developers know kernel internals (oops, another citation need). So, naturally, a large portion of C++ developers can smell the rubbish in this claim. Which is unfortunate because this increases animosity between the two communities, which goes against the goal of converting folks to safe languages. Rust folks know that Rust can definitely help make the situation better, but C++ folks cannot get bought into it because the arguments they hear don’t resonate with them.

Here we see a further development in the animosity between the (seasoned/expert/kernel) C++ developers and the Aha-Gotcha Rust enthusiasts. Why can’t we just get along?

unethical_ban · 2024-07-23T16:41:33 1721752893

Gonna tear into the criticism of the product for a moment.

>Falcon is a product typically installed on corporate machines so that the security team can detect and mitigate threats in real time (while monitoring the actions of their employees).

Infosec here.

Some of the proposed "real time behavioral analytics" tools I have heard about are creepy and draconian, but this is a malware detection software. "Monitoring actions of employees" is not what I understand the tool to be used for.

Second, the privacy absolutists on HN have yelled at me before for defending MITM devices like web proxies and DNS filters for "degrading the security of TLS". If we don't do security on the network, then the only place to do it is on the host. Which means, in a perfect (security) world, the endpoint has some kind of intercept layer looking at all incoming/outgoing traffic, anyway.

In other words: This product AFAICT isn't human-behavior analytics, and endpoint security is the future of infosec as TLS gets better so it isn't "whether this is good software for companies to have" it's "is this version of the software reliable and functional".

unethical_ban · 2024-07-24T01:47:22 1721785642

I will entertain any counterpoint to my comment. It's relevant because it's in the article. I left the analysis of the author's takes to those who know the language better.

chris_wot · 2024-07-23T17:01:02 1721754062

I’m not sure why you have been downvoted so severely, this seems like a reasonable take on the issue.

unethical_ban · 2024-07-23T19:20:13 1721762413

Maybe the term "privacy absolutists" annoyed some people, or Steve's point.

steveklabnik · 2024-07-23T17:06:41 1721754401

If I had to guess (I didn’t downvote) it’s because this article isn’t about the product, so it feels off topic.

That said HN also tends to dislike meta commentary about voting patterns (it’s discouraged in the site guidelines) so you and I may get downvoted too.