Initial details about why CrowdStrike's CSAgent.sys crashed

js2 · 2024-07-21T01:40:11 1721526011

https://threadreaderapp.com/thread/1814343502886477857.html

delta_p_delta_x · 2024-07-21T06:38:31 1721543911

The moment I read 'it is a content update that causes the BSOD, deleting it solves the problem', I was immediately willing to bet a hundred quid (for the non-British, that's £100) that it was a combination of said bad binary data and a poorly-written parser that didn't error out correctly upon reading invalid data (in this case, read an array of pointers, didn't verify that all of them were both non-null and pointed to valid data/code).

In the past ten years or so of having done somewhat serious computing and zero cybersecurity whatsoever, I have my mind concluded, feel free to disagree.

Approximately 100% of CVEs, crashes, bugs, slowdowns, and pain points of computing have to do with various forms of deserialising binary data back into machine-readable data structures. All because a) human programmers forget to account for edge cases, and b) imperative programming languages allow us to do so.

This includes everything from: decompression algorithms; font outline readers; image, video, and audio parsers; video game data parsers; XML and HTML parsers; the various certificate/signature/key parsers in OpenSSL (and derivatives); and now, this CrowdStrike content parser in its EDR program.

That wager stands, by the way, and I'm happy to up the ante by £50 to account for my second theory.

mtlmtlmtlmtl · 2024-07-21T08:52:45 1721551965

There's at least five different things that went wrong simultaneously.

1. Poorly written code in the kernel module crashed the whole OS, and kept trying to parse the corrupted files, causing a boot loop. Instead of handling the error gracefully and deleting/marking the files as corrupt.

2. Either the corrupted files slipped through internal testing, or there is no internal testing.

3. Individual settings for when to apply such updates were apparently ignored. It's unclear whether this was a glitch or standard practice. Either way I consider it a bug(it's just a matter of whether it's a software bug or a bug in their procedures).

4. This was pushed out everywhere simultaneously instead of staggered to limit any potential damage.

5. Whatever caused the corruption in the first place, which is anyone's guess.

rco8786 · 2024-07-21T12:25:14 1721564714

Number 4 continues to be the most surprising bit to me. I could not fathom having a process that involves deploying to 8.5 million remote machines simultaneously.

Bugs in code I can almost always understand and forgive, even the ones that seem like they’d be obvious with hindsight. But this is just an egregious lack of the most basic rollout standards.

mbreese · 2024-07-21T15:47:13 1721576833

For me, number 1 is the worst of the bunch. You should always expect that there will be bugs in processes, input files, etc… the fact that their code wasn’t robust enough to recognize a corrupted file and not crash is inexcusable. Especially in kernel code that is so widely deployed.

If any one of the five points above hadn’t happened, this event would have been avoided. However, if number 1 had been addressed - any of the others could have happened (or all at the same time) and it would have been fine.

I understand that we should assume that bugs will be present anywhere, which is why staggered deployments are also important. If there had been staggered deployments, the. The damage would have happened, but it would have been localized. I think security people would argue against a staged deployment though, as if it were discovered what the new definitions protected against, an exploit could be developed quickly to put those servers that aren’t in the “canary” group at risk. (At least in theory — I can’t see how staggering deployment over a 6-12 hour window would have been that risky).

timmytokyo · 2024-07-21T16:13:29 1721578409

They're all terrible, but I agree #1 is particularly egregious for a company ostensibly dedicated to security. A simple fuzz tester would have caught this type of bug, so they clearly don't perform even a minimal amount of testing on their code.

nsguy · 2024-07-21T19:02:50 1721588570

Totally agree. Not only would a coverage guided fuzzer catch this they should also be adding every single file they send out to the corpus of that automated fuzz testing so they can get somewhat increased coverage on their parser.

There may not be out of the box fuzzers that test device drivers so you hoist all the parser code, build it into a stand-alone application, and fuzz that.

Likely this is a form of technical debt since I can understand not doing all of this day #1 when you have 5 customers but at some point as you scale up you need to change the way you look at risk.

fluidcruft · 2024-07-21T23:43:55 1721605435

Seems like it would be easy enough to add a new checkbox for this to audits.

throwaway5752 · 2024-07-21T22:49:09 1721602149

I disagree. Has to be 4, something will always go wrong, so you have to deliver in cohorts.

That goes equally if it was a Windows Update rolled out in one motion that broke the falcon agent/driver, or if it was Crowdstrike. There is almost no excuse for a global rollout without telemetry checks, whether it's security agent updates or os patches.

contravariant · 2024-07-21T23:23:19 1721604199

It might be the worst mistake, but number 1 was always going to happen sometime.

And even testing can't be trusted 100%, because writing code that does the right thing and code that tests things correctly are about equally hard, they just aren't always hard simultaneously.

jayd16 · 2024-07-21T22:28:47 1721600927

You admit that bugs are inevitable and then claim a bug free parser as the most important bullet. That seems flawed to me. It would certainly be nice, but is that achievable?

Policy changes seem more reliable and would catch other, as of yet unknown classes of bugs.

marsten · 2024-07-22T01:12:14 1721610734

This shouldn't be an either-or situation; you do all of the above. A simple validating parser in the client would be easy to write and would have easily caught a null payload.

What looks especially bad for Crowdstrike is how many things (relatively simple things) had to fail in order for this to slip through. It's like walking into Fort Knox, grabbing a gold bar, and walking out unimpeded. A complete systemic failure.

thundershart · 2024-07-21T14:06:13 1721570773

Surely, CrowdStrike's safety posture for update rollouts is in serious need of improvement. No argument there.

But is there any responsibility for the clients consuming the data to have verified these updates prior to taking them in production? I haven't worn the sysadmin hat in a while now, but back when I was responsible for the upkeep of many thousands of machines, we'd never have blindly consumed updates without at least a basic smoke test in a production-adjacent UAT type environment. Core OS updates, firmware updates, third party software, whatever -- all of it would get at least some cursory smoke testing before allowing it to hit production.

On the other hand, given EDR's real-world purpose and the speed at which novel attacks propagate, there's probably a compelling argument for always taking the latest definition/signature updates as soon as they're available, even in your production environments.

I'm certainly not saying that CrowdStrike did nothing wrong here, that's clearly not the case. But if conventional wisdom says that you should kick the tires on the latest batch of OS updates from Microsoft in a test environment, maybe that same rationale should apply to EDR agents?

kiitos · 2024-07-21T16:00:50 1721577650

> But is there any responsibility for the clients consuming the data to have verified these updates prior to taking them in production

In the boolean sense, yes. United Airlines (for example) is ultimately responsible for their own production uptime, so any change they apply without validation is a risk vector.

In pragmatic terms, it's a bit fuzzier. Does CrowdStrike provide any practical way for customers to validate, canary-deploy, etc. changes before applying them to production? And not just changes with type=important, but all changes? From what I understand, the answer to that question is no, at least for the type=channel-update change that triggered this outage. In which case I think the blame ultimately falls almost entirely on CrowdStrike.

Brian_K_White · 2024-07-22T02:28:17 1721615297

"In which case I think the blame ultimately falls almost entirely on CrowdStrike"

I would say on the client for buying into CrowdStrike.

And also the client for having no contingencies and just accepting a vendor pinky-swear as meaningful.

CrowdStrike failed at their responsibilities too, I just mean that so did everyone else.

When you cede your own responsibilities to someone else and don't have that backed up with contractually enforced liability to make you whole when they fuck up, and also don't provide your own contingency so it doesn't really matter what some vendor does, that's on you. That's 100% entirely on you and it doesn't matter if a million other people also did the same utterly thoughtless and lazy thing.

kiitos · 2024-07-22T03:07:22 1721617642

> I would say on the client for buying into CrowdStrike.

I understand this perspective but I think it misses the forest for the trees. You have to evaluate this kind of stuff in context. Purity tests really smack on tech message boards where nobody has any accountability to any kind of business requirements, but basically no real-world organization operates in that way, so it's all a bit irrelevant.

> When you cede your own responsibilities to someone else ...

This framing is a bit naive, I think. It isn't a boolean. Everything is about risk management, cost/benefit analysis.

thundershart · 2024-07-21T21:58:56 1721599136

> From what I understand, the answer to that question is no, at least for the type=channel-update change that triggered this outage. In which case I think the blame ultimately falls almost entirely on CrowdStrike.

Honestly, it hadn't even occurred to me that software like this marketed at enterprise customers wouldn't have this kind of control already available. It seems like an obvious thing that any big organization would insist on that I just took it for granted that it existed.

Whoops.

janstice · 2024-07-21T23:23:23 1721604203

It seems nuts to me too - MS Defender has this out of the box. From looking at sysadmins on reddit, it seems that CS has a tiered update mechanism, but didn’t use it for this change.

cozzyd · 2024-07-21T17:43:30 1721583810

Arguably United airlines shouldn't have chosen a product they can't test updates of, though maybe there are no good options.

chrisoconnell · 2024-07-22T15:11:27 1721661087

>Arguably United airlines shouldn't have chosen a product they can't test updates of, though maybe there are no good options.

I used to work with regional parks and recreation departments, and they would not approve any updates that did not go through UAW environments that we had set up. All updates had to be deployed to their UAW, thoroughly tested, before going to their production environment.

I get this this is slightly different, but I'd imagine Airlines, Banks, and Hospitals would have far more strict UAW policies to avoid a single vendor from kneecapping operations.

vel0city · 2024-07-21T23:54:18 1721606058

> Does CrowdStrike provide any practical way for customers to validate, canary-deploy, etc. changes before applying them to production?

They do, but this update bypassed all of those rules.

jamesfinlayson · 2024-07-22T00:53:15 1721609595

Checks out - my company had lots of issues on Friday afternoon, and when it first happened I wondered who on Earth decided to roll out updates to prod systems on Friday afternoon.

No one at my company apparently.

suzzer99 · 2024-07-21T17:54:16 1721584456

Yeah one of the major problems seems to be CrowdStrike's assumptions that channel files are benign. Which isn't true if there's a bug in your code that only gets triggered by the right virus definition.

I don't know how you could assert that this is impossible, hence channel files should be treated as code.

stoolpigeon · 2024-07-21T14:36:45 1721572605

I think point 3 of the grand parent indicates admins were not given an opportunity to test this.

My company had a lot of Azure vms impacted by this and I'm not sure who the admin was who should have tested it. Microsoft? I don't think we have anything to do with crowdstrike software on our vms. ( I think - I'm sure I'll find out this week.)

Edit: I just learned the Azure central region failure wasn't related to the larger event - and we weren't impacted by the crowd strike issue - I didn't know it was two different things. So my second part of the comment is irrelevant.

thundershart · 2024-07-21T14:58:01 1721573881

Oh, I'd missed point #3 somehow. If individual consumers weren't even given the opportunity to test this, whether by policy or by bug, then ... yeesh. Even worse than I'd thought.

Exactly which team owns the testing is probably left up to each individual company to determine. But ultimately, if you have a team of admins supporting the production deployment of the machines that enable your business, then someone's responsible for ensuring the availability of those machines. Given how impactful this CrowdStrike incident was, maybe these kinds of third-party auto-update postures need to be reviewed and potentially brought back into the fold of admin-reviewed updates.

volkl48 · 2024-07-21T16:59:24 1721581164

It's not an option. While the admins at the customer have the ability to control when/how revisions of the client software go out (and this, can + generally do their own testing, can decide to stay one rev back as default, etc), there is no control over updates to the kind of update/definition files that were the primary cause here.

Which is also why you see every single customer affected - what you are suggesting is simply not an available thing to do at present for them.

At least for now - I imagine that some kind of staggered/slowed/ringed option will have to be implemented in the future if they want to retain customers.

gitfan86 · 2024-07-21T12:47:26 1721566046

They probably don't get to claim agile story points until the ticket is in finished state. And they probably have a culture where vanity Metrics like "velocity" are prioritized

nmg · 2024-07-21T13:07:08 1721567228

This would answer the question that i've not heard anyone asking:

what incentivized the bad decisions that led to this catastrophic failure?

phs318u · 2024-07-21T13:18:08 1721567888

My understanding is that the culture (as reported by some customers) is quite aggressive and pushy. They are quite vocal when customers don’t turn in automatic updates.

It makes sense in a way - given their fast growth strategy (from nowhere to top 3) and desire to “do things differently” - the iconoclast upstarts that redefine the industry.

Or to summarise - hubris.

hello_moto · 2024-07-21T14:22:13 1721571733

To catch 0day quickly, EDR needs to know "how".

The "how" here is AV definition or a way to identify the attack. In CS-speak: content.

Catching 0day quickly results in good reputation that your EDR works well.

If people turn off their AV definition auto-update, they are at-risk. Why use EDR if folks don't want to stop attack quickly?

LtWorf · 2024-07-21T22:41:14 1721601674

In theory you're correct. In practice it seems that crowdstrike has crashed systems with their updates much more often than 0day attacks.

hello_moto · 2024-07-22T00:17:28 1721607448

How many times?

This is one bsod on windows 10. I saw another kernel panic on specific Linux distro.

What else?

One thing that is funny is that quite a few of their competitors are taking this opportunity to shit on them via Twitter and by marketing themselves as better than CrowdStrike.

Twitter, with all its issues, apparently has a feature to prevent fake news and that feature will show crowd source sentiment to debunk fake news, in this case, Twitter users showed how many times Crowdstrike competitor BSOD windows

phs318u · 2024-07-22T01:01:27 1721610087

Bottom line is this: there is absolutely no good reason for not doing rolling updates. Do a few and make sure they are ok. Keep rolling out in groups. This single approach alone would've meant that this event was of marginal impact to most of the public, as sysadmins would've had the opportunity to halt further updates and work on remediating their first group (typically non-critical servers). Rolling out to everything all at once is just bad practice, period.

hello_moto · 2024-07-22T01:56:06 1721613366

Put yourself in the customer shoes: who wants to sign up to be rolled first?

There's best practice and then there's customer

RoyalHenOil · 2024-07-23T14:43:39 1721745819

Give those customers a discount.

The solution to customer reluctance to being the guinea pig is not to force every customer to be a guinea pig.

LtWorf · 2024-07-22T04:45:12 1721623512

> This is one bsod on windows 10. I saw another kernel panic on specific Linux distro.

The red hat one. But they also did it to debian with a different issue, and I think another distro as well.

77pt77 · 2024-07-21T15:37:31 1721576251

> They are quite vocal when customers don’t turn in automatic updates.

I'm sorry but this is the customer's fault.

If I'm using your services you work for me and you don't get to bully me into doing whatever you think needs to be done.

People that chose this solution need to be penalized, but they won't.

mbreese · 2024-07-21T15:50:11 1721577011

Customers don’t always have a choice here. They could be restricted by compliance programs (PCI, et al) and be required under those terms to have auto updates on.

Compliance also has to share some of the blame here, if best practices (local testing) aren’t allowed to be followed in the name of “security”.

nerdjon · 2024-07-21T16:04:18 1721577858

This needs to keep being repeated anytime someone wants to blame the company.

Many don’t have a choice, a lot of compliance is doing x to satisfy a checkbox and you don’t have a lot of flexibility in that or you may not be able to things like process credit cards which is kinda unacceptable depending on your company. (Note: I didn’t say all)

CrowdStrike automatic update happens to satisfy some of those checkboxes.

cruffle_duffle · 2024-07-21T15:37:04 1721576224

Oh the games I have to play with story points that have personal performance metrics attached to them. Splitting tickets to span sprints so there aren’t holes in some dudes “effort” because they didn’t compete some task they committed to.

I never thought such stories were real until I encountered them…

dboreham · 2024-07-22T06:37:07 1721630227

Save yourself while you still can.

alsetmusic · 2024-07-21T20:06:01 1721592361

I worked at one of the big ones and we always shipped live to all consumer devices at the same time. But this was for a popular suite of products that generate a lot of consumer demand, so we had a rigorous QA process to make sure this wouldn't be a problem. As I was typing this, it occurred to me that zero people would have cared if this update was staggered making it pretty silly not to.

trhway · 2024-07-21T20:54:27 1721595267

As the QA manager said in our recent product meeting - "as the canary doesn't work we roll out and test on the production cloud".

layer8 · 2024-07-21T16:19:51 1721578791

Malware signature updates are supposed to be deployed ASAP, because every minute may count when a new attack is spreading. The mistake may have been to apply that policy indiscriminately.

avree · 2024-07-21T18:27:37 1721586457

A lot of snarky replies to this comment, but the reality is that if you were selling an anti-virus, identified a malicious virus, and then chose not to update millions of your machines with that virus’s signature, you’d also be in the wrong.

rco8786 · 2024-07-22T11:23:07 1721647387

I’m not saying don’t update? I’m talking about rolling the update over the course of a short amount of time, like under an hour. With the ability to stop the rollout.

naasking · 2024-07-21T19:14:29 1721589269

> identified a malicious virus, and then chose not to update millions of your machines with that virus’s signature, you’d also be in the wrong.

No, for exactly the reason we just saw, and the same reason why vaccines are tested before widespread rollout.

aforwardslash · 2024-07-21T20:40:58 1721594458

On the other hand, diseases vaccines prevent dont have almost instantaneous propagation, thats why they are effective at containing propagation.

As an example, reaction time is paramount to counter many kinds of attacks - thats why blocklists are so popular, and AS blackholing is a viable option.

VirusNewbie · 2024-07-21T20:03:50 1721592230

> But this is just an egregious lack of the most basic rollout standards.

Agreed. It's crazy that the top tech companies enforce this in a biblical fashion, despite all sorts of pressure to ship and all that. Crowdstrike went YOLO at a global scale.

mrbombastic · 2024-07-21T18:20:05 1721586005

And here I thought shipping a new version on the app store was scary.

Is there anything we can take from other professions/tradecraft/unions/legislation to ensure shops can’t skip the basic best practices we are aware of in the industry like staged rollouts? How do we set incentives to prevent this? Seriously the App Store was raking in $$ from us for years with no support for staged rollouts and no other options.

robomc · 2024-07-21T21:19:30 1721596770

I wonder if there's a concern that staggering the malware signatures would open them up to lawsuits if somebody was hacked in between other customers getting the data and them getting the data.

thundershart · 2024-07-21T22:08:57 1721599737

> I wonder if there's a concern that staggering the malware signatures would open them up to lawsuits if somebody was hacked in between other customers getting the data and them getting the data.

I'd assume that sort of thing would be covered in the EULA and contract -- but even if it weren't, it seems like allowing customers to define their own definition update strategy would give them a pretty compelling avenue to claim non-liability. If CrowdStrike can credibly claim "hey, we made the definitions available, you chose to wait for 2 weeks to apply them, that's on you", then it becomes much less of a concern.

otabdeveloper4 · 2024-07-22T17:57:39 1721671059

Rolling it all out at once is "security best practices".

rco8786 · 2024-07-22T18:09:18 1721671758

What's the reasoning here? Extreme time sensitivity?

rwmj · 2024-07-21T09:03:43 1721552623

Zero effort to fuzz test the parser too. I mean, we know how to harden parsers against bugs and attacks, and any semi-competent fuzzer would have caught such a trivial bug.

jatins · 2024-07-21T13:02:56 1721566976

You are seriously overestimating the engineering practises at these companies. I have worked in "enterprise security" previously though not at this scale. In a previous life I worked with of the engineering leaders currently at Crowdstrike.

I'll bet you this company has some arbitrary unit test coverage requirements for PRs which developers game be mocking the heck out of dependencies. I am sure they have some vanity sonarqube integration to ensure great "code quality". This likely also went through manual QA.

However I am sure the topic of fuzz testing would not have come up once. These companies sell checkbox compliance, and they themselves develop their software the same way. Checking all the "quality engineering" boxes with very little regards for long term engineering initiatives that would provide real value.

And I am not trying to kick Crowdstrike when they are down. It's the state of any software company run by suits with myopic vision. Their engineering blogs and their codebases are poles apart.

mavhc · 2024-07-21T09:18:19 1721553499

AV software is a great target for malware, badly written, probably runs too much stuff in the kernel, tries to parse everything

Comfy-Tinwork · 2024-07-21T11:25:47 1721561147

And at the very least straight to system level access if not more.

londons_explore · 2024-07-21T12:47:04 1721566024

AV software needs kernel privilidges to have access to everything it needs to inspect, but the actual inspection of that data should be done with no privilidges.

I think most AV companies now have a helper process to do that.

If you successfully exploit the helper process, the worst damage you ought to be able to do is falsely find files to be clean.

bornfreddy · 2024-07-22T11:17:20 1721647040

> ...the worst damage you ought to be able to do is...

Ought. But it depends on the way the communication with the main process is done. I wouldn't be surprised if the main process trusts the output from the parser just a tiny bit too much.

MyFedora · 2024-07-21T12:02:09 1721563329

Anti-cheats also whitelist legit AV drivers, even though cheaters exploit them to no end.

chrisjj · 2024-07-21T09:16:41 1721553401

The triggering file was all zeros.

It is not possible that only this pattern caused the crash, and fuzzing omitted to try this unfuzzy pattern?

Retr0id · 2024-07-21T11:04:53 1721559893

Competent fuzzers don't just use random bytes, they systematically explore the state-space of the target program. If there's a crash state to be found by feeding in a file full of null bytes, it's probably going to be found quickly.

A fun example is that if you point AFL at a JPEG parser, it will eventually "learn" to produce valid JPEG files as test cases, without ever having been told what JPEG file is supposed to look like. https://lcamtuf.blogspot.com/2014/11/pulling-jpegs-out-of-th...

rwmj · 2024-07-21T12:18:06 1721564286

AFL is really "magical". It finds bugs very quickly and with little effort on our part except to leave it running and look at the results occasionally. We use it to fuzz test a variety of file formats and network interfaces, including QEMU image parsing, nbdkit, libnbd, hivex. We also use clang's libfuzzer with QEMU which is another good fuzzing solution. There's really no excuse for CrowdStrike not to have been using fuzzing.

formerly_proven · 2024-07-21T09:54:57 1721555697

Instrumented fuzzing (like AFL and friends) tweaks the input to traverse unseen code paths in the target, so they're super quick to find stuff like "heyyyyy, nobody is actually checking if this offset is in bounds before loading from that address".

monsieurbanana · 2024-07-21T09:50:27 1721555427

In my limited experience, I thought any serious fuzzing program does test for all "standard" patters like only null bytes, empty strings, etc...

layer8 · 2024-07-21T16:22:25 1721578945

No, it wasn’t all zeros: https://x.com/patrickwardle/status/1814782404583936170

omeid2 · 2024-07-21T10:00:15 1721556015

The files in question has a magic number is "0xAAAAAAAA" so it is not possible that the file was all zeros.

gliptic · 2024-07-21T09:41:13 1721554873

No, it wasn't. Crowdstrike denied it had to do with zeros in the files.

jojobas · 2024-07-21T10:24:33 1721557473

At this point I wouldn't be paying too much attention to what Crowdstrike is saying.

hello_moto · 2024-07-21T14:25:13 1721571913

Have to speak the truth albeit at minimum, in case legal...

kchr · 2024-07-21T15:51:06 1721577066

Which also explains why they, only if needed to cover their back legally, confirm or deny details being shared on social and mass media.

watwut · 2024-07-21T09:41:26 1721554886

Possible? Yes. Likely? No.

dcuthbertson · 2024-07-21T14:41:37 1721572897

I wonder if it was pushed anywhere that didn't crash, as an extension of "It works on my machine. Ship it!"

I've built a couple of kernel drivers over the years and what I know is that ".sys" files are to the kernel as ".dll" files are to user-space programs in that the ones with code in them run only after they are loaded and a desired function is run (assuming boilerplate initialization code is good).

I've never made a data-only .sys file, but I don't see why someone couldn't. In that case, I'd guess that no one ever checked it was correct, and the service/program that loads it didn't do any verification either -- why would it, the developers of said service/program would tend to trust their own data .sys file would be valid, never thinking they'd release a broken file or consider that files sometimes get corrupted -- another failure mode waiting to happen on some unfortunate soul's computer.

kchr · 2024-07-21T15:54:54 1721577294

The file extension is `sys` by convention, it's nothing magical to it and it's not handled in any special way by the OS. In the case of CrowdStrike, there seems to be some confusion as to why they use this file extension since it's only supposed to be a config/data file to be used by the real kernel driver.

dcuthbertson · 2024-07-21T20:23:00 1721593380

Thanks. I understand that '.sys' is a naming convention. I'd guess that they used it because those config/data files are used by their kernel driver, and so makes kernel vs user-space files easier to distinguish.

pclmulqdq · 2024-07-21T13:54:25 1721570065

Number 4 is what everyone will fixate on, but I have the biggest problem with number 1. Anything like this sort of file should have (1) validation on all its pointers and (2) probably >2 layers of checksumming/signing. They should generally expect these files to get corrupted in transit once in a while, but they didn't seem to plan for anything other than exactly perfect communication between their intent and their kernel driver.

HDThoreaun · 2024-07-22T02:00:23 1721613623

Files dont get corrupted in transit with TCP.

pclmulqdq · 2024-07-22T04:46:17 1721623577

I think you mean: In theory, files don't get corrupted in transit with TCP. In theory, they also don't get corrupted when stored in memory or on disks either.

The only reason any of these things don't cause issues in practice is checksums and error correcting codes.

HDThoreaun · 2024-07-22T19:28:35 1721676515

No, I mean that TCP checks for corruption in transit and has the packets resent in that case. I guess you could be running a buggy implementation, but that seems unlikely with how ubiquitous TCP is.

pclmulqdq · 2024-07-22T21:31:37 1721683897

I would suggest that you read this: https://web.mit.edu/Saltzer/www/publications/endtoend/endtoe...

Errors can show up any time, and usually show up between the parts that checksums correct. On the wire, TCP protects you with a (weak) checksum. Off the wire, your computer and filesystem can still screw things up. Even CPU bugs can do this.

simonh · 2024-07-21T09:16:24 1721553384

There is a story out that the problem was introduced in a post processing step after testing. That makes more sense than that there was no testing. If true it means they thought they’d tested the update, but actually hadn’t.

bryant · 2024-07-21T21:04:43 1721595883

Of all of these, I think #3 has crowdstrike the most exposed, legally. Companies with robust update and config management protocols got burned by this as well, including places like hospitals and others with mission critical systems where config management is more strictly enforced.

If the crowdstrike selloff continues, I'm betting this will be why.

(There's a chance I'll make trading decisions based on this rationale in the next 72 hours, though I'm not certain yet)

denton-scratch · 2024-07-22T08:49:45 1721638185

> or there is no internal testing

Thing is, as far as I can see, deploying this database update to a Windows machine will result promptly and unconditionally in a BSOD. That implies that this update was tried on exactly zero machines before it was shipped.

The bug can't have "slipped through internal testing"; it would have failed immediately on any machine it was loaded on.

ratorx · 2024-07-21T10:49:47 1721558987

I’d also maybe add another one on the Windows end:

6) some form of sandboxing/error handling/api changes to make it possible to write safer kernel modules (not sure if it already exists and was just not used). It seems like the design could be better if a bad kernel module can cause a boot loop in the OS…

leosarev · 2024-07-21T12:14:19 1721564059

There is sandboxing API in Windows. It's called running programs in userspace.

hello_moto · 2024-07-21T14:44:05 1721573045

Run what a userspace?

layer8 · 2024-07-21T16:26:08 1721579168

It’s a tough problem, because you also don’t want the system to start without the CrowdStrike protection. Or more generally, a kernel driver is supposedly installed for a reason, and presumably you don’t want to keep the system running if it doesn’t work. So the alternative would be to shut down the system upon detection of the faulty driver without rebooting, which wouldn’t be much of an improvement in the present case.

ratorx · 2024-07-21T17:27:57 1721582877

I can imagine better defaults. Assuming the threat vector is malicious programs running in userspace (probably malicious programs in kernel space is game over anyway right?), then you could simply boot into safe mode or something instead of crashlooping.

One of the problems with this outage was that you couldn’t even boot into safe mode without having the bit locker recovery key.

layer8 · 2024-07-21T17:31:40 1721583100

You don’t want to boot into safe mode with networking enabled if the software that is supposed to detect attacks from the network isn’t running. Safe mode doesn’t protect you from malicious code in userspace, it only “protects” you from faulty drivers. Safe mode is for troubleshooting system components, not for increasing security.

I don’t know the exact reasoning why safe mode requires the BitLocker recovery key, but presumably not doing so would open up an attack vector defeating the BitLocker protection.

discostrings · 2024-07-21T20:43:57 1721594637

The BitLocker configurations I've seen over the last few days don't require the recovery key to enter safe mode.

Uvix · 2024-07-21T19:44:12 1721591052

Normally BitLocker gets the key from the TPM, which will have its own driver that's likely disabled in Safe Mode.

sm_1024 · 2024-07-21T22:23:35 1721600615

Doesn't microsoft support eBPF on Windows?

https://github.com/microsoft/ebpf-for-windows

nullfrigid · 2024-07-22T14:18:22 1721657902

If you know the answer why are you asking the question?

0xBDB · 2024-07-22T19:14:59 1721675699

No. Not in production yet. But that should solve this problem once it's available for any company that uses it (and I believe CrowdStrike is heavily involved with it).

dartos · 2024-07-21T12:02:42 1721563362

Bugs happen.

Not staggering the updates is what blew my mind.

londons_explore · 2024-07-21T12:51:26 1721566286

Since the issue manifested at 04:09 UTC, which is 11pm where Crowdstrikes HQ is, I would guess someone was working late at night and skipped the proper process so they could get the update done and go to bed.

They probably considered it low risk, had done similar things of times hundreds of times before, etc.

kchr · 2024-07-21T15:57:06 1721577426

A good reminder of the fact that your Thursday might be someone else's Friday.

dartos · 2024-07-21T13:06:49 1721567209

> They probably considered it low risk

Wild that anyone would consider anything in the “critical path” low risk. I would bet that they just don’t do rolling releases normally since it never caused issues before.

hello_moto · 2024-07-21T14:27:25 1721572045

Companies these days are global btw.

Not everyone is working on the same timezone.

londons_explore · 2024-07-21T14:56:25 1721573785

They don't appear to have engineering jobs in any location where that would be considered regular office hours...

hello_moto · 2024-07-21T15:42:57 1721576577

https://crowdstrike.wd5.myworkdayjobs.com/crowdstrikecareers

I see remote, Israel, Canada.

https://crowdstrike.wd5.myworkdayjobs.com/en-US/crowdstrikec...

This one specifically Spain and Romania

I know they bought companies all over the globe from Denmark to other locations.

londons_explore · 2024-07-21T16:51:55 1721580715

0409UTC is 07:09 AM in Israel. Doubt an engineer was doing a push then either...

All the other engineering locations seem even less likely.

vitus · 2024-07-21T18:02:56 1721584976

On Friday, no less. (Israel's weekend is Friday / Saturday instead of the usual Saturday / Sunday.)

d1sxeyes · 2024-07-22T10:45:22 1721645122

Not sure of Crowdstrike's specific arrangements, but a lot of Israelis do work on Friday mornings.

HDThoreaun · 2024-07-22T02:02:55 1721613775

Their sales pitch is being the first to apply patches for any virus. I think it makes sense to try to push as quickly as possible when speed of updates is core to your sales pitch.

RoyalHenOil · 2024-07-23T14:54:09 1721746449

Marketing is supposed to sell the product, not lead the product. They need to get their priorities straight.

rainsford · 2024-07-21T12:50:43 1721566243

> 2. Either the corrupted files slipped through internal testing, or there is no internal testing.

This is the most interesting question to me because it doesn't seem like there is an obviously guessable answer. It seem very unlikely to me that a company like CrowdStrike pushes out updates of any kind without doing some sort of testing, but the widespread nature of the outage would also seem to suggest any sort of testing setup should have caught the issue. Unless it's somehow possible for CrowdStrike to test an update that was different than what was deployed, it's not obvious what went wrong here.

bloopernova · 2024-07-21T13:10:02 1721567402

I had read somewhere that the definition file was corrupted after testing, during the final CI/CD pipeline.

dboreham · 2024-07-22T06:40:08 1721630408

Because nowadays nobody knows you test the actual bits you're going to ship, not whatever random crap comes out of someone's build script run in a different place and time that's supposed to be the same as what you'll ship.

fhub · 2024-07-21T19:03:26 1721588606

#1 could be slit into two parts I think. Microsoft kernel side and CloudStrike module side.

spike021 · 2024-07-21T20:51:40 1721595100

6. Companies using CS have no testing to verify that new updates won't break anything.

Any SWE job I've worked over my entire career, nothing is deployed with new versions of dependencies without testing them against a staging environment first.

strunz · 2024-07-21T22:06:52 1721599612

Crowdstrike doesn't give that option. Updates happen without a choice to "keep you safe".

cratermoon · 2024-07-21T22:21:53 1721600513

> Individual settings for when to apply such updates were apparently ignored.

I've heard that said elsewhere, but I haven't found a source for it at all. Are you able to point to one for me?

shrimp_emoji · 2024-07-21T13:21:29 1721568089

Well, Microsoft led by example with #2: https://news.ycombinator.com/item?id=20557488

LtWorf · 2024-07-21T20:14:04 1721592844

> 4. This was pushed out everywhere simultaneously instead of staggered to limit any potential damage.

Most importantly it was never tested at all :D

dboreham · 2024-07-22T06:42:52 1721630572

TBF it was extensively tested. On airline and hospital computers.

WWLink · 2024-07-22T04:13:30 1721621610

6. The software architecture where a kernel mode driver is required is faulty and poor design.

dboreham · 2024-07-22T06:41:44 1721630504

7. The operating system vendor that allows buggy kernel drivers to be loaded and run and panic the system.

cynicalsecurity · 2024-07-21T16:27:00 1721579220

I'm betting on them having no internal testing.

hulitu · 2024-07-21T10:44:24 1721558664

6. No development process, no testing.

krisoft · 2024-07-21T11:34:42 1721561682

How is that different from point 2?

throw0101d · 2024-07-21T11:32:09 1721561529

> Approximately 100% of CVEs, crashes, bugs, slowdowns, and pain points of computing have to do with various forms of deserialising binary data back into machine-readable data structures.

For the record, the top 25 common weaknesses for 2023 are listed at:

* https://cwe.mitre.org/top25/archive/2023/2023_top25_list.htm...

Deserialization of Untrusted Data (CWE-502) was number fifteen. Number one was Out-of-bounds Write (CWE-787), Use After Free (CWE-416) was number four.

CWEs that have been in every list since they started doing this (2019):

* https://cwe.mitre.org/top25/archive/2023/2023_stubborn_weakn...

lioeters · 2024-07-21T12:15:23 1721564123

# Top Stubborn Software Weaknesses (2019-2023)

Out-of-bounds Write

Improper Neutralization of Input During Web Page Generation (‘Cross-site Scripting’)

Improper Neutralization of Special Elements used in an SQL Command (‘SQL Injection’)

Use After Free

Improper Neutralization of Special Elements used in an OS Command ('OS Command Injection')

Improper Input Validation

Out-of-bounds Read

Improper Limitation of a Pathname to a Restricted Directory (‘Path Traversal’)

Cross-Site Request Forgery (CSRF)

NULL Pointer Dereference

Improper Authentication

Integer Overflow or Wraparound

Deserialization of Untrusted Data

Improper Restriction of Operations within Bounds of a Memory Buffer

Use of Hard-coded Credentials

TeMPOraL · 2024-07-21T15:39:43 1721576383

Yup. Almost all of them are various flavor of fucking up a parser or misusing it (in particular, all the injection cases are typically caused by writing stupid code that glues strings together instead of proper parsing).

lolinder · 2024-07-21T18:24:58 1721586298

That's not parsing, that's the inverse of parsing. It's taking untrusted data and injecting it into a string that will later be parsed into code without treating the data as untrusted and adapting accordingly. It's compiling, of a sort.

Parsing is the reverse—taking an untrusted string (or binary string) that is meant to be code and converting it into a data structure.

Both are the result of taking untrusted data and assuming it'll look like what you expect, but both are not parsing issues.

TeMPOraL · 2024-07-21T18:59:49 1721588389

> It's taking untrusted data and injecting it into a string that will later be parsed into code without treating the data as untrusted and adapting accordingly.

Which is precisely why parsing should've been used here instead. The correct way to do this is to work at the level after parsing, not before it. "SELECT * FROM foo WHERE bar LIKE ${untrusted input}" is dumb. Parsing the query with a placeholder in it, replacing it as an abstract node in the parsed form with data, and then serializing to string if needed to be sent elsewhere, is the correct way to do it, and is immune to injection attacks.

lolinder · 2024-07-21T19:32:03 1721590323

For SQL we tend to use prepared statements as the answer, which probably do some parsing under the hood but that's not visible to the programmer. I'd raise a lot of questions if I saw someone breaking out a parser to handle a SQL injection risk.

TeMPOraL · 2024-07-21T21:23:45 1721597025

That's because prepared statements were developed before understaning of langsec was mature enough. They provide a very simple API, but it's at (or above) the right level - you just get to use special symbols to mark "this node will be provided separately", and provide it separately, while the API makes sure it's correctly integrated into the whole according to the rules of the language.

(Probably one other factor is that SQL was designed in a peculiar way, for "readability to non-programmers", which tends to result with languages that don't map well to simple data structures. Still, there are tools that let you construct a tree, and will generate a valid SQL from that.)

HTML is a better example, because it's inherently tree-structured, and trees tend to be convenient to work with in code. There it's more obvious when you're crossing from dumb string to parsed representation, and then back.

lolinder · 2024-07-22T02:40:18 1721616018

The same thing applies to HTML, though: I would shudder if I saw a parser implemented for most HTML injection prevention. The correct answer in almost all cases is to escape the HTML using the language's standard library or the web framework's tooling.

The only situation where a parser makes sense over simple escaping routines is if you actually intended to accept a subset of the language that you're injecting into rather than plain text, in which case you'll need more than just a parser to ensure you don't have anything dangerous—you'd need to do a lot of error-prone analysis of the AST afterward as well.

TeMPOraL · 2024-07-22T05:04:34 1721624674

Or, you just use the DOM API to manipulate the structure. You don't implement a parser because one is already provided by the tooling - you use it to go from known-valid text to a data structure (here, DOM), and do your operations there.

You shouldn't do "escaping" and string concatenation. That's just parsing and unparsing while cutting corners, which is how you get injection bugs.

> The only situation where a parser makes sense over simple escaping routines is if you actually intended to accept a subset of the language that you're injecting into rather than plain text

And that's exactly what you're doing. With escaping, you're taking a serialized form of some data, and splice into it some other data, massaged in a way you hope will make it always parse to string when something parses this later. It's going to eventually bite you; not necessarily with XSS - web template breakage is another common occurrence.

Working in string space is tricky, dangerous, and dumb - parsing, working on the parsed representation, and unparsing at the end, is how you do it correctly and safely.

(Another way to put it: plaintext is a wire format; you don't work in it if the data is structured.)

Note that the API may look like you're doing text - see JSX - but it internally goes through a parsing stage, and makes it impossible for you to do stupid things that break or transform the program, like working in string space lets you.

lolinder · 2024-07-22T16:59:45 1721667585

If you don't want your users to produce HTML, then why would you use the DOM API to parse their text into an HTML data structure? Then you'd have code that's capable of producing <script> tags or who knows what else from untrusted user input and you now have to explicitly filter out tag types. Alternatively you can implement the middle bit of a compiler and map nodes to a new, safe data structure that you spit out at the end, but in the scenario we're discussing the user input was supposed to be unstructured text. HTML content is in most cases a malicious edge case, not expected data.

If you instead escape the user-provided unstructured text by replacing the very well-known set of special characters that could create tags, you know your users cannot produce active code, only text nodes.

It's the principle of least power: if you don't need users to access anything other than unstructured text then why feed their input into a parser that produces a data structure that represents code? Make illegal states unrepresentable by just escaping the text nodes as they're saved!

TeMPOraL · 2024-07-22T18:15:23 1721672123

The problem isn't with what the user can do, but with what your code can. If you bork your escaping, which is context-dependent, then user data can turn into arbitrary HTML, complete with script tags. If you keep an abstract tree representation, and add the user-provided data by passing it vetbatim to "set text content" method on a node, then there's no possible way the user input can break it. That is exactly what it means to make illegal states unrepresentable!

Working on the data structures after parsing makes it impossible to accidentally break the structure itself. Like, maybe your string escaping is perfect, but if you do:

  $content = $templatePrefix + $sanitizedString + $templateSuffix;

Then you're still vulnerable to trivial errors in your template breaking the structure and creating an exploitable vulnerability, despite the $sanitizedString being correct. If you instead work at parsed level and do:

  $result = $template.findNode("#foo").setText($unsanitizedInput)

Then there's just no way this can break (except bugs in the HTML parser and DOM API in general, which are much less likely to exist, and much easier to find and fix).

lolinder · 2024-07-22T23:36:03 1721691363

I think we've been talking past each other.

  $result = $template.findNode("#foo").setText($unsanitizedInput)

This is not parsing the user input, this is letting the native API escape the input for you, which is exactly what I'm advocating for. See my note above:

> escape the HTML using the language's standard library or the web framework's tooling.

This is what parsing the user input would look like with the DOM API:

    const newDiv = document.createElement('div');
    newDiv.innerHTML = untrustedUserInput;
    // Do some work to attempt to sanitize the new HTML elements
    document.body.appendChild(newDiv);

To me this is definitively a Bad Idea™, and I thought this was what you were advocating for.

What you actually proposed is just escaping the HTML, not parsing user input, with the only twist being that you prefer to inject user input into your templating system imperatively with something resembling the DOM API instead of declaratively with something resembling JSX. That's fine, but not relevant to the question of what method we use to sanitize the untrusted input that we're injecting. On that front it sounds like we're in agreement that parsing user input is a terrible idea.

stouset · 2024-07-21T22:56:24 1721602584

> Number one was Out-of-bounds Write (CWE-787)

Surely many of these originate from deserialization of untrusted data (e.g., trusting a supplied length). It’s probably documented but I’m passively curious how they disambiguate these cases.

throw0101d · 2024-07-22T02:33:09 1721615589

>> Number one was Out-of-bounds Write (CWE-787)

> Surely many of these originate from deserialization of untrusted data (e.g., trusting a supplied length).

Then they would presumably be classified under "Deserialization of Untrusted Data", number fifteen.

stouset · 2024-07-22T07:10:27 1721632227

That’s entirely my point. If a vulnerability happens due to writing out of bounds during untrusted deserialization, which category would you file it under?

“Deserialization of untrusted data” isn’t even a security bug like an out of bounds write is. Every meaningful program deserializes external input. It’s a common area where bugs occur, but it’s not a type of bug in and of itself. Every bug in that category “belongs” in a more proximate category.

HDThoreaun · 2024-07-22T02:04:49 1721613889

Out of bounds write attacks are generally executed on parsers to be fair

bostik · 2024-07-21T07:48:59 1721548139

> Approximately 100% of CVEs, crashes, bugs, [...], deserialising binary data

I'd make that 98%. Outside of rounding errors in the margins, the remaining two percent is made up of logic bugs, configuration errors, bad defaults, and outright insecure design choices.

Disclosure: infosec for more than three decades.

delta_p_delta_x · 2024-07-21T08:00:55 1721548855

I feel vindicated but also a bit surprised that my gut feeling was this accurate.

bostik · 2024-07-21T09:03:20 1721552600

Not really a surprise, to be honest. "Deserialisation" encapsulates most forms of injection attacks.

OWASP top-10 was dominated by those for a very long time. They have only recently been overtaken by authorization failures.

epanchin · 2024-07-21T07:58:57 1721548737

They forgot to account for those edge cases

delta_p_delta_x · 2024-07-21T08:00:02 1721548802

Heh, touché.

eru · 2024-07-21T11:35:24 1721561724

> Approximately 100% of CVEs, crashes, bugs, slowdowns, and pain points of computing have to do with various forms of deserialising binary data back into machine-readable data structures. All because a) human programmers forget to account for edge cases, and b) imperative programming languages allow us to do so.

I wouldn't blame imperative programming.

Eg Rust is imperative, and pretty good at telling you off when you forgot a case in your switch.

By contrast the variant of Scheme I used twenty years ago was functional, but didn't have checks for covering all cases. (And Haskell's ghc didn't have that checked turned on by default a few years ago. Not sure if they changed that.)

Sakos · 2024-07-21T08:12:29 1721549549

I can't decide what's more damning. The fact that there was effectively no error/failure handling or this:

> Note "channel updates ...bypassed client's staging controls and was rolled out to everyone regardless"

> A few IT folks who had set the CS policy to ignore latest version confirmed this was, ya, bypassed, as this was "content" update (vs. a version update)

If your content updates can break clients, they should not be able to bypass staging controls or policies.

SoftTalker · 2024-07-21T16:00:01 1721577601

> If your content updates can break clients

This is going to be what most customers did not realize. I'm sure Crowdstrike assured them that content updates were completely safe "it's not a change to the software" etc.

Well they know differently now.

vladvasiliu · 2024-07-21T08:42:00 1721551320

The way I understand it, the policy the users can configure are about "agent versions". I don't think there's a setting for "content versions" you can toggle.

sateesh · 2024-07-21T10:18:58 1721557138

Maybe there isn't a switch that says "content version",but from end user perspective it is a new version. Whether it was a content change, or just a fix for typo in documentation (say) the change being pushed is different than what currently exists.And for the end user the configuration implies that they have a chance to decide whether to accept any new change being pushed or not.

fire_lake · 2024-07-21T08:47:45 1721551665

Yes indeed. If you are doing this kind of job, reach for a parser generator framework and fuzz your program.

Also go read Parse Don’t Validate https://lexi-lambda.github.io/blog/2019/11/05/parse-don-t-va...

teeheelol · 2024-07-21T08:53:17 1721551997

Yep.

Looking at how this whole thing is pasted together, there's probably a regex engine in one of those sys files somewhere that was doing the "parsing"...

lolinder · 2024-07-21T18:38:58 1721587138

> reach for a parser generator framework and fuzz your program

I agree to the second but disagree on the first. Parser generator frameworks produce a lot of code that is hard to read and understand and they don't necessarily do a better job of error handling than you would. A hand-written recursive descent parser will usually be more legible, will clearly line up with the grammar that you're supposed to be parsing, and will be easier to add better error handling to.

Once you're aware of the risks of a bad parser you're halfway there. Write a parser with proper parsing theory in mind and in a language that forces you to handle all cases. Then fuzz the program, turn bad inputs that turn up into permanent regression tests, and write your own tests with your knowledge of the inner workings of your parser in mind.

This isn't like rolling your own crypto because the alternative isn't a battle-tested open source library, it's a framework that generates a brand new library that only you will use and maintain. If you're going to end up with a bespoke library anyway, you ought to understand it well.

fire_lake · 2024-07-22T08:47:38 1721638058

You place your trust in:

1. The battle testing of the generator (e.g. lex yacc is widely used)

2. The readability of the grammar file, which enables good review

You don’t (often) read the generated code, only test it. Much like you wouldn’t read the binary generated by a compiler.

praptak · 2024-07-21T10:29:38 1721557778

> imperative languages allow us to do so

This problem has a promising solution, WUFFS, "a memory-safe programming language (and a standard library written in that language) for Wrangling Untrusted File Formats Safely."

HN discussion: https://news.ycombinator.com/item?id=40378433

HN discussion of Wuffs implementation of PNG parser: https://news.ycombinator.com/item?id=26714831

bradley13 · 2024-07-21T08:54:00 1721552040

No bet. There are two failures here. (1) Failing to check the data for validity, and (2) Failing to handle an error gracefully.

Both of these are undergraduate-level techniques. Heck, they are covered in most first-semester programming courses. Either of these failures is inexcusable in a professional product, much less one that is running with kernel-level privileges.

Bet: CrowdStrike has outsourced much of its development work.

danielPort9 · 2024-07-21T10:12:19 1721556739

> Either of these failures is inexcusable in a professional product

Don’t we have those kind of failures in almost every professional product? I’ve been working in the industry for over a decade and in every single company we had those bugs. The only difference was that none of those companies were developing kernel modules or whatever. Simple saas. And no, none of the bugs were outsourced (the companies I worked for hired only locals and people in the range of +- 2h time zone)

ahoka · 2024-07-21T09:30:00 1721554200

What do you mean by outsourced?

Rinzler89 · 2024-07-21T09:46:14 1721555174

He probably means work was sent offshore to offices with cheaper labor that's less skilled or less vested into delivering quality work. Though there's no proof of that yet, people just like to throw the blame on offshoring whoever $BIG_CORP fucks up, as if all programmers in the US are John Carmack and they can never cause catastrophic fuckups with their code or processes.

jojobas · 2024-07-21T10:26:22 1721557582

Not everyone in the US might be Carmack, but it's ridiculously nearsighted to assert that cultural differences don't play into people desire and ability to Do It Right.

Rinzler89 · 2024-07-21T12:11:45 1721563905

It's not cultural differences that make the difference in output quality, it's pay and quality standards of the output set by the team/management, which is also mostly a function of pay, since underpaid and unhappy developers tend not to care at all beyond doing the bare minimum to not getting fired (#notmyjob, laying flat movement, etc).

You think everyone writing code in the US would give two shits about the quality of their output if they see the CEO pocketing another private jet while they can barley make big-city rent?

Hell, even well paid devs at top companies in the US can be careless and lazy if their company doesn't care about quality. Have you seen some of the vulnerabilities and bugs that make it into the Android source code and on Pixel devices? And guess what, that code was written by well paid developers in the US, hired at Google leetcode standards, yet would give far-east sweatshops a run for their money in terms of carelessness. It's what you get when you have a high barrier of entry but a low barrier of output quality where devs just care about "rest and vest".

bradley13 · 2024-07-21T16:04:24 1721577864

I was talking about outsourcing (and not necessarily offshoring). Too many companies like CrowdStrike are run by managers who think that management, sales, and marketing are the important activities. Software development is just an unpleasant expense that needs to be minimized. Hence: outsourcing.

That said, I have had some experience with classic offshoring. Cultural differences make a huge difference!

My experience with "typical" programmers from India, China, et al is that they do exactly what they are told. Their boss makes the design decisions down to the last detail, and the "programmers" are little more than typists. I specifically remember one sweatshop where the boss looped continually among the desks, giving each person very specific instructions of what they were to do next. The individual programmers implemented his instructions literally, with zero thought and zero knowledge of the big picture.

Even if the boss was good enough to actually keep the big picture of a dozen simultaneous activities in his head, his non-thinking minions certainly made mistakes. I have no idea how this all got integrated and tested, and I probably don't want to know.

Rinzler89 · 2024-07-21T16:58:04 1721581084

>That said, I have had some experience with classic offshoring. Cultural differences make a huge difference!

Sure but there's no proof yet that was the case here. That's just masive speculations based on anecdotes on your side. There's plenty of offshore devs that can run rings around western devs.

Spooky23 · 2024-07-21T19:18:38 1721589518

Staff trained at outsourcers have a different type of focus. My experience is more operational, and usually the training for those guys is about restoration to hit SLA, period. Makes root cause harder to ID sometimes.

It doesn’t mean ‘Murica better, just that the origin story of staff matter, especially if you don’t have good processes around things like rca.

jojobas · 2024-07-21T22:56:35 1721602595

Western slacker movements never came close to deadma or the dedicated indifference in the face of samsara. You seem to have a lot of experience with the former and little of the latter two, but what do I know.

Every stereotype exists for a reason.

ahoka · 2024-07-21T11:09:38 1721560178

Offshoring and outsourcing is very different. It would be also very hard to talk about offshoring at a company claiming to provider services in 170 countries.

spotplay · 2024-07-21T11:21:57 1721560917

It's probably just the common US-centric bias that external development teams, particularly those overseas, may deliver subpar software quality. This notion is often veiled under seemingly intellectual critiques to avoid overt xenophobic rhetoric like "They're taking our jobs!".

Alternatively, there might be a general assumption that lower development costs equate to inferior quality, which is a flawed yet prevalent human bias.

chuckadams · 2024-07-21T14:25:47 1721571947

“You get what you pay for” is still a reasonable metric, even if it is more a relative scale than an absolute one.

seymore_12 · 2024-07-21T11:53:04 1721562784

>Approximately 100% of CVEs, crashes, bugs, slowdowns, and pain points of computing have to do with various forms of deserialising binary data back into machine-readable data structures. All because a) human programmers forget to account for edge cases, and b) imperative programming languages allow us to do so.

This. One year ago UK air traffic control collapsed due to inability to properly parse "faulty" flight plan: https://news.ycombinator.com/item?id=37461695

stefan_ · 2024-07-21T14:36:41 1721572601

People are target fixating too much. Sure, this parser crashed and caused the system to go down. But in an alternative universe they push a definition file that rejects every openat() or connect() syscall. Your system is now equally as dead, except it probably won't even have the grace to restart.

The whole concept of "we fuck with the system in kernel based on data downloaded from the internet" is just not very sound and safe.

hello_moto · 2024-07-21T14:48:19 1721573299

It's not and that's the sad state of AV in Windows

noobermin · 2024-07-21T11:10:11 1721560211

So, I also have near zero cybersecurity expertise (I took an online intro course on cryptography due to curiousity) and no expertise in writing kernel modules actually, but why if ever would you parse an array of pointers...in a file...instead of any other way of serializing data that doesn't include hardcoded array offsets in an on-disk file...

Ignore this failure which was catastrophic, this was a bad design asking to be exploited by criminals.

Jare · 2024-07-21T11:41:31 1721562091

Performance, I assume. Right now it may look like the wrong tradeoff, but every day in between incidents like this we're instead complaining that software is slow.

Of course it doesn't have to be either/or; you can have fast + secure, but it costs a lot more to design, develop, maintain and validate. What you can't have is a "why don't they just" simple and obvious solution that makes it cheap without making it either less secure, less performant, or both.

Given all the other mishaps in this story, it is very well possible that the software is insecure (we know that), slow and also still very expensive. There's a limit to how high you can push the triangle, but there's not bottom to how bad it can get.

deaddodo · 2024-07-21T12:01:32 1721563292

I'm curious, how else would you store direct memory offsets? No matter how you store/transmit them, eventually you're going to need those same offsets.

The problem wasn't storing raw memory offsets, it was not having some way to validate the data at runtime.

noobermin · 2024-07-22T08:20:20 1721636420

You don't store direct memory offsets if possible. You have a system of handles and store the handle keys instead ofndirect pointer addresses.

Yes you still need to validate the keys to avoid logic errors, but you can avoid the memory errors.

deaddodo · 2024-07-22T18:21:55 1721672515

In this case, the direct memory addresses are literally needed.

The addresses aren't being generated internal to the program, so there are no "handles". They are referencing external data by design.

That's like saying "you shouldn't use a hard-coded volatile pointer to reference a hardware device". No, you literally need to do that sometimes; especially in embedded software.

lol768 · 2024-07-21T11:28:01 1721561281

> I'm happy to up the ante by £50 to account for my second theory

What's that, three pints in a pub inside the M25? :P

Completely agree with this sentiment though, we've known that handling of binary data in memory unsafe languages has been risky for yonks. At the very least, fuzzing should've been employed here to try and detect these sorts of issues. More fundamentally though, where was their QA? These "channel files" just went out of the door without any idea as to their validity? Was there no continuous integration check to just .. ensure they parsed with the same parser as was deployed to the endpoints? And why were the channel files not deployed gradually?

TeMPOraL · 2024-07-21T15:52:32 1721577152

FWIW, before someone brings up JSON, GP's bet only makes sense when "binary" includes parsing text as well. In fact, most notorious software bugs are related to misuse of textual formats like SQL or JS.

back_to_basics · 2024-07-21T18:00:26 1721584826

"human programmers forget to account for edge cases"

Which is precisely the rationale which led to Standard Operating Procedures and Best Practices (much like any other Sector of business has developed).

I submit to you, respectfully, that a corporation shall never rise to a $75 Billion Market Cap without a bullet-proof adherence to such, and thus, this "event" should be properly characterized and viewed as a very suspicious anomaly, at the least

https://news.ycombinator.com/item?id=41023539 fleshes out the proper context.

divan · 2024-07-21T18:27:17 1721586437

Related talk:

28c3: The Science of Insecurity (2011)

https://www.youtube.com/watch?v=3kEfedtQVOY

nonrandomstring · 2024-07-21T22:57:16 1721602636

Excellent talk. So long ago and what since?

smsm42 · 2024-07-21T21:04:11 1721595851

> combination of said bad binary data and a poorly-written parser that didn't error out correctly upon reading invalid data

By now, if you write any parser that deals with any outside data and don't fuzz the heck out of it, you are willfully negligent. Fuzzers are pretty easy to use, automatic and would likely catch any such problem pretty soon. So, did they fuzz and got very very unlucky or do they just like to live dangerously?

xxs · 2024-07-21T15:35:29 1721576129

>(for the non-British, that's £100)

next time you'd be adding /s to your posts

variadix · 2024-07-21T09:17:27 1721553447

More or less. Binary parsers are the easiest place to find exploits because of how hard it is to do correctly. Bounds checks, overflow checks, pointer checks, etc. Especially when the data format is complicated.

hnthrowaway0328 · 2024-07-21T23:07:33 1721603253

Is there any reading about this topic? By saying binary parsing I guess you meant code that parses say PNG or WAD files?

miohtama · 2024-07-21T08:13:03 1721549583

I was immediately willing to bet a hundred quid this was C/C++ code :)

formerly_proven · 2024-07-21T09:58:32 1721555912

Not that interesting a bet considering we know it's a Windows driver.

cedws · 2024-07-21T12:50:15 1721566215

I’d say that it is a bug by definition if your program ungracefully crashes when it’s passed malformed data at runtime.

Log_out_ · 2024-07-22T17:05:52 1721667952

Soon, update revert watchdogs

1992spacemovie · 2024-07-21T11:31:36 1721561496

Interesting observation. As a non-developer, what can one do to enhance coverage for these types of scenerios? Fuzz testing?

rwmj · 2024-07-21T12:24:44 1721564684

Fuzz testing absolutely should be used whenever you parse anything.

SoftTalker · 2024-07-21T15:57:23 1721577443

Yeah, even if you are only parsing "safe" inputs such as ones you created yourself. Other bugs and sometimes even truly random events can corrupt data.

smackeyacky · 2024-07-21T08:07:07 1721549227

Hmmm. Most common problems these days are certificate related I would have thought. Binary data transfers are pretty rare in an age of base64 json bloat

madaxe_again · 2024-07-21T08:53:59 1721552039

There are plenty of binary serialisation protocols out there, many proprietary - maybe you’ll stuff that base64’d in a json container for transit, but you’re still dealing with a binary decoder.

MBCook · 2024-07-21T01:46:16 1721526376

https://twitter-thread.com/t/1814343502886477857

G3rn0ti · 2024-07-21T05:50:40 1721541040

By-passing the discussion whether one actually needs root kit powered endpoint surveillance software such as CS perhaps an open-source solution would be a killer to move this whole sector to more ethical standards. So the main tool would be open source and it would be transparent what it does exactly and that it is free of backdoors or really bad bugs. It could be audited by the public. On the other hand it could still be a business model to supply malware signatures as a security team feeding this system.

imiric · 2024-07-21T06:22:50 1721542970

I'd say no. Kolide is one such attempt, and their practices, and how it's used in companies, are as insidious as those from a proprietary product. As a user, it gives me no assurance that an open source surveillance rootkit is better tested and developed, or that it has my best interests in mind.

The problem is the entire category of surveillance software. It should not exist. Companies that use it don't understand security, and don't trust their employees. They're not good places to work at.

WA · 2024-07-21T07:19:16 1721546356

> Companies that use it don't understand security

What should these companies understand about security exactly?

And aren’t they kinda right to not trust their employees if they employ 50,000 people with different skills and intentions?

InsideOutSanta · 2024-07-21T08:06:29 1721549189

"And aren’t they kinda right to not trust their employees if they employ 50,000 people with different skills and intentions?"

Yes, in a 50k employee company, the CEO won't know every single employee and be able to vouch for their skills and intentions.

But in a non-dysfunctional company, you have a hierarchy of trust, where each management level knows and trusts the people above and below them. You also have siloed data, where people have access to the specific things they need to do their jobs. And you have disaster mitigation mechanisms for when things go wrong.

Having worked in companies of different sizes and with different trust cultures, I do think that problems start to arise when you add things like individual monitoring and control. You're basically telling people that you don't trust them, which makes them see their employer in an adversarial role, which actually makes them start to behave less trustworthy, which further diminishes trust across the company, harms collaboration, and eventually harms productivity and security.

kemotep · 2024-07-21T11:00:27 1721559627

Setting aside the possibility of deploying an EDR like Crowdstrike just being a box ticking exercise for compliance or insurance purposes, can something like an EDR be used not because of a lack of trust but a desire to protect the environment?

A user doesn’t have to do anything wrong for the computer to become compromised, or even if they do, being able to limit the blast radius and lock down the computer or at least after the fact have collected the data to be able to identify what went wrong seems important.

How would you secure a network of computers without an agent that can do anti-virus, detect anomalies, and remediate them? That is to say, how would you manage to secure it without doing something that has monitoring and lockdown capabilities? In your words, signaling that you do not trust the users?

kchr · 2024-07-21T12:29:50 1721564990

This. From all the comments I've seen in the multiple posts and threads about the incident, this simple fact seems to be the least discussed. How else to protect a complex IT environment with thousands of assets in form of servers and workstations, without some kind of endpoint protection? Sure, these solutions like CrowdStrike et al are box-checking and risk transferring exercises in one sense, but they actually work as intended when it comes to protecting endpoints from novel malware and TTP:s. As long as they don't botch their own software, that is :D

imiric · 2024-07-21T12:57:48 1721566668

> How else to protect a complex IT environment with thousands of assets in form of servers and workstations, without some kind of endpoint protection?

There is no straightforward answer to this question. Assuming that your infrastructure is "secure" because you deployed an EDR solution is wrong. It only gives you a false sense of security.

The reality is that security takes a lot of effort from everyone involved, and it starts by educating people. There is no quick bandaid solution to these problems, and, as with anything in IT, any approach has tradeoffs. In this case, and particularly after the recent events, it's evident that an EDR system is as much of a liability as it is an asset—perhaps even more so. You give away control of your systems to a 3rd party, and expect them to work flawlessly 100% of the time. The alarming thing is how much this particular vendor was trusted with critical parts of our civil infrastructure. It not only exposes us to operational failures due to negligence, but to attacks from actors who will seek to exploit that 3rd party.