I wrote this in another thread already, but the fuck up was both at crowdstrike ... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

chronid 6 months ago | parent | context | favorite | on: Initial details about why CrowdStrike's CSAgent.sy...

I wrote this in another thread already, but the fuck up was both at crowdstrike (they borked a release) but also and more importantly their customers. Shit happens even with the best testing in the world.

You do not deploy anything, ever on your entire production fleet at the same time and you do not buy software that does that. It's madness and we're not talking about small companies with tiny IT departments here.

d1sxeyes 6 months ago | [–]

That’s a tricky one. CrowdStrike is cybersecurity. Wait until the first customer complains that they were hit by WannaCry v2 because CrowdStrike wanted to wait a few days after they updated a canary fleet.

The problem here is that this type of update (a content update) should never be able to cause this however badly it goes. In case the software receives a bad content update, it should fail back to the last known good content update (potentially with a warning fired off to CS, the user, or someone else about the failed update).

In principle, updates that could go wrong and cause this kind of issue should absolutely be deployed slowly, but per my understanding, that’s already the practice for non-content updates at CrowdStrike.

chronid 6 months ago | | [–]

Windows updates are also cybersecurity, but the customer has (had?) a choice to how to roll those out (with Intune nowadays?). The customer should decide when to update, they own the fleet not the vendor!

You do not know if a content update will screw you over and mark all the files of your company as malware. The "It should never happen" situations are the thing you need to prepare for, the reason we talk about security as an onion, the reason we still do staggered production releases with baking times even after tests and QA have passed...

"But it's cybersecurity" is not a justification. I know that security departments and IT departments and companies in general love dropping the "responsibility" part on someone else, but in the end of the day the thing getting screwed over is the company fleet. You should retain control and make sure things work properly, the fact those billion dollar revenue companies are unable to do so is a joke. A terrible one, since IT underpins everything nowadays.

d1sxeyes 6 months ago | | | [–]

It is a justification, just not necessarily one you agree with.

Companies choose to work with Crowdstrike. One of the reasons they do that is ‘hands-off’ administration-let a trusted partner do it for you. There are absolutely risks of doing it this way. But there are also risks of doing it the other way.

The difference is, if you hand over to Crowdstrike, you’re not on your own if something goes wrong. If you manage it yourself, you’ve only got yourself working on the problem if something goes wrong.

Or worse, something goes wrong and your vendor says “yes, we knew about this issue and released the fix in the patch last Tuesday. Only 5% of your fleet took the patch? Oh. Sounds like your IT guys have got a lot of work on their hands to fix the remaining 95% then!”.

chrisjj 6 months ago | | | | [–]

> The customer should decide when to update, they own the fleet not the vendor!

The CS customer has decided to update whenever 24/7 CS says. The alternative is to arrive on Monday morning to an infected fleet.

chronid 6 months ago | | | [–]

Sorry, this is untrue. Enterprises have SOCs and oncalls, if there is a high risk they can do at least minimal testing (which would have found this issue as it has a 100% bsod rate) and then fleet rollout. It would have been rolled out by Friday evening in this case without crashing hundred of thousands of servers.

The CS customer has decided to offload the responsibility of its fleet to CS. In my opinion that's bullshit and negligence (it doesn't mean I don't understand why they did it), particularly at the scale of some of the customers :)

chrisjj 6 months ago | | | [–]

> they can do at least minimal testing (which would have found this issue as it has a 100% bsod rate)

Incorrect, I believe, given they could and did not get advance sight of the offending forced update.

chrisjj 6 months ago | | | | [–]

> they can do at least minimal testing (which would have found this issue as it has a 100% bsod rate)

Incorrect, I believe, given they did not and could not get advance sight of the offending forced update.

Kwpolska 6 months ago | | | [–]

I doubt CrowdStrike had done any testing of the update.

wazzaps 6 months ago | | [–]

Apparently CrowdStrike bypassed clients' staging areas with this update.

Source: https://x.com/patrickwardle/status/1814367918425079934

sateesh 6 months ago | | [–]

Disagree with the part where you put onus on customer. As has been mentioned in other HN thread [1], this update was pushed ignoring whatever the settings customer had configured. The original mistake of the customer, if any, was they didn't read this in fine print of the contract (if this point about updates was explicitly mentioned in the contract). 1. https://news.ycombinator.com/item?id=41003390

owl57 6 months ago | | [–]

> you do not buy software that does that

Note how the incident disproportionally affected highly regulated industries, where businesses don't have a choice to screw "best practice".

TeMPOraL 6 months ago | | [–]

Only highlighting that "best practice" of cybersecurity is, charitably, total bullshit; less charitably, a racket. This is apparent if you look at the costs to the day-to-day ability of employees to do work, but maybe it'll be more apparent now that people got killed because of it.

badgersnake 6 months ago | | | [–]

It’s absolutely a racket.

stef25 6 months ago | | [–]

You'd think that the software would sit in a kind of sandbox so that it couldn't nuke the whole device but only itself. It's crazy that this is possible.

echoangle 6 months ago | | [–]

The software basically works as a kernel module as far as I understand, I don’t think there’s a good way to separate that from the OS while still allowing it to have the capabilities it needs to have to surveil all other processes.

layer8 6 months ago | | | [–]

And even then, you wouldn’t want the system to continue running if the security software crashes. Such a crash might indicate a successful security breach.

temac 6 months ago | | | | [–]

Something like ebpf.

KaiserPro 6 months ago | | [–]

> You do not deploy anything, ever on your entire production fleet at the same time and you do not buy software that does that

I am sympathetic to that, but its only possible if both policy and staffing allow.

for policy, there are lots of places that demand CVEs be patched within x hours depending on severity. A lot of times, that policy comes from the payment integration systems provider/third party.

However you are also dependent on programs you install not autoupdating. Now, most have an option to flip that off, but its not always 100% effective.

chronid 6 months ago | | [–]

> I am sympathetic to that, but its only possible if both policy and staffing allow.

We are not talking about small companies here. We're talking about massive billion revenue enterprises with enormous IT teams and in some cases multiple NOCs and SOCs and probably thousands consultants all around at minimum.

I find it hard to be sympathetic to this complete disregard of ownership just to ship responsibility somewhere else (because this is the need at the of the day let's not joke around). I can understand it, sure, and I can believe - to a point - someone did a risk calculation (possibility of crowdstrike upgrade killing all systems vs hack if we don't patch a CVE in <4h), but it's still madness from a reliability standpoint.

> for policy, there are lots of places that demand CVEs be patched within x hours depending on severity.

I'm pretty sure leadership when they need to choose between production being down for an unspecified amount of time and taking the risk of delaying (of hours in this case) the patching will choose the delay. Partners and payment integration providers can be reasoned with, contracts are not code. A BSOD you cannot talk away.

Sure, leadership is also now saying "but we were doing the same thing as everyone else, the consultants told us to and how could have we have known this random software with root on every machine we own could kill us?!" to cover their asses. The problem is solved already, since it impacted everyone, and they're not the ones spending their weekend hammering systems back to life.

> However you are also dependent on programs you install not autoupdating. Now, most have an option to flip that off, but its not always 100% effective.

You choose what to install on your systems, and you have the option to refuse to engage with companies that don't provide such options. If you don't, you accept the risk.

chrisjj 6 months ago | | [–]

> You do not deploy anything, ever on your entire production fleet at the same time

And if an attacker does??

perbu 6 months ago | [–]

Shit might happen with the best testing, but with decent testing it would not be this serious.

Join us for AI Startup School this June 16-17 in San Francisco!
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact