The moment I read 'it is a content update that causes the BSOD, deleting it solves the problem', I was immediately willing to bet a hundred quid (for the non-British, that's £100) that it was a combination of said bad binary data and a poorly-written parser that didn't error out correctly upon reading invalid data (in this case, read an array of pointers, didn't verify that all of them were both non-null and pointed to valid data/code).
In the past ten years or so of having done somewhat serious computing and zero cybersecurity whatsoever, I have my mind concluded, feel free to disagree.
Approximately 100% of CVEs, crashes, bugs, slowdowns, and pain points of computing have to do with various forms of deserialising binary data back into machine-readable data structures. All because a) human programmers forget to account for edge cases, and b) imperative programming languages allow us to do so.
This includes everything from: decompression algorithms; font outline readers; image, video, and audio parsers; video game data parsers; XML and HTML parsers; the various certificate/signature/key parsers in OpenSSL (and derivatives); and now, this CrowdStrike content parser in its EDR program.
That wager stands, by the way, and I'm happy to up the ante by £50 to account for my second theory.
There's at least five different things that went wrong simultaneously.
1. Poorly written code in the kernel module crashed the whole OS, and kept trying to parse the corrupted files, causing a boot loop. Instead of handling the error gracefully and deleting/marking the files as corrupt.
2. Either the corrupted files slipped through internal testing, or there is no internal testing.
3. Individual settings for when to apply such updates were apparently ignored. It's unclear whether this was a glitch or standard practice. Either way I consider it a bug(it's just a matter of whether it's a software bug or a bug in their procedures).
4. This was pushed out everywhere simultaneously instead of staggered to limit any potential damage.
5. Whatever caused the corruption in the first place, which is anyone's guess.
Number 4 continues to be the most surprising bit to me. I could not fathom having a process that involves deploying to 8.5 million remote machines simultaneously.
Bugs in code I can almost always understand and forgive, even the ones that seem like they’d be obvious with hindsight. But this is just an egregious lack of the most basic rollout standards.
For me, number 1 is the worst of the bunch. You should always expect that there will be bugs in processes, input files, etc… the fact that their code wasn’t robust enough to recognize a corrupted file and not crash is inexcusable. Especially in kernel code that is so widely deployed.
If any one of the five points above hadn’t happened, this event would have been avoided. However, if number 1 had been addressed - any of the others could have happened (or all at the same time) and it would have been fine.
I understand that we should assume that bugs will be present anywhere, which is why staggered deployments are also important. If there had been staggered deployments, the. The damage would have happened, but it would have been localized. I think security people would argue against a staged deployment though, as if it were discovered what the new definitions protected against, an exploit could be developed quickly to put those servers that aren’t in the “canary” group at risk. (At least in theory — I can’t see how staggering deployment over a 6-12 hour window would have been that risky).
They're all terrible, but I agree #1 is particularly egregious for a company ostensibly dedicated to security. A simple fuzz tester would have caught this type of bug, so they clearly don't perform even a minimal amount of testing on their code.
Totally agree. Not only would a coverage guided fuzzer catch this they should also be adding every single file they send out to the corpus of that automated fuzz testing so they can get somewhat increased coverage on their parser.
There may not be out of the box fuzzers that test device drivers so you hoist all the parser code, build it into a stand-alone application, and fuzz that.
Likely this is a form of technical debt since I can understand not doing all of this day #1 when you have 5 customers but at some point as you scale up you need to change the way you look at risk.
I disagree. Has to be 4, something will always go wrong, so you have to deliver in cohorts.
That goes equally if it was a Windows Update rolled out in one motion that broke the falcon agent/driver, or if it was Crowdstrike. There is almost no excuse for a global rollout without telemetry checks, whether it's security agent updates or os patches.
It might be the worst mistake, but number 1 was always going to happen sometime.
And even testing can't be trusted 100%, because writing code that does the right thing and code that tests things correctly are about equally hard, they just aren't always hard simultaneously.
You admit that bugs are inevitable and then claim a bug free parser as the most important bullet. That seems flawed to me. It would certainly be nice, but is that achievable?
Policy changes seem more reliable and would catch other, as of yet unknown classes of bugs.
This shouldn't be an either-or situation; you do all of the above. A simple validating parser in the client would be easy to write and would have easily caught a null payload.
What looks especially bad for Crowdstrike is how many things (relatively simple things) had to fail in order for this to slip through. It's like walking into Fort Knox, grabbing a gold bar, and walking out unimpeded. A complete systemic failure.
Surely, CrowdStrike's safety posture for update rollouts is in serious need of improvement. No argument there.
But is there any responsibility for the clients consuming the data to have verified these updates prior to taking them in production? I haven't worn the sysadmin hat in a while now, but back when I was responsible for the upkeep of many thousands of machines, we'd never have blindly consumed updates without at least a basic smoke test in a production-adjacent UAT type environment. Core OS updates, firmware updates, third party software, whatever -- all of it would get at least some cursory smoke testing before allowing it to hit production.
On the other hand, given EDR's real-world purpose and the speed at which novel attacks propagate, there's probably a compelling argument for always taking the latest definition/signature updates as soon as they're available, even in your production environments.
I'm certainly not saying that CrowdStrike did nothing wrong here, that's clearly not the case. But if conventional wisdom says that you should kick the tires on the latest batch of OS updates from Microsoft in a test environment, maybe that same rationale should apply to EDR agents?
> But is there any responsibility for the clients consuming the data to have verified these updates prior to taking them in production
In the boolean sense, yes. United Airlines (for example) is ultimately responsible for their own production uptime, so any change they apply without validation is a risk vector.
In pragmatic terms, it's a bit fuzzier. Does CrowdStrike provide any practical way for customers to validate, canary-deploy, etc. changes before applying them to production? And not just changes with type=important, but all changes? From what I understand, the answer to that question is no, at least for the type=channel-update change that triggered this outage. In which case I think the blame ultimately falls almost entirely on CrowdStrike.
"In which case I think the blame ultimately falls almost entirely on CrowdStrike"
I would say on the client for buying into CrowdStrike.
And also the client for having no contingencies and just accepting a vendor pinky-swear as meaningful.
CrowdStrike failed at their responsibilities too, I just mean that so did everyone else.
When you cede your own responsibilities to someone else and don't have that backed up with contractually enforced liability to make you whole when they fuck up, and also don't provide your own contingency so it doesn't really matter what some vendor does, that's on you. That's 100% entirely on you and it doesn't matter if a million other people also did the same utterly thoughtless and lazy thing.
> I would say on the client for buying into CrowdStrike.
I understand this perspective but I think it misses the forest for the trees. You have to evaluate this kind of stuff in context. Purity tests really smack on tech message boards where nobody has any accountability to any kind of business requirements, but basically no real-world organization operates in that way, so it's all a bit irrelevant.
> When you cede your own responsibilities to someone else ...
This framing is a bit naive, I think. It isn't a boolean. Everything is about risk management, cost/benefit analysis.
> From what I understand, the answer to that question is no, at least for the type=channel-update change that triggered this outage. In which case I think the blame ultimately falls almost entirely on CrowdStrike.
Honestly, it hadn't even occurred to me that software like this marketed at enterprise customers wouldn't have this kind of control already available. It seems like an obvious thing that any big organization would insist on that I just took it for granted that it existed.
It seems nuts to me too - MS Defender has this out of the box. From looking at sysadmins on reddit, it seems that CS has a tiered update mechanism, but didn’t use it for this change.
>Arguably United airlines shouldn't have chosen a product they can't test updates of, though maybe there are no good options.
I used to work with regional parks and recreation departments, and they would not approve any updates that did not go through UAW environments that we had set up. All updates had to be deployed to their UAW, thoroughly tested, before going to their production environment.
I get this this is slightly different, but I'd imagine Airlines, Banks, and Hospitals would have far more strict UAW policies to avoid a single vendor from kneecapping operations.
Checks out - my company had lots of issues on Friday afternoon, and when it first happened I wondered who on Earth decided to roll out updates to prod systems on Friday afternoon.
Yeah one of the major problems seems to be CrowdStrike's assumptions that channel files are benign. Which isn't true if there's a bug in your code that only gets triggered by the right virus definition.
I don't know how you could assert that this is impossible, hence channel files should be treated as code.
I think point 3 of the grand parent indicates admins were not given an opportunity to test this.
My company had a lot of Azure vms impacted by this and I'm not sure who the admin was who should have tested it. Microsoft? I don't think we have anything to do with crowdstrike software on our vms. ( I think - I'm sure I'll find out this week.)
Edit: I just learned the Azure central region failure wasn't related to the larger event - and we weren't impacted by the crowd strike issue - I didn't know it was two different things. So my second part of the comment is irrelevant.
Oh, I'd missed point #3 somehow. If individual consumers weren't even given the opportunity to test this, whether by policy or by bug, then ... yeesh. Even worse than I'd thought.
Exactly which team owns the testing is probably left up to each individual company to determine. But ultimately, if you have a team of admins supporting the production deployment of the machines that enable your business, then someone's responsible for ensuring the availability of those machines. Given how impactful this CrowdStrike incident was, maybe these kinds of third-party auto-update postures need to be reviewed and potentially brought back into the fold of admin-reviewed updates.
It's not an option. While the admins at the customer have the ability to control when/how revisions of the client software go out (and this, can + generally do their own testing, can decide to stay one rev back as default, etc), there is no control over updates to the kind of update/definition files that were the primary cause here.
Which is also why you see every single customer affected - what you are suggesting is simply not an available thing to do at present for them.
At least for now - I imagine that some kind of staggered/slowed/ringed option will have to be implemented in the future if they want to retain customers.
They probably don't get to claim agile story points until the ticket is in finished state. And they probably have a culture where vanity Metrics like "velocity" are prioritized
My understanding is that the culture (as reported by some customers) is quite aggressive and pushy. They are quite vocal when customers don’t turn in automatic updates.
It makes sense in a way - given their fast growth strategy (from nowhere to top 3) and desire to “do things differently” - the iconoclast upstarts that redefine the industry.
This is one bsod on windows 10. I saw another kernel panic on specific Linux distro.
What else?
One thing that is funny is that quite a few of their competitors are taking this opportunity to shit on them via Twitter and by marketing themselves as better than CrowdStrike.
Twitter, with all its issues, apparently has a feature to prevent fake news and that feature will show crowd source sentiment to debunk fake news, in this case, Twitter users showed how many times Crowdstrike competitor BSOD windows
Bottom line is this: there is absolutely no good reason for not doing rolling updates. Do a few and make sure they are ok. Keep rolling out in groups. This single approach alone would've meant that this event was of marginal impact to most of the public, as sysadmins would've had the opportunity to halt further updates and work on remediating their first group (typically non-critical servers). Rolling out to everything all at once is just bad practice, period.
Customers don’t always have a choice here. They could be restricted by compliance programs (PCI, et al) and be required under those terms to have auto updates on.
Compliance also has to share
some of the blame here, if best practices (local testing) aren’t allowed to be followed in the name of “security”.
This needs to keep being repeated anytime someone wants to blame the company.
Many don’t have a choice, a lot of compliance is doing x to satisfy a checkbox and you don’t have a lot of flexibility in that or you may not be able to things like process credit cards which is kinda unacceptable depending on your company. (Note: I didn’t say all)
CrowdStrike automatic update happens to satisfy some of those checkboxes.
Oh the games I have to play with story points that have personal performance metrics attached to them. Splitting tickets to span sprints so there aren’t holes in some dudes “effort” because they didn’t compete some task they committed to.
I never thought such stories were real until I encountered them…
I worked at one of the big ones and we always shipped live to all consumer devices at the same time. But this was for a popular suite of products that generate a lot of consumer demand, so we had a rigorous QA process to make sure this wouldn't be a problem. As I was typing this, it occurred to me that zero people would have cared if this update was staggered making it pretty silly not to.
Malware signature updates are supposed to be deployed ASAP, because every minute may count when a new attack is spreading. The mistake may have been to apply that policy indiscriminately.
A lot of snarky replies to this comment, but the reality is that if you were selling an anti-virus, identified a malicious virus, and then chose not to update millions of your machines with that virus’s signature, you’d also be in the wrong.
I’m not saying don’t update? I’m talking about rolling the update over the course of a short amount of time, like under an hour. With the ability to stop the rollout.
On the other hand, diseases vaccines prevent dont have almost instantaneous propagation, thats why they are effective at containing propagation.
As an example, reaction time is paramount to counter many kinds of attacks - thats why blocklists are so popular, and AS blackholing is a viable option.
> But this is just an egregious lack of the most basic rollout standards.
Agreed. It's crazy that the top tech companies enforce this in a biblical fashion, despite all sorts of pressure to ship and all that. Crowdstrike went YOLO at a global scale.
And here I thought shipping a new version on the app store was scary.
Is there anything we can take from other professions/tradecraft/unions/legislation to ensure shops can’t skip the basic best practices we are aware of in the industry like staged rollouts? How do we set incentives to prevent this? Seriously the App Store was raking in $$ from us for years with no support for staged rollouts and no other options.
I wonder if there's a concern that staggering the malware signatures would open them up to lawsuits if somebody was hacked in between other customers getting the data and them getting the data.
> I wonder if there's a concern that staggering the malware signatures would open them up to lawsuits if somebody was hacked in between other customers getting the data and them getting the data.
I'd assume that sort of thing would be covered in the EULA and contract -- but even if it weren't, it seems like allowing customers to define their own definition update strategy would give them a pretty compelling avenue to claim non-liability. If CrowdStrike can credibly claim "hey, we made the definitions available, you chose to wait for 2 weeks to apply them, that's on you", then it becomes much less of a concern.
Zero effort to fuzz test the parser too. I mean, we know how to harden parsers against bugs and attacks, and any semi-competent fuzzer would have caught such a trivial bug.
You are seriously overestimating the engineering practises at these companies. I have worked in "enterprise security" previously though not at this scale. In a previous life I worked with of the engineering leaders currently at Crowdstrike.
I'll bet you this company has some arbitrary unit test coverage requirements for PRs which developers game be mocking the heck out of dependencies. I am sure they have some vanity sonarqube integration to ensure great "code quality". This likely also went through manual QA.
However I am sure the topic of fuzz testing would not have come up once. These companies sell checkbox compliance, and they themselves develop their software the same way. Checking all the "quality engineering" boxes with very little regards for long term engineering initiatives that would provide real value.
And I am not trying to kick Crowdstrike when they are down. It's the state of any software company run by suits with myopic vision. Their engineering blogs and their codebases are poles apart.
AV software needs kernel privilidges to have access to everything it needs to inspect, but the actual inspection of that data should be done with no privilidges.
I think most AV companies now have a helper process to do that.
If you successfully exploit the helper process, the worst damage you ought to be able to do is falsely find files to be clean.
> ...the worst damage you ought to be able to do is...
Ought. But it depends on the way the communication with the main process is done. I wouldn't be surprised if the main process trusts the output from the parser just a tiny bit too much.
Competent fuzzers don't just use random bytes, they systematically explore the state-space of the target program. If there's a crash state to be found by feeding in a file full of null bytes, it's probably going to be found quickly.
A fun example is that if you point AFL at a JPEG parser, it will eventually "learn" to produce valid JPEG files as test cases, without ever having been told what JPEG file is supposed to look like. https://lcamtuf.blogspot.com/2014/11/pulling-jpegs-out-of-th...
AFL is really "magical". It finds bugs very quickly and with little effort on our part except to leave it running and look at the results occasionally. We use it to fuzz test a variety of file formats and network interfaces, including QEMU image parsing, nbdkit, libnbd, hivex. We also use clang's libfuzzer with QEMU which is another good fuzzing solution. There's really no excuse for CrowdStrike not to have been using fuzzing.
Instrumented fuzzing (like AFL and friends) tweaks the input to traverse unseen code paths in the target, so they're super quick to find stuff like "heyyyyy, nobody is actually checking if this offset is in bounds before loading from that address".
I wonder if it was pushed anywhere that didn't crash, as an extension of "It works on my machine. Ship it!"
I've built a couple of kernel drivers over the years and what I know is that ".sys" files are to the kernel as ".dll" files are to user-space programs in that the ones with code in them run only after they are loaded and a desired function is run (assuming boilerplate initialization code is good).
I've never made a data-only .sys file, but I don't see why someone couldn't. In that case, I'd guess that no one ever checked it was correct, and the service/program that loads it didn't do any verification either -- why would it, the developers of said service/program would tend to trust their own data .sys file would be valid, never thinking they'd release a broken file or consider that files sometimes get corrupted -- another failure mode waiting to happen on some unfortunate soul's computer.
The file extension is `sys` by convention, it's nothing magical to it and it's not handled in any special way by the OS. In the case of CrowdStrike, there seems to be some confusion as to why they use this file extension since it's only supposed to be a config/data file to be used by the real kernel driver.
Thanks. I understand that '.sys' is a naming convention. I'd guess that they used it because those config/data files are used by their kernel driver, and so makes kernel vs user-space files easier to distinguish.
Number 4 is what everyone will fixate on, but I have the biggest problem with number 1. Anything like this sort of file should have (1) validation on all its pointers and (2) probably >2 layers of checksumming/signing. They should generally expect these files to get corrupted in transit once in a while, but they didn't seem to plan for anything other than exactly perfect communication between their intent and their kernel driver.
I think you mean: In theory, files don't get corrupted in transit with TCP. In theory, they also don't get corrupted when stored in memory or on disks either.
The only reason any of these things don't cause issues in practice is checksums and error correcting codes.
No, I mean that TCP checks for corruption in transit and has the packets resent in that case. I guess you could be running a buggy implementation, but that seems unlikely with how ubiquitous TCP is.
Errors can show up any time, and usually show up between the parts that checksums correct. On the wire, TCP protects you with a (weak) checksum. Off the wire, your computer and filesystem can still screw things up. Even CPU bugs can do this.
There is a story out that the problem was introduced in a post processing step after testing. That makes more sense than that there was no testing. If true it means they thought they’d tested the update, but actually hadn’t.
Of all of these, I think #3 has crowdstrike the most exposed, legally. Companies with robust update and config management protocols got burned by this as well, including places like hospitals and others with mission critical systems where config management is more strictly enforced.
If the crowdstrike selloff continues, I'm betting this will be why.
(There's a chance I'll make trading decisions based on this rationale in the next 72 hours, though I'm not certain yet)
Thing is, as far as I can see, deploying this database update to a Windows machine will result promptly and unconditionally in a BSOD. That implies that this update was tried on exactly zero machines before it was shipped.
The bug can't have "slipped through internal testing"; it would have failed immediately on any machine it was loaded on.
I’d also maybe add another one on the Windows end:
6) some form of sandboxing/error handling/api changes to make it possible to write safer kernel modules (not sure if it already exists and was just not used). It seems like the design could be better if a bad kernel module can cause a boot loop in the OS…
It’s a tough problem, because you also don’t want the system to start without the CrowdStrike protection. Or more generally, a kernel driver is supposedly installed for a reason, and presumably you don’t want to keep the system running if it doesn’t work. So the alternative would be to shut down the system upon detection of the faulty driver without rebooting, which wouldn’t be much of an improvement in the present case.
I can imagine better defaults. Assuming the threat vector is malicious programs running in userspace (probably malicious programs in kernel space is game over anyway right?), then you could simply boot into safe mode or something instead of crashlooping.
One of the problems with this outage was that you couldn’t even boot into safe mode without having the bit locker recovery key.
You don’t want to boot into safe mode with networking enabled if the software that is supposed to detect attacks from the network isn’t running. Safe mode doesn’t protect you from malicious code in userspace, it only “protects” you from faulty drivers. Safe mode is for troubleshooting system components, not for increasing security.
I don’t know the exact reasoning why safe mode requires the BitLocker recovery key, but presumably not doing so would open up an attack vector defeating the BitLocker protection.
No. Not in production yet. But that should solve this problem once it's available for any company that uses it (and I believe CrowdStrike is heavily involved with it).
Since the issue manifested at 04:09 UTC, which is 11pm where Crowdstrikes HQ is, I would guess someone was working late at night and skipped the proper process so they could get the update done and go to bed.
They probably considered it low risk, had done similar things of times hundreds of times before, etc.
Wild that anyone would consider anything in the “critical path” low risk.
I would bet that they just don’t do rolling releases normally since it never caused issues before.
Their sales pitch is being the first to apply patches for any virus. I think it makes sense to try to push as quickly as possible when speed of updates is core to your sales pitch.
> 2. Either the corrupted files slipped through internal testing, or there is no internal testing.
This is the most interesting question to me because it doesn't seem like there is an obviously guessable answer. It seem very unlikely to me that a company like CrowdStrike pushes out updates of any kind without doing some sort of testing, but the widespread nature of the outage would also seem to suggest any sort of testing setup should have caught the issue. Unless it's somehow possible for CrowdStrike to test an update that was different than what was deployed, it's not obvious what went wrong here.
Because nowadays nobody knows you test the actual bits you're going to ship, not whatever random crap comes out of someone's build script run in a different place and time that's supposed to be the same as what you'll ship.
6. Companies using CS have no testing to verify that new updates won't break anything.
Any SWE job I've worked over my entire career, nothing is deployed with new versions of dependencies without testing them against a staging environment first.
> Approximately 100% of CVEs, crashes, bugs, slowdowns, and pain points of computing have to do with various forms of deserialising binary data back into machine-readable data structures.
For the record, the top 25 common weaknesses for 2023 are listed at:
Deserialization of Untrusted Data (CWE-502) was number fifteen. Number one was Out-of-bounds Write (CWE-787), Use After Free (CWE-416) was number four.
CWEs that have been in every list since they started doing this (2019):
Yup. Almost all of them are various flavor of fucking up a parser or misusing it (in particular, all the injection cases are typically caused by writing stupid code that glues strings together instead of proper parsing).
That's not parsing, that's the inverse of parsing. It's taking untrusted data and injecting it into a string that will later be parsed into code without treating the data as untrusted and adapting accordingly. It's compiling, of a sort.
Parsing is the reverse—taking an untrusted string (or binary string) that is meant to be code and converting it into a data structure.
Both are the result of taking untrusted data and assuming it'll look like what you expect, but both are not parsing issues.
> It's taking untrusted data and injecting it into a string that will later be parsed into code without treating the data as untrusted and adapting accordingly.
Which is precisely why parsing should've been used here instead. The correct way to do this is to work at the level after parsing, not before it. "SELECT * FROM foo WHERE bar LIKE ${untrusted input}" is dumb. Parsing the query with a placeholder in it, replacing it as an abstract node in the parsed form with data, and then serializing to string if needed to be sent elsewhere, is the correct way to do it, and is immune to injection attacks.
For SQL we tend to use prepared statements as the answer, which probably do some parsing under the hood but that's not visible to the programmer. I'd raise a lot of questions if I saw someone breaking out a parser to handle a SQL injection risk.
That's because prepared statements were developed before understaning of langsec was mature enough. They provide a very simple API, but it's at (or above) the right level - you just get to use special symbols to mark "this node will be provided separately", and provide it separately, while the API makes sure it's correctly integrated into the whole according to the rules of the language.
(Probably one other factor is that SQL was designed in a peculiar way, for "readability to non-programmers", which tends to result with languages that don't map well to simple data structures. Still, there are tools that let you construct a tree, and will generate a valid SQL from that.)
HTML is a better example, because it's inherently tree-structured, and trees tend to be convenient to work with in code. There it's more obvious when you're crossing from dumb string to parsed representation, and then back.
The same thing applies to HTML, though: I would shudder if I saw a parser implemented for most HTML injection prevention. The correct answer in almost all cases is to escape the HTML using the language's standard library or the web framework's tooling.
The only situation where a parser makes sense over simple escaping routines is if you actually intended to accept a subset of the language that you're injecting into rather than plain text, in which case you'll need more than just a parser to ensure you don't have anything dangerous—you'd need to do a lot of error-prone analysis of the AST afterward as well.
Or, you just use the DOM API to manipulate the structure. You don't implement a parser because one is already provided by the tooling - you use it to go from known-valid text to a data structure (here, DOM), and do your operations there.
You shouldn't do "escaping" and string concatenation. That's just parsing and unparsing while cutting corners, which is how you get injection bugs.
> The only situation where a parser makes sense over simple escaping routines is if you actually intended to accept a subset of the language that you're injecting into rather than plain text
And that's exactly what you're doing. With escaping, you're taking a serialized form of some data, and splice into it some other data, massaged in a way you hope will make it always parse to string when something parses this later. It's going to eventually bite you; not necessarily with XSS - web template breakage is another common occurrence.
Working in string space is tricky, dangerous, and dumb - parsing, working on the parsed representation, and unparsing at the end, is how you do it correctly and safely.
(Another way to put it: plaintext is a wire format; you don't work in it if the data is structured.)
Note that the API may look like you're doing text - see JSX - but it internally goes through a parsing stage, and makes it impossible for you to do stupid things that break or transform the program, like working in string space lets you.
If you don't want your users to produce HTML, then why would you use the DOM API to parse their text into an HTML data structure? Then you'd have code that's capable of producing <script> tags or who knows what else from untrusted user input and you now have to explicitly filter out tag types. Alternatively you can implement the middle bit of a compiler and map nodes to a new, safe data structure that you spit out at the end, but in the scenario we're discussing the user input was supposed to be unstructured text. HTML content is in most cases a malicious edge case, not expected data.
If you instead escape the user-provided unstructured text by replacing the very well-known set of special characters that could create tags, you know your users cannot produce active code, only text nodes.
It's the principle of least power: if you don't need users to access anything other than unstructured text then why feed their input into a parser that produces a data structure that represents code? Make illegal states unrepresentable by just escaping the text nodes as they're saved!
The problem isn't with what the user can do, but with what your code can. If you bork your escaping, which is context-dependent, then user data can turn into arbitrary HTML, complete with script tags. If you keep an abstract tree representation, and add the user-provided data by passing it vetbatim to "set text content" method on a node, then there's no possible way the user input can break it. That is exactly what it means to make illegal states unrepresentable!
Working on the data structures after parsing makes it impossible to accidentally break the structure itself. Like, maybe your string escaping is perfect, but if you do:
Then you're still vulnerable to trivial errors in your template breaking the structure and creating an exploitable vulnerability, despite the $sanitizedString being correct. If you instead work at parsed level and do:
Then there's just no way this can break (except bugs in the HTML parser and DOM API in general, which are much less likely to exist, and much easier to find and fix).
This is not parsing the user input, this is letting the native API escape the input for you, which is exactly what I'm advocating for. See my note above:
> escape the HTML using the language's standard library or the web framework's tooling.
This is what parsing the user input would look like with the DOM API:
const newDiv = document.createElement('div');
newDiv.innerHTML = untrustedUserInput;
// Do some work to attempt to sanitize the new HTML elements
document.body.appendChild(newDiv);
To me this is definitively a Bad Idea™, and I thought this was what you were advocating for.
What you actually proposed is just escaping the HTML, not parsing user input, with the only twist being that you prefer to inject user input into your templating system imperatively with something resembling the DOM API instead of declaratively with something resembling JSX. That's fine, but not relevant to the question of what method we use to sanitize the untrusted input that we're injecting. On that front it sounds like we're in agreement that parsing user input is a terrible idea.
Surely many of these originate from deserialization of untrusted data (e.g., trusting a supplied length). It’s probably documented but I’m passively curious how they disambiguate these cases.
That’s entirely my point. If a vulnerability happens due to writing out of bounds during untrusted deserialization, which category would you file it under?
“Deserialization of untrusted data” isn’t even a security bug like an out of bounds write is. Every meaningful program deserializes external input. It’s a common area where bugs occur, but it’s not a type of bug in and of itself. Every bug in that category “belongs” in a more proximate category.
> Approximately 100% of CVEs, crashes, bugs, [...], deserialising binary data
I'd make that 98%. Outside of rounding errors in the margins, the remaining two percent is made up of logic bugs, configuration errors, bad defaults, and outright insecure design choices.
> Approximately 100% of CVEs, crashes, bugs, slowdowns, and pain points of computing have to do with various forms of deserialising binary data back into machine-readable data structures. All because a) human programmers forget to account for edge cases, and b) imperative programming languages allow us to do so.
I wouldn't blame imperative programming.
Eg Rust is imperative, and pretty good at telling you off when you forgot a case in your switch.
By contrast the variant of Scheme I used twenty years ago was functional, but didn't have checks for covering all cases. (And Haskell's ghc didn't have that checked turned on by default a few years ago. Not sure if they changed that.)
I can't decide what's more damning. The fact that there was effectively no error/failure handling or this:
> Note "channel updates ...bypassed client's staging controls and was rolled out to everyone regardless"
> A few IT folks who had set the CS policy to ignore latest version confirmed this was, ya, bypassed, as this was "content" update (vs. a version update)
If your content updates can break clients, they should not be able to bypass staging controls or policies.
This is going to be what most customers did not realize. I'm sure Crowdstrike assured them that content updates were completely safe "it's not a change to the software" etc.
The way I understand it, the policy the users can configure are about "agent versions". I don't think there's a setting for "content versions" you can toggle.
Maybe there isn't a switch that says "content version",but from end user perspective it is a new version. Whether it was a content change, or just a fix for typo in documentation (say) the change being pushed is different than what currently exists.And for the end user the configuration implies that they have a chance to decide whether to accept any new change being pushed or not.
Looking at how this whole thing is pasted together, there's probably a regex engine in one of those sys files somewhere that was doing the "parsing"...
> reach for a parser generator framework and fuzz your program
I agree to the second but disagree on the first. Parser generator frameworks produce a lot of code that is hard to read and understand and they don't necessarily do a better job of error handling than you would. A hand-written recursive descent parser will usually be more legible, will clearly line up with the grammar that you're supposed to be parsing, and will be easier to add better error handling to.
Once you're aware of the risks of a bad parser you're halfway there. Write a parser with proper parsing theory in mind and in a language that forces you to handle all cases. Then fuzz the program, turn bad inputs that turn up into permanent regression tests, and write your own tests with your knowledge of the inner workings of your parser in mind.
This isn't like rolling your own crypto because the alternative isn't a battle-tested open source library, it's a framework that generates a brand new library that only you will use and maintain. If you're going to end up with a bespoke library anyway, you ought to understand it well.
This problem has a promising solution, WUFFS, "a memory-safe programming language (and a standard library written in that language) for Wrangling Untrusted File Formats Safely."
No bet. There are two failures here. (1) Failing to check the data for validity, and (2) Failing to handle an error gracefully.
Both of these are undergraduate-level techniques. Heck, they are covered in most first-semester programming courses. Either of these failures is inexcusable in a professional product, much less one that is running with kernel-level privileges.
Bet: CrowdStrike has outsourced much of its development work.
> Either of these failures is inexcusable in a professional product
Don’t we have those kind of failures in almost every professional product? I’ve been working in the industry for over a decade and in every single company we had those bugs. The only difference was that none of those companies were developing kernel modules or whatever. Simple saas. And no, none of the bugs were outsourced (the companies I worked for hired only locals and people in the range of +- 2h time zone)
He probably means work was sent offshore to offices with cheaper labor that's less skilled or less vested into delivering quality work. Though there's no proof of that yet, people just like to throw the blame on offshoring whoever $BIG_CORP fucks up, as if all programmers in the US are John Carmack and they can never cause catastrophic fuckups with their code or processes.
Not everyone in the US might be Carmack, but it's ridiculously nearsighted to assert that cultural differences don't play into people desire and ability to Do It Right.
It's not cultural differences that make the difference in output quality, it's pay and quality standards of the output set by the team/management, which is also mostly a function of pay, since underpaid and unhappy developers tend not to care at all beyond doing the bare minimum to not getting fired (#notmyjob, laying flat movement, etc).
You think everyone writing code in the US would give two shits about the quality of their output if they see the CEO pocketing another private jet while they can barley make big-city rent?
Hell, even well paid devs at top companies in the US can be careless and lazy if their company doesn't care about quality. Have you seen some of the vulnerabilities and bugs that make it into the Android source code and on Pixel devices? And guess what, that code was written by well paid developers in the US, hired at Google leetcode standards, yet would give far-east sweatshops a run for their money in terms of carelessness. It's what you get when you have a high barrier of entry but a low barrier of output quality where devs just care about "rest and vest".
I was talking about outsourcing (and not necessarily offshoring). Too many companies like CrowdStrike are run by managers who think that management, sales, and marketing are the important activities. Software development is just an unpleasant expense that needs to be minimized. Hence: outsourcing.
That said, I have had some experience with classic offshoring. Cultural differences make a huge difference!
My experience with "typical" programmers from India, China, et al is that they do exactly what they are told. Their boss makes the design decisions down to the last detail, and the "programmers" are little more than typists. I specifically remember one sweatshop where the boss looped continually among the desks, giving each person very specific instructions of what they were to do next. The individual programmers implemented his instructions literally, with zero thought and zero knowledge of the big picture.
Even if the boss was good enough to actually keep the big picture of a dozen simultaneous activities in his head, his non-thinking minions certainly made mistakes. I have no idea how this all got integrated and tested, and I probably don't want to know.
>That said, I have had some experience with classic offshoring. Cultural differences make a huge difference!
Sure but there's no proof yet that was the case here. That's just masive speculations based on anecdotes on your side. There's plenty of offshore devs that can run rings around western devs.
Staff trained at outsourcers have a different type of focus. My experience is more operational, and usually the training for those guys is about restoration to hit SLA, period. Makes root cause harder to ID sometimes.
It doesn’t mean ‘Murica better, just that the origin story of staff matter, especially if you don’t have good processes around things like rca.
Western slacker movements never came close to deadma or the dedicated indifference in the face of samsara. You seem to have a lot of experience with the former and little of the latter two, but what do I know.
Offshoring and outsourcing is very different. It would be also very hard to talk about offshoring at a company claiming to provider services in 170 countries.
It's probably just the common US-centric bias that external development teams, particularly those overseas, may deliver subpar software quality. This notion is often veiled under seemingly intellectual critiques to avoid overt xenophobic rhetoric like "They're taking our jobs!".
Alternatively, there might be a general assumption that lower development costs equate to inferior quality, which is a flawed yet prevalent human bias.
>Approximately 100% of CVEs, crashes, bugs, slowdowns, and pain points of computing have to do with various forms of deserialising binary data back into machine-readable data structures. All because a) human programmers forget to account for edge cases, and b) imperative programming languages allow us to do so.
People are target fixating too much. Sure, this parser crashed and caused the system to go down. But in an alternative universe they push a definition file that rejects every openat() or connect() syscall. Your system is now equally as dead, except it probably won't even have the grace to restart.
The whole concept of "we fuck with the system in kernel based on data downloaded from the internet" is just not very sound and safe.
So, I also have near zero cybersecurity expertise (I took an online intro course on cryptography due to curiousity) and no expertise in writing kernel modules actually, but why if ever would you parse an array of pointers...in a file...instead of any other way of serializing data that doesn't include hardcoded array offsets in an on-disk file...
Ignore this failure which was catastrophic, this was a bad design asking to be exploited by criminals.
Performance, I assume. Right now it may look like the wrong tradeoff, but every day in between incidents like this we're instead complaining that software is slow.
Of course it doesn't have to be either/or; you can have fast + secure, but it costs a lot more to design, develop, maintain and validate. What you can't have is a "why don't they just" simple and obvious solution that makes it cheap without making it either less secure, less performant, or both.
Given all the other mishaps in this story, it is very well possible that the software is insecure (we know that), slow and also still very expensive. There's a limit to how high you can push the triangle, but there's not bottom to how bad it can get.
I'm curious, how else would you store direct memory offsets? No matter how you store/transmit them, eventually you're going to need those same offsets.
The problem wasn't storing raw memory offsets, it was not having some way to validate the data at runtime.
In this case, the direct memory addresses are literally needed.
The addresses aren't being generated internal to the program, so there are no "handles". They are referencing external data by design.
That's like saying "you shouldn't use a hard-coded volatile pointer to reference a hardware device". No, you literally need to do that sometimes; especially in embedded software.
> I'm happy to up the ante by £50 to account for my second theory
What's that, three pints in a pub inside the M25? :P
Completely agree with this sentiment though, we've known that handling of binary data in memory unsafe languages has been risky for yonks. At the very least, fuzzing should've been employed here to try and detect these sorts of issues. More fundamentally though, where was their QA? These "channel files" just went out of the door without any idea as to their validity? Was there no continuous integration check to just .. ensure they parsed with the same parser as was deployed to the endpoints? And why were the channel files not deployed gradually?
FWIW, before someone brings up JSON, GP's bet only makes sense when "binary" includes parsing text as well. In fact, most notorious software bugs are related to misuse of textual formats like SQL or JS.
"human programmers forget to account for edge cases"
Which is precisely the rationale which led to Standard Operating Procedures and Best Practices (much like any other Sector of business has developed).
I submit to you, respectfully, that a corporation shall never rise to a $75 Billion Market Cap without a bullet-proof adherence to such, and thus, this "event" should be properly characterized and viewed as a very suspicious anomaly, at the least
> combination of said bad binary data and a poorly-written parser that didn't error out correctly upon reading invalid data
By now, if you write any parser that deals with any outside data and don't fuzz the heck out of it, you are willfully negligent. Fuzzers are pretty easy to use, automatic and would likely catch any such problem pretty soon. So, did they fuzz and got very very unlucky or do they just like to live dangerously?
More or less. Binary parsers are the easiest place to find exploits because of how hard it is to do correctly. Bounds checks, overflow checks, pointer checks, etc. Especially when the data format is complicated.
Yeah, even if you are only parsing "safe" inputs such as ones you created yourself. Other bugs and sometimes even truly random events can corrupt data.
Hmmm. Most common problems these days are certificate related I would have thought. Binary data transfers are pretty rare in an age of base64 json bloat
There are plenty of binary serialisation protocols out there, many proprietary - maybe you’ll stuff that base64’d in a json container for transit, but you’re still dealing with a binary decoder.
By-passing the discussion whether one actually needs root kit powered endpoint surveillance software such as CS perhaps an open-source solution would be a killer to move this whole sector to more ethical standards. So the main tool would be open source and it would be transparent what it does exactly and that it is free of backdoors or really bad bugs. It could be audited by the public. On the other hand it could still be a business model to supply malware signatures as a security team feeding this system.
I'd say no. Kolide is one such attempt, and their practices, and how it's used in companies, are as insidious as those from a proprietary product. As a user, it gives me no assurance that an open source surveillance rootkit is better tested and developed, or that it has my best interests in mind.
The problem is the entire category of surveillance software. It should not exist. Companies that use it don't understand security, and don't trust their employees. They're not good places to work at.
"And aren’t they kinda right to not trust their employees if they employ 50,000 people with different skills and intentions?"
Yes, in a 50k employee company, the CEO won't know every single employee and be able to vouch for their skills and intentions.
But in a non-dysfunctional company, you have a hierarchy of trust, where each management level knows and trusts the people above and below them. You also have siloed data, where people have access to the specific things they need to do their jobs. And you have disaster mitigation mechanisms for when things go wrong.
Having worked in companies of different sizes and with different trust cultures, I do think that problems start to arise when you add things like individual monitoring and control. You're basically telling people that you don't trust them, which makes them see their employer in an adversarial role, which actually makes them start to behave less trustworthy, which further diminishes trust across the company, harms collaboration, and eventually harms productivity and security.
Setting aside the possibility of deploying an EDR like Crowdstrike just being a box ticking exercise for compliance or insurance purposes, can something like an EDR be used not because of a lack of trust but a desire to protect the environment?
A user doesn’t have to do anything wrong for the computer to become compromised, or even if they do, being able to limit the blast radius and lock down the computer or at least after the fact have collected the data to be able to identify what went wrong seems important.
How would you secure a network of computers without an agent that can do anti-virus, detect anomalies, and remediate them? That is to say, how would you manage to secure it without doing something that has monitoring and lockdown capabilities? In your words, signaling that you do not trust the users?
This. From all the comments I've seen in the multiple posts and threads about the incident, this simple fact seems to be the least discussed. How else to protect a complex IT environment with thousands of assets in form of servers and workstations, without some kind of endpoint protection? Sure, these solutions like CrowdStrike et al are box-checking and risk transferring exercises in one sense, but they actually work as intended when it comes to protecting endpoints from novel malware and TTP:s. As long as they don't botch their own software, that is :D
> How else to protect a complex IT environment with thousands of assets in form of servers and workstations, without some kind of endpoint protection?
There is no straightforward answer to this question. Assuming that your infrastructure is "secure" because you deployed an EDR solution is wrong. It only gives you a false sense of security.
The reality is that security takes a lot of effort from everyone involved, and it starts by educating people. There is no quick bandaid solution to these problems, and, as with anything in IT, any approach has tradeoffs. In this case, and particularly after the recent events, it's evident that an EDR system is as much of a liability as it is an asset—perhaps even more so. You give away control of your systems to a 3rd party, and expect them to work flawlessly 100% of the time. The alarming thing is how much this particular vendor was trusted with critical parts of our civil infrastructure. It not only exposes us to operational failures due to negligence, but to attacks from actors who will seek to exploit that 3rd party.
I totally agree. In my current work environment, we do deploy EDR but it is primarily for assets critical for delivering our main service to customers. Ironically, this incident caused them all to be unavailable and there is for sure a lesson to be learned here!
It is not considered a silver bullet by the security team, rather a last-resort detection mechanism for suspicious behavior (for example if the network segmentation or access control fails, or someone managed to get foothold by other means). It also helps them identify which employees need more training as they keep downloading random executables from the web.
Absolutely, training is key. Alas, managers don't seem to want their employees spending time on anything other than delivering profit and so the training courses are zipped through just to mark them as completed.
Personally, I don't know how to solve that problem.
It is a good question. Is there a possibility of fundamentally fixing software/hardware to eliminate the vectors that malware exploits to gain a foot hold at all? e.g. not storing return address on the stack or letting it be manipulated by callee? memory bounds enforcement, either statically at compile time, or with the help of hardware, to prevent writing past memory not yours? (Not asking about feasibility of coexisting with or migrating from the current world, just about the possibility of fundamentally solving this at all...)
Economic drivers spring to mind, possibly connected with civil or criminal liability in some cases.
But this will be the work of at least two human generations; our tools and work practices are woefully inadequate, so even if the pointy haired bosses (fearing imprisonment for gratuitous failure) and grasping, greedy investors fear (for the destruction of “hard earned” capital), it’s not going to be done in the snap of our fingers, not least because the people occupying technology industry - and this is an overgeneralisation, but I’m pretty angry so I’m going to let it stand - Just Don’t Care Enough.
If we cared, it would be nigh on impossible for my granny to get tricked to pop her Windows desktop by opening an attachment in her email client.
It wouldn’t be possible to sell (or buy!) cloud services for which we don’t get security data in real time and signal about what our vendor advises to do if worst comes to worst.
Yet we don't apply total surveillance to people. The reason isn't just ethics and US constitution, but also that it's just not possible without destroying society. Same perhaps applies to computer systems.
I think it doesn't. I think that the kind of security the likes of CrowdStrike promise is fundamentally impossible to have, and pursuing it is a fool's errand.
I disagree. You seem to start from a premise that all people are honest, except those that aren't, but you don't work with or meet dishonest people, unless the employer sets himself up in an adversarial role?
As the other reply to your comment said: the world is not 'fair' or 'honest', that's just a lie told to children. Apart from geuinely evil people, there are unlimited variables that dictate people's behavior. Culture, personality, nutrition, financial situation, mood, stress, bully coworkers, intrinsic values, etc etc. To think people are all fair and honest "unless" is a really harmful worldview to have and in my opinion the reason for a lot of bad things being allowed to happen and continue (troughout all society, not just work).
Zero-trust in IT is just the digitized version of "trust is earned". In computers you can be more crude and direct about it, but it should be the same for social connections and interactions.
> You seem to start from a premise that all people are honest
You have to start with that premise otherwise organizations and society fail. Every hour of every day, even people in high security organizations have opportunities to betray the trust bestowed on them. Software and processes are about keeping honest people honest. The dishonest ones you cannot do too much about but hope you limit the damage they can cause.
If everyone is treated as dishonest then there will eventually be an organizational breakdown. Creativity, high productivity, etc... do not work in a low/zero trust environment.
That’s a lie we tell children so they think the world is fair.
A Marxist reading would suggest alienation, but a more modern one would realize that it is a bit more than that: to enable modern business practices (both good and bad!) we designed systems of management to remove or reduce trust and accountability in the org, yet maintain as similar results to a world that is more in line with the one you believe is possible.
A security professional though would tell you that even in such a world, you can not expect even the most diligent folks to be able to identify all risks (e.g. phishing became so good, even professionals can’t always discern the real from fake), or practice perfect opsec (which probably requires one to be a psychopath).
Security is a process not a product. Anyone selling you security as a product is scamming you.
These endpoint security companies latch onto people making decisions, those people want security and these software vendors promise to make the process as easy as possible. No need to change the way a company operates, just buy our stuff and you're good. That's the scam.
Truthfully, it must be practically infeasible to transform security practices of a large company overnight. Most of the time they buy into these products because they're chasing a security certification (ISO 27001, SOC2, etc.), and by just deploying this to their entire fleet they get to sidestep the actually difficult part.
The irony is that at the end of this they're not anymore "secure" than they were before, but since they have the certification, their customers trust that they are. It's security theater 101.
whether you morally agree with surveillance software's purpose is not the same as whether a particular piece of surveillence software works well or not.
I would imagine an open source version of crowdstrike would not have had such a bad outcome.
I disagree with the concept of surveillance altogether. Computer users should be educated about security, given control of their devices, and trusted that they will do the right thing. If a company can't do that, that's a sign that they don't have good security practices to begin with, and don't do a good job at hiring and training.
The only reason this kind of software is used is so that companies can tick a certification checkbox that gives the appearance of running a tight ship.
I realize it's the easy way out, and possibly the only practical solution for a large corporation, but then this type of issues is unavoidable. Whether the product is free or proprietary makes no difference.
Most people do not understand, or care to understand, what "security" means.
You highlight training as a control. Training is expensive - to reduce cost and enhanced effectiveness, how do you focus training on those that need it without any method to identify those that do things in insecure ways?
Additionally, I would say a major function of these systems is not surveillance at all - it is preventive controls to prevent compromise of your systems.
Overall, your comment strikes me a naive and not based on operational experience.
This type of software is notorious for severely degrading employees' ability to do their jobs, occasionally preventing it entirely. It's a main reason why "shadow IT" is a thing - bullshit IT restrictions and endpoint security malware can't reach third-party SaaS' servers.
This is to say, there are costs and threats caused by deploying these systems too, and they should be considered when making security decisions.
Explain exactly how any AV prevents a user from checking e-mails and opening word?
The years I spent doing IT at that level, every time, every single time I got a request for admin privileges to be granted to a user or for software to be installed on an endpoint we already had a solution in place for exactly what the user wanted, installed and tested on their workstation that was taught in onboarding and they simply "forgot".
Just like the users I had to reset their passwords for every monday because they forgot their passwords. It's an irritation but that doesn't mean they didn't do their job well. They met all performance expectations, they just needed to be handheld with technology .
The real world isn't black and white and this isn't Reddit.
> Explain exactly how any AV prevents a user from checking e-mails and opening word?
For example by doing continuous scans that consume so much CPU the machine stays thermally throttled at all times.
(Yes, really. I've seen a colleague raising a ticket about AV making it near-impossible to do dev work, to which IT replied the company will reimburse them for a cooling pad for the laptop, and closed the issue as solved.)
The problem is so bad that Microsoft, despite Defender being by far the lightest and least bullshit AV solution, created "dev drive", a designated drive that's excluded by design from Defender scanning, as a blatant workaround for corporate policies preventing users and admins from setting custom Defender exclusions. Before that, your only alternative was to run WSL2 or a regular VM, which are opaque to AVs, but that tends to be restricted by corporate too, because "sekhurity".
And yes, people in these situations invent workarounds, such as VMs, unauthorized third-party SaaS, or using personal devices, because at the end of the day, the work still needs to be done. So all those security measures do is reduce actual security.
Most AV and EDR solutions support exceptions, either on specific assets or fleets of assets. You can make exceptions for some employees (for example developers or IT) while keeping (sane) defaults for everybody else. Exceptions are usually applied on file paths, executable image names, file hashes, signature certificates or the complete asset. It sounds like people are applying these solutions wrong, which of course has a negative outcome for everybody and builds distrust.
In theory, those solutions could be used right. In practice, they never are.
People making decisions about purchasing, deploying and configuring those systems are separated by many layers from rank-and-file employees. The impact on business downstream is diffuse and doesn't affect them directly, while the direct incentives they have are not aligned with the overall business operations. The top doesn't feel the damage this is doing, and the bottom has no way of communicating it in a way that will be heard.
It does build distrust, but not necessarily in the sense that "company thinks I'm a potential criminal" - rather, just the mundane expectation that work will continue to get more difficult to perform with every new announcement from the security team.
I'm going to just echo my sibling comment here. This seems like a management issue. If IT wouldn't help it was up to your management to intervene and say that it needs to be addressed.
Also I'm unsure I've ever seen an AV even come close to stressing a machine I would spec for dev work. Likely misconfigured for the use case but I've been there and definitely understand the other side of the coin, sometimes a beer or pizza with someone high up at IT gets you much further than barking. We all live in a society with other people.
I would also hazard a guess that the defender drive is more a matter of just making it easier for IT to do the right thing, requested by IT departments more than likely. I personally have my entire dev tree excluded from AV purely because of false positives on binaries and just unnecessary scans because the fines change content so regularly. That can be annoying to do with group policy if where that data is stored isn't mandated and then you have engineers who would be babies about "I really want my data in %USERPROFILE%/documents instead oF %USERPROFILE%/source" now IT can much easier just say that the Microsoft blessed solution is X and you need to use it.
Regarding WSL, if it's needed for you job then go for it and have you manager out in a request. However if you are only doing it to circumvent IT restrictions, well don't expect anyone to play nice.
On the person devices note. If there's company data on your device it and all it's content can be subpoenad in a court case. You really want that? Keep work and personal seperate, it really is better for all parties involved.
> sometimes a beer or pizza with someone high up at IT gets you much further than barking. We all live in a society with other people.
That's true, but it gets tricky in a large multinational, when the rules are set by some team in a different country, whose responsibilities are to the corporate HQ, and the IT department of the merged-in company I worked for has zero authority on the issue. I tried, I've also sent tickets up the chain, they all got politely ignored.
From the POV of all the regular employees, it looks like this: there are some annoying restrictions here and there, and you learn how to navigate the CPU-eating AV scans; you adapt and learn how to do your work. Then one day, some sneaky group policy update kills one of your workarounds and you notice this by observing that compilation takes 5x as long as it used to, and git operations take 20x as long as they should. You find a way to deal (goodbye small commits). Then one day, you get an e-mail from corporate IT saying that they just partnered with ESET or CrowdStrike or ZScaler or not, and they'll be deploying the new software to everyone. Then they do, and everything goes to shit, and you need to start to triple every estimate from now on, as the new software noticeably slows down everything across the board. You think to yourself, at least corporate gave you top-of-the-line laptops with powerful CPUs and absurd amount of RAM; too bad for sales and managers who are likely using much weaker machines. And then you realize that sales and management were doing half their work in random third-party SaaS, and there is an ongoing process to reluctantly in-house some of the shadow IT that's been going on.
Fortunately for me, in my various corporate jobs, I've always managed to cope by using Ubuntu VMs or (later) WSL2, and that this always managed to stay "in the clear" with company security rules. Even if it meant I had to figure out some nasty hacks to operate Windows compilers from inside Linux, or to stop the newest and bestest corporate VPN from blackholing all network traffic to/from WSL2 (was worth it, at least my work wasn't disrupted by the Docker Desktop licensing fiasco...). I never had to use personal devices, and I learned long ago to keep firm separation between private and work hardware, but for many people, this is a fuzzy boundary.
There was one job where corporate installed a blatant keylogger on everyones' machines, and for a while, with our office IT's and our manager's blessing, our team managed to stave it off - and keep local admin rights - by conveniently forgetting to sign relevant consent forms. The bad taste this left was a major factor in me quitting that job few months later, though.
Anyway, the point to these stories is, I've experienced first-hand how security in medium and large enterprises impacts day-to-day work. I fought both alongside and against IT departments over these. I know that most of the time, from the corporate HQ's perspective, it's difficult to quantify the impact of various security practices on everyone's day-to-day work (and I briefly worked in cybersecurity, so I also know this isn't even obvious to people this should be considered!). I also know that large organizations can eat a lot of inefficiency without noticing it, because at that size, they have huge inertia. The corporate may not notice the work slowing down 2x across the board, when it's still completing million-dollar contracts on time (negotiated accordingly). It just really sucks to work in this environment; the inefficiency has a way of touching your soul.
EDIT:
The worst is the learned helplessness. One day, you get fed up with Git taking 2+ minutes to make a goddamn commit, and you whine a bit on the team channel. You hope someone will point out you're just stupid and holding it wrong, but no - you get couple people saying "yeah, that's how it is", and one saying "yeah, I tried to get IT to fix that; they told me a cooling stand for the laptop should speed things a bit". You eventually learn that security people just don't care, or can't care, and you can only try to survive it.
(And then you go through several mandatory cybersecurity trainings, and then you discover a dumb SQL injection bug in a new flagship project after 2 hours of playing with it, and start questioning your own sanity.)
Look I'm not disagreeing with you that it sucks. I just know I've been on the other side of the fence and people like to throw shade at IT when they themselves are just trying to do their jobs.
And let's see if we can agree that likely corporate multinationals are probably a bad thing, or at least micromanaging from the stratosphere when you cannot see how youe decision effects things. That however is likely a management antipattern and if it is really negatively effecting your mental health but you are still meeting performance expectations I'm not against you making a decision to walk.
Sometimes the only way to solve those problems is to cause turnover and make management look twice, and a lot of time one key person leaving can cause an exodus that will force change.
Not being negative here, sometimes you are just in a toxic relationship and need to get out.
I don't have first-hand experience with Kolide, as I refused to install it when it was pushed upon everyone in a company I worked for.
Complaints voiced by others included false positives (flagging something as a threat when it wasn't, or alerting that a system wasn't in place when it was), being too intrusive and affecting their workflow, and privacy concerns (reading and reporting all files, web browsing history, etc.). There were others I'm not remembering, as I mostly tried to stay away from the discussion, but it was generally disliked by the (mostly technical) workforce. Everyone just accepted it as the company deemed it necessary to secure some enterprise customers.
Also, Kolide's whole spiel about "honest security"[1] reeks of PR mumbo jumbo whose only purpose is to distance themselves from other "bad" solutions in the same space, when in reality they're not much different. It's built by Facebook alumni, after all, and relies on FB software (osquery).
I think some of the information here is misleading and a bit unfair.
> being too intrusive and affecting their workflow
Kolide is a reporting tool, it doesn't for example remove files or put them in quarantine. You also cannot execute commands remotely like in Crowdstrike. As you mentioned, it's based on osquery which makes it possible to query machine information using SQL. Usually, Kolide is configured to send a Slack message or email if there is a finding, which I guess can be seen as intrusive but IMO not very.
> reading and reporting all files
It does not read and report all files as far as I know, but I think it's possible to make SQL queries to read specific files. But all files or file names aren't stored in Kolide or anything like that. And that live query feature is audited (ens users can see all queries run against their machines) and can be disabled by administrators.
> web browsing history
This is not directly possible as far as I know, but maybe via a file read query but it's not something built-in out of the box/default. And again, custom queries are transparent to users and can be disabled.
> Kolide's whole spiel about "honest security"[1] reeks of PR mumbo jumbo whose only purpose is to distance themselves from other "bad" solutions in the same space
While it's definitely a PR thing, they might still believe in it and practice what they preach. To me it sounds like a good thing to differentiate oneself from bad actors.
Kolide gives users full transparency of what data is collected via their Privacy Center, and they allow end users to make decisions about what to do about findings (if anything) rather than enforcing them.
> It's built by Facebook alumni, after all, and relies on FB software (osquery).
For example React and Semgrep is also built by Facebook/Facebook alumni, but I don't really see the relevance other than some ad-hominem.
Full disclosure: No association with Kolide, just a happy user.
I concede that I may be unreasonably biased against Kolide because of the type of software it is, but I think you're minimizing some of these issues. My memory may be vague on the specifics, but there were certainly many complaints in the areas I mentioned in the company I worked at.
That said, since Kolide/osquery is a very flexible product, the complaints might not have been directed at the product itself, but at how it was configured by the security department as well. There are definitely some growing pains until the company finds the right balance of features that everyone finds acceptable.
Re: intrusiveness, it doesn't matter that Kolide is a report-only tool. Although, it's also possible to install extensions[1,2] that give it a deeper control over the system.
The problem is that the policies it enforces can negatively affect people's workflow. For example, forcing screen locking after a short period of inactivity has dubious security benefits if I'm working from a trusted environment like my home, yet it's highly disruptive. (No, the solution is not to track my location, or give me a setting I have to manage...) Forcing automatic system updates is also disruptive, since I want to update and reboot at my own schedule. Things like this add up, and the combination of all of them is equivalent to working in a babyproofed environment where I'm constantly monitored and nagged about issues that don't take any nuance into account, and at the end of the day do not improve security in the slightest.
Re: web browsing history, I do remember one engineer looking into this and noticing that Kolide read their browser's profile files, and coming up with a way to read the contents of the history data in SQLite files. But I am very vague on the details, so I won't claim that this is something that Kolide enables by default. osquery developers are clearly against this kind of use case[3]. It is concerning that the product can, in theory, be exploited to do this. It's also technically possible to pull any file from endpoints[4], so even if this is not directly possible, it could easily be done outside of Kolide/osquery itself.
> Kolide gives users full transparency of what data is collected via their Privacy Center
Honestly, why should I trust what that says? Facebook and Google also have privacy policies, yet have been caught violating their users' privacy numerous times. Trust is earned, not assumed based on "trust me, bro" statements.
> For example React and Semgrep is also built by Facebook/Facebook alumni, but I don't really see the relevance other than some ad-hominem.
Facebook has historically abused their users' privacy, and even has a Wikipedia article about it.[5] In the context of an EDR system, ensuring trust from users and handling their data with the utmost care w.r.t. their privacy are two of the most paramount features. Actually, it's a bit silly that Kolide/osquery is so vocal in favor of preserving user privacy, when this goes against working with employer-owned devices where employee privacy is definitely not expected. In any case, the fact this product is made by people who worked at a company built by exploiting its users is very relevant considering the type of software it is. React and Semgrep have an entirely different purpose.
> For example, forcing screen locking after a short period of inactivity has dubious security benefits if I'm working from a trusted environment like my home, yet it's highly disruptive.
There is a better alternative too. Make it a fair game for coworkers to send an invitation to a beer from the forgetful worker's machine to the whole company / department. It works wonders.
If your company is large enough, you can’t really trust your employees. Do you really think google can trust their employees that not a single user does something stupid or even is actively malicious?
Limit their abilities using OS features? Have the vendor fix security issues rather than a third party incompetently slapping on band-aid?
It's like you let one company build your office building and then bring in another contractor to randomly add walls and have others removed while having never looked at the blueprints and then one day "whoopsie, that was a supporting wall I guess".
Why is it not just completely normal but even expected that an OS vendor can't build an OS properly, or that the admins can't properly configure it, but instead you need to install a bunch of crap that fucks around with OS internals in batshit crazy ways? I guess because it has a nice dashboard somewhere that says "you're protected". Checkbox software.
The sensor basically monitors everything that's happening on the system and then uses heuristics and known attack vectors and behavior to for example then lock compromised systems down. For example a fileless malware that connects to a c&c and then begins to upload all local documents and stored passwords, then slowly enumerates every service the employee has access to for vulnerabilities.
If you manage a fleet of tens of thousands of systems and you need to protect against well funded organized crime? Employees running malicious code under their user is a given and can't be prevented. Buying crowdstrike sensor doesn't seem like such a bad idea to me. What would you do instead?
As said, limit the user's abilities as much as possible with features of the OS and software in use. Maybe if you want those other metrics, use a firewall, but not a Tls-breaking virus scanning abomination that has all the same problems, but a simple one that can warn you on unusual traffic patterns. If soneone from accounting starts uploading a lot of data, connects to Google cloud when you don't use any of their products, that should be odd.
If we're talking about organized crime, I'm not convinced crowdstrike in particular doesn't actually enlarge the attack surface. So we had what now as the cause, a malformed binary ruleset that the parser, running with kernel privileges, choked on and crashed the system. Because of course the parsing needs to happen in kernel space and not a sandboxed process. That's enough for me to make assumptions about the quality of the rest of the software, and answer the question regarding attack surface.
Before this incident nobody ever really looked at this product at all from a security standpoint, maybe because it is (supposed to be) a security product and thus cannot have any flaws. But it seems now security researchers all over the planet start looking at this thing and are having a field day.
Bill gates sent that infamous email in the early 2000s, I think after sasser hit the world, that security should be made the no1 priority for Windows. As much as I dislike windows for various reasons, I think overall Microsoft does a rather good job about this. Maybe it's time those companies behind these security products start taking security serious too?
> Before this incident nobody ever really looked at this product at all from a security standpoint
If you only knew how absurd of a statement that is. But in any case, there are just too many threats network IDS/IPS solutions won't help you with, any decent C2 will make it trivial to circumvent them. You can't limit the permissions of your employees to the point of being effective against such attacks while still being able to do their job.
> If you only knew how absurd of a statement that is.
You don't seem to know either since you don't elaborate on this. As said, people are picking this apart on Twitter and mastodon right now. Give it a week or two and I bet we'll see a couple CVEs from this.
For the rest of your post you seem to ignore the argument regarding attack surface, as well as the fact that there are companies not using this kind of software and apparently doing fine. But I guess we can just claim they are fully infiltrated and just don't know because they don't use crowdstrike. Are you working for crowdstrike by any chance?
But sure, at the end of the day you're just gonna weigh the damage this outage did to your bottom line and the frequency you expect this to happen with, against a potential hack - however you even come up with the numbers here, maybe crowdstrike salespeople will help you out - and maybe tell yourself it's still worth it.
In a sense the secure platform already exists. You use web apps as much as possible. You store data in cloud storage. You restrict local file access and execute permissions. Authenticate using passkeys.
The trouble is that people still need local file access, and use network file shares. You have hundreds of apps used by a handful of users that need to run locally. And a few intranet apps that are mission critical and have dubious security. That creates the necessity for wrapping users in firewalls, vpns, tls interception, end point security etc. And the less well it all works the more you need to fill the gaps.
Next you'll be saying "I dont need an immune system..."
Fun fact an attacker only needs to steal credentials from the home directory to jump into a companies AWS account where all the juicy customer data lives, so there are reasons we want this control.
Frankly I'd like to see the smart people complaining help write better solutions rather than hinder.
There are lots of variants of this. Wazuh, Velociraptor, etc. They have several problems. One is that user-mode EDR is just not very efficient and effective, and kernel mode requires Microsoft driver signing. There are some hoops for that, and I don't know how hard they are, but I don't know of any of these products that seems to be jumping through them.
The other issue is that detection engineering is really expensive, so the detections that are included with CrowdStrike out of the box are your problem if you're using a free product. From a cost perspective you're not getting off a lot cheaper and trying to sell open source and a detection engineer's salary to a CISO who can just buy CrowdStrike instead is understandably a pretty tough sell. Or it was until this weekend, anyway.
It sounds really interesting. But the only thing it does not do is scanning for vira/malwares, although this could be implemented using GRR I guess. How does Google mitigate malware threats in-house?
> By-passing the discussion whether one actually needs root kit powered endpoint surveillance software such as CS perhaps an open-source solution would be a killer to move this whole sector to more ethical standards.
As a red teamer developing malware for my team to evade EDR solutions we come across, I can tell you that EDR systems are essential. The phrase "root kit powered endpoint surveillance" is a mischaracterization, often fueled by misconceptions from the gaming community. These tools provide essential protection against sophisticated threats, and they catch them. Without them, my job would be 90% easier when doing a test where Windows boxes are included.
> So the main tool would be open source and it would be transparent what it does exactly and that it is free of backdoors or really bad bugs.
Open-source EDR solutions, like OpenEDR [1], exist but are outdated and offer poor telemetry. Assembling various GitHub POCs that exist for production EDR is impractical and insecure.
The EDR sensor itself becomes the targeted thing. As a threat actor, the EDR is the only thing in your way most of the time. Open sourcing them increases the risk of attackers contributing malicious code to slow down development or introduce vulnerabilities. It becomes a nightmare for development, as you can't be sure who is on the other side of the pull request. TAs will do everything to slow down the development of a security sensor. It is a very adversarial atmosphere.
> On the other hand it could still be a business model to supply malware signatures as a security team feeding this system.
It is actually the other way around. Open-source malware heuristic rules do exist, such as Elastic Security's detection rules [2]. Elastic also provides EDR solutions that include kernel drivers and is, in my experience, the harder one to bypass. Again, please make an EDR without drivers for Windows, it makes my job easier.
> *It could be audited by the public."
The EDR sensors already do get "audited" by security researchers and the threat actors themselves. Reverse engineering and debugging the EDR sensors to spot weaknesses that can be "abused." If I spot things like the EDR just plainly accepting kernel mode shellcode and executing it, I will, of course, publicly disclose that. EDR sensors are under a lot of scrutiny.
> Open sourcing them increases the risk of attackers contributing malicious code to slow down development or introduce vulnerabilities.
This is a such tired non-sequitur argument with no evidence whatsoever to back it up that the risk is actually higher for open source versus closed source.
I can just easily argue that a state or non-state actor could buy[1], bribe or simply threaten to get weak code in a proprietary system, without users having any means to ever find out. On the other hand, it is always easier(easier not easy) to discover compromise in open-source like it happened with xz[2] and verify such reports independently.
If there is no proof that compromise is less likely with closed source and it is far easier to discover them in open-source, the logical conclusion is simply open source is better for security libraries.
Funding defensive security infrastructure which is open source and freely available for everyone to use even with 1/100th of the NSA budget that is effectively only offensive, would improve info-security enormously for everyone not just from nation state actors, but also from scammers etc. Instead we get companies like CS that have enormous vested interest in seeing that never happens and trying to scare the rest of us that open-source is bad for security.
I could see an open source solution with "private" or vendor specific definition files. But I think I'd disagree with the statement that open sourcing everything wouldn't cause any problem. Engineering isn't necessarily about peer reviewed studies, it's about empirical observations and applying the engineering method (which can be complemented by a more scientific one but shouldn't be confused for it). It's clear that this type of stuff is a game of cat and mouse. Attackers search for any possible vulnerability, bypass etc. It does make sense that exposing one side's machinery will make it easier for the other side to see how it works. A good example of that is how active hackers are at finding different ways to bypass Windows Defender by using certain types of Office file formats, or certain combinations of file conversions to execute code. Exposing the code would just make all of those immediately visible to everyone.
Eventually that's something that gets exposed anyways, but I think the crucial part is timing and being a few steps ahead in the cat and mouse game.
Otherwise I'm not sure what kind of proof would even be meaningful here.
I actually agree there is no intrinsic advantage in having this piece of software as opensource - closed teams tend to have a more contained collaborator "blast radius", and you don't have 500 forks with patches that may modify behaviour in a subtle way and that are somehow conflated with the original project.
On the other hand, anyone serious about malware development already has "the actual source code", either for defensive operations and offensive operations.
Open source doesn't mean the bazzar, plenty of projects have a cathedral style development.
Bazzar works absolutely fine for security, Linux kernel is one project which does this , all security infrastructure uses it one way or another. The tens of thousands of patches and forks has not once been discovered to have the subtle bug/vulnerability scenario intentionally submitted yet in 30 years .
There seems to be a lot of misconceptions in this thread what open source is or can do. Most of my points have been made by people much better than me for decades now.
I feel having the solution open sourced isn't bad from a code security standpoint, but rathee that it is simply not economically viable. To my knowledge most of the major open source technologies are currently funded by FAANG and purely because it's needed by them to conduct business and the moment it becomes inconvenient for them to support it they fork it or develop their own, see Terraform/Redis...
I also cannot get behind a government funding model purely because it will simply become a design by committee nightmare because this isn't flashy tech. Just see how many private companies have beaten NASA to market in a pretty well funded and very flashy industry. The very government you want to fund these solutions are currently running on private companies infrastructure for all their IT needs.
Yes opensouring is definitely amazing and if executed well will be better, just like communism.
Plenty of fundamental research and development happens in academia fairly effectively.
Government has to fund not run it like any other grant works today. The existing foundations and non profits like Apache or even mixed ones like Mozilla are fairly capable of handling the grants.
Expecting private companies or dedicated volunteers to maintain mission critical libraries like xz is not a viable option as we are doing it now.
> The phrase "root kit powered endpoint surveillance" is a mischaracterization, often fueled by misconceptions from the gaming community.
How exactly is this is mischaracterization? Technically these EDR tools are identical to kernel level anticheat and they are identical to rootkits, because fundamentally they're all the same thing just with a different owner. If you disagree it would be nice if you explained why.
As for open source EDRs becoming the target, this is just as true of closed source EDR. Cortex for example was hilariously easy to exploit for years and years until someone was nice enough to tell them as much. This event from CrowdStrike means that it's probably just as true here.
The fact that the EDR is 90% of the work of attacking a Windows network isn't a sign that we should continue using EDRs. It means that nothing privileged should be in a Windows network. This isn't that complicated, I've administered such a network where everything important was on Linux while end users could run Windows clients, and if anything it's easier than doing a modern Windows/AD deployment. Good luck pivoting from one computer to another when they're completely isolated through a Linux server you have no credentials for. No endpoint should have any credentials that are valid anywhere except on the endpoint itself and no two endpoints should be talking to each other directly: this is in fact not very restrictive to end users and completely shuts down lateral movement - it's a far better solution than convoluted and insecure EDR schemes that claim to provide zero-trust but fundamentally can't, while following this simple rule actually provides you zero-trust.
Look at it this way - if you (and other redteamers) can economically get past EDR systems for the cost of a pentest, what do you think competent hackers with economies of scale and million dollar payouts can do? For now there's enough systems without EDRs that many just won't bother, but as it spread more they will just be exploited more. This is true as well of the technical analogue in kernel anticheat, which you and I can bypass in a couple days of work.
Where we are is that we're using EDRs as a patch over a fundamentally insecure security model in a misguided attempt to keep the convenience that insecurity brings.
People don't go around complaining that Microsoft Defender is "rootkit powered endpoint surveillance". It's intent is to protect the system.
There is a lot more suspicion around kernel level anti-cheat software developed by the likes of Epic games due to their ownership than they Crowdstrike or Microsoft.
People don't complain about kernel code from Microsoft because Microsoft wrote the kernel. You don't have a choice but to trust Microsoft with that.
People have been complaining about rootkit powered antimalware for a long time. It didn't start with CrowdStrike: there was a whole debacle about it in the Windows XP days when Microsoft stopped antiviruses from patching the kernel.
The value CrowdStrike provides is the maintenance of the signature database, and being able to monitor attack campaigns worldwide. That takes a fair amount of resources that an open source project wouldn’t have. It’s a bit more complicated than a basic hash lookup program.
Crowdstrike is a gun. A tool. But not the silver bullet. Or training to be able to fire it accurately under pressure at the werewolf.
You can very easily shoot your own foot off instead of slaying the monster, use the wrong ammunition to be effective, or in this case a poorly crafted gun can explode in your hand when you are holding it.
DAT-style content updates and signature-based prevention are very archaic. Directly loading content into memory and a hard-coded list of threats? I was honestly shocked that CS was still doing DAT-style updates in an age of ML and real-time threat feeds. There are a number of vendors who've offered it for almost a decade. We use one. We have to run updates a couple of times a year.
There are no "ethical standards" to move to. Nobody should be able to usurp control of our computers. That should simply be declared illegal. Creating contractual obligations that require people to cede control of their computers should also be prohibited. Anything that does this is malware and malware does not become justified or "ethical" when some corporation does it. Open source malware is still malware.
What does “our computer” mean when it is not owned by you, but issued to you to perform a task with by your employer? Does that also apply to the operator at a switchboard in a nuclear missile launch facility?
Does the switchboard in a nuclear missile launch facility run Crowdstrike? I picture it as a high quality analog circuit board that does 1 thing and 1 thing only. No way to run anything else.
Globally networked personal computers were kind of cultural revolution against the setting you describe. Everyone had their own private compute and compute time and everyone could share their own opinion. Computers became our personal extensions. This is what IBM, Atari, Commodore, Be, Microsoft and Apple (and later desktop Linux) sold. Now given this ideology, can a company own my limbs? If not, they can't own my computers.
> What does “our computer” mean when it is not owned by you, but issued to you to perform a task with by your employer?
Well, presuming that:
1. the employee is issued a computer, that they have possession of even if not ownership (i.e. they bring the computer home with them, etc.)
2. and the employee is required to perform creative/intellectual labor activities on this computer — implying that they do things like connecting their online accounts to this computer; installing software on this computer (whether themselves or by asking IT to do it); doing general web-browsing on this computer; etc.
3. and where the extent of their job duties, blurs the line between "work" and "not work" (most salaried intellectual-labor jobs are like this) such that the employee basically "lives in" this computer, even when not at work...
4. ...to the point that the employee could reasonably conclude that it'd be silly for them to maintain a separate "personal" computer — and so would potentially sell any such devices (if they owned any), leaving them dependent on this employer-issued computer for all their computing needs...
...then I would argue that, by the same chain of reasoning as in the GP post, employers should not be legally permitted to “issue” employees such devices.
Instead, the employer should either purchase such equipment for the employee, giving it to them permanently as a taxable benefit; or they should require that the employee purchase it themselves, and recompense them for doing so.
Cyberpunk analogy: imagine you are a brain in a vat. Should your employer be able to purchase an arbitrary android body for you; make you use it while at work; and stuff it full of monitoring and DRM? No, that'd be awful.
Same analogy, but with the veil stripped off: imagine you are paraplegic. Should your employer be allowed to issue you an arbitrary specific wheelchair, and require you to use it at work, and then monitor everything you do with it / limit what you can do with it because it’s “theirs”? No, that’d be ridiculous. And humanity already knows that — employers already can't do that, in any country with even a shred of awareness about accessibility devices. The employer — or very much more likely, the employer's insurance provider — just buys the person the chair. And then it's the employee's chair.
And yes, by exactly the same logic, this also means that issuing an employee a company car should be illegal — at least in cases where the employee lives in a non-walkable area, and doesn't already have another car (that they could afford to keep + maintain + insure); and/or where their commute is long enough that they'd do most non-employment-related car-requiring things around work and thus using their company car. Just buy them a car. (Or, if you're worried they might run away with it, then lease-to-own them a car — i.e. where their "equity in the car" is in the form of options that vest over time, right along-side any equity they have in the company itself.)
> Does that also apply to the operator at a switchboard…
Actually, no! Because an operator of a switchboard is not a “user” of the computer that powers the switchboard, in the same sense that a regular person sitting at a workstation is a "user" of the workstation.
The system in this case is a “kiosk computer”, and the operator is performing a prescribed domain-specific function through a limited UX they’re locked into by said system. The operator of a nuclear power plant is akin to a customer ordering food from a fast-food kiosk — just providing slightly more mission-critical inputs. (Or, for a maybe better analogy: they're akin to a transit security officer using one of those scanner kiosk-handhelds to check people's tickets.)
If the "computer" the nuclear-plant operator was operating, exposed a purely electromechanical UX rather than a digital one — switches and knobs and LEDs rather than screens and keyboards[1] — then nothing about the operator's workflow would change. Which means that the operator isn't truly computing with the computer; they're just interacting with an interface that happens to be a computer.
[1] ...which, in fact, "modern" nuclear plants are. The UX for a nuclear power plant control-center has not changed much since the 1960s; the sort of "just make it a touchscreen"-ification that has infected e.g. automotive has thankfully not made its way into these more mission-critical systems yet. (I believe it's all computers under the hood now, but those computers are GPIO-relayed up to panels with lots and lots of analogue controls. Or maybe those panels are USB HID devices these days; I dunno, I'm not a nuclear control-systems engineer.)
Anyway, in the general case, you can recognize these "the operator is just interacting with an interface, not computing on a computer" cases because:
• The machine has separate system administrators who log onto it frequently — less like a workstation, more like a server.
• The machine is never allowed to run anything other than the kiosk app (which might be some kind of custom launcher providing several kiosk apps, but where these are all business-domain specific apps, with none of them being general-purpose "use this device as a computer" apps.)
• The machine is set up to use domain login rather than local login, and keeps no local per-user state; or, more often, the machine is configured to auto-login to an "app user" account (in modern Windows, this would be a Mandatory User Profile) — and then the actual user authentication mechanism is built into the kiosk app itself.
• Hopefully, the machine is using an embedded version of the OS, which has had all general-purpose software stripped out of it to remove vulnerability surface.
> the employee could reasonably conclude that it'd be silly for them to maintain a separate "personal" computer — and so would potentially sell any such devices
What a bizarre leap of logic. Can Fedex employees reasonably sell their non-uniform clothes? Just because the employer in this scenario didn't 100% lock down the computer (which is a good thing because the alternative would be incredibly annoying for day-to-day work), doesn't mean the the employee can treat it as their own. Even from the privacy perspective, it would be pretty silly. Are you going to use the employer provided computer to apply to your next job?
People do do it, though. Especially poor people, who might not use their personal computers very often.
Also, many people don't own a separate "personal" computer in the first place. Especially, again, poor people. (I know many people who, if needing to use "a PC" for something, would go to a public library to use the computers there.)
Not every job is a software dev position in the Bay Area, where everyone has enough disposable income to have a pile of old technology laying around. Many jobs for which you might be issued a work laptop still might not pay enough to get you above the poverty line. McDonald's managers are issued work laptops, for instance.
(Also, disregarding economic class for a moment: in the modern day, most people who aren't in tech solve most of their computing problems by owning a smartphone, and so are unlikely to have a full PC at home. But their phone can't do everything, so if they have a work computer they happen to be sat in front of for hours each day — whether one issued to them, or a fixed workstation at work — then they'll default to doing their rare personal "productivity" tasks on that work computer. And yes, this does include updating their CV!)
---
Maybe you can see it more clearly with the case of company cars.
People sometimes don't own any other car (that actually works) until they get issued a company car; so they end up using their company car for everything. (Think especially: tradespeople using their company-logo-branded work box-truck for everything. Where I live, every third vehicle in any parking lot is one of those.)
And people — especially poorer people — also often sell their personal vehicle when they are issued a company car, because this 1. releases them from the need to pay a lease + insurance on that vehicle, and 2. gets them possibly tens of thousands of dollars in a lump sum (that they don't need to immediately reinvest into another car, because they can now rely on the company car.)
The point is that if you do do it, it's on you to understand the limitations of using someone else property. Just like the difference between rental vs owned housing.
There are also fairly obvious differences between work-issued computers and all of your other analogies:
1. A car (and presumably the cyberpunk android body) is much more expensive than a computer, so the downside of owning both a personal and a work one is much higher.
2. A chair or a wheel chair doesn't need security monitoring because it's a chair (I guess you could come up with an incredibly convoluted scenario where it would make sense to put GPS tracking in a wheelchair, but come on).
> just buys the person the chair. And then it's the employee's chair.
It's not because there's a law against loaning chairs, it's because the chair is likely customized for a specific person and can't be reused. Or if you're talking about WFH scenarios, they just don't want to bother with return shipping.
No, it's the difference between owned housing vs renting from a landlord who is also your boss in a company town, where the landlord has a vested interest in e.g. preventing you from using your apartment to also do work for a competitor.
Which is, again, a situation so shitty that we've outlawed it entirely! And then also imposed further regulations on regular, non-employer landlords, about what kinds of conditions they can impose on tenants. (E.g. in most jurisdictions, your landlord can't restrict you from having guests stay the night in your room.)
Tenants' rights are actually a great analogy for what I'm talking about here. A company-issued laptop is very much like an apartment, in that you're "living in it" (literally and figuratively, respectively), and that you therefore should deserve certain rights to autonomous possession/use, privacy, freedom from restriction/compromise in use, etc.
While you don't literally own an apartment you're renting, the law tries to, as much as possible, give tenants the rights of someone who does own that property; and to restrict the set of legal justifications that a landlord can use to punish someone for exercising those (temporary) rights over their property.
IMHO having the equivalent of "tenants' rights" for something like a laptop is silly, because that'd be a lot of additional legal edifice for not-much gain. But, unlike with real-estate rental, it'd actually be quite practical to just make the "tenancy" case of company IT equipment use impossible/illegal — forcing employers to do something else instead — something that doesn't force employees into the sort of legal area that would make "tenants' rights" considerations applicable in the first place.
No, that would be more like sleeping at the office (purely because of employee preferences, not because the employer forces you to or anything like that) and complaining about security cameras.
Tangent — a question you didn't ask, but I'll pretend you did:
> If employers allowed employees to "bring their own devices", and then didn't force said employees to run MDM software on those devices, then how in the world could the employer guarantee the integrity of any line-of-business software the employee must run on the device; impose controls to stop PII + customer-shared data + trade secrets from being leaked outside the domain; and so forth?
My answer to that question: it's safe to say that most people in the modern day are fine with the compromise that your device might be 100% yours most of the time; but, when necessary — when you decide it to be so — 99% yours, 1% someone else's.
For example, anti-cheat software in online games.
The anti-cheat logic in online games, is this little nugget of code that runs on a little sub-computer within your computer (Intel SGX or equivalent.) This sub-computer acts as a "black box" — it's something the root user of the PC can't introspect or tamper with. However:
• Whenever you're not playing a game, the anti-cheat software isn't loaded. So most of the time, your computer is entirely yours.
• You get to decide when to play an online game, and you are explicitly aware of doing so.
• When you are playing an online game, most of your computer — the CPU's "application cores", and 99% of the RAM — is still 100% under your control. The anti-cheat software isn't actually a rootkit (despite what some people say); it can't affect any app that doesn't explicitly hook into it.
• In a brute-force sense, you still "control" the little sub-computer as well — in that you can force it to stop running whatever it's running whenever you want. SGX and the like aren't like Intel's Management Engine (which really could be used by a state actor to plant a non-removable "ring -3" rootkit on your PC); instead, SGX is more like a TPM, or an FPGA: it's something that's ultimately controlled by the CPU from ring 0, just with a very circumscribed API that doesn't give the CPU the ability to "get in the way" of a workload once the CPU has deployed that workload to it, other than by shutting that workload off.
As much as people like Richard Stallman might freak out at the above design, it really isn't the same thing as your employer having root on your wheelchair. It's more like how someone in a wheelchair knows that if they get on a plane, then they're not allowed to wheel their own wheelchair around on the plane, and a flight attendant will instead be doing that for them.
How does that translate to employer MDM software?
Well, there's no clear translation currently, because we're currently in a paradigm that favors employer-issued devices.
But here's what we could do:
• Modern PCs are powerful enough that anything a corporation wants you to do, can be done in a corporation-issued VM that runs on the computer.
• The employer could then require the installation of an integrity-verification extension (essentially "anti-cheat for VMs") that ensures that the VM itself, and the hypervisor software that runs it, and the host kernel the hypervisor is running on top of, all haven't been tampered with. (If any of them were, then the extension wouldn't be able to sign a remote-attestation packet, and the employer's server in turn wouldn't return a decryption key for the VM, so the VM wouldn't start.)
• The employer could feel free to MDM the VM guest kernel — but they likely wouldn't need to, as they could instead just lock it down in much-more-severe ways (the sorts of approaches you use to lock down a server! or a kiosk computer!) that would make a general-purpose PC next-to-useless, but which would be fine in the context of a VM running only line-of-business software. (Remember, all your general-purpose "personal computer" software would be running outside the VM. Web browsing? Outside the VM. The VM is just for interacting with Intranet apps, reading secure email, etc.)
There you go. An anti-cheat rootkit so ineptly coded it serves as literal privilege escalation as a service. Can we stop normalizing this stuff already?
My computer is my computer, and your computer is your computer.
The game company owns their servers, not my computer. If their game runs on my machine, then cheating is my prerrogative. It is quite literally an exercise of my computer freedom if I decide to change the game's state to give myself infinite health or see through walls or whatever. It's not their business what software I run on my computer. I can do whatever I want.
It's my machine. I am the god of this domain. The game doesn't get to protect itself from me. It will bend to my will if I so decide. It doesn't have a choice in the matter. Anything that strips me of this divine power should be straight up illegal. I don't care what the consequences are for corporations, they should not get to usurp me. They don't get to create little extraterritorial islands in our domains where they have higher power and control than we do.
I don't try to own their servers and mess with the code running on them. They owe me the exact same respect in return.
> If their game runs on my machine, then cheating is my prerrogative. v
Sure.
However, due to the nature of how these games work, cheating cannot be prevented serverside only.
So, if you want to play the game, you have to agree to install the anti-cheat because it's the only way to actually stop cheating.
The *only other alternative is to sell a separate category of gaming machines where users wouldn't have access to install cheats, using something like the TPM to enforce.
I don't have to agree to a thing. They're the ones who should have to accept our freedom. We're not about to sacrifice our power and freedom for the sake of preventing cheating in video games. Not only are we going to play the games, we're going to impose some of our terms and conditions on these things.
Yes, that is why the owners of the computers (corps) use these tools - to maintain control over their hardware (and IP accessible on it). The end user is not the customer or user here.
Oh stop it. It’s not your machine, it’s your employer’s machine. You’re the user of the machine. You’re cargo-culting some ideological take that doesn’t apply here at all.
> It’s not your machine, it’s your employer’s machine.
Agreed. I'm fine with this, as long as the employer also accepts that I will never use a personal device for work, that I will never use a minute of personal time for work, and that my productivity is significantly affected by working on devices and systems provided and configured by the employer. This knife cuts both ways.
If only that were possible. Luckily for my employer, I end up thinking about problems to be solved during my off hours like when I'm sleeping and in the shower. Then again, I also think about non-work life problems sitting at my desk when I'm supposed to be working, so (hopefully) it evens out.
I don't think it's possible either. But the moment my employer forces me to install a surveillance rootkit on the machine I use for work—regardless of who owns the machine—any trust that existed in the relationship is broken. And trust is paramount, even in professional settings.
If you don't already have an anti virus on your work machine, you're in a extremely small minority. As a consultant with projects that go about a week, I've experienced the onboarding process of over a hundred orgs first hand. They almost all hand out a Windows laptop, and every single Windows laptop had an AV on it. It's considered negligent not to have some AV solution in the corporate world. And these days, almost all the fancy AVs live in the kernel.
Setting aside the question whether these security tools are effective at their stated goal, what does this have to do with trust at all? Does the existence of a bank vault break the trust between the bank and the tellers? What is the mechanism that would prevent your computer from getting infected by a 0-day if only your employer trusted you?
> Does the existence of a bank vault break the trust between the bank and the tellers?
That's a strange analogy, since the vault is meant to safeguard customer assets from the public, not from bank employees. Besides, the vault doesn't make the teller's job more difficult.
> What is the mechanism that would prevent your computer from getting infected by a 0-day if only your employer trusted you?
There isn't one. What my employer does is trust that I take care of their assets and follow good security practices to the best of my abilities. Making me install monitoring software is an explicit admission that they don't trust me to do this, and with that they also break my trust in them.
You mean like AV software is meant to safeguard the computer from malware? I'm sure banks have a lot of annoying security related processes that make teller's job more difficult.
My experience is that in these workplaces where EDR is enforced on all devices used for work, your hypothetical is true (i.e. you are not expected to work on devices not provided by your employer - on the contrary, that is most likely forbidden).
But how come they didn't catch it in the testing deployments? what was the difference that caused it to happen when they deployed to the outside world. I find it hard to believe that they didn't test it before deployment. I also think companies should all have a testing environment before deploying 3rd party components. I mean, we all install some packages during development that fails or cause some problems but nobody think it is a good idea to do it directly in their production environment before testing, so how is this different?
My guess -- there are two separate pipelines one for code changes and one for data files.
Pipeline 1 --
Code updates to their software are treated as material changes that require non-production and canary testing before global roll-out of a new "Version".
Pipeline 2 --
Content / channel updates are handled differently -- via a separate pipeline -- because only new malware signatures and the like are distrubuted via this route. The new files are just data files -- they are supposed to be in a standard format and only read, not "executed".
This pipeline itself must have been tested originally and found tobe working satisfactorily -- but inside the pipeline there is no "test" stagethat verifies the integrity of the data fine so generated, nor - more importantly - checking if this new data file works without errors when deployed to the latest versions of the software in use.
The agent software that reads these daily channel files must have been "thoroughly" tested (as part of pipeline 1) for all conceivable data file sizes and simulated contents before deployment. (any invalid data files should simply be rejected with an error ... "obviously")
But the exact scenario here -- possibly caused by a broken pipeline in the second path (pipeline 2) -- created invalid data files with some quirks. And THAT specific scenario was not imagined or tested in the software version dev-test-deploy pipeine (pipeline 1).
If this is true --
The lesson obviously is that even for "data" only distributions and roll-outs, however standardized and stable their pipelines may be, testing is still an essential part before large scale roll-outs. It will increase cost and add latency sure, but we have to live with it. (similar to how people pay for "security" software in the first place)
Same lesson for enterprise customers as well -- test new distributions on non-production within your IT setup, or have a canary deployment in place before allowing full roll-outs into production fleets.
Same lesson for enterprise customers as well -- test new distributions on non-production within your IT setup, or have a canary deployment in place before allowing full roll-outs into production fleets.
It was mentioned in one of the HN threads, that the update was pushed overriding the settings customer had [1]. What recourse any customer can have in in such a case ?
But the problem here is that the code runs in kernel mode. As such any data that it may consume should have been tested with the same care as the code itself which has never been the case in this industry.
> I find it hard to believe that they didn't test it before deployment.
I’m not sure why you find that hard to believe - based on the (admittedly fairly limited) evidence we have right now, it’s highly unlikely that this deployment was tested much, if at all. It seems much more likely to me that they were playing fast and loose with definition updates to meet some arbitrary SLAs[1] on zero-day prevention, and it finally caught up with them. Much more likely than somehow every single real-world pc running their software being affected but their test machines somehow all impervious.
[1] When my company was considering getting into endpoint security and network anomaly detection, we were required on multiple occasions by multiple potential clients to provide a 4-hour SLA on a wide number of CVE types and severities. That would mean 24/7 on-call security engineers and a sub-4-hour definition creation and deployment. Yes, that 4 hours was for the deployment being available on 100% of the targets. Good luck writing and deploying a high-quality definition for a zero day in 4 hours, let alone running it through a test pipeline, let alone writing new tests to actually cover it. We very quickly noped out of the space, because that was considered “normal” (at least to the potential clients we were discussing). It wouldn’t shock me if CS was working in roughly the same way here.
This whole f*up was a failure of management and processes at Crowdstrike. "Intern Steve" pushing faulty code to production on a Friday is only a couple of cm of the tip of an enormous iceberg.
I wrote this in another thread already, but the fuck up was both at crowdstrike (they borked a release) but also and more importantly their customers. Shit happens even with the best testing in the world.
You do not deploy anything, ever on your entire production fleet at the same time and you do not buy software that does that. It's madness and we're not talking about small companies with tiny IT departments here.
That’s a tricky one. CrowdStrike is cybersecurity. Wait until the first customer complains that they were hit by WannaCry v2 because CrowdStrike wanted to wait a few days after they updated a canary fleet.
The problem here is that this type of update (a content update) should never be able to cause this however badly it goes. In case the software receives a bad content update, it should fail back to the last known good content update (potentially with a warning fired off to CS, the user, or someone else about the failed update).
In principle, updates that could go wrong and cause this kind of issue should absolutely be deployed slowly, but per my understanding, that’s already the practice for non-content updates at CrowdStrike.
Windows updates are also cybersecurity, but the customer has (had?) a choice to how to roll those out (with Intune nowadays?). The customer should decide when to update, they own the fleet not the vendor!
You do not know if a content update will screw you over and mark all the files of your company as malware. The "It should never happen" situations are the thing you need to prepare for, the reason we talk about security as an onion, the reason we still do staggered production releases with baking times even after tests and QA have passed...
"But it's cybersecurity" is not a justification. I know that security departments and IT departments and companies in general love dropping the "responsibility" part on someone else, but in the end of the day the thing getting screwed over is the company fleet. You should retain control and make sure things work properly, the fact those billion dollar revenue companies are unable to do so is a joke. A terrible one, since IT underpins everything nowadays.
It is a justification, just not necessarily one you agree with.
Companies choose to work with Crowdstrike. One of the reasons they do that is ‘hands-off’ administration-let a trusted partner do it for you. There are absolutely risks of doing it this way. But there are also risks of doing it the other way.
The difference is, if you hand over to Crowdstrike, you’re not on your own if something goes wrong. If you manage it yourself, you’ve only got yourself working on the problem if something goes wrong.
Or worse, something goes wrong and your vendor says “yes, we knew about this issue and released the fix in the patch last Tuesday. Only 5% of your fleet took the patch? Oh. Sounds like your IT guys have got a lot of work on their hands to fix the remaining 95% then!”.
Sorry, this is untrue. Enterprises have SOCs and oncalls, if there is a high risk they can do at least minimal testing (which would have found this issue as it has a 100% bsod rate) and then fleet rollout. It would have been rolled out by Friday evening in this case without crashing hundred of thousands of servers.
The CS customer has decided to offload the responsibility of its fleet to CS. In my opinion that's bullshit and negligence (it doesn't mean I don't understand why they did it), particularly at the scale of some of the customers :)
Disagree with the part where you put onus on customer. As has been mentioned in other HN thread [1], this update was pushed ignoring whatever the settings customer had configured. The original mistake of the customer, if any, was they didn't read this in fine print of the contract (if this point about updates was explicitly mentioned in the contract).
1. https://news.ycombinator.com/item?id=41003390
Only highlighting that "best practice" of cybersecurity is, charitably, total bullshit; less charitably, a racket. This is apparent if you look at the costs to the day-to-day ability of employees to do work, but maybe it'll be more apparent now that people got killed because of it.
You'd think that the software would sit in a kind of sandbox so that it couldn't nuke the whole device but only itself. It's crazy that this is possible.
The software basically works as a kernel module as far as I understand, I don’t think there’s a good way to separate that from the OS while still allowing it to have the capabilities it needs to have to surveil all other processes.
And even then, you wouldn’t want the system to continue running if the security software crashes. Such a crash might indicate a successful security breach.
> You do not deploy anything, ever on your entire production fleet at the same time and you do not buy software that does that
I am sympathetic to that, but its only possible if both policy and staffing allow.
for policy, there are lots of places that demand CVEs be patched within x hours depending on severity. A lot of times, that policy comes from the payment integration systems provider/third party.
However you are also dependent on programs you install not autoupdating. Now, most have an option to flip that off, but its not always 100% effective.
> I am sympathetic to that, but its only possible if both policy and staffing allow.
We are not talking about small companies here. We're talking about massive billion revenue enterprises with enormous IT teams and in some cases multiple NOCs and SOCs and probably thousands consultants all around at minimum.
I find it hard to be sympathetic to this complete disregard of ownership just to ship responsibility somewhere else (because this is the need at the of the day let's not joke around). I can understand it, sure, and I can believe - to a point - someone did a risk calculation (possibility of crowdstrike upgrade killing all systems vs hack if we don't patch a CVE in <4h), but it's still madness from a reliability standpoint.
> for policy, there are lots of places that demand CVEs be patched within x hours depending on severity.
I'm pretty sure leadership when they need to choose between production being down for an unspecified amount of time and taking the risk of delaying (of hours in this case) the patching will choose the delay.
Partners and payment integration providers can be reasoned with, contracts are not code. A BSOD you cannot talk away.
Sure, leadership is also now saying "but we were doing the same thing as everyone else, the consultants told us to and how could have we have known this random software with root on every machine we own could kill us?!" to cover their asses. The problem is solved already, since it impacted everyone, and they're not the ones spending their weekend hammering systems back to life.
> However you are also dependent on programs you install not autoupdating. Now, most have an option to flip that off, but its not always 100% effective.
You choose what to install on your systems, and you have the option to refuse to engage with companies that don't provide such options. If you don't, you accept the risk.
Oh absolutely. There’s many levels of failure here. A few that I see as being likely:
- Lack of testing of a deployment
- Lack of required procedures to validate a deployment
- Engineering management prioritizing release pace over stability/testing
- Management prioritizing tech debt/pentests/etc far too low
- Sales/etc promising fast turnarounds that can’t be feasibly met while following proper standards
- Lack of top-down company culture of security and stability first, which should be a must for any security company
This outage wasn’t caused only by “the intern pushing release.” It was caused by a poor company culture (read: incorrect direction from the top) resulting in a lack of testing of the program code, lack of testing environment for deployments, lack of formal deployment process, and someone messing up a definition file that was caught by 0 other employees or automated systems.
I can't speak to its veracity but there's a screenshot making its way around in which Crowdstrike discouraged sites from testing due to the urgency of the update.
I don’t work with CS products atm, but my experience with a big CS deployment was exactly like this. They were openly quite hostile to any suggestion of testing their products, we were frequently rebuked for running our prod censors on version n-1. I talked about it a bit in this comment.
It’s kind of hard to pitch “zero-day prevention” if you suggest people roll out definitions slowly, over the course of days/weeks. Thus making it a lot harder to charge to the moon for your service.
Now, if these sorts of things were battle tested before release, and had a (ideally decade+-long) history of stability with well-documented processes to ensure that stability, you can more easily make the argument that it’s worth it. None of those things are close to true though (and more than likely will never be for any AV/endpoint solution), so it is very hard to justify this sort of configuration.
Detecting system crashes would be hard. You could try logging and comparing timestamps on agent startups and see if the difference is 5 minutes or less. Buggy kernel drivers crash Windows hard and fast.
Store something like an `attemptingUpdate` flag before updating, and remove it if the update was successful. Upon system startup, if the flag is present, revert to the previous config and mark the new config bad.
One possible explanation could be automated testing deployments for definitions updates that don't run the current version of the definition consumer, and the old one they do run is unaffected.
Even on hn, comments advocating engineering excellence or just quality in general are frequently looked down on, which probably also tells you a lot about the wider world.
This is why we can’t have nice things, but maybe we just don’t want them anyway? “Mistakes will be made” is way less true if you actually put the effort in to prevent them, but I am beginning to think this has become code for quiet-quitters to telegraph a “I want to get paid for no effort and sympathize with others who feel the same” sentiment and appear compassionate and grimly realistic all at the same time.
yes, billion dollar companies are going to make mistakes, but almost always because of cost cutting, willful ignorance, or negligence. If average people are apologizing for them and excusing that, there has to be some reason that it’s good for them.
If it is a standard production rocket, I agree. If it is a first of kind or even third of kind launch, celebrating the lessons learned from a failure is a healthy attitude. This production software is not the same thing at all.
spaceX celebrating when their rocket blows up after a certain milestone it's like us devs celebrating when our branch with that new big feature only fails a few tests. Did it pass no? Are you satisfied as first try? Probably
I find it hard to believe they didn't do any testing. I wonder if they tested the virus signatures against the engine, but didn't check the final release artefact (the .sys file) and the bug was somehow introduced in the packaging step.
This would have been poor, but to have released it with no testing would have been the most staggering negligence.
The thing I don't understand about all of this is another, much less technical and much more important.
Why the blas radius was so huge?
I have deployed much less important services much more slowly with automatic monitoring and rollback in place.
You first deploy to beta, where you don't get customers traffic, if everything goes right to a small part of your fleet, and slowly increase the percentage of hosts that receives the updates.
This would have stopped the issue immediately, and I somehow I thought it was common practices...
It wasn't software update. It was signature database update. It's supposed to roll out as fast as possible. When you learn about new virus, it's already in the wild, so every minute counts. You don't want to delay update for a day just to find out that your servers were breached 20 hours ago.
We can see clearly now that this is a stupid approach. Viruses don't move that fast.
This situation is akin to the immune system overreacting and melting the patient in response to a papercut. This sometimes happens, but it's considered a serious medical condition, and I believe the treatment is to nuke someone's immune system entirely with hard radiation, and reinstall a less aggressive copy. Take from that analogy what you want.
Yes they do? And it’s more akin to a shared immune system than a single organism.
In this case, it’s not like viruses move fast relative to the total population of machines, but within the population of machines being targeted they do move fast.
Still, better to let them spread a bit and deal with the localized damage than risk nuking everything. There is such a thing as treatment that's very effective, but not used because of a low probability risk of terminal damage.
[SQL] Slammer spread incredibly quickly, even though the vulnerability was patched in the prior year.
> As it began spreading throughout the Internet, it doubled in size every 8.5 seconds. It infected more than 90 percent of vulnerable hosts within 10 minutes.
Worms are not technically viruses, but they can have similar impacts/perform similar tasks on an infected host.
Also keep in mind 8.5 million is likely the count of machines fully impacted and are not counting the machines impacted but were able to be automatically recovered.
> Also keep in mind 8.5 million is likely the count of machines fully impacted and are not counting the machines impacted but were able to be automatically recovered.
Do you have evidence of this? Please bring sources with you.
Can you explain why you find this idea of fast moving viruses so improbable? Just from the way the internet works, I wouldn’t be surprised if every reachable host could be infected in a few hours if the virus can infect a machine in a short time (a few seconds) and would then begin infecting other machines. Why is that so hard to imagine?
Proper firewalling for one. "Every reachable host" should be a fairly small set, ideally an empty set, when you're on the outside looking in.
And operating systems aren't that bad anymore. You don't have services out of the box opening ports on all the interfaces, no firewalls, accepting connections from everywhere, and using well-known default (or no) credentials.
Even stuff like the recent OpenSSH bug that is remotely exploitable and grants root access wasn't anything close to this kind of disaster because (a) most computers are not running SSH servers on the public internet (b) the exploit is rather difficult to actually execute. Eventually it might not be, but that gives people a bit of breathing space to react.
Most cyberattacks use old, unpatched vulnerabilites against unprotected systems combined with social engineering to get the payload past the network boundary. If you are within a pretty broad window of "up to date" on your OS and antivirus updates, you are pretty safe.
Microsoft puts the count at 8.5 million computers. So, percentage wise, the MyDoom virus in 2004 infected a far greater % of computers in a month: which in the context of internet penetration, availability and speeds (40kb/s average, 450kb/s fastest) in 2004 was about as fast as it could have. So it might as well have been 70 minutes, given downloading a 50mb file on dial up would take way longer than 70 mins.
To the smart people below:
It’s clear to everyone that 70 minutes is not 1 month. The point is that it’s not a fair comparison: it would simply not have been possible to infect that many computers in 70 minutes: the internet infrastructure just wasn’t there.
It’s like saying “the Spanish flu didn’t do that much damage because there where less people on the planet” - it’s a meaningless absolute comparison, whereas the relative comparison is what matters.
There's also orders of magnitudes more machines today than 20 years ago -- so it should be easier to infect more machines now than before, and yet no one can sight a virus that was as quickly moving and damaging as what crowdstrike did through gross negligence.
Computer security as a whole has improved, whilst the complexity of interconnected systems has exponentially increased.
This has made the barrier to entry for malware higher, and so means we no longer have the same historic examples of large scale worms targeting consumer machines that we used to.
At the same time the financial rewards for finding and exploiting a vulnerability within an organisations complex stack have greatly increased. The rewards are coupled to the time it takes to execute on the vulnerability.
This leads to what we have today: localised, and often specialised attacks against valuable targets that are executed as fast as possible in order to minimise the chance a target has to respond or the vulnerability they are exploiting to be burned.
Of course the “smart people belw” must know this, so it’s unclear why they are pretending to be dumb.
> This leads to what we have today: localised, and often specialised attacks against valuable targets that are executed as fast as possible in order to minimise the chance a target has to respond or the vulnerability they are exploiting to be burned.
Yup, exactly that.
So what I'm saying it, it's beyond idiotic to combat this with a kernel-level backdoor managed by one entity and deployed across half the Internet. If anyone manages to breach that, they have a way to make their attack much simpler and much less localized (though they're unlikely to be prepared to capitalize on that). A fuckup on the defense side, on the other hand, can kill everything everywhere all at once. Which is what just happened.
It's a "cure" for disease that happens to both boost the potency of the disease, and, once in blue moon, randomly kills the patient for no reason.
The fact is that this does help organisations. Definitely not all of the orgs that buy Crowdstrike, but rapid defence against evolving threats is a valuable thing for companies.
So, individually it’s good for a company. But as a whole, and as currently implemented, it’s not good for everyone.
However that doesn’t matter. Because individually it’s a benefit.
Which is why I'm hoping that this incident will make both security professionals and regulators reconsider the idea of endpoint security as it's currently done, and that there will be some cultural and regulatory pushback. Maybe this will incentivize people to come up with other ideas on how to secure systems and companies, that don't look like a police state on steroids.
But you’re conflating a few different things here. The regulations don’t say “you must use a fragile kernel module that runs the risk of boot-locking” do they?
The underlying fault in this drama is Microsoft - third party code shouldn’t be able to have the impact it did, regardless of how it is loaded or what it does. Their commitment to supporting legacy interfaces has shot them in the foot here.
If HP pushed a dodgy printer driver (and if those still lived in the kernel) that nuked tens of millions of machines, would you be out here saying “regulators and security professionals need to re-consider printers”?
Microsoft will shit bricks, start to do something to isolate kernel modules, Crowdstrike will be the first shining user of this, and life will go on.
You’re not displaying a pattern of healthy behavior by creating numerous new accounts to try and provoke an argument on such a stupid point, without contributing anything of substance to the discussion.
ILOVEYOU is a pretty decent contender, although the Internet was smaller back then and it didn't "crash" computers, it did different damage. Computer viruses and worms can spread extremely quickly.
> infected millions of Windows computers worldwide within a few hours of its release
It’s quite unclear about what your point/agenda is here. Are you truly this unfamiliar with the topic? If so, why comment, and if not, then why comment?
It takes about 1 search and 2 clicks to find an article posted less than 24 hours after the initial infection, quoting 2.5 million infected machines.
No they're in the wrong. They didn't test adequately, regardless of their motive for not doing so. Obviously reality is not backing up your theory there
No, but there are impenetrable barriers. 0days in paricular are usually very specific and affect few systems directly but even the broader ones aren't usually followed by a blanket attack that pwns everything and steals all the data or monies. Just about the only way to achieve this kind of blast radius is to have a kernel-level backdoor installed in every other computer on the planet - which is exactly what those endpoint "security" systems are.
It’s quite impressive really — crowdstrike were deploying a content update to all of their servers to warn them of the “nothing but nulls, anti-crowdstrike virus”
Their precognitive intelligence suggested that a world wide attack was only moments away. The same precognitive system showed that the virus was so totally incapacitating that the only safe response was to incapacitate the server.
Knowing that the virus was capable of taking down every crowdstrike server, they didn’t waste time trying it on a subset of servers.
Surely there is a happy medium between zero (nil,none,nada,zilch) staging and 24 hours of rolling updates? A single 30 second or so VM test would have revealed this issues.
There should have been a test catching the error before rollout, however this doesn’t require a staged rollout as suggested by the GP comment, testing the update at some customers (which would still be hosed in that case), it only requires executing the test before the rollout.
Because kernel needs to parse the data in some way and that parser apparently was broken enough. Whether it could be done in a more resilient manner, I don't know, you need to remember that antivirus works in hostile environment and can't necessarily trust userspace, so probably they need to verify signatures and parse payload in the kernel space.
Yup. If they were delaying update to half of their customers for 24 hours, and in that 24 hours some of their customers got hacked by a zero day, say leading to ransomeware, the comment threads would be demanding their head for that!
Sure. And if someone showed up here with a story about how they got attacked and ransomwared enterprise-wide in the however many several hours that they were waiting for their turn to rollout, what do you think HN response would be?
Hmm, maybe you could have companies pay more to be in the first rollout group? That'd go over well too.
True, there will be comments blaming CS for not doing faster rollout. But there would be some comments empathizing with CS viewpoint and pointing out the conflicting compromise between velocity, and correctness. Even now I think the comments wouldn't have been unequivocally critical, of CS, if the hosts affected were a variant of windows (say issue was seen on version of windows 10 which was two update behind),there would have been some emphasizing the thorniness of the problem and sympathetic of CS.
It doesn't matter what kind of update it was: signature, content,etc. Only thing that matters is does the update has a potential to disrupt the user's normal activity (leave alone bricking the host), if yes ensure it either works or have a staged rollout with a remediation plan.
It's answered in the post (in the thread) as well. But for comparison, when I worked for an AV vendor we pushed maybe 4 updates a day to a much bigger customer base (if the numbers reported by MS are true).
It was a long time ago and I wasn't as involved with this, so I don't know with certainty what was used and how. We had multiple channels for major product versions + beta customers on the most recent one. On top of these we could stage different CDN regions.
There were different types of data you could update and those might have been treated differently (e.g. simple file signatures vs definitions for heuristics).
One thing I am surprised no one has been discussing is the role Microsoft have played in this and how they set the stage for the CrowdStrike outage through a lack of incentive (profit, competition) to make Windows resilient to this sort of situation.
While they were not directly responsible for the bug that caused the crashes, Microsoft does hold an effective monopoly position over workstation computing space (I'd consider this as infrastructure at this point) and therefore have a duty of care to ensure the security/reliability and capabilities of their product.
Without competition, Microsoft have been asleep at the wheel on innovations to Windows - some of which could have prevented this outage.
For example; Crowdstrike runs in user space on MacOS and Linux - does Windows not provide the capabilities needed to run Crowdstrike in user space?
What about innovations in application sandboxing which could mitigate the need for level of control CrowdStrike requires?
The fact is; Microsoft is largely uncontested in holding the keys to the world's computing infrastructure and they have virtually no oversight.
Windows has fallen from making over 80% of Microsoft's revenue to 10% today - there is nothing wrong with being a private company chasing money - but when your product is critical to the operation of hospitals, airlines, critical infrastructure, you can't be out there tickling your undercarriage on AI assistants and advertisements to increase the product's profitability.
IMO Microsoft have dropped the ball on their duty of care to consumers and CrowdStrike is a symptom of that. Governments need to seriously consider encouraging competition in the desktop workspace market. That, or regulate Microsoft's Windows product
So the grandparent poster has a fundamental misunderstanding of how Windows works, and why CrowdStrike has a kernel driver in the first place.
Microsoft has long desired to kick AV vendors out of kernel space and has even attempted to do so prior, however because of its dominant position in the market, it is unable to do so. I was at MS when an iteration of this effort was underway, and the EU said no.
See, Windows is a highly regulated OS today, and making a change like kicking out AV vendors from the kernel runs afoul of antitrust laws.
Microsoft also has ELAM: https://learn.microsoft.com/en-us/windows-hardware/drivers/i... which is a rootkit / bootkit defensive mechanism. A defect in the definition files (as noted in the twitter thread) is what caused the crash in an ELAM driver. CrowdStrike obviously was not following the required process for ELAM drivers.
All good points, I might have been slightly over-impassioned and under-informed in my original rant (though still salty at Microsoft's assault on the usability of Windows).
My understanding was that CrowdStrike breaking on Debian was actually the motivation for them moving to user-space on Linux. I'm surprised that, assuming they have the capability to do so, they haven't done the same on Windows.
I don’t run CrowdStrike and to the best of my knowledge haven’t had it installed on one of my systems (something similar ran on my machine at the last corporate Jon I had), so correct me if I’m wrong.
It seems great pains are made to ensure the CS driver is installed first _and_ cannot be uninstalled (presumably the remote monitor will notice) or tampered with (signed driver).
Then the driver goes and loads unsigned data files that can be arbitrarily deleted by end users? Can these files also be arbitrarily added by end users to get the driver to behave in ways that it shouldn’t? What prevents a malicious actor from writing a malicious data file and starting another cascade of failing machines or worse, getting kernel privileges?
These files cannot be deleted or modified by the user, even with admin privs. That would make it trivial to disable the antivirus. It's only possible by mounting the file system in a different OS, which is typically prevented by Bitlocker.
Not in the BitLocker configurations I've seen over the last few days. The file is deletable as a local administrator in safe mode without the BitLocker recovery key in at least some configurations.
Do these customers of crowd strike even have a say in these updates going out or do they all just bend over and let crowd strike have full RCE on every machine in their enterprise.
I sure hope the certificate authorities and other crypto folks get to keep that stuff off their systems at least.
I don't know if there's a way to outsource ongoing endpoint security to a third party like Crowdstrike without giving them RCE (and ring 0 too) on all endpoints to be secured. Having Crowdstrike automate that part is kind of the point of their product.
Does anybody know if these “channel files” are signed and verified by the CS driver? Because if not, that seems like a gaping hole for a ring 0 rootkit. Yeah, you need privileges to install the channel files, but once you have it you can hide yourself much deeper in the system. If the channel files can cause a segfault, they can probably do more.
Any input for something that runs at such high privilege should be at least integrity checked. That’s the basics.
And the fact that you can simply delete these channel files suggests there isn’t even an anti-tamper mechanism.
This is a pretty brief 'analysis'. The poster traces back one stack frame in assembler, it basically amounts to just reading out a stack dump from gdb. It's a good starting point I guess.
These "channel files" sound like they could be used to execute arbitrary code... Would be a big embarrassment if it shows up in KDU as a provider...
(This is just an early guess from looking at some of the csagent in ida decompiler, haven't validated that all the sanity checks can be bypassed as these channel files appear to have some kind of signature attached to them.)
A 'channel file' is a file interpreted by their signature detection system. How far is this from a bytecode compiled domain specific language? Javascript anyone?
eBPF, much the same thing, is actually thought about and well designed. If it wasn't it would be easy to crash linux.
This is what they do and they are doing badly. I bet it's just shit on shit under the hood, developed by somewhat competent engineers, all gone or promoted to management.
Oddly enough, there was an issue last month with CrowdStrike and RHEL 9 kernel where they were triggering a kernel panic when attempting to load a bpf program from their newer bpf sensor. One of the workarounds was to switch to their kernel driver mode.
This was obviously a bug in RHEL kernel because even if the bpf program was bunk it should not cause the kernel to panic. However, it's almost like CrowdStrike does zero testing of their software and looks at their end users as Test/QA.
The kernel update in question was released as part of a RHEL point release (9.3 or 9.4, I forget which).
I’m not sure how much early warning RH gives to folks when a kernel change comes in via a point release. Looking at https://www.redhat.com/en/blog/upcoming-improvements-red-hat..., it seems like it’s changing for 9.5. I hope CrowdStrike will be able to start testing against those beta kernels.
It was 9.4. I don’t think any amount of heads up will make a difference considering it took them like 3+ years to notice that E4S streams were a thing. Most of these security vendors tend to treat Linux as the red headed step child and do the least.. With that said, after the recent event it would seem that CrowdStrike treats all OSes as red headed step children lol
It's really difficult to evaluate the risk the CrowdStrike system imposed. Was this a confluence of improbable events or an inevitable disaster waiting to happen?
Some still-open questions in my mind:
- was the broken rule in the config file (C-00000291-...32.sys) human authored and reviewed or machine-generated?
- was the config file syntactically or semantically invalid according to its spec?
- what is the intended failure mode of the kernel driver that encounters an invalid config (presumably it's not "go into a boot loop")?
- what automated testing was done on both the file going out and the kernel driver code? Where would we have expected to catch this bug?
- what release strategy, if any, was in place to limit the blast radius of a bug? Was there a bug in the release gates or were there simply no release gates?
Given what we know so far, it seems much more likely that this was a "disaster waiting to happen" but I still think there's a lot more to know. I look forward to the public post-mortem.
Would any of these, or even a collection of these, resolving in some direction make it highly improbable that it'll never happen again?
Seems to me 3rd party code, running in the kernel, on parsed inputs, that can be remotely updated is enough to be disaster waiting to happen gestures breezily at Friday
That's, in the Taleb parlance, a Fat Tony argument, but barring it being a cosmic ray causing a uncorrected bit flop during deploy, I don't think there's room to call it anything but "a disaster waiting to happen"
Yes, if CrowdStrike was following industry best practices and this happened, it would teach us something novel about industry practices that we could learn from and use to reduce the risk of a similar scale outage happening again.
If they weren't following these practices, this is kind of a boring incident with not much to be learned, despite how dramatic the scale is. Practices like staged rollout of changes exist precisely because we've learned these lessons before.
Well, kernel code is kernel code, and kernel code in general takes input from outside the kernel. An audio driver takes audio data, a video driver might take drawing instructions, a file system interacts with files, etc. Microsoft, and others, have been releasing kernel code since forever and for the most part, not crashlooping their entire install base.
My Tesla remote updates ... hmph.
It doesn't feel like this is inherently impossible. It feels more like not enough design/process to mitigate the risks.
kernel driver could have data check on the channel file and fail gracefully/ignore wrong file instead of BSOD.
this code is executed only once during the driver initialization, so shouldn't be much overhead, but will greatly improve reliability against broken channel file
This is going to code as radical, but I always assumed it was derivable from bog-standard first principles that would fit in any economics class I sat in for my 40 credits:
the natural cost of these bits we sell is zero, so in the long run, if the bar is "just write a good & tested kernel driver", there will always be one more subsequent market entrant who will go too cheap on engineering. Then, they touch the hot wire and burn down the establishment.
That doesn't mean capitalism bad, but it does mean I expect only Microsoft is capable of writing and maintaining this type of software in the long run.
Ex. The dentist and dental hygienist were asking me who was attacking Microsoft on Friday, and they were not going to get through to the the subtleties of 3rd kernel driver release gating strategy.
MS has a very strong incentive to fix this. I don't know how they will. But I love when incentives align and assume they always will, in the long run.
To answer some of my questions based on the "Preliminary Post Incident Review", the config file was indeed invalid, it was only checked with a (buggy) validator, and then released to the whole world at once. Never was the config file ever tested with the actual software that would read it in an actual environment like the customer machines that got this update.
They don't say why it was invalid or really what the file is, but it seems like it is some kind of relatively complex set of rules that are evaluated by the kernel module. Presumably they are manually authored and reviewed and it seems possible the bug was missed in review because this was a relatively new type of rule.
So this isn't a case of an incident that slipped through a rigorous testing and release process process following industry best practice, but rather a "disaster waiting to happen". Further, CrowdStrike CEO George Kurtz should have known better, considering an analogous incident happened under his watch as CTO of McAfee in 2010.
The glaring question is how and why it was rolled out everywhere all at once?
Many corporations have pretty strict rules on system update scheduling so as to ensure business continuity in case of situations like this but all of those were completely circumvented and we had fully synchronised global failure. It really does not seem like business as usual situation.
The glaring question is how and why it was rolled out everywhere all at once?
Because the point of these updates is to be rolled out quickly and globally. It wasn't a system/driver update, but a data file update: think antivirus signature file. (Yes, I know it can get complicated, and that AV signatures can be dynamic... not the point here.)
Why those data updates skipped validity testing at the source is another question, and one that CrowdStrike better be prepared to answer; but the tempo of redistribution can't be changed.
A customer should be able to test an update, whether a signature file or literally any kind of update, before rolling it out to production systems. Anything else is madness. Being "vulnerable" for an extra few hours carries less risk than auto-updates (of any kind) on production systems. As we've seen here. If you can point to hard evidence to the contrary, where many companies were saved just in time because of a signature update and would have been exploited if they'd waited a few hours, I'd love to read about it. It would have to have happened on a rather large scale for all of the instances combined to have had a larger positive impact than this single instance.
Is it realistic that there's a threat actor that will be attacking every computer on the whole planet at once?
I can understand that it's most practical to update everyone when pushing an update to protect a few actively under attack but I can also imagine policies where that isn't how it's done, while still getting urgent updates to those under attack.
which crowdstrike gets to bypass because they claime themselves as an antivirus and malware detection platform - at least, this is what the executives they've wined and dined into the purchase contracts have been told. The update schedule is independently controlled by crowdstrike, rather than by a system admin i believe.
From the article on The Verge it seems that this kind of update is downloaded automatically even if you disable automatic updates. So those users who took this kind of issue seriously would have thought that everything was configured correctly to not automatically update.
It seems like a none of the above situation because each of those should have really minimized the chances of something like this happening. But this is pure speculation. Even the most perfect organization engineering culture can still have one thing get through... (Wasn't there some Linux incident a little back though?)
Quality starts with good design, good people, etc. the process parts come much after that. I'd like to think that if you do this "right" then this sort of stuff simply can't happen.
If we have organization/culture/engineering/process issues then we're likely not going to get an in-depth public most-mortem. I'd love to get one just for all of us to learn from it. Let's see. Given the cost/impact having something like the Challenger investigation with some smart uninvolved people would be good.
You do remember Solarwinds right? This is an obvious high value target, so it is reasonable to entertain malicious causes.
Given the number of systems infected, if you could push code that rebooted every client into a compromised state you’d still have run of some % of the lot until it was halted. That time window could be invaluable.
Now, imagine if you screw up the code and just boot loop everything.
I’d say business wise it’s better for crowd strike to let people think it’s an own-goal.
The truth may be mundane but a hack is as reasonable a theory as “oops we pushed boot loop code to world+dog”.
> The truth may be mundane but a hack is as reasonable a theory as “oops we pushed boot loop code to world+dog”.
No it's not. There are many signs that point to this being a mistake. There are very few that point to it being a hack. You can't just go "oh it being a hack is one of the options therefore it is also something worth considering".
In a world of complex systems a "confluence of improbable events" is the same thing as "a disaster waiting to happen". Its the swiss cheese model of failure. Y
Has anyone looked into their terms and conditions? Usually any resulting damage from software malfunctioning is excluded. Only the software itself being unavailable may be an SLA breach.
Typically there would also be some clauses where CS is the only one that is allowed to determine an SLA breach, SLA breaches only result in future licence credits no cash, and if you disagree it's limited to mandatory arbitration...
The biggest impact is probably only their reputation taking a huge hit. Loosing some customers over this and making it harder to win future business.
No big company is going to agree to the terms and conditions that are listed on their website, they'll have their own schedules for indemnification that CS would agree to, not the other way around. Those 300 of the Fortune 500 companies are going to rip CS apart.
They will still need to hire lawyers to prove this. Thousands of litigants. I am sure there is some tort which is not covered by the arbitration agreement that would give plaintiff standing no?
Commenter on stack exchange had an interesting counter:
In some jurisdictions, any attempt to sidestep consumer law may be interpreted by the courts as conspiracy, which can prove more serious than merely accepting the original penalties.
They really are delusional, as a security person crowdstrike was overvalued before this event, and to everyone in tech this shows how bad their engineering practices are.
but they are able to insert themselves into this many enterprise machines! So regardless of your security credentials, they made good business decisions.
On the other hand, this may open the veil for a lot of companies to dump them.
This reminds me of the vulnerability that hit jwt tokens a few years ago, when you could set the 'alg' to 'none'.
Surely CrowdStrike encrypts and signs their channel files, and I'm wondering if a file full of 0's inadvertently signaled to the validating software than a 'null' or 'none' encryption algo was being used.
This could imply the file full of zeros is just fine, as the null encryption passes, because it's not encrypted.
That could explain why it tried to reference the null memory location, because the null encryption file full of zeroes just forced it to run to memory location zero.
The risk is, if this is true, then their channel loading verification system is critically exposed by being able to load malicious channel drivers through disabled encryption on channel files.
The only thing I know about crowdstrike is they hired a large percentage of the underperforming engineers we fired at multiple companies I’ve worked at
You'd use WinDBG today. It allows you to do remote kernel debugging over a network. This also includes running Windows in a virtual machine, and debugging it through the private network connection.
SoftIce predates me, but when I was doing filesystem filter driver work, the tool of choice was WinDbg. Been out of the trade for a bit, but it looks to still be in use. We had it set up between a couple of VMs on VMware.
They're referencing an (in)famous video of a drunk/drugged/tired Orson Welles attempting to do a commercial; his line is "Ahhh, the... French... champagne has always been celebrated for its excellence..."
I don't think there's anything more to the inclusion of "French" in their comment beyond it being in the original line.
lol, I’ve lost count of how many CI systems I’ve seen that are essentially no-ops, letting through all errors, because somewhere there was a bash script without set -o errexit.
In my experience, testing data and config is very rare in the whole industry.
Feeding software corrupted config files or corrupted content from its own database often makes software to crash. Most often this content is "trusted" to be "correct".
The question I have is, why doesn't Windows have a way to allow booting still without the faulting kernel module?
I know there's safe mode, but that's the nuclear option, and safe mode isn't really "usable".
Couldn't a lot of this been avoided if Windows could just retry its boot after BSOD without the faulting module, and then they could push out a new module with a fix shortly after?
Try solving some crackme's. They're binary executables of various difficulty (with rated difficulty), where the goal ranges from finding a hardcoded password to making a keygen to patching the executable. They used to be more popular, but I'm guessing you can still find tutorials on how to get started and solve a simple one.
Take this with a grain of salt as I’m not an SME, but there is a need for volunteers on reverse-engineering projects such as the Zelda decompilation projects[1]. This would probably give you some level of exposure, particularly if you have an interest in videogames.
Writing your own simple programs and debugging/disassembling them is a solid option. Windbg and Ida are good tools to start with. Reading a disassembly is a lot easier than coding in assembly, and once you know what things like function calls and switch statements, etc. look like you can get a feel for what the original program was doing.
you can compile your own hello world and look at the executable with x64dbg. press space on any instruction and you can assemble your own instruction in it's place (optionally filling the leftover bytes with NOPs)
first you need to learn assembly, second you can start by downloading ghidra and directly start decompiling some simple things you use and seeing what they do.
I wonder what privilege level this service runs at. If it's less than ring 0, i think some blame needs to go to Windows itself. If it's ring 0, did it really need to be that high??
Surely an OS doesn't have to go completely kaput due to one service crashing.
It's not a service, it's a driver. "Anti"malware drivers typically run with a lot of permissions to allow spying on all processes. Driver failures likely mean the kernel state is borked as well, so Windows errs on the side of caution and halts.
I am genuinely curious what their CI process that passed this looks like, as well as if they're doing any sort of dogfooding or manual QA? Are changes just CI/CD'd out to production right away?
No this is kernelspace, an so while all addresses are 'virtual' an unmapped address is an address that hasn't been mapped in the page tables. Normally critical kernel drivers and data are marked as non-pagable (note: The Linux Kernel doesn't page, NTKernel does a legacy of when it was first written and memory constraints of the time). So if a driver needs to access pagable data it must not be part of the storage flow (and Crowdstrike is almost certainly part of it), and at the correct IRQL (the Interrupt priority level, anything above dispatch, AKA the scheduler, has severe restraints on what can happen there).
So no an unmapped address is a completely different BSOD, usually PAGE_FAULT_IN_UNPAGED_AREA which is a very bad sign
PAGE_FAULT_IN_NONPAGED_AREA[1]... was the BSOD that occurred in this case. That's basically the first sign that it was a bad pointer dereference in the first place.
(DRIVER_)IRQL_NOT_LESS_OR_EQUAL[2][3] is not this case, but it's probably one of the most common reasons drivers crash the system generally. Like you said it's basically attempting to access pageable memory at a time that paging isn't allowed (i.e. when at DISPATCH_LEVEL or higher).
R8 is 0x9c in that example, which is somewhat typical for null+offset, but in the twitter thread it's 0xffff9c8e0000008a.
So the actual bug is further back. It's not a null pointer dereference, but it somehow results in the mov r8, [rax+r11*8] instruction reading random data (could be anything) into r8, which then gets used as a pointer.
As an example to illustrate the sibling comments’ explanations:
int *array = NULL
int position = 0x9C
int a = *(array[pos]) //equivalent to *(array + 0x9C) - dereferencing NULL+0x9C, which is just 0x9C
This will segfault (or equivalent) due to reading invalid memory at address 0x9C. Most people would call array[pos] a null pointer dereference casually, even though it’s actually a 0x9C pointer dereference, because there’s very little effective difference between them.
Now, whether this case was actually something like this (dereferencing some element of a null array pointer) or something like type confusion (value 0x9C was supposed to be loaded into an int, or char, or some other non-pointer type) isn’t clear to me. But I haven’t dug into it really, someone smarter than me could probably figure out which it is.
What we are witnessing quite starkly in this thread is that the majority of HN commenters are the kinds of people exposed to anti-woke/DEI culture warriors on Twitter.
0x9c (156 dec) is still a very small number, all things considered. To me that sounds like attempting to access an offset from null - for instance, using a null pointer to a struct type, and trying to access one of its member fields.
It is pretty common for null pointers to structures to have members dereferenced at small offsets, and people usually consider those null dereferences despite not literally being 0. (However, the assembly generated in this case does not match that access pattern, and in fact there was an explicit null check before the dereference.)
Such an invalid access of a very small address probably does result from a nullptr error:
struct BigObject {
char stuff[0x9c]; // random fields
int field;
}
BigObject* object = nullptr;
printf("%d", object->field);
That will result in "Attempt to read from address 0x9c". Just because it's not trying to read from literal address 0x0 doesn't mean it's not nullptr error.
In every real world implementation anyone cares about, it's zero. Also I believe it is defined to compare equal to zero in the standard, but don't quote me on that.
> Also I believe it is defined to compare equal to zero in the standard, but don't quote me on that.
That's true for the literal constant 0. For 0 in a variable it is not necessarily true. Basically when a literal 0 is assigned to a pointer or compared to a pointer the compiler takes that 0 to mean whatever bit pattern represents the null pointer on the target system.
What? If you have a null pointer to a class, and try to reference the member that starts 156 bytes from the start of the class, you’ll deference 0x9c (0 + 156)
I found windows confusing. In Linux speak, was this some kind of kernel module thing that CS installed? It’s all I can think of for why the machines BSOD
It was a binary data file (supposedly invalid) that caused the actual CS driver component to BSOD. However, they used the „sys“ suffix to make it look just like a driver supposedly to get Windows protection from a malicious actor to just delete it. AFAIU.
Windows filesystem protection doesn't rely upon the filename, but on the location.
They could have named their files "foo.cfg", "foo.dat", "foo.bla" and been equally protected.
The use of ".sys" here is probably related to the fact it is used by their system driver. I don't think anybody was trying to pretend the files there are system drivers themselves, and a quick look at the exports/disassembly would make that apparent anyway.
'Analysis' of the null pointer is completely missing the point. The simple fact of the matter is they didnt do anywhere near enough testing before pushing the files out. Auto update comes with big responsibility, this was criminally reckless
The other issue is that they push to everyone - as someone who at my last job had a million boxes in the wild, and was very aware that bricking them all would kill the company we would NEVER push them all at once, we'd push a few 'friends and family' (ie practice each release on ourselves first), then do a few % of the customer base and wait for problems, then maybe 10%, wait again, then the rest.
Of course we didn't have had any third party loading code into our boxes out of our control (and we run linux)
I'm not overly familiar with crowdstrike processes, but assume they are long running. If it's all loaded to memory, eg a config, I can't see how you'd get any performance gain at all. It just seems lazy.
The girl on the supermarket checkout said she hoped her computer wouldn't be affected. I knowingly laughed and said "you probably don't have on your own computer unless your a bank".
She said, "I installed it before for my cybersecurity course but I think it was just a trial"
Imagine if Microsoft sold you a secure operation system like Apple. A staggering portion of the existing cybersecurity industry would be irrelevant if this ever happened.
Most enterprises these days also run stuff like Crowdstrike (or literally Crowdstrike) on their macOS deployments. Similarly Windows these days is bundled with OS-level antivirus which is sufficient for non-enterprise users.
Not in the security industry, but my take is that basically the desktop OS permissions and security model is wrong for a lot of these devices, but there is no alternative that is suitable or that companies are willing to invest in. Probably many of the highest-profile affected machines (airport terminals, signage, medical systems, etc.) should just resemble a phone/iPad/Chromebook in terms of security/trust, but for historical/cost/practical reasons are Windows PCs with Crowdstrike.
CrowdStrike uses eBPF on Linux and System Extensions on macOS. Neither if which need kernel level presence. Microsoft should move towards offering these kind of solutions to make AV and EDR more resistent on Windows devices, without jeopardising system integrity and availability.
What really blew my mind about this story is learning that a single company (CrowdStrike) has the power to push random kernel code to a large part of the world's IT infrastructure, at any time, at their will.
Correct me if I'm wrong but isn't kernel-level access essentially God Mode on every computer their software is installed on? Including spying on the entire memory, running any code, deleting data, installing ransomware? This feels like an insane amount of power concentrated into the hands of a single entity, on the level of a nuclear submarine. Wouldn't that make them a prime target for all sorts of nation-state actors?
This time the damage was (likely) unintentional and no data was lost (save for lost BitLocker keys), but were we really all this time one compromised employee away from the largest-ever ransomware attack, or even worse?
It's not perfectly clear yet if CrowdStrike is able to push executable code via those updates. It looks like they updated some definition files and not the kernel driver itself.
But the kernel driver obviously contains some bugs, so it's possible that those definition updates can inject code. There might be a bug inside the driver that allows code execution (it happens all the time that some file parsing code can be tricked into executing parts of the data). I'm not sure, but I guess a lot of kernel memory is not fully protected by NX bits.
I still have the gut feeling, that this incident was connected to some kind of attack. Maybe a distraction from another attack while everyone is busy about fixing all the clients. During this incident security measures were for sure lowered, lists with BitLocker keys printed out for service technicians to fix the systems. Even the fix itself was to remove some parts of the CroudStrike protection. I would really like to know what was inside the C-00000291*.sys file before the update replaced it with all zeros. Maybe it was a cleanup job to remove something concerning that went wrong. But Hanlon's razor tells me not to trust my gut: "Never attribute to malice that which is adequately explained by stupidity."
For what it's worth, I 10000% agree with your gut feeling, and mine is a gut feeling too so I didn't mention it on HN because we typically don't talk about these types of guts feelings because of the directions they become speculative in (+the razor), but what you wrote is exactly what is in my head, fwiw.
Data was lost in the knock on effects of this, I assure you.
> largest-ever ransomware attack
A ransomware attack would be a terrible use of this power. A terrorist attack or cover while a country invades another country is a more appropriate scale of potential damage here. Perhaps even worse.
This is the mini existential crisis I have randomly. The attack area for a modern IT computer is mind bogglingly massive. Computers are pulling and executing code from a vast array of “trusted” sources without a sandbox. If any one of those “trusted” sources are compromised (package managers, cdns, OS updates, security software updates, just app updates in general, even specific utilities like xz) then you’re absolutely screwed.
It’s hard not to be a little nihilistic about security.
Well kernel agents and drivers are not uncommon, however anyone doing anything at scale where there is anything touching a kernel is typically well understood in the system you're implementing it on. That aside, I gather from skimming around (so might be wrong here) - seems people were specifically implementing this because of a business case not a technical case, I read it's mostly used to create compliance (I think via shifted liability) - so I think it was probably too easy to happen and so it happened - in that - someone in the bizniz dept said "if we run this software we are compliant with whatever, enabling XYZ multiple of new revenue, clear business case!!!" and the tech people probably went "bizniz people want this, bizniz case is clear, this seems like a relatively advanced business who know what they're doing, it doesn't really do much on my system and I'm mostly deploying it to innocuous edge user systems, so seems fine shrug" - and then a bad push happened and lots and lots of IT departments had had the same convo aforementioned.
Could be wrong here so if anyone knows better and can correct me...plz do!
> implementing this because of a business case not a technical case
there are some certification requirements to do pentests/red teaming and then those security folk will all tell them to install an EDR so they picked crowdstrike, but the security people have a very valid technical case for that recommendation.
it doesn't shift liability to crowdstrike, thats not how this works. In this specific case they are very likely liable due to gross negligence, but that is different
The OS vendors themselves (Microsoft, Apple, all the linux distros) have this power as well via their automatic update channels. As do many others who have automatically-updating applications. So it's not a single company, it's many companies.
That's true; I suppose it doesn't feel as bad because they're much larger companies and more in the public's eye. It's still scary to think about the amount of power they yield.
"What really blew my mind about this story is learning that a single company (CrowdStrike) has the power to push random kernel code to a large part of the world's IT infrastructure, at any time, at their will."
Isn't that every antivirus software and game anticheat?
It is a well known fact that these companies who hold huge sway on the world's IT landscape are commonly infiltrated at the top levels by Intel agents.
I see a paradox that the null bytes are "not related" to the current situation and yet deleting the file seems to cure the issue. Perhaps the CS official statement that "This is not related to null bytes contained within Channel File 291 or any other Channel File." is poorly worded.
My opinion is that CS is trying to say the null bytes themselves aren't the actual root cause of the issue, but merely a trigger for the actual root cause, which is that CSAgent.sys has a problem where malformed input vectors can cause it to crash. Well designed programs should error out gracefully for foreseeable errors, like corrupted config files.
If we interpret that quoted sentence such that "this" is referring to "the logical error", and that "the logical error" is the error in CSAgent.sys that causes it to crash upon reading a bad channel file, then that statement makes sense.
This is a bit of a stretch, but so far my impression with CS corporate communication regarding this issue has been nothing but abject chaos, so this is totally on-brand for them.
> My opinion is that CS is trying to say the null bytes themselves aren't the actual root cause of the issue, but merely a trigger for the actual root cause,
My opinion is they say "unrelated" because they are trying to say unrelated - and hence no, this was not a trigger.
It seems really scary to me, that crowdstrike is able to push updates in real time to most of their customers systems. I don't know of any other system, that would provide a similar method to inject code at kernel level. Not even windows updates, as they always roll out with some delay and not to all computers at the same time
If you want to attack high profile systems, crowdstrike would be one of the best possible targets.
The amount of self pwning that goes on in both corporate and personal devices these days is insane. The amount of games that want you to install kernal level anti-cheat is astounding. The amount of companies that have centralized remote surveillance and control of all devices, where access to this is through a great number of sloppily managed accounts, is beyond spooky.
Exactly. It's ridiculous to open up all/most of a companies systems to such a single point of failure. We install redundant PSUs, backup networks, generators, and many more things. But one single automatic update can bring down all systems within minutes. Without any redundancy.
I mean centralized control of devices is great for the far more common occurrence of Bob from accounting leaving his laptop on the train with his password on post-it note stuck to the screen.
Absolutely, there are many reasons for why it's useful and helps keep the IT department smaller. However, there could be a little more paranoia around how access is managed, which is possible to do without severely impacting the usability of the tool and without making work unnecessarily difficult.
The scarier thought I've had -- if a black hat had discovered this crash case, could it have been turned into a widely deployed code execution vulnerability?
Start a story for them: "and then, the hackers managed to install a rootkit which runs in kernel mode. The rootkit has sophisticated C2 mechanism with configuration files pretending to be drivers suffixed with .sys extensions. And then, they used that to prevent hospitals and 911 systems around the world from working, resulting in delayed emergency responses, injuries, possibly deaths".
After they cuss the hackers under their breath exclaiming something like: "they should be locked up in jail for the rest of their lives!...", tell them that's exactly what happened, but CS were the hackers, and maybe they should reconsider mandating installing that crap everywhere.
I mean kernal level access does provide feature not accessible in userspace. Is it alsooverused when other solutions exist, you bet.
Most people don't need this stuff. Just keeping shit up to date, no not on the nightly build branch, but like installing windows update atleast a day or two after they come out. Or maby regular antivirus scans.
But let's be honest, your kernal drivers are useless if your employees fall for phishing or social engineering. See then its not malware, its an authorized user on the system....just copying data onto a USB drive or a rouge employee taking your customer list to your competition. That fancy pants kernal driver might be really good at stopping sophisticated threats and I'm sure the marketing majors at any company cram products full of buzz words. But remember, you can't fix incompetent or malicious employees unless your taking steps to prevent it.
What's more likely: some foreign government hacking khols? Or a script kiddie social engineers some poor worker pretending to be the support desk?
Not here to shit on this product, it has its place and it obviously does a good job....(heard its expensive but most xrd/edr is)
Seems like we are learning how vulnerable certain things are once again. As a fellow security fellow, I must say that Jia Tan must be so envious that he couldn't have this level of market impact.
To trigger the crash, you need to write a bad file into C:\Windows\System32\drivers\CrowdStrike\
You need Administrator permissions to write a file there, which means you already have code execution permissions, and don't need an exploit.
The only people who can trigger it over network are CrowdStrike themselves... Or a malicious entity inside their system who controls both their update signing keys, and the update endpoint.
Anyone know if the updates use outbound HTTPS requests? If so, those companies that have crappy TLS terminating outbound proxies are looking juicy. And if they aren't pinning certs or using CAA, I'm sure a $5 wrench[1] could convince one of the lesser certificate authorities to sign a cert for whatever domain they're using.
Even if the HTTPS channel is compromised with a man-in-the-middle attack, the attacker shouldn't be able to craft a valid update, unless they also compromised CrowdStrke's keys.
However, the fact that this update apparently managed to bypass any internal testing or staging release channels makes me question how good CrowdStrike's procedures are about securing those update keys.
Depends when/how the signature is checked. I could imagine a signature being embedded in the file itself, or the file could be partially parsed before the signature is checked.
It's wild to me that it's so normal to install software like this on critical infrastructure, but questions about how they do code signing is a closely guarded/obfuscated secret.
Though, I prefer to give people benefit of doubt for this type of thing. IMO, the level of incompetence to parse a binary file before checking the signature is significantly higher (or at least different) than simply pushing out a bad update (even if the latter produces a much more spectacular result).
Besides, we don't need to speculate.
We have the driver. We have the signature files [1]. Because of the publicity, I bet thousands of people are throwing it into Binary RE tools right now, and if they are doing something as stupid as parsing a binary file before checking it's signature (or not checking a signature at all), I'm sure we will hear about it.
We can't see how it was signed because that's happening on Cloudstrike's infrastructure, but checking the signature verification code is trivial.
Kind of a side talent, but I’m currently (begrudgingly) working on a project with a Fortune 20 company that involves a complicated mess of PKI management, custom (read: non-standard) certificates, a variety of management/logging/debugging keys, and (critically) code signing. It’s taken me months of pulling teeth just to get details about the hierarchy and how the PKI is supposed to work from my own coworkers in a different department (who are in charge of the project), let alone from the client. I still have absolutely 0 idea how they perform code signing, how it’s validated, or how I can test that the non-standard certificates can validate this black-hole-box code signing process. So yeah, companies really don’t like sharing details about code signing.
This wasn't a code update, just a configuration update. Maybe they don't put config update though QA at all, assuming they are safe.
It's possible that QA is different enough from production (for example debug builds, or signature checking disabled) that it didn't detect this bug.
Might be an ordering issue, and that they tested applying update A then update B, but pushed out update B first.
The fact that it instantly went out to all channels is interesting. Maybe they tested it for the beta channel it was meant for (and it worked, because that version of the driver knew how to cope with that config) but then accidentally pushed it out to all channels, and the older versions had no idea what to do wiht it.
Or maybe they though they were only sending it to their QA systems but pushed the wrong button and sent it out everywhere.
that's assuming they don't do cert pinning. Moreover despite all the evil things you can supposedly do with a $5 wrench, I'm not aware of any documented cases of this sort of attack happening. The closest we've seen are misissuances seemingly caused by buggy code.
If you get have privileged escalation vulnerability there are worse things you can do. Just making the system unbootable by destroying the boot sector/EFI partition and overwriting system files. No more rebooting in safe mode and no more deleting a single file to fix the boot.
This would probably be classified as a terrorist attack and frankly it’s just a matter of time until we get one some day. A small dedicated team could pull it off. It’s just so happens that the people with the skills currently either opt for cyber criminality (crypto lockers and such), work for a state actor (think Stuxnet) or play defense in a cyber security firm.
Microsoft has leaked keys that weren't used for code signing. I've been on the receiving end of this actually, when someone from the Microsoft Active Protections Program accidentally sent me the program's email private key.
Microsoft has been tricked into signing bad code themselves, just like Apple, Google, and everyone else who does centralized review and signing.
Microsoft has had certificates forged, basically, through MD5 collisions. Trail of Bits did a good write-up of this years ago.
But I can't think of a case of Microsoft losing control of a code signing key. What are you referring to?
The hard part is the deploying. Yes if you can get control of the crowdstrike deployment machinery, you can do whatever you want on hundreds of millions of machines. but you don’t need any vulnerabilities in the crowdstrike deployed software for that only the deploying servers.
Call me crazy but that is a real worry for me, and has been for a while. How long until we see some large corporate software have their deployment process hijacked, and have it affect a ton of computers that auto-update?
One of the most dangerous versions of this IMO is someone who compromises a NPM/Pypi package that's widely used as a dependency. If you can make it so that the original developer doesn't know you've compromised their accounts (spear-phished SIM swap + email compromise while the target is traveling, for instance, or simply compromising the developer themselves), you don't need every downstream user to manually update - you just need enough projects that aren't properly configured with lockfiles, and you've got code execution on a huge number of servers.
I'm hopeful that the fallout from Crowdstrike will be a larger emphasis on software BOM risk - when your systems regularly phone home for updates, you're at the mercy of the weakest link in that chain, and that applies to CI/CD and end user devices alike.
As always, a relevant xkcd[1]. I would not be surprised if the answer to “how many machines can be compromised in 24 hours by threatening one person” was less than 8 figures. If you can find the right person, probably 9+.
I mean, isn't that roughly the solarwinds story? There is no real shortage of supply chain incidents in the last few years. The reality is we are all mostly okay with that tradeoff.
I had that same one. If loading a file crashed the kernel module, could it have been exploitable? Or was there a different exploitable bug in there?
Did any nation states/other groups have 0-days on this?
Did this event reveal something known to the public, or did this screw up accidentally protect us from someone finding + exploiting this in the future?
Meta Conversation: The fact that X has a "Show Probable Spam" and both of the responses were pretty valid, with one even getting a reply from the creator.
I just don't understand how they still have users.
Relatedly, it's crazy to me how many people still get their news from X. I mean serious people, not just Joe Schmoe.
The probable spam thing was nuts to me too. My guess was it's maybe trying to detect users with lower engagement. Like people who aren't moving the investigation forward but are trying to follow it and be in the discussion.
One of the things to keep in mind is that Twitter had most of these misfeatures before Musk bought it.
The basic problem is, no moderation results in a deluge of spam and algorithmic moderation is hot garbage that can only filter out the bulk of the spam by also filtering out like half of the legitimate comments. Human moderation is prohibitively expensive unless you want to hire Mechanical Turk-level moderators and not give them enough time to do a good job, in which case you're back to hot garbage.
Nobody really knows how to solve it outside of the knob everybody knows about that can improve the false negative rate at the expense of the false positive rate or vice versa. Do you want less ham or more spam?
I agree the problem is hard from a technical level.
The problem is also getting significantly worse because it's trivial to generate entire pages of inorganic content with LLMs.
The backstories of inorganic accounts are also much more convincing now that they can be generated by LLMs. Before LLMs, backstories all focused on a small handful of topics (e.g. sports, games) because humans had to generate them from playbooks of best pracitces. Now they can be into almost anything.
I use X solely for the AI discussions and I actively curate who I follow, but where is there a better platform to join in conversations with the top 500 people in a particular field?
I always assumed that the reason legit answers often fall under "Show probable spam" is because of the inevitable reports coming in on controversial topics. It seems like the community notes feature works well most of the time.
If bad spam detection was such a big issue for a social platform, YouTube wouldn't be used by anyone ;). In fact it's even worse on YouTube, it's the same pattern of accounts with weird profile pictures copy pasting an existing comment as is and posting it, for thousands of videos, and it's been going on for a year now. It's actually so basic that I really wonder if there's some other secret sauce to those bots to make them undetectable.
Well if it's just the comments, I think a lot of people just don't read those. In fact, it's a fair bit of effort just to read the descriptions with the YouTube app on some devices (e.g. smart TVs), and it's really not worth the effort to read the comments when users can just move on to the next video.
I don't necessarily think that's true anymore. YouTube comments are important to the algorithm so creators are more and more active in the comment section, and the comments in general have been a lot more alive and often add a lot of context or info for some type of videos. YouTube has also started giving the comments a lot more visibility in the layout (more than say, the video description). But you're probably right w.r.t platforms like TVs.
Before this wave of insane bot spam, the comments had started to be so much better than what they used to be (low effort, boomer spam). In fact I think they were much better than the absolute cringy mess that comments on dedicated forums like Reddit turned into
I'd go so far to say that almost all responses that I see under "probable spam" are legitimate. Meanwhile real spam is everywhere in replies, and most ads are dropshipped crap and crypto scams with community notes. It's far worse than it's ever been before.
I believe that is dependent on your account settings. I block all comments on accounts that do not have a verified phone number as an example and they get dropped into that.
There’s literally not a better alternative and nobody seems to be earnestly trying to fill that gap. Threads is boomer chat with an instagram requirement. Every Mastodon instance is slow beyond reason and it’s still confusing to regular users in terms of how it works. And is Bluesky still invite only? Honestly haven’t heard about it in a long time.
Mastodon is a PERFECT replacement. But it'll never win because there isn't a business propping it up and there is inherent complexity, mixed with the biggest problem, cost.
No one wants to pay for anything, and that's the true root of every issue around this. People complain YouTube has ads, but wont buy premium. People hate Elon and Twitter but won't take even an ounce of temporary inconvenience to try and solve it.
Threads exists, I'm happy they integrate with Activity Pub, which should give us the best of both worlds. Why don't people use Threads? I'd a little more popular outside the US but personally, I think the "algorithm" pushes a lot of engagement bait nonsense.
>No one wants to pay for anything, and that's the true root of every issue around this. People complain YouTube has ads, but wont buy premium.
Perhaps if buying into a service guaranteed that they would not be sold out then there would be more engagement. When someone signs up it is pretty much a rock-hard guarantee that their personal information will be marketed and sold to any entity with the money and interest to buy it - paying customers, free-loaders, etc.
When someone chooses to buy your app or SaaS then they should be excluded from the list of users that you sell or trade between "business partners".
When paying for a service guarantees that you're selling all details of your engagement with that service to unrelated business entities you have a disincentive to pay.
People are wising up to all this PII harvesting and those clowns who sold everyone out need to find a different model or quit bitching when real people choose to avoid their "services" since most of these things are not necessary for people to enjoy life anyway. They are distractions.
EDIT: This is not intended as a personal attack on you but is instead a general observation from the perspective of someone who does not use or pay for any apps or SaaS services and who actively avoids handing out accurate personal information when the opportunity arises.
In my experience, Mastodon is nice until you want to partake in discussions. To do so, you need an account.
With an account you can engage in civilized discussions. Some people don't agree with you, and you don't agree with some people. That's fine, maybe you'll learn something new. It's a discussion.
And then, suddenly, a secret court convenes and kills your account just like that; no reason will be given, no recourse will be available, admins won't reply, and you can do two things: go away for good, or try again on a different server.
I'm happy with a read-only Mastodon via a web interface.
But read-write? Never again, I probably don't have the correct ideology for it.
All the people I know that are still active on Twitter because they need to be "informed" are constantly sending me alarmist "news" that breaks on Twitter that, far more often than not, turns out to be wrong.
> Every Mastodon instance is slow beyond reason and it’s still confusing to regular users in terms of how it works.
I'll concede the confusing part but all the major Mastodon servers I interact with regularly are pretty quick so I'm not sure where that part comes from.
It is not so bad with Mastodon but much fedi software gets slower the longer it's been running. "Akkoma Rot" is the one that's typically most talked about but the universe of misskey forks experiences the same problems, and Mastodon can sometimes absolutely crunch to a halt on 4GB of ram even for a single user instance.
Maybe the experience varies depending on where the user is located. Users near Mastodon servers (possibly on the US East or West Coast) may not feel the slowness as much as users in other parts of the world. I notice noticeably slower response times when I use Mastodon in my location (Korea).
I think a lot of people use Hetzner. I notice slowness, especially with media, in Hong Kong. A workaround I've found is to use VPNs which seem to utilise networks with better peering with local ISPs
It is the best internet social feed to me as well. I use pro a lot for following different communities and there is nothing that can comes close today to being on the edge of change online.
Some people don't jump on every fad out there. Most of the people who miss out on fads quickly realize that they aren't losing out on much simply because fads are so ephemeral. As far as I can tell, this is normal (though different people will come to that realization at different stages of their life).
While a fad (in this context) depends upon a company maintaining a product, the act of maintaining a product is not a measure of how long the fad lasts. Take Facebook, the product. I'm fairly certain that it is long past its peak as a communications tool between family, friends, and colleagues. Facebook, the company, remains relevant for other reasons.
As for ChatGPT, I'm sure time will prove it is a fad. That doesn't mean that LLMs are a fad (though it is too early to tell).
Sadly enough the "average" instagram user doesn't use threads. It's just a weird subset of them that use it, and imo it's not the subset that makes Instagram great lol. (It's a lot of pre 2021 twitter refugees, and that's an incredibly obnoxious and self centered crowd in my experience)