Sounds vaguely similar to the trickiest bug I ever had (and the first really hard bug I ever dealt with on my own). Mine was in AIX 3.2.5, circa 1995. We were buffering a latency-prone data stream between read and write processes using shared memory buffers. The original design used shmat(), which was limited to three buffers total on AIX. I rewrote the buffering using mmap() to create memory-mapped anonymous file buffers - the number was effectively unlimited and could be tuned via configuration. Around 100 buffers gave optimum performance, with a 300%+ throughput improvement in real-world conditions - a huge win on a huge problem!
Then it blew up in production. Like hard crashes shortly after starting. Upon investigation, I found that entire pages of mapped memory were being overwritten with nulls, more or less randomly (4096 bytes at a time).
Turned out the bug was in mmap() due to the order patches were applied on various servers. The dev/qa servers were patched at different times than the production servers. That, that was hairy. And for a junior programmer to have to explain this to the tech leads and IBM support - I don't even know how many times I heard variants on "What's wrong with your code, really? mmap() isn't buggy!"
Ah, the 1990s. No Stack Overflow. No ssl (not even http). We did network programming by wrapping raw sockets in C and writing stream parsers in lex and yacc. Kids these days, they don't know hacking!
I once #defined strlen to return a short. This was in the days of the 16-bit to 32-bit transition on Windows, and I was running a project to eliminate spurious warnings from our code. There were several hundred places where the return value of strlen was being assigned to a short, every one of which created a warning.
I was a senior developer but it was my first commerical job--I'd been in academia previously. I ran the change by the the most senior technical people in the company, all the way up to the guy who'd written the original application as employee #1. He OK'd it because, as he said, "strlen returns as short on the Mac anyway, and since our code runs on Mac as well as Windows it's a limitation we have to respect anyways."
A few years later the company stopped supporting the Mac.
A few years after that (and well after I'd left the company) one user site started getting crashes. I heard later that it took four senior devs a week to track down the cause, and much head-scratching because strlen was documented to return an int. They eventually found my #define, along with a comment that this eliminated so-many hundred warnings in the build, and that the change had been approved by the most senior people (I wasn't totally naive).
It turned out the problem at the specific site having the issue was users putting the entire contents of files into what amounted to tool-tips. It was totally unexpected user behaviour, but they'd found a place they could cache some useful data and we let them do it, so it should have worked.
Today I'd write a script that auto-edited all the cases where the problem occured, and regression test the hell out of it, but yeah: the '90's were a different time!
I'd just like to point out that at least in the draft C89 standard (http://port70.net/~nsz/c/c89/c89-draft.html#4.11.6.3), strlen() had the same prototype as it has today: in other words it returns size_t. Not int.
How about this one from just last year: I'm writing an application that encrypts packets using datagram TLS with OpenSSL. Take a raw packet, send it through OpenSSL, it comes out the same on the other end and everything looks fine.
Then I try load testing it and start getting mangled packets. Application data where there should be IP headers and vice versa. So get out the debugger and find out that the data going into SSL_write() on one end isn't the same as the data coming out of SSL_read() on the other end. No TLS error, just mangled data.
So I install the OpenSSL symbols and discover that the data going into the cipher is exactly the same as the data that gets decrypted on the other side, hence no HMAC verification failure. But OpenSSL by default compresses data before encrypting it, and decompression output is not the data that was originally compressed.
TLS requires records to be delivered reliably. Datagram TLS, by contrast, is like UDP. Packets can be lost. And if they are, you can't use stateful compression or the lost data creates a hole in the decompressor's stream and corrupts the output. But OpenSSL was doing exactly that. So disable compression and the problem disappears instantly. (After three days in a debugger.)
FYI, using TLS compression makes you susceptible to the CRIME attack. I think I have a ticket with OpenSSL for them to turn that off by default, but I don't think they've done it, yet. Glad you got there accidentally!
In my day job, I have to write reams of complicated code that can slow down the system or make maintenance more annoying...just because a user can do something (even though doing that is unsupported).
That's what it means to write business-class software. Nobody worth having as a customer is going to build their business on your platform if your attitude whenever something goes wrong is "you shouldn't have been doing that in the first place".
(I am surprised to hear this story though. I actually found a bug in the NTFS buffer cache a few years ago which was introduced in (as I recall) Windows Server 2012. Maybe the Server organization are way more on the ball than the consumer OS organization, which is definitely possible. But they took it seriously and fixed it in a patch.)
> Nobody worth having as a customer is going to build their business on your platform if your attitude whenever something goes wrong is "you shouldn't have been doing that in the first place".
My favorite set of APIs is AWS. You know why? Because they've realized they hold two very weighty sticks that they can use when designing, and they've put them in place all over.
1. They can make any arbitrary message to the API cost the user money every time they send it, to disincentivize using that part of the API thoughtlessly. That's whether or not they expect this to be an actual revenue stream at the rates people are charged for reasonable usage.
2. They can put a "soft cap" on any arbitrary resource, so that you have to phone them and get the cap raised if you want more than [some reasonable number] of something. This likewise disincentivizes bad designs that use a nigh-infinite number of costly somethings to accomplish tasks that could be just as easily accomplished some more idiomatic, less costly way.
AWS doesn't prevent you from doing stupid things... but it makes you really not want to. I love it.
Have you ever tried to setup AWS IAM permissions for a user pursuant to the principle of least privilege? Because Amazon's APIs are about as far from friendly as you can get in this respect.
Their docs make it easy to make the mistake of thinking that fine-grained controls are available for most things, but when it comes to really important things like being able to segregate a production and Dev VPC, their APIs basically force you to grant permissions to everything or nothing.
Some examples of things I've hit:
Not being able to restrict a user to only change a specific routing table
Not being able to restrict a user to only change a specific elastic NIC
I'm consistently surprised at what's missing from their API and couldn't disagree more about being happy with it.
These things are possible... but this gets at another aspect of AWS's design in particular.
I do a lot of my AWS work in CloudFormation. When I hit a wall, the answer is pretty much always to stand up an EC2 instance that can speak SNS, grant it larger-than-necessary permissions to my VPC, teach CloudFormation about it as a custom resource type, and have it serve as a proxy for the not-configurable-enough resource, allowing it to assert its own policy and make third-party calls before making the real callback into your VPC [or not.] It's the AWS equivalent of writing a factory method to wrap a badly-written constructor.
To generalize that thought: IAM "users" are made to either be people (e.g. your developers, your ops people), or representative tokens for entire third-party organizations (e.g. a CI bot.) Despite the existence of IAM roles, IAM isn't really made to assert "machine-agent"-granular permissions.
Instead, what you really want is to imagine a third-party service running in the AWS cloud that does exactly what you want. You would grant that third-party's IAM user overly-wide permission to play with your VPC, but trust it to only do what it should, because, obviously, you have a business relationship and it would be dumb of them to abuse it.
As soon as you can see what API needs to exist, you can turn around and become that very same imaginary third-party: make a separate AWS account, stand up an API server in it that takes requests to do what your "clients" want, and then, in turn, make requests to the AWS APIs on their behalf to accomplish those things.
AWS isn't a high-level framework; it's a kit of low-level tools. (This is really what the PaaS vs IaaS distinction implies, I think.) AWS is built assuming that you're willing and able to take their tools and pipe/script them together to build the higher-level components you need. And, since AWS is for web services, that assumption comes in the form of expecting you to be able to pipe, hook, or wrap any of their APIs to/with/in your own API.
The 'fine-grain' of IAM varies considerably depending on which AWS service you're restricting. You can add extra flexibility with 'Conditions', which I'm sure you're aware of, but I think it's a bit of a misrepresentation to paint IAM as being poor quality. AWS is a very complex environment; I can't see how you could have a user-friendly yet fine-grained user control for something that complex. Anything you choose is going to require training in how to use it.
I wouldn't say I'm happy about it, but neither am I unhappy, and neither am I happy about anything in the world of security (also in today's task list is updating https cipher lists... again...). Not even the simplest thing in security is easy. For example, the basic concept of a password is simple, but actually implementing it? Ugh - it involves every layer from backend to frontend to user training (the hardest part - no sticky notes, no friendly phone calls, no passing around in emails...).
Anyway, for those not used to IAM 'Conditions', an example of use. The following allows Packer (an AMI builder) to destroy any EC2 instance, but only if they have the tag 'name' as 'Packer Builder'. Conditions don't work for everything, so they're not a workaround to get fine-grain everywhere, but they do add a lot of flexibility.
No wonder everyone wants to be in the *aaS business. That sounds like the analog telco days, where long distance were billed harshly to get people to keep their calls short. Thus getting the most out of a low number of wires.
There's a balance, though. Oftentimes, you serve the business folks better by preventing them from doing something really silly, and instead letting them optimize their workflow not to include silliness.
Unfortunately, it's tricky to tell ahead of time which features are bad ideas and which aren't.
It's pretty easy at first order: features that users want are good ideas. I agree at second order users want a lot of features that solve their problems in ways that are less clever than one might like, but in the vast majority of cases the right answer from development is, "Users know what they need better than I do, so if they ask for something, I'll do my best to implement it even if I don't totally understand why."
That answer exposes the deeper answer, too: "My job as a developer is to understand user needs so that our software can help them fulfill them, so if I don't understand why someone needs a feature I should dig in further before implementing it so I don't implement the wrong thing or the right thing in the wrong way." Not always possible given shedule constraints, though.
Admitedly, those rules don't seem to apply in this case, since if the OS allows you to do something that corrupts data, that's a problem no matter why the user wants to do it. If it can corrupt data, the OS shouldn't allow you to do it, end of story.
It's not that simple (speaking as someone who works on OS SCSI driver code for a living...).
I will amend your statement to say that the OS itself should never corrupt your data, which I agree with entirely.
The post was not super clear about exactly what was happening (or it may be that my limited knowledge of Windows storage internals is keeping me from understanding it), but it sounds like the NTFS client requested a cache flush and then was issuing writes during the flush. I don't know what contract these operations have, but it may very well be the case that the user was violating the contract. If Microsoft responded with "don't do that", this may be the case.
But wait! Shouldn't Windows prevent the data from being corrupted? Or shouldn't NTFS fail the writes in this case? Possibly. And most likely inserting the checks to make this happen would increase write latency for every NTFS client, even the ones that don't behave in this way.
This reminds me of another scenario I encountered, with Veritas VxFS running on top of AIX. The user initiated a space reclamation, which was sending what you can think of as a delete to the storage array. And the user was also writing data to the device at the same time. Due to a race condition (which I can describe for you if you really care), the legitimate user data would sometimes be deleted.
Should VxFS have protected users against this case? Yes. Was VxFS violating the SCSI protocol? No. Was the storage array violating the SCSI protocol? No. (Has VxFS fixed this bug, almost three years after I discovered it? No comment.)
It's always a lot more complicated than it seems on the outside.
Sorry I don't buy that. An operating system's "contract" with the user is the syscall API. There's no room for argument there.
Calling a write while flushing the same data from another process (as the OP reported) or thread is perfectly legitimate set of operations. If these operations are not supposed to be running simultaneously, that has to be enforced by the OS kernel. The "right" way would be for the OS to serialize these operations internally. But even returning an error for the write might be an acceptable (though not nice) way to handle it. What's absolutely not acceptable is randomly corrupting data.
I don't know anything about Windows driver programming so I don't know what the contract of the buffer flush operation is. Obviously, "sometimes fills your buffer with random data from memory pages owned by the operating system" is not part of that contract. This is a bug. I'm not trying to argue that it isn't.
All I'm saying is that no real, sane OS is going to be capable of protecting itself against every possible misuse. Remember that these are developers coding to an API we're talking about, not end users.
What if you have an API that takes an out pointer. I pass in data that I own, then I free that pointer in another thread. If I'm in userspace, I can blow up with SIGSEGV. If I'm in the kernel, maybe now you've just scribbled all over somebody else's memory. Shame on you for corrupting data.
APIs can always be abused. All the API developer can do is try to protect against obvious forms of abuse. I guarantee you every operating system that supports simultaneous multiprocess execution has some series of APIs that, when called in parallel, will corrupt your data.
>All I'm saying is that no real, sane OS is going to be capable of protecting itself against every possible misuse. Remember that these are developers coding to an API we're talking about, not end users.
Only for a wide interpretation of 'misuse'. In the case you described, the kernel is still doing what it was asked. In the article, it wasn't. Situations where one command invalidates another should be very uncommon and very documented.
A real, sane OS should be resilient to any syscalls in any order without triggering internal bugs.
This may come as a shock to you, but operating systems have bugs. And nobody is trying to defend NTFS populating your file with random OS pages.
So yes, an OS should allow you to call its API without triggering internal bugs. But that's kind of a straw man.
There's an interesting line that you're exploring, though, and I would like to dig deeper. You said "the kernel is still doing what it was asked". Let's ignore the bug described by the article for a moment and look into this.
Let's say I have two threads. Each one does a write to the same LBA range. You wind up with a file that doesn't represent the full contents of either write (say it's a 8192-byte write, and your file has 4096 bytes from the first write and 4096 bytes from the second write). Do you consider that to be a bug in the OS or a bug in the application client?
(It's obvious that two conflicting writes will result in someone losing their data. The part of the scenario I'm exploring is that in this case everyone lost their data.)
Wait, I think you and I are having very different conversations.
If the topic you're discussing is that the clients of an OS API should not encounter a bug...uh, I mean, I don't think you'll find anyone to disagree with you. It's not like Microsoft engineers put that bug in there thinking it would be OK. They even acknowledged that it was a bug...it just wasn't high priority because the caller (presumably; again, I know nothing about these syscalls) is not following best practices. Someone's pet bug got deprioritized. This is not news.
The topic I thought we were discussing was whether it's the OS's responsibility to prevent callers from misusing their APIs in a way that causes stupid things to happen, even when the caller is doing something stupid.
I think the latter is a more interesting conversation, but I apologize if I interrupted your discussion of the former.
"Wait, I think you and I are having very different conversations."
So it would seem.
"It's not like Microsoft engineers put that bug in there thinking it would be OK."
You say that now. But the reason my strident response was triggered was because of what you said in your original post that I replied to:
"I don't know what contract these operations have, but it may very well be the case that the user was violating the contract. If Microsoft responded with "don't do that", this may be the case."
That is quite absurd on it's face. I replied with how the contract was nothing more or less than the syscall API and there was no margin for negotiation on kernel's part when it comes to corrupting data when a user program calls those APIs in whatever order it pleases.
Subsequently, your arguments seem to have become more elaborate with a lot more caveats added. I think it has stopped being fruitful for me to respond at this point.
The bug wasn't just corrupting the data in the file after the weird sequence of API calls.
It was corrupting it with data from other files on the disk, which could have been sensitive.
Where this was particularly troubling was that an unprivileged user could then use this as an exploit/attack on the system to get it to leak pages of system files. This is where fun stuff like pass-the-hash begins.
My reading of the bug is that it would write random data from memory owned by the OS, not necessarily other files on disk. Certainly no better (and could very well be worse), but just a clarification.
(Going back and reading the post again, it seems that it's even worse than that, since a virtual environment hitting this bug could get data from the host. This opens an attack vector for a guest to bypass the hypervisor and compromise sensitive host data. I don't know if data from other VMs on the system could also be exposed in this way, but that would also be quite bad.)
Anyway, I think my rambling caused my point to get lost, because I wasn't trying to argue that Microsoft are justified in their "don't do that" comment. But no sane, realistic OS can prevent against every hare-brained thing a driver developer is going to do.
Getting Microsoft to fix bugs is hard. We had a bug in the .Net runtime. Took ages to reach even agreement on the memory dumps: probably because bugs in the .Net runtime are like unicorns, even I had a level of disbelief that .Net was to blame.
Once you prove it, it's completely free. Ultimately memory dumps are the way to go when it comes to MSFT bug reports, if you can catch their bug red-handed and snap it to disk things go real smooth.
Yeah, I found a great one in .net. If you used HttpWebRequest to fetch a url that had an invalid GZIP header the app would hard crash. This is because the request and decompression was being handled on a threadpool thread. An exception would be thrown and no one would be around to catch. Fortunately this was in a web crawler so it took quite a while to build a repro case and an exact diagnoses. I believe it was fixed 4 or 5 years after I reported it in 2008.
I always found it incredibly difficult to report anything to Microsoft and usually only get a 'we have reported it to engineers and it will be fixed some day' message.
To me, stuff like this is a powerful argument for using open-source software (or at least software which you can access the source for), whenever possible. When things go very wrong, at least you can dig into the source code and try to fix it.
It's certainly an argument for having source access. For OS / "platform" level bugs it's no panacea however - you may very well need to keep your workaround, either because you can't rely on end users upgrading to the latest version, or simply need a stopgap until you finish the patch, submit the patch upstream, address all the issues brought up in the review, have it integrated into the next stable release, have it trickle downstream into the stable releases of individual distros, etc etc etc...
Ahhh, the fun with caching in the file system. Incidentally one of trickiest bugs I encountered also had to do with the Windows Cache Manager.
It was the interaction between the Cache Manager and the Memory Manager in managing the rare transient state ModifiedNoWrite of the cache pages when dealing with reentry IO read requests. The cache page status became Modified when its content was filled in from disk, but it's marked as NoWrite to avoid being flushed out by the Memory Manager. The physical page backing the cache page can't be reused since it's dirty (Modified) but the Memory Manager can't flush it out (NoWrite). Slowly over time as more pages are read, the system would run out of physical pages.
The Cache Manager was supposed to change the cache page status back to Standby after the read returned from lower layer. But with reentry IO read requests, it won't do it when upper layer IO request buffer passed straight down. The work around was to allocate a separate buffer to interact with the Cache Manager and copied the content back to the upper layer IO request buffer, incurring an extra copy.
At the end I intimately knew about the Windows Cache Manager and Memory Manager more than I needed to know.
The experience of dealing with Microsoft's support sounds exactly the same as what I had to deal with. In my situation there was no workaround possible short of rebooting the system (it was a kernel resource leak), so after spending 3 months iterating through different ways to reproduce the problem (including "we don't support Windows on VMWare, after asking me to send them a VM image to reproduce the problem), going through 3 levels of support, I got to the people who were able to get me a fix. Alas, it was a private patch which was only available upon request, which didn't help much as I was working for an ISV.
Similar, but the NTFS one was worse in that it was potentially a security vulnerability, since the corrupted data that it wrote inappropriately into the middle of files came from elsewhere in the filesystem, rather than zeroes.
(Software is hard and often buggy, filesystems included. Check your backups! Even if your filesystem is perfect, I spilled water on my laptop yesterday, and you could, too.)
Yeah, I like to extend this out and make it generic like this:
If you don't test it, how do you know?
For example, one person told me he can't understand antivirus software and why people buy it, because he never got a virus. I asked him "how do you know you didn't get a virus?". He just looked at me, not saying a word. I hope my point got across though. If you aren't checking, you don't know. Same could be said about hacking these days. You secure your system, that's good, but if you don't have something to detect hackers, you are the same as the guy without antivirus and the guy without tested backups. You just have no idea whether or not you are protected.
Though to be fair, the people telling me that I might have a virus are the people who want me to give them money.
"How do you know you didn't get a virus?"
I don't. But it's not epistemically clean to let other people set your priors for things like risk, if they have a financial interest in making you worry.
When you reported the bug did you include a repro? I would appreciate it if someone could post it here.
To be specific, I wanted to know the specific conditions the article talks about under which this causes an issue. Flush will cause writes (obviously) so issuing writes to a file which is pending a flush is an interesting scenario.
Ah, reminds me of a Windows bug that I chased in Win2000 Server. (For all I know it might still be there - I've stopped writing Windows code in 2007).
When you wrote to a file (using WriteFile or fwrite()), it first extended the file length, and then committed the buffer. This is supposed to be atomic - that is, you should never be able to see the length already extended but the data not yet there. And it apparently was atomic if both reads and writes came from the local machine, or both came from the network - however, if the write was on the local machine but the read was from the network, locking was missing, and it WAS possible to read zeros instead of the real data (but only because of a race condition - reading the same file again later would give the expected answer)
Tried to get Microsoft to at least confirm this bug, to no avail - there was no one interested in talking to a lone freelance developer back then.
I think the trickiest bug that I heard of was USAF F22 jets losing all computer systems (navigation, communication etc etc) as it crossed date line while flying to Hawaii for the first deployment on the island.
The flight of 4 jets were able to return to US mainland only because the accompanying tanker was able to guide them back.
I hate it, skinny text, low contrast, non-consistent background. It's everything wrong with modern web design. If, say, as you scrolled down the gradient went away, I'd be fine with it, but since it's fixed it's distracting.
Agreed! I genuinely wanted to read the article but found myself squinting just to try read the text on that awful background. It was excruciating so I just gave up.
I might agree with at least one thing the MS guy said. I really do not trust the OS so I would be zeroing out memory. That is me, I am an un-trusting soul.
Zeroing your own pages won't help. What if the OS reads in your super secret file into its cache pages which you have no control of. And those cache pages somehow got written to another process's page.
Oh I'm so sorry, destination had only been allocated for you, writing to it caused some other pages to be evicted and other processes being scheduled and the bug to be triggered and your data is still there. You lost
Same here. I think it's a rare person that's been living and working with computers for more than 20 years that doesn't develop an innate distrust of them or at the very least a subtle set of supersticious about how they work, how they're supposed to work, and the best way to get your work done.
Then it blew up in production. Like hard crashes shortly after starting. Upon investigation, I found that entire pages of mapped memory were being overwritten with nulls, more or less randomly (4096 bytes at a time).
Turned out the bug was in mmap() due to the order patches were applied on various servers. The dev/qa servers were patched at different times than the production servers. That, that was hairy. And for a junior programmer to have to explain this to the tech leads and IBM support - I don't even know how many times I heard variants on "What's wrong with your code, really? mmap() isn't buggy!"
Ah, the 1990s. No Stack Overflow. No ssl (not even http). We did network programming by wrapping raw sockets in C and writing stream parsers in lex and yacc. Kids these days, they don't know hacking!