I'm now wondering if I have enough material to do an interesting writeup for my time as a CPU bug-hunter in verification.
The client (a now vanished startup) had a small 8-bit CPU design which they wanted validation for, using the technique of executing random sequences of instructions and comparing the result against an emulator. We wrote the emulator independently from their architecture description. Given that most instructions were a single byte plus arguments and most of those were valid, the test coverage was pretty thorough. All looked fine until I added support for interrupts, at which point we discovered that an interrupt during a branch would not return to the correct point in execution.
Verifying security properties of processors is really hard; you can go looking for specific types of failure, but I'm not aware of a general way of proving no information leakage.
I've done the same line of work in a larger-scale armv8 processor.
I think there is no answer that doesn't involve sign-offs on certain things.
In the situation you described, we resorted to fully randomizing everything, and then adding constraints to ensure it remained legal. This is a tedious process, one that leads to multiple false-positive failing tests.
However, the opposite situation, writing constrained stimulus from the beginning, leads to gaps where you didn't think of randomizing something, or you over-constrained it.
It's similar to a top-down vs down-up approach, only one is difficult and tedious and riddled with false positives, whereas the other one is easier and simpler, but can hide bugs.
In the face of issues like what you said, we implemented randomized interface behavior. Interrupts are always meant to be program-order invisible, so you can randomize those (imagine, 100 interrupts back to back, or 1 every N cycles where N is random but smaller than 5 or 500 etc). Pure random isn't interesting on its own. You need controlled randomness. From that point, you have to then start adding constraints for illegal/impossible situations, and you have to be extremely careful about these. We found a bug a week before tapeout when a senior architect forced a junior engineer(wonder who this is) to add a constraint. He used his seniority to sign it off, without going through a group review, and it was a mistake.
> but I'm not aware of a general way of proving no information leakage.
As I understand it, the current consensus is to use a tagged architecture, then show that there are no observable differences when the tags involve secret data and the value of the tagged data changes.
There are a few rubs here. One is that this is pretty hard to do. The other is that no one wants to pay the overhead of using a tagged architecture. Yet another is that deciding what "observable difference" means is challenging (too little, and you probably miss attacks, too much, and you will probably discover information leaks).
Further, this kind of rigorous system hasn't been rigorously empirically evaluated by a third party (as far as I know, perhaps in part because it's so hard to create these systems). "Empirically evaluated?" you scoff, "there's a proof, what's the point of testing?" Well...
Overhead of tagged architecture is to some extent an myth caused by abysmal performance of certain implementations (eg. iAPX whose performance problems are AFAIK caused mainly by funky instruction encoding) and by performance issues with "straightforward" way of running C code on such architectures.
I remember one of my professors in college talking about tracking down a register leak in the renaming block of an out of order processor he was working on. If you thought memory leaks could be annoying to deal with...
I've used this product and you can't jam the entire chip into it. You can only jam pieces of it before the search space is so big it runs forever. It is useful for smaller pieces, but it will eventually go critical and melt down your server farm if you keep growing the problem size. That was my experience a few years back.
Formal verification of specs, microcode, and security policy they hold in ACL2. Part of that is embedding a separation kernel a la MILS or SKPP model. It also has triplicated registers for fault-tolerance. Then, they integrate high-level languages like SPARK and microCryptol with that using certified compilers or equivalence checking.
Most hardware work in information-flow analysis and control is currently focused on making hardware that controls bad software. I don't think most of them consider a CPU failure. If anything, they might be more vulnerable from a practical strategy of trying to reuse existing cores by making info-flow components that operate side-by-side with them. Here's an example from that area, though, since they could be modified to address recent concerns.
Article Excerpt: "So a speculatively-executed xdcbt was identical to a real xdcbt!"
I've never thought about it until I saw the above line in the article, and that thought went something like this:
"Assembler instructions which might never have the conditions met for their execution during a program's runtime might be speculatively executed nonetheless, and this, depending on the nature of the instruction executed, might have huge ramifications up to and including program incorrectness and even program failure."
In other words, your absolutely 100% deterministic Turing machine (or programs that you write on it that you deem to be 100% deterministic) -- may not be quite so deterministic when viewed against these understandings...
It adds a whole new dimension to what must be thought about when writing assembler code...
The machine is still deterministic, it just doesn't conform to the assumption. It's a bug. You can only think about the bug when you write the assembly code if you know about it.
There is lots of hardware with strange bugs out there, and in some setups (embedded devices, mission critical systems etc.) it can be easier to work around them instead of trying to fix the hardware. I know of a medical systems manufacturer which still sticks to some incredibly old CPUs, a home-grown operating system and gcc 2.95 because after 15 years they think they at least know about all the problems and how to work around them. Fixing the issues would pretty much require them to re-validate the whole design and repeat all the testing, which would take several years.
In production though the machine will not behave deterministically from your perspective, because you do not have all the inputs to the machine. Some of the inputs are controlled by a third party (the hypervisor, or worse, another VM running on your EC2 host that is actively manipulating branch predictor data to control the speculative execution for your application.) Even if you capture every operation that runs inside your VM, including register states, timing, etc, you won't be able to reproduce the actual behavior that occurred on the host because you don't know what happened outside the VM.
One could argue that the introduction of cloud computing (using virtualization) has converted the machines we run on from deterministic to sorta-deterministic. It just happens that until now we haven't been hurt by it much. Now everyone's paying for it, to the tune of a sizable performance hit on syscalls.
A medical company using old processors and gcc2 for their system is not running on AWS.
But still no. Deterministic means the the input maps consistently to the output. The hardware in this sense is doing the mapping. Deterministic does not mean for all x the only mapping is f(x). It means when f(x) is the mapping f(x) does not change to f'(x). However g(x) is a perfectly deterministic possibility for a mapping too. Point being two hardwares f(x) and g(x) can independently deterministically map x. The existence of two hardwares does not affect whether it's possible to deterministically map x or not. The reality and I think your point is that if you are unaware a different hardware is performing the mapping this can lead to a result that would appear nondeterministic. But the original point with the medical company anecdote is that this is only an appearance and in some domains people go to great lengths to make sure they intimately understand their f(x) and make sure they're not using a g(x) they don't understand.
If you're looking for deterministic timing generally that goes out the window the moment you add caches, at least non-manual ones. Other than timings if you can tell the difference between a processor speculatively executing and one that doesn't then there's something seriously wrong, which unfortunately includes xdcbt.
Awesome story! I'm curious though why the branch predictor was running the xdcbt instruction if "The symptoms were identical. Except that the game was no longer using the xdcbt instruction".
Was the game no longer "using the xdcbt instruction", but the branch predictor caused issues, because they put a jmp instruction in front of it instead of removing the instruction entirely?
As I understand it, the function that used the xdcbt instruction had an oversight that led to the first set of crashes, so it was rewritten to fix that issue. Even with the fix, the game programmer decided to remove the PREFETCH_EX flag, which is what caused the copy routine to use xdcbt in the first place. So xdcbt ended up in the compiled code, but the flag that caused it to be used wasn't there. The branch predictor ran it anyway, in an unsafe way, leading to the next set of crashes.
I think that the developer didn't yet have the fixed version. They thought they would be safe because they weren't passing PREFETCH_EX, but the branch predictor decided otherwise.
There were other ways that xdcbt could cause problems (what if the block of memory was freed before being purged from L1) so who knows, and I don't remember for sure.
And the branch predictor would decide the xdcbt branch was most likely, executes xdcbt and then realizes oops and does it the safe way. Which would be okay except xdcbt is evil, it breaks cache coherency.
Any if statement could cause this, doesn't have to be feature flags.
An untrained branch predictor could still take the "wrong" path even if none of the code that calls that if statement takes the bad path. That's presumably what happened here, the calling program removed all instances of PREFETCH_EX but the CPU would still sometimes execute that branch, such as on the first run through that function.
This is a lot like the problem of "don't think of a pink elephant". The mere existence of the "evil opcode" in the library means the CPU might execute it under certain circumstances.
Even if they had the fixed version the branch predictor could still decide to predict the instruction would be hit right? Your ultimate analysis seems correct: the instruction is unsafe to have anywhere.
Would it be fair to say that having any sequence of bytes in memory which looks like the xdcbt instruction (even if those bytes are just data) is unsafe? Given that a stale entry in the branch prediction table might end up pointing at those bytes.
PowerPC instructions are always four-byte aligned, and the Execute bits are checked very early, so the only instructions the developer had to care about were those that were compiled into the code segment.
SolarNet addresses one aspect of the issue but there's another important one here for PowerPC. Every instruction has to start on a 4-byte boundary[1] so if a sequence of bytes that make up the Evil Instruction is present but at a 1 byte offset you're still perfectly safe. This isn't true for x86, where if you start executing at byte 1000 you might get one valid sequence of instructions but if you start executing at byte 1001 you might get another equally valid but malicious sequence[2].
[1] Unless there's some variable width mode like ARM's thumb I'm not aware of, but even then you're limited to 2 byte boundaries.
Data and code are not quite so interchangeable (thankfully). Typically programs have code segments and data segments, and processors actually have support for specifying the difference in behavior. A branch predictor would probably not run against pointers to memory segments without the execute bit set.
Actually, this is exactly the assumption that led to the Meltdown bug. The Intel CPU speculative (non-retired) instruction pipeline runs instructions which access memory and load it into cache and execute the instructions, delaying the check that the memory is accessible within the current context until later (just before instruction retirement).
It's certainly possible that a CPU could delay the same sort of exception processing for "NX" flags until later in the pipeline, but still before retirement, allowing a similar sort of problem as Meltdown.
... unless they suffer from the same problem as Intel CPU's Meltdown fiasco. Where the branch prediction runs concurrently with the security checking. The branch predictor hits and speculates, firing the bug, while the security check then decides "that's an invalid branch" and discard it--but it's too late the bug side-effect his occurred.
But I think for it to happen randomly, you'd need an invalid pointer that just so happens to jump to what "becomes" an instruction with a bug, and a branch. That seems pretty rare on first glance.
84 million XBox 360s in the wild so it may be a bit late for that :-) Consoles have very predictable patterns of execution and only run a limited set of software so it's not so much a problem there. It would be a much more serious problem if this was a chip for a general purpose PC.
The 360 didn't open up the ability for titles to generate their own code and run it. Some internal features of the 360 did (the back compat stuff, and some other projects I can't talk about), but in general you couldn't JIT without some special permissions bits.
This issue that xdcbt is supposed to solve happens at a completely different level. Performing a full disk to tape backup would slow systems down because the entire disk would be copied through the OS buffer cache, evicting data used by other processes.
The UNIX fix for this was to use a raw device or O_DIRECT to bypass the buffer cache.
Maybe Intel's new cache partitioning feature offers a similar fix, see:
Actually in the comments someone mentions using cache partitioning for security. Maybe the threads used by jit code could be placed in their cache partition to avoid some of Spectre.
Correct me if I'm wrong, but won't Meltdown/Spectre bug allow people to jailbreak pretty much any os/device that has CPU that "supports" these bugs? This could potentially open a lot of currently closed devices to people.
- Meltdown, on affected processors (...basically anything from Intel), gets you the ability to read anywhere in memory, whether you could legally map it or not (e.g. you can read from kernel, user, or w/e memory as an unprivileged user, albeit on the order of bytes per second)
- Spectre affects approximately anything with a cache and speculative execution, but only buys you reading anywhere that you could legally map, e.g. you can read anything the process you're in could legally read, but you can't, say, read kernel memory with it as an unprivileged user.
Either way, you don't get an arbitrary write, but you could maybe use it to make another exploit more reliable with the additional information you get (a big danger of the Spectre family is using it to help you read data outside your sandbox but still in your "process", e.g. stealing password auto-fill data from inside JavaScript, or using it to defeat ASLR on your sandbox, or ...).
Which is to say, the set of attacks allows Kernel memory to be read from userspace:
> If the kernel's BPF JIT is enabled (non-default configuration), it also works on the AMD PRO CPU. On the Intel Haswell Xeon CPU, kernel virtual memory can be read at a rate of around 2000 bytes per second after around 4 seconds of startup time.
True, but sometimes just getting private key from the memory will be enough to gain access to the system as it could give researchers ability to build executable that will look like they came from the manufacturer/developer.
The problem with X360 is it can’t run arbitrary native code, it only can run code digitally signed by MS. The hypervisor refuses to run an unsigned code, and the key never leaves the CPU i.e. very hard to tamper with.
X360 can execute JavaScript in the web browser, and .NET in (now discontinued?) XNA. I’m not sure if it’s possible to exploit the bug if you can only run JavaScript or .NET, both are very high level and have strong security model on top of what’s in the hardware/OS/hypervisor.
You could have exploited this using XNA, potentially, yes. But I don't think the software for running your own builds on the hardware works anymore, so the threat has passed. These attacks certainly reduce the likelihood of a similar service appearing on current-gen consoles.
The paper for Spectre includes a Javascript proof of concept - https://spectreattack.com/spectre.pdf, though I don't know if the version of IE on the 360 supports web workers (though, unpatched, it's exceedingly like there's a way to exploit it from javascript that doesn't involve web workers).
Cache/RAM always looks like basically solid colors because it’s so densely packed with a repeating structure.
How would you know it’s not the green stuff? I don’t know, perhaps by size. I would imagine the green stuff is some other kind of cache or memory (op cache? TLB buffers? patchy SRAM).
This is what I know from seeing these shots online for years. I’m no expert by any means.
The green stuff might have been the VMX register files - those things were pretty huge (128 registers per thread, each register 16 bytes, so 212816=4 KiB per core). While the L1 cache was higher capacity (32 KiB) as a general rule of thumb faster memory uses more die area per bit.
But I seem to remember being told that the green areas were the VMX pipelines. I don't remember clearly, and the hardware people I asked at the time were uncertain - they were too far removed from that aspect of the hardware.
The other reason to assume that the blue is the caches is because it is closer to the cross-bar and the cross-bar is what connects the cores to the L2. Incoming data from the L2 goes first to the L1 caches so they would logically be on that side.
In general, no, because branch predictors often ignore or override hints like this. It's hard to say without knowing this architecture specifically, of course.
Keep in mind that it's possible for a branch predictor to predict with "high confidence" a branch it's never seen before, because many branches may overlap into the same memory region of the predictor. The article touches on this.
Yes, it might have, assuming it's available in the 360 compilation environment.
Another solution would be to split the existing function into safe_function() and unsafe_function(), and remove use of the PREFETCH_EX flag. But then you have to explain to callers that they must not protect calls to unsafe_function() with any kind of conditional statement.
The author hinted at possible solutions but it was probably a case of "twice bitten" by that point.
In the list of possible risks to online banking (and why is it always online banking in these hypothetical world-is-ending scenarios?), CPU bugs are way down the list.
Note that the bank will carry on using their Intel systems on the front end, with probably some crawling horror buried in the back office. All the intranet webpages will be IE6-only or similar.
Not by this, but it shows the chip has speculative execution, and is therefore vulnerable to Spectre.
In this specific instance, the xdcbt instruction accidentally does a bad thing in the Spectre Zone that leaks back into the real world through a side-channel in the L1 cache. Crafted malicious code could do another bad thing in the Spectre Zone that ordinarily isn't allowed in the real world, then leak the results back to the real world through any side-channel that works.
Only on Xbox 360, it could even use xdcbt vs dcbt for that side-channel. If you set code up to always execute xdcbt on core 2, and always dcbt on cores 0 and 1, you might be able to do timing-based attacks between core 0 (closer to L2) and core 1 (further from L2) comparing against core 2 (bypassing L2).
A different version of the same PowerPC chip family would have its own side-channels. All those side-channels are hardware-specific.
Just because the execution is in-order, doesn't mean there is no speculative execution.
The 360 cpu (and the closely related ps3 cpu) is the perfect example of this. Despite the fact that it's 100% in-order, it has a very long instruction pipeline (upto 50 instructions long) and branch prediction. And that's all Spectre needs, a long pipeline + branch prediction.
I'm not even 100% sure the intel atom will be safe from Spectre & Meltdown. It's pipeline is much shorter, only 16 stages long, but it still speculatively executes up-to 32 instructions (2 instructions per cycle).
It just makes things harder, you have to find a Spectre gadget that's short enough. Though piss-easy for Meltdown, because you can just hand-assemble short code.
The client (a now vanished startup) had a small 8-bit CPU design which they wanted validation for, using the technique of executing random sequences of instructions and comparing the result against an emulator. We wrote the emulator independently from their architecture description. Given that most instructions were a single byte plus arguments and most of those were valid, the test coverage was pretty thorough. All looked fine until I added support for interrupts, at which point we discovered that an interrupt during a branch would not return to the correct point in execution.
Verifying security properties of processors is really hard; you can go looking for specific types of failure, but I'm not aware of a general way of proving no information leakage.