Hacker News new | past | comments | ask | show | jobs | submit login
Why Raspberry Pi Isn't Vulnerable to Spectre or Meltdown (raspberrypi.org)
1603 points by TiredOfLife on Jan 5, 2018 | hide | past | favorite | 215 comments



This is great, but remember that it covers Meltdown, not Spectre. Meltdown is the more immediate disaster, but Spectre is the more batshit vulnerability. You really want to get your head around:

* The branch target injection variant of Spectre if you want to get a sense of how amazing this vulnerability is: you can spoof the branch predictor to trick a target process into running arbitrary code in its address space! This is crazy!

* The misprediction variant of Spectre if you want to get a hopeless feeling in the pit of your stomach, since the implications of mispredict are that certain kinds of programs are riddled with a new kind of side channel we didn't really grok until last week, and no upcoming microcode update seems to be in the offing.

You could probably use the same Python conceit to illustrate the other two attacks; someone might take a crack at that.

(I'm not disputing that the R-Pi's aren't vulnerable to Spectre).


This covers both Meltdown and Spectre.

> Both vulnerabilities exploit performance features (caching and speculative execution) common to many modern processors to leak data via a so-called side-channel attack. Happily, the Raspberry Pi isn’t susceptible to these vulnerabilities, because of the particular ARM cores that we use.

The reason why Spectre is not a problem is because there is no branch predictor in these simpler arm cores. Instructions are processed in parallel when possible, but not before dependencies, including branch decisions.

EDIT: under "What is speculation?" branch prediction is described. Then in the conclusion: "The lack of speculation in the ARM1176, Cortex-A7, and Cortex-A53 cores used in Raspberry Pi render us immune to attacks of the sort."


A lot of the cheaper Pi-like boards also aren't affected for the same reason, as are lower-end Android phones. The articles claiming it affected every modern CPU were basically mistaken. It was an easy mistake to make, given that ARM's announcement only listed the cores that were affected and had a little note saying everything not listed was unaffected by both Meltdown and Spectre. (There is precisely one ARM-designed core that is affected by Meltdown, a high-end one so new that no chips based on it have been released yet.)


One could argue that CPUs without branch prediction are not modern in the sense of "modern, state of the art design". The fact that they still produce,e.g. 8051, even "new" overall designs, doesn't make the 8051 modern.


Branch prediction is neither new nor modern. Modern state of the art design doesn't actually do branch prediction in hardware at all. I mean look at all the ideas from VLIW architecture, no hardware - no problems.


You're being pretty silly by zeroing in on the word modern and then using an equally silly redefinition of it yourself.

Branch prediction isn't new, you're right. But VLIW instructions are equally unmodern and are entirely orthogonal to speculative execution. Sufficiently smart compilers are also no substitute for runtime analysis.


My CS book (Bryant and O'Hallaron, 2003, p. 399) says Control Data 6600 in 1964 was the first processor w out of order executuon, considered exotic until RS/6000 by IBM in 1990/PowerPC 601 in 1993.


There is almost certainly a branch predictor even in these simple ARM cores.


Yes. The reference manuals for the cores indicate that they all do.

rpi 1: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0301h/dd...

rpi 2: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0464d/DD...

rpi 3: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0500d/DD...

As others have indicated further down, though, it won't open up much of a vulnerability unless they are speculatively fetching memory.


There's no reason to predict a branch if you're not going to execute speculatively.

I need to re-read the papers but I think the real problem isn't even speculative execution but allowing speculative cache changes.

The notion that "gadgets" didn't even need to return properly was both amusing and eye opening for me. It doesn't matter because the result will be flushed anyway! ;-)


In an in-order CPU, you can still use a branch predictor to predict what to fetch and decode, so that you don't stall waiting for instruction fetch to finish after you resolve the branch.

In practice, advanced in-order designs contain more local reordering mechanisms, e.g. in the load/store unit, but they lack the unified global abstraction of a reorder buffer. The most successful timing attacks involve a mis-speculated load, so they wouldn't apply to these mechanisms, but it's not completely out of the question that they are also an effective side-channel.


> There's no reason to predict a branch if you're not going to execute speculatively.

Not quite. Branch prediction is typically used on non-speculative architectures in order to avoid pipeline bubbles. (You could argue that pipelining is a form of speculation)

Here is the branch prediction documentation for one of the processors they claim is not vulnerable. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....

Whether or not they're vulnerable has more to do with how their pipeline is structured. It's possible for an architecture to be vulnerable if a request to the load store unit can be done within the window between post-branch instruction fetch/exec and a branch resolution. Eyeballing the pipeline diagram from the above docs, it looks like you can maybe get a request to the LSU off before the branch resolves. dramatic music


Pipelined processors would slowed down considerably without branch prediction: every branch (= loop iteration) would stall the pipeline and instruction prefetch. 20-25% of instructions are branches, so this would mean a 5-10 clock cycle pause every 4 or 5 instructions.

(Simplest cores have only static branch prediction though)


Low end ARM parts keep track of branches to avoid purging cached pages that will be likely be needed soon. That allows them to have decent performance when executing from flash.


(Asking for my own edification) What does the output of branch predictor get used for in a CPU without speculation?


Avoiding having to empty the instruction pipeline when a branch happens and restart. My favorite simple video explaining pipeline is still this old Apple one from back in the early 2000s:

https://youtu.be/PKF9GOE2q38?t=237


In a pipelined CPU, you want to start the memory fetches for which instructions to next feed into the instruction decoder, before the branch instruction has finished the execute stage. Otherwise every branch incurs a pipeline bubble.


If you're doing speculative memory fetches, wouldn't that also leave a measurable impact on the cache?

Simply instead of MOV AX, [something that depends on secret value] as in the original Meltdown paper you'd need to use JMP [something that depends on secret value] to trigger the memory fetch in a branch that's not going to be used.


Yeah, it's not 100% obvious to me that you couldn't stage some variant of Spectre against even this more limited form of speculation.

In the comments on the article the author argues: "Why don’t speculative instruction (and data) fetches introduce a vulnerability? Because unlike speculative execution they don’t lead to a separation between a read instruction and the process (whether a hardware page fault or a software bounds check) that determines whether that read instruction is allowed."

but that would seem to be confusing the details of Spectre with Meltdown (which is happening a lot right now). Spectre doesn't depend on unauthorized reads succeeding.


Not if you're just trying to avoid a bubble in you pipeline and not actually executing the opcodes speculatively. In this situation the code will be loaded (and probably decoded etc...) but not executed before the CPU made sure that the branch was actually taken. If it's not the pipeline is flushed and a bubble is introduced after all. It's not as efficient as executing speculatively out of order but at least if you predict correctly you avoid the cost of stalling the CPU on every conditional branch until it's resolved.

ARM also has an other trick for that, every opcode in the (full, non-Thumb) instruction set has a condition code that can let you execute an instruction conditionally based on the flags state without requiring an explicit branch. This way from the CPU perspective the flow of the code is linear, it's only late in the pipeline that the condition code is evaluated and the instruction discarded if it doesn't mach the flags. This way you're sure you'll never have a bubble no matter what, although the downside is that you end up fetching instructions that may end up not being executed so it's only worth it for "short" branches.


> Not if you're just trying to avoid a bubble in you pipeline and not actually executing the opcodes speculatively. In this situation the code will be loaded (and probably decoded etc...) but not executed before the CPU made sure that the branch was actually taken.

Here's what I'm trying to figure out. Let's say there's a JIT-generated instruction that I, an attacker, am interested in learning but cannot directly read from my position in the sandbox. If I can influence the instruction fetch speculator to issue a load for that instruction, then AFAICT it doesn't matter that it never makes it as far as the execute stage -- merely the act of fetching it for decode will have had a side-effect I can probably exploit into determining what it was.


Oh yeah you can do that, but I'm not sure if you can extract something useful out of this. Basically you can know whether or not a branch was speculatively loaded by timing the time it takes to go over it (if it was loaded by mistake it'll slow down execution). But then where do you go from there? Execution timings are not supposed to be secret.

I can't really imagine how you can construct an attack based on that, but maybe I lack imagination.


It is used to determine which instructions to fetch and decode, so that you don't have to wait until the branch is resolved to query the instruction cache, instruction TLB, MMU, DRAM, etc. If the branch was mispredicted, you have to stall to wait for this to complete.


Ah, that's good to know ... a comment below mentioned a "BTAC" (branch target access cache).

Is it just used for getting instructions into the L1i cache then?


I'm not arguing that R-Pi's are vulnerable to Spectre.


You're arguing that this article does not cover Spectre.

It does ... it just doesn't have to fully describe spectre to show why the raspi is not affected.


At least in the case of Cortex A-7, there's about 2-3 cycles (maybe 3-4 instructions with dual-issuing) before a branch mis-predict causes a pipeline flush. You can maybe fit a couple of loads into that time, and maybe one of them will result in a Dcache fill. Other in-order cores in the Cortex-A line have similar Dcache side-effect under mis-predict. It's not clear whether this actually leads to speculated Dcache fills, though.

So it's possible that some of the Raspberry Pi versions are in fact vulnerable to (much weaker versions of) Meltdown and Spectre.


I too would love to see the continuation you suggest. This was a great article, and I was sad to see it end, since I was hoping to grok a little more about the technical differences between Meltdown and Spectre.



Those updates cover Meltdown and Spectre 2, not Spectre 1.


If you don't feel overwhelmed enough by the misprediction variant of Spectre yet, consider that the cache side effect of speculation is actually desirable in most normal cases, because it helps prime caches with data that is likely to be used in the correct branch as well. Just turning it off is not the right way to solve this.

Perhaps this will finally provide enough incentives to model data sensitivity in the type systems of practical programming languages.


It would be hard to model data-dependent timing of an actual CPU. Integer division is variable time on lots of architectures, but on ARM even multiplication timing is data-dependent! And on desktops you could cause a even mov instruction to stall a few cycles if you can run something before it that uses all the renamed registers - there might be a way to turn this into data dependence and observe it from the perf counters…


You don't need to model the timing though.

What I'm saying with the type system comment is this: the cache side effects of speculation are desirable most of the time but not always. We should find a way to model data sensitivity in the type system so that a compiler can automatically choose to generate a side-effect-free code sequence where the side effects must be avoided (this assumes a future with ISA extensions that allow telling the CPU to prevent such side effects by blocking the speculative execution).


All shortcuts leak information. Anything with the equivalent of an early return leaks information.


It's not all that hopeless, actually.

Let's ignore meltdown, which seems solvable in hardware with no obvious performance loss (assuming amd's existence proof is correct), and concentrate on spectre, which everyone thinks means the world is ending.

One thing to note is that compilers have been safely speculating instructions for years. Shocking, i know :) Processors could too.

They just weren't. One of the cardinal rules of safe speculation is that you can't speculate an instruction unless you can prove it has no observable side-effects (ie possibly faulting, or in this case, ending up in cache), or that the side-effects are the precisely the same whether you speculate it or not (where "precisely the same" generally includes some notion of ordering of side-effect occurrence. i'm not going to get into all the various intricacies)

IE

  a+b -> safe to speculate, has no side effects
  load a -> generally unsafe, has possible observable side effects

  if foo:
    load a
  else
    load a 
  -> safe to speculate load a to right above the if, it must always be executed, and in this example, side-effects will be the same.
 
  if foo:
    load a
  -> not safe to speculate load a above the if, it may not execute
Compilers generally did not speculate loads, mainly because they can fault at that level[1]. The fault comes from the processor though.

So processors were/are assuming that speculation of loads, calls, etc, had no side effects because they could "throw away the results". As the processor controlled the faulting, it could just throw away the fault and pretend it never happened.

These two attacks are just proving that is not true, and the side-effects of computation themselves are observable.

The end result should be the same. Processors speculate more like compilers do: only in safe situations.

The idea that you have to give up all speculative execution seems very wrong.

You only have to give it up in cases where there are possibly observable side-effects and it's not guaranteed they will always happen.

The upshot is the most likely outcome is that both processors and compilers will work harder to speculate.

Compilers will get called upon to do more safe speculation.

Processors will have to grow the logic to determine when speculation is safe (or figure out a way to actually undo all side effects, which is fairly hard).

Now, the downside is the most useful speculation is obviously to hide load/store latency, and those things are the hardest to reason about safety.

But like i said, compilers have been doing it for many years at this point. Our hardware brethren just found out the hard way that they are likely to start having to do it too, and that there are more observable side-effects than they thought.

[1] Some JITs do it and catch the result in a fault handler if it turns out badly. I expect someone is going to discover ways to exploit all of these basically instantly. These will not be fixed by processor related fixes because they are not processor directed.


Considering that on average you have a branch every 5 instructions and the reorder buffer of high performance OoO CPU is in the order of a hundred of instruction, such a CPU is pretty much running under speculation all the time (often of multiple branches at the same time).

Not being able to fetch new cachelines when under speculation would be a huge blow, as exposing memory level parallelism is one of the most important features of OoO CPUs.

Edit: also fetching a cacheline is not really undoable as it is observable from other CPUs via the coherency protocol: i.e. while it might be possible to hide to the local cpu that a cacheline was loaded under a failed speculation, the effect can be still observed by another core by noticing that an exclusive line is now shared (by timing for example the latency of an atomic instruction)


"Considering that on average you have a branch every 5 instructions and the reorder buffer of high performance OoO CPU is in the order of a hundred of instruction, such a CPU is pretty much running under speculation all the time (often of multiple branches at the same time)."

Yes, it is. Remember, again, that the vast majority of those instructions can still be speculated, because the vast majority of instructions are not loads or stores.

Now, certainly, the expensive ones are loads and stores, but i'm just pointing out that the hundreds of instructions you are talking about in the buffer are mostly not loads and stores.

It's true that lowering memory level parallelism would be a huge blow, as the vast majority of time in well-tuned cpu bound apps is usually spent in stalls waiting for memory (otherwise, if it's really just arithmetic bound, it may make more sense to run it on a GPU or something), and this would just increase it.

The real question is what percent can you prove are safe to speculate, and at what point can you prove that safety (ie assuming it must be dynamically speculated, can you prove safety with enough cycles left that it matters). If you have 5 instructions, yeah, no, probably not.. But it may also be the case that the execution environment can prove it safe for you as the program executes and tell you.

I expect getting back this performance is going to be done using a variety of methods, some cooperation between jits/compilers and processors, and possibly some weird abstractions around marking memory you want to protect or not (IE not speculate around).

I mean, in the absolute worst case, you could make loads/stores take constant time and speculate as much as you like :)

It's just that this has a much higher performance cost right now than not speculating at all (by a few orders of magnitude)


I doubt that any CPU manufacturer is going to go first in crippling its memory subsystem.

I think realistically the only safe and not completely performance crippling workaround will be, at the very least, to run any untrusted code under a separate address space (assuming that the cpu is immune from meltdown). That doesn't necessarily require a full blown separate process, but something like memory protection keys might work.

The alternative is full static analysis and source level annotations, which realistically is only going to be done for very few programs and will still be error prone.


Yeah, i don't disagree at all. Something like this what i meant by "cooperation" (lfence or equivalent for code that can't be isolated but must be protected) + "weird abstractions" (IE back to segmented land we go)


Forgot about segments! Time to resurrect them on x86-64 cpus?


Just giving up speculation entirely is too much of a performance penalty. So one possible solution is: when you know it was a misspeculation, processor can just kick cache lines out of cache if the speculation caused the cache line to be loaded (probably you only need to do this at the last level cache closest to the memory). More states to track, but not impossible.


This only works if it's not observable by other cores/CPUs (IE you have no shared cache coherency).

Otherwise, they can observe it before you roll it back due to the way the coherency protocols work.


I don't think it really is hopeless. But what makes Spectre 1 "scary" is that it's situational: you want to serialize instructions predicating some loads --- unbounded offsets based on attacker-controlled data --- but all of them. You presumably don't want your compiler to lfence every basic block containing a load.


> You only have to give it up in cases where there are possibly observable side-effects and it's not guaranteed they will always happen.

> Compilers will get called upon to do more safe speculation

Compilers can also emit code that minimizes those situations. Also, extending the ISA with an instruction modifier that signals that otherwise innocent instruction has indeed observable side effects (ideally the processor should know that, but, if the compiler already knows it, a runtime check can be skipped.

> Processors will have to grow the logic to determine when speculation is safe

In many cases, this would be as simple as flagging the micro-op as unsafe and pause the speculation at that point.

For Meltdown, the just-add-silicon approach could be to never share cache between privileged and unprivileged code. To extend that to Spectre, never share cache across different PIDs (but, then, the ISA will have to know what a PID is). Since that would reduce the effective cache effectiveness, caches will have to grow.

Fun times ahead.


Intel CPUs do already have PCID (PID-tagged page table cache) to make switching processes cheaper.


Shouldn't that protect against Spectre?


If I understand it correctly, Spectre only works within a single address space.


Well... I never expected one part of my process not to be able to peek into what the others are doing.


One example would be malicious Javascript - or WebAssembly - strolling through your browser's memory looking for passwords to send home to the mothership. Or other sensitive information, like credit card numbers and so forth.


It's clearly a mistake to make all these different programs from different sources share a single security context.


By the way, didn't cperciva point out this vuln way back in 2005? http://www.daemonology.net/papers/htt.pdf

Actually that seems to be rather different, but still similar-ish. They both use threads to exploit memory caches in an unexpected way.

I thought he deserved a mention since no one really took it seriously back then. https://it.slashdot.org/story/05/05/13/0520214/hyperthreadin...

http://freerepublic.com/focus/f-news/1406913/posts

"The recent Hyper-Threading vulnerability announcement has generated a fair amount of discussion since it was released. KernelTrap has an interesting article quoting Linux creator Linus Torvalds who recently compared the vulnerability to similar issues with early SMP and direct-mapped caches suggesting, "it doesn't seem all that worrying in real life." Colin Percival, who published a recent paper on the vulnerability, strongly disagreed with Linus' assessment saying, "it is at times like this that Linux really suffers from having a single dictator in charge; when Linus doesn't understand a problem, he won't fix it, even if all the cryptographers in the world are standing against him."

Always found that amusing.


Cache timing goes back to at least 2005 with Osvik and Tromer. This isn't a simple cache timing bug, though.


Cache timing goes back to 2005 with Percival. I published a couple weeks before them. :-)


Yeah, but did you ever win a Putnam?


Just the once, though.


"Cache timing goes back to at least 2005 with Osvik and Tromer. This isn't a simple cache timing bug, though." (tptacek)

"Cache timing goes back to 2005 with Percival. I published a couple weeks before them. :-)" (cpercival)

Cache timing goes back to the VAX Security Kernel (early 1990's) designed for those A1 certification requirements that tptacek calls useless, "red tape." One of the mandated techniques was covert, channel analysis of whole system. They found lots of side channels in hardware and software they tried to mitigate. Hu found one in CPU caches following with a design to mitigate it. That was presented in 1992.

https://news.ycombinator.com/item?id=16083384

Since Hu is paywalled, see (b) "cache-type covert timing channels" in his patent:

https://www.google.ch/patents/US5574912

So, one of INFOSEC's founders (Paul Karger) did the first, secure VMM for non-mainframe machines. The team followed the security certification procedures discovering a pile of threats that required fixes from microcode for clean virtualization to mitigation of cache, timing channels. They published that. Most security professionals outside high assurance sector and CompSci ignored and/or talked crap about their work presumably without reading it. Those same folks reported later on virtualization stacks hit in 2000's with the attack from 1992 on new software with weaker security than KVM/370 done in 1978. Now, another attack making waves uses the 1992 weakness combined with another problem they discovered by looking at what interacts with it. That might have been discovered earlier if they did that with x86 like high-assurance security (aka "red tape") did in 1995 for B3/A1 requirements with them spotting potential for SMM and cache issues:

https://pdfs.semanticscholar.org/2209/42809262c17b6631c0f653...

Note: High-assurance security was avoiding x86 wherever allowed by market for stuff like what's in that report. As report notes with exemplar systems, market often forced it to detriment of security. Their identification of SMM as a potential attack vector preempted Invisible Things by quite a bit of lead time. That was typical in this kind of work since TCSEC require thoroughness.

In 2016, one team surveyed the components plus research on them in a modern CPU. Researchers had spotted branching as a potential, timing channel pretty quickly after CPU's got mainstream attention.

https://eprint.iacr.org/2016/613.pdf

So, a team following B3 or A1 requirements of TCSEC for hardware like in 1990-1995 would've identified the cache channels (as done in 1992) plus other risky components. They'd have applied a temporal or non-interference analysis like they did with secure TCB's in 1990's to early 2000's. A combination of human eyeballs plus model-checking or proving for interference might have found recent attacks, too, given prior problems found in ordering or information flow violations. This is a maybe but I say focusing on interactions with known-risks would've sped discovery with high probability. Far as resources, it would be one team doing this on one grant using standard, certification techniques from mid-1980's on a CPU others analyzed in mid-1990's say it was bad for security due to the cache leaking secrets, too many privileged modes, various components implemented poorly, and so on.

I keep posting this stuff on HN, Lobste.rs, and so on since it's apparently (a) unknown to most new people in the security field for who knows what reason or (b) dismissed by some of them based on the recommendations of popular, security professionals who have clearly never read any of it or built a secure, hardware/software system. I'm assuming you were unaware of the prior work given your scrypt work brilliantly addressed a problem at root cause quite like Karger et al did when they approached security problems. The old work's importance is clear as I see yet again well-known, security professionals are citing attack vectors discovered, mitigated, and published in the 1990's like it was a 2005 thing. How much you want to bet there's more problems they already solved in their security techniques and certifications that we'd profit from applying instead of ignoring?

I encourage all security professionals to read up on prior work in tagged/capability machines, MLS kernels, covert channel analysis, secure virtualization, trusted paths, hardware security analysis, and so on. History keeps repeating. It's why I stay beating this drum on every forum.


If you read my 2005 paper, you'll see that I devoted a section to providing the background on covert channels, dating back to Lampson's 1973 paper on the topic. I was very much aware of that earlier work.

My paper was the first to demonstrate that microarchitectural side channels could be used to steal cryptologically significant information from another process, as opposed to using a covert channel to deliberately transmit information.


Hmm. It's possible you made a previously-unknown distinction but I'm not sure. The Ware Report that started INFOSEC field in 1970 put vulnerabilities in three categories: "accidental disclosures, deliberate penetration, and physical attack." The diagram on Figure 3 (p 6) shows with radiation and crosstalk risks they were definitely considering hardware problems and side channels at least for EMSEC. When talking of that stuff, they usually treat it as a side effect of program design rather than deliberate.

https://csrc.nist.gov/csrc/media/publications/conference-pap...

Prior and current work usually models secure operation as a superset of safe/correct operation. Schell, Karger, and others prioritized defeating deliberate penetration with their mechanisms since (a) you had to design for malice from the beginning and (b) defeating one takes care of the other as a side effect. They'd consider the ability for any Sender to leak to any Receiver to be a vulnerability if that flow violates the security policy. That's something they might not have spelled out since they habitually avoided accidental leaks with mechanisms. Then again, you might be right where they never thought of it while working on the superset model. It's possible. I'm leaning toward they already considered side channels to be covert channels given descriptions from the time:

"A covert channel is typically a side effect of the proper functioning of software in the trusted computing base (TCB) of a multilevel system... Also, as we explain later, malicious users can exploit some special kinds of covert channels directly without using any Trojan horse at all."

"Avoiding all covert channels in multilevel processors would require static, delayed, or manual allocation of all the following resources: processor time, space in physical memory, service time from the memory bus, kernel service time, service time from all multilevel processes, and all storage within the address spaces of the kernel and the multilevel processes. We doubt that this can be achieved in a practical, general purpose processor. "

https://csrc.nist.gov/CSRC/media/Publications/conference-pap...

The description is it's incidental problem from normal, software functioning that can be maliciously exploited with or without a Trojan horse. They focus on penetration attempts since that was culture of time (rightly so!) but know it can be incidental. They also know in second quote just how bad the problem is with later work finding covert channels in all of that. Hu did the timing channels in caches that same year. Wray made a SRM replacement for timing channels year before. They were all over this area but without a clear solution that wouldn't kill the performance or pricing. We may never find one if talking timing channels or just secure sharing of physical resources.

Now far as your work, I just read it for refresher. It seems to assume, not prove, that the prior research never considered incidental disclosure. Past that, you do a great job identifying and demonstrating the problem. I want to be extra clear here I'm not claiming you didn't independently discover this or do something of value: I give researchers like you plenty credit elsewhere on researching practical problems, identifying solutions, and sharing them. I'm also grateful for those like you who deploy alternatives to common tech like scrypt and tarsnap. Much respect.

My counter is directed at the misinformation than you personally. My usual activity. I'm showing this was a well-known problem with potential mitigations presented at security conferences, one product was actually built to avoid it, it was higly cited with subsequent work in high-security imitating some of its ideas, these prior works/research is not getting to new ones concerned about similar problems, some people in security field are also discouraging or misrepresenting it on top of that, and I'm giving the forerunners their due credit plus raising awareness of that research to potentially speed up development of next, new ideas. My theory is people like you might build even greater things if you know about prior discoveries in problems and solutions, esp on root causes behind multiple problems. That I keep seeing prior problems re-identified makes me think it's true.

So, I just wanted to make that clear as I was mainly debunking this recent myth of cache-based, timing channels being a 2005 problem. It was rediscovered in 2005, perhaps under a new focus on incidental leaks, in a field where majority of breakers or professionals either didn't read much prior work or went out of their way to avoid it depending on who they are. Others and I studying such work also have posted that specific project in many forums for around a decade. You'd think people would've have checked out or tried to imitate something in early secure VMM's or OS's by now when trying to figure out how to secure VMM's or OS's. For some reason, they don't in majority of industry and FOSS. Your own conclusion echos that problem of apathy:

"Sadly, in the six months since this work was first quietly circulated within the operating system security community, and the four months since it was first publicly disclosed, some vendors failed to provide any response."

In case you wondered, that was also true in the past. Only the vendors intending to certify under higher levels of TCSEC looked for or mitigated covert channels. The general market didn't care. There's a reason: the regulations for acquisition said they wouldn't get paid their five to six digit licensing fees unless they proved to evaluators they applied the security techniques (eg covert-channel analysis). They also knew the evaluators would re-run what they could of the analyses and tests to look for bullshit. It's why I'm in favor of security regulations and certifications since they worked under TCSEC. Just gotta keep what worked while ditching bullshit like excess paperwork, overly prescriptive, and so on. DO-178B/DO-178C has been really good, too.

Whereas, understanding why FOSS doesn't give a shit I'm not sure on. My hypothesis is cultural attitudes, how security knowledge disseminates in the groups, and rigorous analysis of simplified software not being fun to most developers versus piles of features they can quickly throw together in favorite language. Curious what your thoughts are on FOSS side of it given FOSS model always had highest potential for high-security given labor advantage. Far as high-security, it never delivered it even once with all strong FOSS made by private parties (esp in academia) or companies that open-sourced it after the fact. Proprietary has them beat from kernels to usable languages several to nothing.


Thanks for the confirmation and inside info!

Yesterday we only guessed it https://news.ycombinator.com/item?id=16069740 based on ARM CPU list,

the RPi 1-3 CPUs

  ARM11, Cortex-A7, Cortex-A53
aren't in the list.

Affected ARM cores:

  Cortex-R7, Cortex-R8, Cortex-A8, Cortex-A9, Cortex A15, 
  Cortex-A17, Cortex-A57, Cortex-A72, Cortex-A73, Cortex-A75
https://developer.arm.com/support/security-update

(I tried to post it 3 hours ago, but HN is rate-limiting my posts, oh well)


This is a good overview of modern, superscalar, out-of-order, speculative CPUs that literally any programmer could easily understand. Recommended reading for every single engineer in the whole world (who doesn't already understand this stuff from reading source material e.g., Google Zero post).


Non-engineer here, this bit is key right:

> However, suppose we flush our cache before executing the code, and arrange a, b, c, and d so that v is zero. Now, the speculative load in the third cycle:

> v, y_ = u+d, user_mem[x_]

> will read from either address 0x000 or address 0x100 depending on the eighth bit of the result of the illegal read. Because v is zero, the results of the speculative instructions will be discarded, and execution will continue. If we time a subsequent access to one of those addresses, we can determine which address is in the cache. Congratulations: you’ve just read a single bit from the kernel’s address space!

To my understanding it is that saying that by...

1) ...flushing the cache so you have a 'clean' state, you can get...

2) ...the speculative execution to 'pull in' to cache the address user_mem[x_] but...

3) ...the particular address that's pulled into cache, 0x000 or 0x100, is determined by whether...

4) ...the illegal read of kern_mem[address] 8th bit was a 1 or 0...

5) ...which you can then subsequently determine the value of by...

6) ...timing how long it takes to access that user_mem[x] address once again and...

7) ...thereby leaking the value of kern_mem[address]...

So you still have to perform some logic on the result of the speed of the access to the secondary address read right?

If read of 0x000 is slow you know kern_mem[address] was a 1 and if fast kern_mem[address] a 0, and if 0x100 is slow you know kern_mem[address] was a 0 and if fast that kern_mem[address] was a 1?

Is that correct?

If it is it seems that timing is the key right, and actually the clever leap of creativity in completing the exploit, at least to my untrained mind.

Please do correct anything I've got wrong, I'm not an engineer/developer!


You're exactly correct. This is why the browsers decreased timing resolution in javascript so that you couldn't time memory accesses accurately enough to tell if the address was cached or not.


You're exactly correct. This is why the browsers decreased timing resolution in javascript so that you couldn't time memory accesses accurately enough to tell if the address was cached or not.

What does that do, besides turn the exfiltration problem from an immediate one into a statistical one?


IF it can be turned into a statistical problem, it may become an infeasible attack. You'd have to run the whole attack (not just the last reading bit since that would bring it into the cache after the first read) many times to be able to ascertain the difference. Even then, the difference might be less than the noise from other processes on the system (I think 80 cycles was used in the PoC?).

Maybe there will end up being a new Jumping Around Kernal Address Space System (JAKASS - a cousin of Linux's FUKWIT patch) that periodically resets kernel ASRL to make it fully impossible.


This Mozilla post https://blog.mozilla.org/security/2018/01/03/mitigations-lan... mentions that "other timing sources and time-fuzzing techniques are being worked on".

The paper they linked to references this one: https://www.usenix.org/system/files/conference/usenixsecurit...

I think this is what all sandboxes have to do: set the TSC disable flag, restrict system timer precision (make it configurable per sandbox: web servers generally don't need more than 1ms precision), make system timer report fuzzy (randomized) time. Heck, why not also make the CPU run at randomized frequency to mess with busy loop timers.


Does that mean -another- slight performance drop ???


Not in general.

Consider an Olympic 100 metre sprinter. Today we time this event very accurately, I think it's to one hundredth of a second, using sophisticated technology.

But even if the judges used a much less accurate mechanical stopwatch, Usain Bolt wouldn't actually be slower, we'd just be less confident of how ridiculously fast he is.

In some special cases, timing things very accurately might be essential to a use of Javascript, but I can't think of any examples off the top of my head.


gotta get that sweet rollover effect right to the 24th decimal baby


It means the only secure system is one where the performance of any given instruction is constant in all cases for that instruction.


A system like that would certainly be secure against these types of attacks, but I don't believe there is evidence that that is the only secure system. It certainly opens up many of these sorts of problems, but that just means it is difficult to secure, not impossible.


Yes, CPU architecture is one of many subjects I don't know enough about, and this was both interesting & has added to my reading list.

It's also quite fun to think that the little Pi I have chugging away in a tiny corner doing a variety of background tasks, which was already the most trouble-free machine I own, may also be the safest (OK I know that's an oversimplification, but I'm feeling affectionate towards it).


Finally the minimalist approach proved its ..... (I lost the word, non English speaker, but you did understand me :D)


I think "merit" may be the word you are looking for.

https://en.wiktionary.org/wiki/merit


"worth" maybe?


Agreed. This was the best of the explanations I've read, and it helped me to finally wrap my mind around the exploits.


agreed


Me too. Finally I understand how the leakage works: no actual reading of kernel memory is taking place.

Instead, the read-ahead/speculative logic causes one of two addresses in user space to be read, and thus placed in the cache. So, by reading both of them, and checking the time it took, the exploit can indirectly determine one bit (0 or 1) of kernel memory. Scary!


What are the exploits enabled by this? Seems like mostly security-related attacks on encryption keys and such.


Well that value has to still be read into the cache, since the "kern_mem[address]&0x100" calculation is speculatively carried out. I don't think the MMU can do any bit level computations.


I understood everything up until the "suppose we flush our cache before executing the code" part which is probably the most important part.

There was a comment below the article that explained this part a little further:

> Imagine the value at the kernel address, which gets loaded into _w, was 0xabde3167. Then the value of _x is 0x100, and address user_mem[0x100] will end up in the cache. A subsequent load of user_mem[0x100] will be fast.

> Now imagine the value at the kernel address, which gets loaded into _w, was 0xabde3067. Then the value of _x is 0x000, and address user_mem[0x000] will end up in the cache. A subsequent load of user_mem[0x100] will be slow.

> So we can use the speed of a read from user_mem[0x100] to discriminate between the two options. Information has leaked, via a side channel, from kernel to user.

https://www.raspberrypi.org/blog/why-raspberry-pi-isnt-vulne...


Yes, that is the this is left as an exercise for the reader part of the explanation. (-:

The remaining part is to iterate the process over all of the bits in the word, using different bitmasks. The resultant set of 0 or 1 results for each bit yields the complete word.

Then one iterates that whole process over all (useful) words in (mapped) kernel memory.


Thanks, this was the missing piece in my understanding. I was wondering how only knowing only 1 bit would be useful. Suppose the attacker wants to read this entire address (0xabde3167) using this method. Is it guaranteed that over multiple runs, this address would be the same each time at that point in execution?


It is certainly possible that the memory the exploit is trying to read might be changing under its nose. An actual implementation of the exploit would need to account for that.


How might one know which address to attack like this? I thought memory was fairly randomly laid out.


Ok I think I understand the subtleties of these attacks now. But: can anyone tell me why the accessibility check for protected memory doesn't happen before the cache loads the contents of RAM? If that happened then none of these attacks would be possible.

I got my computer engineering degree in 1999 and ended up going the computer science route making CRUD apps all day. I feel in my gut that some engineer, somewhere, MUST have asked this question at one of the big chip manufacturers.

Am I missing something fundamental? Is the access check too expensive? If it isn't, then can the microcode be updated to do this, or is caching/accessibility checking happening at a level above microcode? If that's the case then it would seem that pretty much all processors everywhere that do speculation without protected memory access checks are now obsolete.


> can anyone tell me why the accessibility check for protected memory doesn't happen before the cache loads the contents of RAM?

From what I understand, it does happen on AMD, which is why AMD CPUs are not vulnerable to the more dangerous Meltdown attack (any code reading kernel / hypervisor host memory).

Intel / ARM delays the checks until later, to the time when the speculated instructions are actually finalised and make their results available. This is faster, and loading some memory into the cache is normally invisible to the unprivileged code. The checks would still be done when actually reading that memory. But nobody spent enough time considering the timing side-effects of the cache.

Now even if the protected memory reads are fixed by the OS updates - then it still leaves the Spectre attack - code running in a process reading all of "its own" memory, regardless of any software sandboxing. This means that all sorts of sandboxing methods for javascript interpreters, bytecode interpreters, plugin architectures, etc are insecure. And the OS patches can't help here, because the sandbox isn't in protected memory.


Thank you, that was a succinct explanation of the difference between AMD/Intel and the order of speculation and protected memory access check for the Meltdown attack, and makes it easier to understand:

https://en.wikipedia.org/wiki/Meltdown_(security_vulnerabili...

I see now why the Spectre attack is so serious (reading out of bounds within the same process memory). I feel like there may be ways to catch unallocated memory access similarly to how protected memory works. But, that wouldn't help reads from allocated memory in runtime environments like for Javascript (where separate scripts in the same process space aren't meant to see each other's data). This is clearer to me now:

https://en.wikipedia.org/wiki/Spectre_(security_vulnerabilit...

Going forward, we may have to assume that security is only possible with true process isolation. For example this might put pressure on OSs to fix their slow context switching implementations to encourage the use of processes instead of threads. Beyond that, I can't see any easy way to fix the situation and am highly skeptical of things like compiler fixes, because there will likely always be another way to abuse various instructions to read outside memory boundaries.


The slow context switching between processes has nothing to do with OS implementation. True context switching involves a page table flush. This is slow due to caching, independent of the OS. The only thing the OS can do is tell the processor which parts not to mark dirty, but -- as these attacks show -- this can expose vulnerabilities.


Too bad segmentation was dropped for 64-bit code on x86, leaving just page tables. 32-bit x86 retains the segmentation model of the 286, extending it to 32 bits, and making it work with virtual addresses instead of physical addresses if the paging system is also enabled.

Most 32-bit operating systems ignored the segmentation system, basically just running everything in what the old timers would call "small model".

If we had segmentation in 64-bit mode, then I wonder if we could defend against a lot of these problems by running things we want to sandbox, such as JavaScript, in a different segment that only has access to the upper end of our virtual address space?

As long as the sandboxed code cannot change the segment registers, this would prevent it from generating an address outside the sandboxed portion of the processes' virtual address space.

I don't recall if the x86 segment system provides a way to trap attempts to change a segment register. If I recall correctly, it does support more than just the two level kernel/user protection system, and I think it supports not allowing loading a segment register with a selector that refers to a segment belonging to a higher level, so maybe if user mode was split into two levels, so sandboxed code could be run at a less privileged level than the main process it could work.

In general, I think processor designers need to take into account the need for processes to run sandboxed code, and provide some kind of mechanism the processes can use to protect themselves from malicious code in the sandbox.



> This means that all sorts of sandboxing methods for javascript interpreters, bytecode interpreters, plugin architectures, etc are insecure.

I just now understood the impact of Spectre. It is not just that all existing attempts to execute code in a sandbox are vulnerable. For CPUs with this problem, it is literally impossible to create a secure sandbox.

We certainly live in interesting times...


I'm still wrapping my head around Spectre. To me it does seem possible to create a sandbox secure against Spectre, but you have to recompile every program with retpoline to mitigate variant 2 and run in a separate process to mitigate variant 1. This way even if something breaks in a sandbox it will be able to see something else only from similar broken sandboxes, but not from other programs. Although there is a possibility of other side channel attacks.

(hoping for someone to correct me if I'm wrong)


I hope I understand your question correctly. The request asks for something in protected memory and also asks for something based off a portion of the protected memory (like the first byte). The system denies access to the request then puts both results in cache. The attacker then asks for a byte of memory similar to the second request, which the system tries to get from cache but then goes to memory since it wasn't in cache. The attacker doesn't want that result so cancels the request and asks for another byte similar to the second request. That process repeats until the system says "Hey this byte is in my cache" and gives the result back to the attacker. That let's the attacker know what the first byte of that protected memory was. The attacker then repeats that whole process until they've read the entire protected memory, which is at a rate of 1500/bytes a second. It never gets the actual protected memory from the cache.


this is for meltdown right? how would you explain the difference between this and spectre, going from this explanation?


> Ok I think I understand the subtleties of these attacks now. But: can anyone tell me why the accessibility check for protected memory doesn't happen before the cache loads the contents of RAM? If that happened then none of these attacks would be possible.

If that happened, Meltdown wouldn't be possible on the processors on which it is possible.

Those checks in no way mitigate against Spectre. Spectre is both simpler and in many ways more profoundly devastating -- as long as an attacker can influence the statistics that the CPU uses for branch prediction, the very act of trying to "work ahead" in another process can be influenced by the attacker, and will produce some side-effects even if you throw out the result, and those side-effects suffice to infer information about the work you did but threw out.


I think you're missing that the misdirection happens in the kernel. On an indirect branch (i.e. "jmp <register>") the CPU picks a value for <register> from a small cache when speculatively executing, since the real value isn't known. A malicious program can fill the small cache with addresses of kernel code that do some computation with kernel data and then depending on the result, touch different parts of user application memory.

There is no memory protection violation because it happens in the kernel; obviously, the kernel can read its own memory.


> I feel in my gut that some engineer, somewhere, MUST have asked this question at one of the big chip manufacturers.

Sure, they were about to commit career suicide but then they learned to love the bomb and went on with their day. Maybe they even tried to explain the problem to management but somehow it got lost in translation.


The memory access could potentially take a very long time, so it makes sense to start it as early as possible, in parallel with all the other operations. Saving a few cycles on each memory access adds up significantly overall.


Right, but at a cost of a leaky cache abstraction. Which is now hugely expensive to work around, making me question how that ever got approved if the security model was clearly leaking at that point (which has been known for the entirety of speculative execution, right?)


> how that ever got approved if the security model was clearly leaking

When you consider a theoretical model of the CPU, then it's not leaking - the speculative execution, cache, other parts of the CPU are designed carefully so that no data can "escape" and be read by processes that don't have the permission to do it. Speculated execution can happen, but before any results from that are released, the permissions are checked, and if they fail, the results are discarded.

What people did not consider is the timing attacks that do leak information. It's only "clear" to us now after the attack has been demonstrated, even if it has been present in CPU design for the last 20 years.

There are probably many more of these side channel data extraction paths possible. For example, in the recent years attacks on cryptographic algorithms have looked at similar timing measurements, and in some cases power consumption measurements.


Isn't the knowledge of potential for timing attacks using cache/memory fairly old? I am pretty sure I heard of the concept long ago.


Very old in the arena of cryptography.

No one in the CPU Architecture Design arena put 2 & 2 together [1] to realize that the same side channel that was devastating for cryptography work would also be quite devastating for bypassing memory permission protections in the CPU's they were designing.

[1] probably because the intersection of "cryptographers who can mount timing side channel attacks" and "CPU Architecture designers" is very close to zero.


I may be wrong, but the timing attack only reveals the cache line, but being able to read the cache line is something itself I would expect to fail even without the timing attack. How does that not cause an exception?


My understanding is that you never directly read memory to which you shouldn't have access. You arrange for the processor to read a specific address you control based on the value of the illegal read. By timing subsequent accesses to that user address, you can infer whether or not the processor brought it into the cache, and thus infer the value of the illegal read.


how that ever got approved if the security model was clearly leaking at that point

Keep in mind that for a very long time, PCs and x86 CPUs were used in environments where either there was a single user with full access, or multiple users that aren't completely adversarial (look at Win95/98's multiuser security model, for example.) Memory protection and other security features served as a barrier to accidents and to "keep the honest honest", not determined adversaries.

They still are today, but this is very different from the shared servers/cloud computing environments which have now become common --- completely mutually untrusting users with possibly adversarial relationships are sharing the same hardware.

This is the reason why all the CPU manufacturers have had some variant of "as designed" in their public comments --- speculative execution was designed with the former model in mind, not the latter.


In retrospect, yes this is of course a big problem. But regarding how it could ever be approved...

Until this news broke, a CPU designer would tell you that speculative execution is a well-understood, proven approach to gaining a lot of performance. Branch predictors are really good at figuring out where the next instruction is going to come from, and are a really important tool for avoiding stalling out the whole machine while you wait for the next instruction to come back from memory.

And intuitively, it seems really "safe." All you're doing is having the CPU get ready to perform "future" calculations more quickly. It gets to guess where the program is going to go, and start fetching resources that it thinks will be needed. As long as nothing architecturally visible is changed by these preparations, everything is functionally the same, so what could go wrong?

And intuitively, you wouldn't think that something like a cache, which the program has no way to access directly, should be architecturally visible. Even putting on a security-minded hat, you would think that it doesn't matter what's in the cache, because if a program tries to access kernel memory, the access still has to undergo a permissions check.

The attack is pretty damn clever. And disheartening.


More or less the conclusions I came to during university. Speculative execution felt dicey (because you're doing things before you know if they are the right things), so I spent some time thinking about it and realised that you are (basically) throwing away or rolling back the erroneous execution. At worst I thought it would be wasted executions if, e.g, your branch predictor is poor - I never even thought about timing or indirect attacks. It's a pretty subtle thing IMO.

IIRC, one of my professors at one point explained modern CPU behaviour (branch prediction/OOOE etc.) as, roughly, the CPU can do whatever it likes because the program never sees what happened under the hood - it just has to order the result/make sure the result is correct.


> if the security model was clearly leaking at that point

The leak is only clear in retrospect. Many, many things are only clear after you see how they were done.

It has been twenty years since processors with this vulnerability started appearing. Over those two decades, thousands of very smart engineers (including state-sponsored ones) have collectively spent millions of hours of analysis trying to find security flaws.

No one has found such a clever timing attack until now. So "clearly leaking" wasn't "clearly leaking" until this week.


No one has reported such clever timing attack until now. There's no reason to suppose that such an attack hasn't been used by nationstate actors for, say, a year already.


> The leak is only clear in retrospect. Many, many things are only clear after you see how they were done.

However, even at the time of the design it would have been obvious that deferring security checks is a risky design choice.


Colin Percival found a similar problem with Intel's implementation of "HyperThreading" back in 2005:

http://www.daemonology.net/hyperthreading-considered-harmful...


With all the news about these attacks lately, this is one of the best posts I've seen in explaining to less knowledgable people how exactly speculation causes a problem.

One question I still have that gets glossed over is how timing of instructions is captured.


in order to exploit it from a script running in a web-browser: there's a high-resolution timer in javascript. This one is limited to 5 to 20 us resolution to prevent such attacks.

Recently a shared-memory extension has been proposed. one javascript thread just increments a counter in the shared memory, functioning as a clock for the other thread.

In both cases, (Spectre) attacks can be prevented by browser updates, so any performance impact is not system wide.

This is different from Meltdown, which (only?) affects intel. That one requires kernel changes which cause system-wide performance degradation.


> This one is limited to 5 to 20 us resolution to prevent such attacks.

* make such attacks more difficult.


I've been pondering how to identify the cached line without a timer. It seems like a classical race condition if multiple reads can be issued at once, eg by ILP. Another thread could easily figure out which load completes first.


Impossible. This attack relies on detecting the timing between a cache hit and miss. If your clock resolution is larger than a cache miss then you can't differentiate the two events and so no information is leaked.


Not quite. An instruction that takes 1us is much less likely to start and end in a different 20us clock cycle than a 10us instruction. Simple repeated sampling combined with statistics still yields a timing attack. It'll be slower and less deterministic, but it's still a problem.


The GPS signal in below thermal noise at ambient temperature [1]. Using your logic you would say that then it would be impossible to extract. However, with clever math you can extract a signal which is below the noise floor.

Particle physicists are experts at this, extracting tiny signals from huge piles of noisy data.

[1] https://sdrgps.blogspot.co.uk/2016/02/find-signal-in-noise.h...


You can wait for a random period on the scale of the clock then measure and take statistics on the results. It will slow down the attack but not stop it.


Right, but all this does is close off one possible avenue for a clever attacker to cobble together an HRT. Given the sheer number of JS APIs that have been thrown into browsers, I think it's pretty naive to imagine that we've closed off the only possible avenue to jury-rig a sufficiently precise measure of time.


> 5 to 20 μs resolution

Updated for those of μs confμsed by this μngainly grammar


One way would be to use the rdtsc instruction, which reports how many cycles have been executed since the processor was reset.


Using rdtscp is better in this case than rdtsc, because it's a serializing instruction. So speculation (instruction reordering) itself doesn't affect the results.


Which is precisely why Firefox is releasing an update that tweaks high resolution timing. I guess tweaks probably means breaks for some uses cases but that’s likely unavoidable and worth it.


I defintiely agree


The cores in (all versions of?) Raspberry Pi do speculatively execute. It's just that the window of opportunity is tiny - just a few cycles (and maybe up to twice that many instructions) - and there's (probably) no way to get an indirected side-effect.

I wouldn't write off the ability to get a useful side-effect signal. The variants widely documented are not the only possible methods of inducing speculative side-effects.


Yes, the Raspberry Pi will issue loads before the branch resolves. But usually a processor's pipeline won't be laid out in such a way that the AGU won't have time to pass an address to the load pipe before the branch resolves and squashes the load. The Cortex A8 was an interesting exception but it was pretty deeply pipelined compared to most in order cores.


Plus, the Raspberri Pi does have a BTAC (Branch Target Access Cache). Part of Spectre uses the fact that in many architectures the branch target cache is shared across the kernel/user boundary to create a side channel.


> The lack of speculation in the ARM1176, Cortex-A7, and Cortex-A53 cores used in Raspberry Pi render us immune to attacks of the sort.

I didn't check, but these will almost certainly have branch prediction. What they probably lack is a predictor advanced enough to speculate on indirect branches, which AIUI is the primary vector of Spectre.


Branch prediction alone is insufficient. Speculative execution alone is insufficient. You need speculative memory loads for any of these attacks to work.

The Cortex-A53 branch predictor [1] does prefetching to keep the core fed. This ensures that the instructions are ready for decoding, but has no architectural effects beyond the L1 instruction cache, which is already a well-studied timing sidechannel.

[1]: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....


What about the fact that these instructions might get partially executed in the pipeline before the branch gets resolved and the pipeline flushed? If a mis-fetched instruction can reach the LSU stage before the pipeline gets flushed, it might serve as a speculative memory load...


They're not partially executed. The branch predictor only fetches instructions. They might be decoded, but it's not an out-of-order processor-- pipeline stages only proceed if the previous phase is correct.

Here's the Cortex-A53 pipeline: https://www.anandtech.com/show/11441/dynamiq-and-arms-new-cp...

It's an in-order CPU, so that "issue" phase (pipeline step 5) stalls until the instruction pointer is resolved. Instructions must be issued to the "AGU Load" functional unit, which is what actually performs the read and pulls data into the cache hierarchy.

Note also that a single speculative memory load is insufficient for Spectre. You need two speculative memory loads.


ARM Ltd has a list of vulnerable cores [1], the above are not listed.

[1] https://developer.arm.com/support/security-update


I didn't imply they are vulnerable.


Eben delves into some unaffected speculative ARM features in the comments, and why they're not vulnerable. His responses in the comments in general are also worth a read, in case folks skip comments as a rule.


I was already on the lookout for a small ARM-based mini PC, just for doing financial transactions and record-keeping. Now that seems more pressing but I don't know of any such thing in existence.

I tried doing that on RPi 3, but the IO seemed not up to the job -- the CPU appeared to be just about tolerable, but using micro SD as a disk was too slow and prone to failure (I'd have tried an external USB disk but I believe the problems were in part because of poor I/O bandwidth). Other single board machines seemed to have better provision for disks that are up to the task I had in mind, but lack software support, so that I had little confidence in security updates, for example.

If somebody sold this I think they'd have my money tomorrow:

* An ARM mini-PC

* With a decent security update team behind it (probably the hard part?)

* That will let me run some basics: for me, a Unixy OS with Chrome/Chromium, emacs, ledger and python, without a big effort to install those and keep them up to date

* Ideally without too much anti-commodification BS (from my customer perspective) so that hardware can be swapped out if needed

Does anything like that exist?


SD cards are optimized for sequential IO (reading/writing photos, video, music). For an OS root partition, random IO is much more important for general use. If the root partition is mounted from an external USB drive with higher random 4K IOPS benchmarks, IO performance should be greatly improved.


(coming late to this thread ...)

arm-powered chromeboxes don't exist (yet?), so perhaps an arm-powered chromebook?

(chromeboxes tend to be more upgradeable than their chromebook counterparts)

Linux OS, kept up to date for you (with google backporting security fixes to their kernel, and of course - updating their browser), running on arm.

Can be used as a simple browser-only machine, or if you are decently comfortable with linux, you can unlock its potential and use it as a fully-fledged linux machine. Your choice.

If you want to avoid being part of the google foodchain, you could try dual-booting into another arm distro of your choice.

Best I can think of, at the moment ...


This is a fantastic read. Timing attacks are insidious and tend to crop it in the oddest places. I first learned of them when learning how to securely compare strings (used a lot with passwords). A naive implementation means that you can easily guess if a character is correct depending on how fast the compare function returns.


You shouldn't be comparing plain text passwords anyway. You should be using a secure password hash, such as bcrypt, sure, you should use constant time comparison, but in this specific case, it won't really make you vulnerable to use normal comparison.


Normal comparison is bad even if you are comparing hashes. Letting the attacker figure out the password hash allows them to attempt to crack the password through an offline brute-force attack running on GPUs.


I don't understand how this can be an issue. This is usually what happens:

> User enters password > PW gets hashed > Hash gets compared to the DB

I don't know of any same system that would allow you to compare a hash to a hash? Unless you have access to something you shouldn't, in which case it doesn't matter anyway, because you can probably just read the hashes.


If the user knows exactly how the hashes are generated (I believe a random salt would prevent this), it could still be used to better target an online bruteforce, though other defences such as rate-limiting should still kick in.

If, say, you submit a password with a hash that starts "b94", if the database doesn't use a constant time comparison, you can use the timing to figure out that the stored hash also starts with "b94" (statistically, given network etc. delays involved), meaning you can pre-filter your submitted guesses (i.e. bruteforce offline and only submit guesses that start "b94").

It's definitely a edge-case thought (and probably not worth worrying about unless you don't salt/rate-limit requests). I also don't know if the number of requests needed to determine the timing would actually be less than just making random guesses outright (intuitively it seems so because even if it takes a lot of requests it shrinks the search space at each step).


I believe that you now know that every time you do this(comparing passwords as Strings) a puppy dies!


Sorry. Is this crude? Don't really understand down votes.


I suspect the downvotes are because the comment is pretty light on substance as well as being a tired meme. HN members typically value good, substantive, constructive comments and try to keep the signal to noise ratio up as best they can.

https://news.ycombinator.com/newsguidelines.html


Thank you.


This article wonderfully explains a complex context without losing a lot of relevant detail.


> In the good old days*, the speed of processors was well matched with the speed of memory access...Over the ensuing 35 years, processors have become very much faster, but memory only modestly so: a single Cortex-A53 in a Raspberry Pi 3 can execute an instruction roughly every 0.5ns (nanoseconds), but can take up to 100ns to access main memory.

In real-world terms, what's the fastest processor we could build today whose execution speed is reasonably matched to it's main memory access speed (so it doesn't need caches, etc)?

I could imagine that a processor, with a simple design that closely matches a naive model of how CPUs work, would be very useful for high-security applications. It would be much easier to reason about up-front.


The problem isn't really the speed of the memory but its size. The planar nature of memory and the speed of light impose a latency of access on a pool of size N proportional to the square root of N. The L1 cache on your CPU has kept up in speed with the processor and is the same size as the memory computers had when computers could access their main memory quickly.


The latency of DRAM is mostly governed by Dennard scaling ("regardless of transistor size, power density remains the same"), because it means that making cells smaller reduces the available charge currents for bit-lines and the like proportionally, so only a small latency advantage can be gained.


Modern DDR4 SDRAM can run up to 2133Mhz, but DDR SDRAM is designed to increase speed by taking advantage of CPU caches. For peak performance reads occur in batches so one read command also fill the cache with nearby data.

SRAM can go fast since that’s what caches are made with, but that’s expensive. Also it would need to be close to the CPU as wire latency is non-trivial at high clock speeds.


Something I don't understand is why SRAM is more expensive than SDRAM. Is it just the manufacturing volumes and/or manufacturing yield, or are there architectural differences that make SRAM more expensive?


Each "bit" in SRAM is bigger than each "bit" in DRAM. Typical SRAM uses six transistors for each "bit", while typical DRAM uses one transistor and one capacitor. Bigger chips means less chips per wafer, and a larger chance for a chip to have a manufacturing defect.


Thank you for the explanation.


In the good old days, the speed of processors was well matched with the speed of memory access...

In fact on the Beeb, spiritual ancestor of the RPi, memory ran at 4Mhz and the CPU at only 2Mhz...


IIRC, some of those old machines interleaved CPU and GPU memory access windows, since memory was so much faster than CPUs back then.

Edit: this is what I'm talking about: https://www.bigmessowires.com/2011/08/25/68000-interleaved-m...:

> The Atari ST and Amiga appear to have both used a more aggressive scheme where video circuitry access occurred during known dead time in the 68000 bus cycle, so the CPU never had to wait


People use mainframes to this day for similar reasons.


That doesn't make a ton of sense to me. Aren't mainframe CPUs developed using modern techniques?


Actually, mainframe CPUs are developed using old techniques, which are all of a sudden interesting again. (-:

See for starters this discussion about reinventing the AS/400: https://news.ycombinator.com/item?id=16053518


At least some time ago IBM mainframes were powered by their regular POWER processors "just" with modified/extended microcode. Though it seems they also have dedicated CPUs for them, like the zEC12. (POWER doesn't really qualify as "old techniques")


I enjoyed reading this a lot. I wonder why the developers decided to allow reading kernel-memory in the first place. When a scalar processor reads kernel memory, it crashes. When a speculative processor reads kernel memory, it relies on the assumption that the read is never committed to prevent leakage. It takes no expert to realise this is a potentially dangerous decision (and, as becomes clear now, is only valid in the absence of a cache).

To me it would make a lot more sense to use a special value to indicate the read did not succeed and propagate this value until it is time to crash. I guess this introduces some overhead (e.g. reserve a special value); but are there any other drawbacks?


This is the part I don't understand. How is the processor able to read a cacheline from a protected memory page without crashing (even if it was speculative and wouldn't happen in the idealized execution due to branching).


Because in the Intel design, for memory reads issued by speculative instructions, any "access denied" results are also delayed until the CPU control unit determines the instruction that issued the read should really have been executed.

But the actual read from is allowed to occur, even if the "access denied" signal is given. Which allows the read to effect the state of the data caches. This was likely done this way as a performance booster, because this would allow speculative instructions to also perform cache pre-fetching during their speculation window.

That seems to be why AMD CPU's are immune to Meltdown. AMD's design prevents the read from occurring when the "access denied" signal appears, so the cache state is not effected, so there is no side channel to detect.


> But the actual read from is allowed to occur, even if the "access denied" signal is given. Which allows the read to effect the state of the data caches. This was likely done this way as a performance booster, because this would allow speculative instructions to also perform cache pre-fetching during their speculation window.

Why is this? Is it because the CPU doesn't know ahead of time what is valid (because it depends on the "outcome" of instructions in flight), or is there something I'm overlooking?


Well, if you try to put yourself in the mindset of a CPU designer, without extensive cryptography experience [1] to be fully aware of timing side-channel attacks, you would see the speculative execution memory reads as harmless. If the predicted path is wrong, you'll reset the CPU state (cpu registers) so the running program sees nothing different. And if you skip running the reads through the full memory protection gamut (you still have to do the address translation) during the speculation window you'll save a few cycles on the reads, and maybe a tiny bit of power. And in the case that the predicted path was correct, any "access denied" signals need to be delayed until the accessing instruction would commit anyway (to maintain proper sync with how the signal works, in that it indicates which instruction took the memory access fault, so you can't raise the signal until you know for sure the instruction would have executed). And if you see the reads as harmless (because they are thrown away if the speculative guess was wrong) then you might also see them as "free" cache pre-fetch instructions (because they do pre-warm the cache when the speculative path is the correct path).

In the end the result is a confluence of several different topic (speculative execution, data caching, high resolution timers [although these can be simulated with a plural CPU system]) that each in isolation is all but harmless, but together emergent behavior appears that was not immediately apparent from each one viewed individually. I.e., without caches there's no side channel to monitor. Without speculative execution there's no way to trick read a bad address and avoid taking a memory access fault. Without high enough resolution timers it becomes very hard detecting the time difference between a cache hit and miss.

[1] a reasonably safe assumption - most CPU architecture designers are not cryptographers, and most cryptographers are not CPU architecture designers, and most timing side-channel attacks have historically been against crypto. algorithm implementations.


  t, w_ = a+b, kern_mem[address]
  u, x_ = t+c, w_&0x100
  v, y_ = u+d, user_mem[x_]
  
  if v:
     # fault
     w, x, y = w_, x_, y_      # we never get here

Form the author's example, it seems like the processor is able to read the kernel memory privately (`w_`) and crashes if the code attempts to commit `w_` to `w`. It would be interesting to know how the processor does that.


The explanation is at end of page 4 and begin of page 5 on the meltdown paper. In a nutshell, the speculative feature allows the read to happen, it "doesn't segfault," and only signals the OS of the segfault or whatever until it's actually executed.


The best explanation of Meltdown I’ve read.


The best comment made on that blog post was by Eben himself:

"One almost wishes that they’d stuck with the original name for the KPTI patchset: Forcefully Unmap Complete Kernel With Interrupt Trampolines.

https://www.theregister.co.uk/2018/01/02/intel_cpu_design_fl... "

Now that's funny!!!


I've been wondering (and haven't seen it addressed anywhere) if these attacks could be used to get the private key out of game consoles. These days I would assume not - that the key would be in a secure enclave - but the current generation of consoles are a few years old now and maybe that's not the case.


The private key used to sign code? No, that wouldn't be in the console at all.


I imagine they mean the decryption key (rather, the ‘private, encryption’ key).


Off topic, but I haven't seen this discussed anywhere yet. My understanding is that font files can contain complex instruction sequences to control exactly how a font is rendered. I believe Windows implements a kernel space VM to execute these instructions. I know variants 1 and 2 did not necessarily require eBPF but that it made the attack simpler because the desired instruction sequences could be injected directly into kernel space (rather than finding existing sequences in the code base). It seems that in theory font rendering could serve a similar function on some platforms.


Now... I am interested in assembly :D Any recommendations ???

Really awesome explanation


I highly recommend http://www.nand2tetris.org/ and the The Elements of Computing Systems book https://www.amazon.com/Elements-Computing-Systems-Building-P...


Thanks :)


Hey, what about Intel Xscale processors like the PXA2xx series ?

These do have Dynamic branch prediction/folding afaik and may be affected ?

Does somebody have a spectre.c tuned for generic armv5tel for example?

Current versions of spectre.c, like this one https://gist.github.com/LionsAd/5116c9cd37f5805c797ed16fafbe... still contain "_mm_clflush" and therefore do not compile on ARM at all.


Fantastic read. Before reading the article, I assumed there were many HN readers who were extremely proud about their Raspberry Pi being invulnerable to Spectre or Meltdown


I've done some cursory searching and not found anything, so I'll ask here: what mechanism is used to measure how long it takes to access a specific address in memory?

I assume there is some way to tell the CPU "when memory location X is read, store the current time in register Y" or some such thing. Could anyone share what that mechanism is?


Elsewhere in the thread, someone asked the same question and got an answer: https://news.ycombinator.com/item?id=16080230


Thank you for that link! I'll write out the conclusion I came too from reading those comments:

Instead of measuring the literal time interval between instructions, the number of cycles between two points is measured (using the RDTSCP instruction).


How many RPi users are using this board to run untrusted code?

The RPi may mitigate risk of these attacks simply in the way it is used.

Perhaps hobbyists use it to run their own small programs, not random third party Javascript in an enormous web browser from some corporation.



...nor are 486s, 386s, AVRs, 8051s, and a bunch of other low-performance in-order CPUs.

There's also this... https://groups.google.com/a/groups.riscv.org/forum/#!topic/i...


That's a little unfair, the RISC-V design is intended for both low power and high power applications, in implementation is has the potential to be comparable to the Pi 3's Cortex-A53. Where as the two 86s you mention are very old slow CPUs and the other two you mentioned are 8bit microcontrollers.


Yay, so my AVR and 8051 "usermode" code can't read kernel data. Phew!


Nice. I wish I could get hold of some inexpensive general purpose RISC-V hardware for hacking on. Hopefully soon!


Me too, I don't follow the news on it much but I would like it if some single board computers appeared with RISC-V in the not too distant future.


What would finally bring this all together to me would be an example of a real world attack that would be carried out using these methods on some target, perhaps with an implementation.


Since it was an industry effort to find the flaw, and they are vulnerable to the threat they've exposed, it would seem at odds to their interests (and my own) to provide you or anyone else with an example of how to exploit it.

And I'll offer that if you're not capable of demonstrating it after reading Eben's description of how it works than there is no good reason for you to have an example handed to you.

If you think you are capable I'll offer your time would be better spent working on fixes.


Didn't ARM say the Cortex A53 is vulnerable to Meltdown?


According to https://developer.arm.com/support/security-update, the Cortex-A57 is affected, but the Cortex-A53 isn't.


A Google Project Zero member said that they got Meltdown working on a Cortex-A53.


Does anyone expect bug revelations after this to be less worse, or is there still a chance there could be vulnerabilities that are worse than these?


It's hard to imagine something worse to be honest. These vulnerabilities basically amount to ripping away the entire veil of protections at every level that we've built up over the years.

Future vulnerabilities that I could imagine being "worse" would be either encryption vulnerabilities or signals level vulnerabilities.


Technical details aside, I find it quite amusing that the hardware in my pi zero is more secure than my desktop that is two orders of magnitude more expensive


Well, the vulnerabilities are directly connected to performance-enhancing architectural features, so...


Which is why i said technical details aside.

I think it's interesting, you are not just paying for speed, you are paying for a compromise, because the speed is gained through complexity, which not only increases the chance of error (by design or implementation) but in the case of a high degree of speculative execution can translate into worse performance per watt. In short, it's the whole "more is less" thing.


> you are paying for a compromise

Very good point. Apparently including at least one compromise that most people (probably including the engineers who designed the CPUs) didn't know they were making.


And yet, my abacus is even more secure ;-)


Wrong kind of arm processor.


Only if you wear gloves while using it and shake it after finishing a calculation.


Been reading Cryptonomicon, have we?


There might be some side-channel attacks if your rows are too close to each other....


Is an Abakus Turing-complete?


I duno, technically it's the 'arm' executing the instructions

:P yes I stole it from mywittyname


Best pun of the year so far. ;-)


Only if you managed to find a flaw in the arm cores used in the Pi.


It isn't funny or witty to point out that an Opel Corsa gets better gas mileage than a 747, even if it is factually correct.

I didn't downvote you (can't even do it), but I suspect you are getting downvoted because your analogy is so off the mark that it can't even be called "Apples vs Oranges".

edit: replying to the "why the downvotes" which has since been edited away


> It isn't funny or witty to point out that an Opel Corsa gets better gas mileage than a 747

Actually, if it's true that a car gets a better mileage than a big airplane (more in one vehicle = more efficient, people seem to believe) I would find that interesting. Similarly, I can see how OP thinks it's ironic that a very cheap machine is not vulnerable whereas a quite expensive piece of equipment is, making it seemingly less-well engineered.


Transatlantic flights average 75 mpg per passenger, but can be up to almost 100 mpg per passenger:

https://en.wikipedia.org/wiki/Fuel_economy_in_aircraft

Jet fuel is 37.4 MJ/L; compare to gasoline at 34.2:

https://en.wikipedia.org/wiki/Energy_density#Energy_densitie...

An electric car will have up to 150 miles-per-gallon equivalent (aka 150 miles per the same amount of energy that's in one gallon of gasoline):

https://en.wikipedia.org/wiki/Miles_per_gallon_gasoline_equi...

So therefore, an electric car with just the driver is more efficient energy-wise than an airplane, while an average airplane is better than a gas car with even three people in it.


You already started along the road to normalisation by using MJ/L but you might as well go the whole way and consider absolute cost per person per mile... or person miles per absolute unit cost if you prefer. Those might be more telling figures.

For instance many EV proponents note the total cost of ownership difference due to significant differences in maintenance costs. Playing the other side of the argument: they tend to forget about the value of upfront cost is more than later cost.

It's still possible to achieve some sort of quantitative comparison by applying an interest rate based on "distance" of the cost in time before summing them. If you applied this to a regular petrol car you might be able to make a better comparison to the ticket price of a flight.


Dunno, a modern UK car will get 60mpg on motorway travel (that's what I've had), or even more. that's 50miles per us gallon. 2 passengers means 747 fully loaded efficiency, 3 wipes the floor.

Of course a 747 in a domestic Japan configuration will have more passengers than a typical BA 747 tatl flight with 100+ beds.


They are obviously way differently powered CPUs, but if they were as incomparable as you suggest then this article wouldn't even exist.

It doesn't even matter though, I said technical details aside...

I'm not stupid i know a pi zero is slow, but it's a cheap computer that happens to be invulnerable to three really bad sidechannel attacks that plague all big shiny expensive ones. How is that not a little amusing. Not any more now I've had to argue for it.




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: