1. The first thing the AES encrypt function does is XOR user input against the secret key.
2. The result of this operation is used to index into a lookup table (the S-box).
3. The timing characteristics of step 2 will vary based on the contents of the cache.
By measuring the timings for a large number of random inputs and correlating them by index, (e.g. sort all messages into buckets based on the value of the first byte), we can recover the corresponding byte of the secret key.
Though it's not clear a priori what the timing characteristics should look like for a particular key byte, we can easily measure this on a machine we control that matches the description of the target platform.
Unsurprisingly excellent description. How much control do we have over the cache lines populated by AES in a TLS scenario? We can't fully randomize the AES inputs, because they have to correspond to the TLS protocol.
I think it would be tough to attack TLS with this.
A couple important stumbling blocks off the top of my head:
We need to pull off the attack in the context of a single TLS session, because new AES keys are generated for each one. This means we can't choose any inputs to the AES function. If we try to manipulate or forge a message, the MAC will fail and then the jig is up. This might not be a big deal, since the AES ciphertexts will seem random anyway.
There will be a large amount of overhead in each request compared to Bernstein's original experiment. This will likely introduce a large amount of per-request jitter. With enough requests we can iron this out, but again: one session. So there may be an upper limit on the number of samples we can realistically take.
We may have a tough time comparing like to like. Certainly we can request the same static resource each time if we control malicious JavaScript, but there may be confounding factors. This might not be a big deal, since I think this just amounts to more jitter.
If "sleep" means "sleep until the next timer tick", how can it average out? Especially if the timer is started at the start of the encryption, and all encryptions (at least, for one block) take less time than the timer is set for. That means all times get set to exactly the timer time, and no information reaches the attacker.
"Random sleep" does not in general mean "sleep until the next timer tick". The best fix is making the function constant time, if you can achieve this with a sleep that makes the operation always take exactly one quantum then the sleep is really an implementation detail and quite far from "random sleep".
I've realized this (and am certainly not alone in that, it's rather obvious, when one think about the nature of timing) -- but does anyone have some links on implementing this? Are there some (sequence of) x86_64 instructions that can be used to bound a procedure (in the pascal sense of the word) to a quantum, regardless of things like when the procedure is called (assuming the procedure is short enough, and depending on branching behaviour, I suppose instruction fetch/decode, data fetch/write and accompanying cache hit/miss can make it hard to a) select a worst-run time in terms of clocks cycles to target (and if so, for which concrete cpus) and b) be hard to make sure the cpu is actually busy for exactly that many cycles...)? Is this even possible to approach in this portable C99?
I suppose if one ignore the information leak due to possible change in cpu load, one might device a kind of evented "call back" model, where one wait to return the result of a procedure until an interrupt is triggered?
I don't expect a full answer, but if anyone has a link to some source code that isn't too complicated, I'd be very happy (either "real-world" or some good "example" code).
The most common way to avoid timing side channel attacks is to write the procedure in C or ASM in such a way that there are no data-dependent differences in execution path. You've probably seen e.g. the memcmp that doesn't exit early. This attack is a little different in that it's not that different instructions are executed, it's that different memory access patterns take different amounts of time. For that, you can maybe change the implementation to not have any data-dependent array accesses, or maybe you can do things with prefetching to make the memory accesses constant time.
An approach where you watch the clock will be inherently less portable and actually much harder. Not only will the timing calls be hardware or OS specific, but so will the worst-case time. Imagine having to deal with a chip going into low power mode during your computation. Also you probably don't want to count time that your thread wasn't scheduled to run, so now you're talking about integrating with the scheduler.
You always want to remove secret data, not obfuscate it. Think of the difference between those reversible pixel mosaics and a flat black censor bar, for example.
Incidentally, many 'normal' blur functions implemented in software are somewhat reversible.
Here's the basic idea:
By measuring the timings for a large number of random inputs and correlating them by index, (e.g. sort all messages into buckets based on the value of the first byte), we can recover the corresponding byte of the secret key.Though it's not clear a priori what the timing characteristics should look like for a particular key byte, we can easily measure this on a machine we control that matches the description of the target platform.