Hacker News new | past | comments | ask | show | jobs | submit login
Speeding up Linux disk encryption (cloudflare.com)
491 points by jgrahamc on March 25, 2020 | hide | past | favorite | 134 comments



> otherwise, we just forward the encryption request to the slower, generic C-based xts(ecb(aes-generic)) implementation

This seems like at least something of a bad idea, because that implementation (if my search-fu is correct) is:

https://github.com/torvalds/linux/blob/master/crypto/aes_gen...

Which is obviously not constant time, and will leak information through cache/timing sidechannels.

AES lends itself to a table based implementation which is simple, fairly fast, and-- unfortunately-- not secure if sidechannels matter. Fortunately, AES-NI eliminated most of the motivation for using such implementations on a vast collection of popular desktop hardware which has had AES-NI for quite a few years now.

For the sake of also being constructive, here is a constant time implementation in naive C for both AES encryption and decryption (the latter being somewhat hard to find, because stream modes only use the former):

https://github.com/bitcoin-core/ctaes

(sadly, being single-block-at-a-time and constant time without hardware acceleration has a significant performance cost! ... better could be done for XTS mode, as the above algorithm could run SIMD using SSE2-- it isn't implemented in that implementation because the intended use was CBC mode which can't be parallelized like that)

Can't the kernel aes-ni just be setup to save the fpu registers itself on the stack, if necessary?


Curious why CF needs to worry about side-channel attacks when all software run on those machines belong to / written by them. They do have a “workers” product with 3rd party code but they can easily keep storage servers out of that pool. Typically storage encryption is all about what happens when a machine is physically stolen, hard disk discarded on failure or other such actions beyond network security. Please correct me if I am wrong.


I believe you are wrong because of https://workers.cloudflare.com/


Last time i saw they use many contractors and third parties to deploy mini data centers in other data centers. They ship them servers to install. Cloudflare doesn't have many private data centers.


You could measure the timing over the internet.


Users aren't interacting directly with the storage layer so any timing attack via the network is going to be once or twice removed. Can attackers really gleam useful and mount a successful attack in this type of setup?


This is almost certainly true in practice, but it's a big risk, compared to the risk tolerance that we usually engineer into crypto. For comparison, suppose someone was suggesting: "Why not use 80-bit keys instead of 128-bit keys? No one in the real world can brute force 80 bits, and we'll save on storage." Yes, that's true, but it's taking a relatively large risk for relatively little benefit. Hardware will get faster over time, and an extremely high value target might justify an extremely expensive attack, etc etc. We prefer 128-bit keys because then we don't even have to consider those questions. I think timing attacks are similar: Yes they're very difficult in practice, but they raise questions that we'd rather not have to think about. (And which, realistically, no one will ever revisit in the future, as hardware evolves and new APIs are exposed.)


I always imagined key size to relate to computation cost and not storage — what algorithm are you referring to?


The point is that you need more bits to store a longer key, but the storage space saved is very little in this case compared to how much easier it is to crack.


Sure, but a difference of 0.1% of storage to go from 80-bit key to 1024-bit key for 1 Megabit of data (that's 118 bytes out of 128KiB), or 0,00001% for 1Gbit (128MiB) seems not worth raising as a concern.

(I've chosen example numbers just to make calculation trivial)

So I can't ever imagine storage size being the driver for choosing the key size, though from the other threads, it seems that there are algorithms that do have a storage overhead that might be related to key sizes.


YES.

It requires statistical techniques to remove the noise, making the attack harder, but not necessarily infeasible.


Is that really the case though when the differences in computation would be measured in microseconds, but the network noise would be in the order of milliseconds?



I don’t know about that... in the paper the client and server are on the same network. It would be very interesting to repeat this study using faster processors (which will make this signal smaller) and over the public internet (making the noise bigger).


This is why constant time functions are used in cryptographic implementations, even over the network.

These are called timing attacks and they're less common now because professional cryptographers know how to deal with it. But this is very much a perfect example of it.


Maybe not relevant for CF-- more likely relevant for wider use of the approach!


> Which is obviously not constant time, and will leak information through cache/timing sidechannels.

This confuses me. Why is it in the kernel if it's not constant time? Isn't that a security risk? (Is there any context where it would be safe to invoke this?)


Sure. There's lots of cases where you are controlling for timing attacks elsewhere, or where a timing attack isn't a concern. This can particularly be true for a case where you are writing data to block storage with the idea that a potential attacker won't be accessing it until much later... at which point all timing information would be gone.


Unfortunately cache sidechannels defeat a lot of measures that would otherwise destroy timing data.

I agree that there can be some cases where it doesn't matter but it's extremely expensive to be sure that it doesn't matter-- making it usually cheaper, when you consider the total costs, to deploy code that doesn't have the sidechannels.


It's pretty cheap if the use case is, "the computer will be turned off before I worry about an attacker".


I wish the world could move on from AES. We have ciphers that are nearly as fast without requiring specialized hardware, just generic SIMD. Imagine how fast a ChaCha ASIC could run!

There are other options for non-AES FDE too: most infamously Speck (suspected to be compromised by the NSA), but also Adiantum, which is now in Linux 5.0.


> Imagine how fast a ChaCha ASIC could run

Not as fast. Chacha20 uses 32-bit additions which are fast in software but expensive and slow in hardware. In addition protecting Chacha20 from power analysis attacks is more difficult compared to AES.

> just generic SIMD

Constant-time AES with SSE2 is actually faster than the naive variable-time AES. See https://www.bearssl.org/constanttime.html#aes

In addition Chacha20 is not nearly as fast as AES when using the AVX-512 Vector AES instructions.

> but also Adiantum

Which uses AES (once per sector).


Genuinely curious, would you mind explaining why some operation can be fast in software but slow in hardware?


I think the parent comment is saying it is fast in software on a modern CPU, but making tha into an ASIC would either be a) slow or b) expensive due to the 32-bit additions.

IIRC (I can't find it right now), when NIST had the contest for AES, AES hhad to run on low power hardware in the late 90s/early 2000s. This required things like everything to be fast on an 8-bit microcontroller.


To implement 32-bit + in hardware you need 31 full adders and one half adder, each of which uses multiple gates and depends on the result of the previous adder.

Meanwhile + and bitwise and tend to take the same amount of cycles to be processed, and each cycle takes the same amount of time, see https://gmplib.org/~tege/x86-timing.pdf

Chacha20 in hardware would not be any slower than chacha20 in software, but it would be slower than other algorithms which do not use 32-bit +.


> To implement 32-bit + in hardware you need 31 full adders and one half adder, each of which uses multiple gates and depends on the result of the previous adder.

This is not how CPUs typically implement addition, or other ALU operations. Carry-lookahead adders have existed since the 1950s: https://en.wikipedia.org/wiki/Carry-lookahead_adder


Thank you, I love this citation so much.

> Charles Babbage recognized the performance penalty imposed by ripple-carry and developed mechanisms for anticipating carriage in his computing engines.


> In addition Chacha20 is not nearly as fast as AES when using the AVX-512 Vector AES instructions.

Note that Cloudflare opted for Xeon Silver chips that aren't good at AVX-512, unless doing pure AVX-512 operations.


And their 10th gen prod servers switched to AMD which, as far as I know, have SIMD support, but not AVX-512 support specifically.


That is correct, Zen 2 doesn't support AVX512 (no AMD chip does).


AES in an ASIC is pretty efficient, I'd expect the difference to flatten if both had good hardware implementations. Not that I wouldn't be happy to see faster chacha20 on systems.


>Which is obviously not constant time, and will leak information through cache/timing sidechannels.

What's the threat model here? I can't think of a plausible scenario where side channel attacks can be used to gain unauthorized access to FDE contents.


Did this commercially for 15 years. Always the same problems.

We ended up with several solutions- but all of them generally work the same conceptually.

First off, separation of I/O layers. System calls into the FS stack should be reading and writing only to memory cache.

Middle layer to schedule, synchronize and prioritize process IO. This layer fills the file system caché with cleartext and schedules writes back to disk using queues or journals.

You also need a way to convert data without downtime. A simple block or file kernel thread to lock, encrypt, mark and writeback works well.

Another beneficial technique is to increase blocksizes on disk. User Processes usually work in 4K blocks, but writing back blocks at small sizes is expensive. Better to schedule those writebacks later at 64k blocks so that hopefully the application is done with that particular stretch of data.

Anyway, my 2 pennies.


The blog post reads like this all happened recently, but their linked post to the dm-crypt mailing list is from September 2017[1]. I'm curious if they've interacted with the dm-crypt people more recently.

[1]https://www.spinics.net/lists/dm-crypt/msg07516.html


Yeah, the time frame is somewhat unclear. The patch they link to in their tree is dated December 2019 however [1], so I assume this blog post is about stuff they've completed recently.

[1] https://github.com/cloudflare/linux/blob/master/patches/0023...


Did they reach out to the Linux kernel mailing list? Or just the dm-crypt team, I found the answer they received rather arrogant and useless to be honest.


I'm a huge "fan" of F/OSS but, unfortunately, such condescending answers are all too common in this "community".


Ages ago I benchmarked truecrypt overhead on my machine at the time (2006, I think?) and it was about 3%; I assumed that's a reasonable and still applicable number, also do dm-crypt and modern VeraCrypt. Guess I was get gradually more wrong through those years, according to the git archeology....


Also, disk speed in 2006 was probably much slower. Disks have gotten faster at a greater pace than processors during the last 10 years.


Wow those speed improvements are very neat. And an awesome blog post accompanying them. Prior to reading this, I've considered Linux disk encryption adding negligible latency because no HDDs/SSD can be fast enough for a CPU equipped with AES-NI, but that view has changed. Two questions: 1. are there any efforts to upstream them? 2. Invoking non-hw-accelerated AES decryption routines sounds quite expensive. Has it been tried out to save the FPU registers only if there is the need for decryption?


The existing Linux system is useful for hardware that does less than 200MB/s, so you should be fine with HDDs.

Cloudflare is optimising for SSDs.

They don't talk about latency: all their crypto benchmarks measure throughput. Near the end they hint at response time for their overall cache system but there's no detailed discussion of latency issues.

The takeaway for me is that I'm OK with what's currently in Linux for the HDDs I use for my backups but I'd probably lose out if I encrypted my main SSD with LUKS.

At the end of the article they say that they're not going to upstream the patches as they are because they've only tested them with this one workload.

I'd also be interested to see a benchmark comparing SW AES with FPU-saving + HW AES. Unfortunately their post does not include stats for how often their proxy falls into the HW or SW implementations. Whatever those numbers are, I'd expect FPU-saving + HW AES to be somewhere in the middle.


You can easily achieve more than 200 MB/s with HDDs in RAID, but the bottleneck might be altogether different — I think it is an important distinction.

While I applaud their wins, they have basically profiled the wrong thing, established the full overhead when disk speed/latency are basically removed, and only gone to actual production workload at the very end — in the worst case, their improvements could have been for naught, but they were "lucky" (not really, they were smart, but profiles did not really guide them — they just optimised the heck out of the system, but they could have been unlucky and not gain anything if the bottleneck was in a particular place unaffected by their code analysis).

It's great that Cloudflare allows this kind of engineering to happen (investigative, explorative, and not necessarily RoI focused), but it's rare to find a company that does.


> The takeaway for me is that I'm OK with what's currently in Linux for the HDDs I use for my backups but I'd probably lose out if I encrypted my main SSD with LUKS.

Yep, when building my latest workstation, I went with a pair of ("regular") SSDs (RAID1) for my data. Later, I decided to add an NVMe for the OS for the additional speed.

I then went and encrypted all of the drives (via LUKS), however, which basically killed any additional performance I would've gotten from the NVMe drive. I would have been just fine as well off with only the SSDs and without the NVMe drive.


I'm using LUKS on my SSDs. I never benchmarked them but they are fast enough that I don't care. I'm working with VMs right now, creating and destroying them with VirtualBox (automated). Kind of a local EC2. The disks are two Samsung EVO 950 and 960, 1 TB each. They're in a laptop from 2014, a SATA III at 6 GB/s so I guess I'm already capped by the interface and the encryption overhead doesn't matter.


They talk about throughput, but in practice their testing regimen is actually testing latency. Dm-crypt's performance ceiling is pretty high if you consider throughput rather than latency, and I would expect the tradeoffs to decrease latency would decrease maximum throughput at least slightly (although I have not tested their patch).


At least your first question is answered in the article: Yes


> Many companies, however, don't encrypt their disks, because they fear the potential performance penalty caused by encryption overhead.

There is also the overhead of automatically unblocking a remote server during an unattended reboot. Reading the encryption password on a USB stick or fetching it through internet is a no from me. I think there are solutions about storing the password in RAM or in an unencrypted partition, but that's the overhead I'm talking about. I wonder how companies deal with that.


Red Hat's solution to this problem is NBDE.

> The Network-Bound Disk Encryption (NBDE) allows the user to encrypt root volumes of hard drives on physical and virtual machines without requiring to manually enter a password when systems are restarted. [0]

[0]: https://access.redhat.com/documentation/en-US/Red_Hat_Enterp...


Isn't this what TPMs are designed for? I think both intel and amd motherboards have them built-in by using the security processor in the CPU.


i use kexec for reboots and store the keys for disks inside an initramfs which itself is stored on an encrypted boot partition. When i do a cold boot these systems boot into a recovery like OS so i can fix stuff when needed but mainly to do a kexec there (its not perfect but what is). If its possible to avoid this (i.e. i have physical access) i can decrypt the initramfs directly from grub using a passphrase entered locally.

A warm reboot using kexec does not need any intervention from my side and directly boots into the already decrypted initramfs with the key already present and thus able to mount the encrypted volumes including the root volume.


Debian offers a dropbear shell in initramfs which you can use to SSH in and provide keys. I only have a handful of servers so currently I do this manually on a reboot but it would not be difficult to automate using for example SSH keys unlocking key material. The downside of this is your initramfs and kernel are on an unencrypted disk so a physical attacker could feasibly backdoor them. I'm sure there's some secure boot UEFI / TPM solution here.


You are missing an integrity checking step. You can do it by sending some sort of ephemeral binary over ssh that does integrity checking and requests a key with the resulting hash of the check to proceed, don't blindly trust sshd running from an unencrypted partition. But still at the end of the day it's all about obscurity and obfuscation, you can't make it provably secure. You can go far and make that binary one time randomly generated, obfuscated, bound its running time, you can use a TPM and what not, but it probably won't matter for pretty much any realistic threat model.


Has anyone already tried to compile the kernel with these patches for their desktop/laptop with encrypted drive? https://github.com/cloudflare/linux/tree/master/patches


Yes, I'm running them on kernel 5.5.13 (which came out today)


Wow, I'd be so happy if you could share the steps you took to achieve this. Let's say I have a Debian machine, how could I try it out?


I use Gentoo Linux so it was as easy as putting the patches in a directory, and then rebuilding the kernel with the option to enable the synchronous cipher.

For Debian it's almost as easy: https://passthroughpo.st/patch-kernel-debian/


Interesting. One other thing they don't mention that I found interesting when doing my own digging on dmcrypt speeds a while back is that the 'cryptosetup benchmark' command is only showing the single core performance of each of those encryption algorithms. You can verify this by watching the processor load as it performs the benchmark. That lead me to find that if you have a Linux software RAID you can get much better performance by having 1 dmcrpyt volume per disk and then software RAID the dm devices instead of putting a single dmcrypt on top of the software RAID. Curious if that would stack performance wise with what they found here or if that just happened to help with the queuing issue they identified.


i remember somewhat recently efforts to parallelize the work of dm-crypt where applicable had been merged. However, i guess having multiple separate encryption parameters and states (read: disks) leaves more opportunity for parallelization of the work especially if disk access patterns are not spread wide enough.


>Being desperate we decided to seek support from the Internet and posted our findings to the dm-crypt mailing list

When I see a company such as CloudFlare being so transparent about their difficulties, and trying to find an answer using their community members, it makes me love them even more.

No ego, pure Professionalism


Correspondingly, the response they received reflects just as strongly on the community itself.


Yep, the response they received was incredibly condescending. The follow-up from Cloudflare remained polite and added a lot more data, and was ignored.

It's a shame because I've seen this condescending attitude quite frequently in the crypto open source community, and am not really sure how it arises. At least in this case it seems to have had the good outcome of motivating Cloudflare to dig in deeper and solve the problem by themselves.


I'd say it comes from sitting at an ivory tower and giving 0 ducks about who you're talking to. In this case, the person/team asking was capable enough to go on their own, dig, test, change, and find a fix. It could've probably been easier for them if given proper directions.

OTOH, perhaps those that answered from atop the tower had little idea of the mechanisms that the author(s?) dug out and changed. So double shame on them, for being condescending and not knowing.

And it also falls on ourselves to be mindful of this behaviour, that can creep up on us without knowing. We sometimes think our time is super valuable and we don't have to spend it on some "newbie question" or this guy who doesn't understand. The past year I've been mentoring grad students in the lab I work at, and found myself once or twice going this route. I luckily caught it early, took a deep breath and gave them the time and explanations they needed. In the end I got a few nice surprises out of two amazing students, who were seeing a bit beyond what was evident.


I see many large project addressing this issue by not having a public list where you can talk directly to the developers. Instead there is user lists which public relation managers maintain and where the expected result from the original mail would either be no response or a polite and nicely written one about testing the user configuration options that they had already tested. That way the developer would not need to respond unless the user has shown enough proof of work to demonstrate to the public relation managers that the issue should be forwarded to a developer, in which case the answer the developer would reply with would be under the assumption that the person/team asking is capable enough to use the directions to dig, test, change, and create a patch which then later might be added to the project.


Full response: https://www.spinics.net/lists/dm-crypt/msg07517.html

From the PoV of the person who responded they didn't provide any relevant information that would indicate what platform they run, or what speed they expect, or why they think 800MiB/s seems slow to them. On many platforms this would be a pretty good result. At first look, it looks like they expected the speed of unencrypted storage, because that's what they tried to compare against.

So the response seems reasonable at first glance to me. They got the answer to their main question. (which they omitted from their blog article)


I disagree with your assessment.

> If the numbers disturb you, then this is from lack of understanding on your side.

This is arrogance on the part of the person replying that hand-waved away their problem "you just don't understand" when in fact, they (Cloudflare) do/did understand. They then went on to prove that it was due to queuing within the kernel, not the hardware as commented by this person in their flippant reply.


Cloudflare did not understand at the time. Anyway, I'm not questioning that the reply was not very helpful, I just don't see it as unreasonable. I liked the technical parts of the CF writeup overall.


The next line after that is "You are probably unaware that encryption is a heavy-weight operation". If you're posting to the dm-crypt mailing list about encryption performance, you're probably very aware that encryption is a heavy-weight operation.


>Cloudflare did not understand at the time

And neither did the people who responded to them it seems.


I'm not sure why you'd try defending such behavior other than you spent as little time reading the original email as the person who responded in the thread. They clearly state at the bottom their in-memory results - while they don't give EXACT hardware, it's more than enough to determine there is a major bottleneck in the encryption engine. To claim "encryption is heavy" is also a poor response - either the poster has no concept of the overhead of encryption with CPU offload or was just too lazy to put together a helpful response. Either way - no response would've been better than that.


So who do you get from 4.5GiB/s no encryption vs 850MiB/s with encryption with no other information to understanding whether there's a bottleneck in some unspecified encryption engine (with unknown throughtput and setup latency)?


I dunno, if one of my coworkers had put in the work that OP did, showed me the results, and asked my thoughts: if I didn't have enough information I'd ask for more. If you're telling me you don't have enough information to assume some basics about the setup (I'd look at that and assume it was a modern Intel or AMD CPU just based on the throughput) - then how does he have enough information to dismiss the findings as "then this is from lack of understanding on your side."

You can't have your cake and eat it too.


Makes sense.


> or what speed they expect

"Without LUKS we are getting 450MB/s write, with LUKS we are twice as low at 225MB.s"


Honestly, read the original message and consider how you would have replied.

They showed work in a vacuum - demonstrated that dm-crypt has costs over raw device access (I would hope so!) on some unknown hardware, and then asks "does this look right to you?"

Well, yeah, that looks like it looks elsewhere, and by the way, there's a built-in command that also would have told you this.

People whine about technical mailing lists, I think because they don't get the context. Think of them as sort of like water coolers at an office that specializes in whatever the list is about. You get a short slice of expert attention in between doing whatever they actually have to get done.

Throwing a bunch of data on the floor and saying "hey, is this expected?" is not going to work well. Seriously, what were they expecting?


> Well, yeah, that looks like it looks elsewhere, and by the way, there's a built-in command that also would have told you this.

It's entirely possible to say both these things in a much more constructive and less condescending tone than was used.


People are so very weirdly sensitive to these things when it is a big company that comes calling. Wonder why that is.

Context matters. If you don't take the time to understand the context you're walking in to and don't follow local rules, don't be surprised if people are rude to you back. Not that I even think what they said was all that rude.

Do you also think you can slide in to a gam3r chat and expect business etiquette?


I don't think a crypto mailing list is the same as "gam3r" subreddit, but it wasn't that rude overall.

It was the tone that they "dont understand" when in fact Cloudflare does understand crypto and performance very well, and went so far as to dive into kernel code and submit patches that fixed the problem that others didnt even realize existed. Even so I agree this isn't worth such a big discussion.


> People are so very weirdly sensitive to these things when it is a big company that comes calling.

They're not sensitive when it's a "big company", they're sensitive when they're trying to get work done and they receive a flippant response.


It's a public mailing list, I see zero upstream kernel commits from arno@* so it doesn't appear the response came from someone who actually knows and works with the dm-crypt code.

I'm on a number of public mailing lists and there's often a participant who tends to be both available/communicative and callous in their communication style. My assumption is there's a filter effect going on here where some folks who have very poor social abilities wind up at their computer alone all the time and public mailing lists become part of their few remaining human interactions.

What I'd take away from this particular dm-crypt interaction isn't that the community is assholes, but that the community is small and the mailing list poorly attended/inactive.

In the past I've reported my own dm-crypt problems upstream and it took years to get a bisected regression reverted. Just getting relevant people to pay attention was a challenge.


And according to the cloudflare, the current dm-crypt implementation is a horrible bit rot which got no reviewing for 15 years.


Especially because my old Haswell-E workstation running FreeBSD has no problem maxing out four encrypted SATA SSDs (>500MB/s each) at the same time with AES-NI. There is no excuse for slow cipher implementations and the queuing sounds insane saving and restoring the SSE registers can't be expensive enough to justify all those context switches between kernel threads.


You won't know about their ego or professionalism until you work with them. Posting on a mailing list and making a blog post about it is not proof of either of these, it's brand marketing. They're trumpeting their engineering talent to build good will/nerd rep so people will love their company, spend money there, and apply for jobs. (But what it does show is that they're good at marketing, because it's working)


Writing informative blog posts and publishing patches to the Linux kernel is the rare kind of marketing I can 100% support.


It is also marketing for sure, but it is well done marketing.

It's like having a high PageRank in Google because you actually write meaningful, useful, well-written blog posts which Google happens (happened) to value vs link factory blog posts.


Sounds like they're also pretty good at making disk encryption faster.


> Unlike file system level encryption it encrypts all data on the disk including file metadata and even free space.

Anyone have a source on how full disk aka block-level encryption encrypts free space? The only way I can imagine this could happen is by overwriting the entire disk initially with random data, so that you can't distinguish between encrypted and true "free space", i.e. on a brand new clean disk. Then, when a file (which, when written, would have been encrypted) is deleted (which by any conventional meaning of the word 'deleted' means the encrypted data is still present, but unallocated, thus indistinguishable from the random data in step 1), then gets overwritten again with random data?

I would argue that overwriting an encrypted file with random data isn't really encrypting free space, but rather just overwriting the data, which already appeared random/encrypted. It is hardly any different to having a cleartext disk and overwriting deleted files with zeros, making them indistinguishable from actual free space.


The point of encrypting free space is just so you can't say how full the drive is.

This way, an attacker can't focus cracking on the fullest disk, match stolen backup disks to hosts based on non-sensitive health metrics, etc.

>The only way I can imagine this could happen is by overwriting the entire disk initially with random data

Traditionally, for speed, you'd write all zeroes to the encrypted volume (causing the physical volume to appear random), but yes

>Then, when a file (which, when written, would have been encrypted) is deleted

You'd just leave it. Crucially, you don't TRIM it.

>I would argue that overwriting an encrypted file with random data isn't really encrypting free space

Yup, that's why it's not done


Debian does the first thing you discussed if you create an encrypted partition in the installer - it writes 0s through the crypto layer to fill the entire disk with encrypted data.


> Data encryption at rest is a must-have for any modern Internet company

What is it protecting against — data recovery from discarded old disks? Very stupid criminals breaking into the datacenter, powering servers off and stealing disks?

A breach in some web app would give the attacker access to a live system that has the encrypted disks already mounted…


As we push further and further to the edge — closer and closer to every Internet user — the risk of a machine just walking away becomes higher and higher. As a result, we aim to build our servers with a similar mindset to how Apple builds iPhones — how can we ensure that secrets remain safe even if someone has physical access to the machines themselves. Ignat's work here is critical to us continuing to build our network to the furthest corners of the Internet. Stay tuned for more posts on how we use Trusted and Secure Boot, TPMs, signed packages, and much more to give us confidence to continue to expand Cloudflare's network.


> criminals breaking into the datacenter, powering servers off and stealing disks?

Yes, exactly. A company I worked for had a hard drive pulled from a running server in a (third party) data center that contained their game server binaries. Shortly afterwards as pirate company setup a business running “gray shards”, with - no surprise - lower prices.


being able to purge old disks confidently in a secure manner is a upside huge enough to make this statement true in my opinion. There have been numerous incidents even involving companies specializing in securely purging disks. If your data is encrypted there is basically nothing to do you could even outright sell those from your DC or something. Just delete the keys/headers from the disk and you are safe.

Its also not possible to get data injected offline into your filesystem without having the keys. Without encryption you could just get the disk of the targeted server running somewhere and set your implants or what you have. When the server sees the disk back up it looks just like a hiccup or something.


> Its also not possible to get data injected offline into your filesystem without having the keys.

This is, in theory, possibly against volumes encrypted using AES XTS (which seems to the how the majority of FDE systems work) as the ciphertext is indeed malleable.


i am no expert on this but i was thinking it is only possible to inject noise which is likely corrupting the filesystem in the process. copying/moving valid blocks should be prevented by XTS as far as i understood (which might not be that much). I guess using a filesystem with integrity checks helps a bit although its still not authenticated or something.


There's some more details/links here (I'm also not an expert): https://en.wikipedia.org/wiki/Disk_encryption_theory#XTS_wea...


I think mostly against breach in datacenter security. Most competent companies already have policies on how to deal with discarded old disks. The one that don't have might not be competent enough to use encryption on rest too.

It's all about layers of defenses.


encrypted data at rest allows you to do an instant erase of the device.


Yes, the former. You can’t just put SSDs through a degausser!


On top of what others have said it protects, for example, from governments of all countries you have servers in and their law enforcement coming in taking the servers, extracting keys for mitm, installing malware and backdoors, placing some child porn on the servers, etc., from staff from various companies in various countries that maintains and deploys the infrastructure or just has access to it doing similar nasty things, and so on.


> one can only encrypt the whole disk with a single key

You can still use partitions.

> not all cryptographic algorithms can be used as the block layer doesn't have a high-level overview of the data anymore

I do not really understand this. Which cryptographic algorithms can't be used?

> Most common algorithms require some sort of block chaining to be secure

Nowadays I would say that from these only CTR is common, which does not require chaining.

> Application and file system level encryption are usually the preferred choice for client systems because of the flexibility

One big issue with "Application and file system level encryption" is that you often end up leaking metadata (such as the date edited, file name, file size, etc).

Regardless I think that this is a really nice article. I can't wait to try their patches on my laptop.


> Which cryptographic algorithms can't be used?

You can't use any algorithm that requires O(n) IVs (e.g. a separate IV per disk sector), because there's nowhere to store the IVs. (Another consequence of this is that you can't store checksums anywhere, so you can't provide integrity checks.)

You can't use CTR mode either, because you'll end up reusing counter values. What do you do when you need to overwrite a block with new data?

XTS mode solves this, at least partially. It's like CTR mode, but with an extra "tweak" that essentially hashes the block's content into the encryption key. So if you overwrite a block with new data, you get a new encryption key.

This isn't perfect, though, because it's still deterministic. If an attacker can see multiple states of the disk, they can tell when you revert a block to a previous state. But it's much better than other modes, especially since the main threat you want to protect against is your laptop getting stolen (in which case the attacker only sees a single state).


> You can't use any algorithm that requires O(n) IVs (e.g. a separate IV per disk sector), because there's nowhere to store the IVs

Certainly you can. You just have to reduce the effective sector size that the file system can use.

> What do you do when you need to overwrite a block with new data?

You generate a new random nonce (as per XChacha) and you store it in the sector.


> Certainly you can. You just have to reduce the effective sector size that the file system can use.

get back to me when you find a high-performance (FAT doesn't count) Linux filesystem that supports sector sizes of 496.


Modern disks use much bigger sectors. See https://en.wikipedia.org/wiki/Advanced_Format


The issue is non power of 2 sector sizes. The kernel computes sectors with shifts not division (which would be slow).


I do not see how you would need to use divisions in that case.

But even if that was the case, you could just pretend to the OS that you have 7 sectors of 512 bytes each rather than a single sector of 4032 bytes. (or if that was not possible you could just take the hit)


You need division to go from a file offset in bytes to a sector number, hence the need for power of 2 sizes to make this fast. The kernel assumes in multiple places sectors are a power for 2 for this reason - it doesn't rely on the compiler to optimize it (which may not even be possible for some of the compilers it works on).

If you are talking about using reserved sectors for book keeping at the end of the disk that is possible and commonly done.


> I do not really understand this. Which cryptographic algorithms can't be used?

CBC - which is one the most common stream cipher algorithm.

It's not clear me whether GCM would work or not.


GCM requires somewhere to put the nonces and authentication tags. In principle, you could use a layer of indirection not entirely unlike a page table to store that information. For example, a 64-bit nonce, 64-bit block pointer, and 128-bit authentication tag could pack together in a radix tree for the job, retiring 7 bits of the virtual-to-physical mapping per level for 4 kB blocks.

Of course, the downside is that now the block layer must tackle all of the write ordering issues that a filesystem does when updating the tree. The block layer would find itself greenspunning up a filesystem inside itself, even if it was a filesystem of only one file.


The 128-bit tag length, which offers less than 128-bit strength depending on the nonce size, makes GCM and similar AEAD constructions poorly suited for archival storage. If you want to store more data without rekeying you need to reduce the authentication security. GCM makes perfect sense for ephemeral, message-based network traffic. Traditional, separate, keyed MACs still seem preferable for archival storage, especially with tree-based modes--native as with BLAKE3 or KangarooTwelve, or constructed like SHA-3-based ParallelHash.


The tag's strength doesn't depend on the nonce size in cases where you can use sequential nonces. Longer nonce sizes are valuable only when using randomly allocated nonces and you need to avoid the birthday paradox. 64 bits is considerably longer than the total write lifetime of modern disks. Even if you used a nonce per 512-byte block, you'd need well over a yottabyte of writes to roll through that counter.

The profile that authenticated encryption defends against is an attacker who is attempting to feed the victim specially crafted bad blocks. 128-bit tags are good enough that the disk will be completely trashed long before the victim executes something of the attacker's choosing.


Apparently CBC is used by cryptsetup by default, see https://linux.die.net/man/8/cryptsetup

It might not be ideal but it still can be used. Though, I would not call CBC common at all. Pretty much everyone has switched to CTR or some variant of it (such as GCM).

Also, CBC is not a stream cipher algorithm.


That online manpage is quite outdated; I recommend man7.org, which gets regularly auto-updated: http://man7.org/linux/man-pages/man8/cryptsetup.8.html . The current default for LUKS is XTS.


> One big issue with "Application and file system level encryption" is that you often end up leaking metadata (such as the date edited, file name, file size, etc).

I wonder how cryfs stacks up in this regard.

https://www.cryfs.org



That response from the dm-crypt mailing list is unreal.


Offtopic, but why am I getting two scrollbars on this website? This is weird.


Hi, I work on the Cloudflare Blog, we're working on deploying a fix now.


There is a scrollable div, the one that leads with:

grep -A 11 'xts(aes)' /proc/crypto

Is that what you mean?


I can confirm, they broke scrolling with CSS setting overflow-x in #main-body which for me also shows two scrollbars.


No, I literally getting two scroll bars for entire page. First scrollbar works, second scrollbar is disabled. Scrollable div is third scrollbar, but that's OK. It looks like that: https://i.imgur.com/Rs8a7m5.png


Does anyone know what the picture is like on FreeBSD? Is it faster?


Does CloudFlare plan to get their kernel patches merged upstream?


Second to last paragraph:

> We are going to submit this work for inclusion in the main kernel source tree, but most likely not in its current form. Although the results look encouraging we have to remember that Linux is a highly portable operating system: it runs on powerful servers as well as small resource constrained IoT devices and on many other CPU architectures as well. The current version of the patches just optimises disk encryption for a particular workload on a particular architecture, but Linux needs a solution which runs smoothly everywhere.


I missed that, thankyou.


Any chance of this patch making it to the mainline kernel?


Not this one, specifically, but they've mentioned that they're working on upstreaming some derivative patches.


Neat. Poorly optimized queues can have a significant impact on performance, doubling throughput for disk encryption with some queue tweaks is pretty significant.


[flagged]


The article discusses this in the conclusion:

> We are going to submit this work for inclusion in the main kernel source tree, but most likely not in its current form. Although the results look encouraging we have to remember that Linux is a highly portable operating system: it runs on powerful servers as well as small resource constrained IoT devices and on many other CPU architectures as well. The current version of the patches just optimises disk encryption for a particular workload on a particular architecture, but Linux needs a solution which runs smoothly everywhere.

That is, they think their current patch is too specialized for their own use-case to warrant inclusion in the mainline kernel without significant adaptation.


>but there doesn't appear to have been any serious effort to coordinate with other Linux contributors to figure out a solution to the problem.

Well when they reached out to the community they were told they're idiots and should f* off in only somewhat nicer language. Then they were simply ignored.

When your community is toxic don't complain that people don't want to be part of it.


They are submitting their work, after they put in even more work to make it more universally applicable to all Linux users. They also did try to engage with the community who basically told them that they didn't know how fast crypto should be.


All this seems to me a series of very strong arguments for doing the crypto in your application.


That would be even slower and more complex.


Why? The slowness in this article comes from architectural brain damage inside the kernel. Doing the encryption and IO on your threads, when and where you choose to do it, is the solution. As your performance requirements increase, you are less and less likely to want kernel features.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: