Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
PostgreSQL Encryption: The Available Options (hezmatt.org)
187 points by JNRowe on Nov 7, 2023 | hide | past | favorite | 65 comments


“So, you can encrypt your users’ dates of birth, in such a way that anyone with the appropriate keys can query the database to return, say, all users over the age of 18, but an attacker just sees unintelligible gibberish.”

Dangerous oversimplification of Order Preserving Encryption (OPE). In structured data like dates of birth, an attacker can infer age ranges, age distributions, and possibly even individual records.

There’s nuance to it of course. But for an article that started out with nuance, the details of the solution and its tradeoffs left me wanting.

The paper Enquo bases itself on could also do a better job of describing what’s missing. Here’s the link to that, https://eprint.iacr.org/2016/612.pdf


If you can encrypt or generate queries, then you can decrypt any order preserving encryption scheme in log_2(n) probes, regardless of the underlying data distribution.

n is the number of possible values. If you want to decrypt all caps strings, then you can decrypt the first character with log_2(26) queries, then the second with another log_2(26) queries, and so on.

If you can only observe the query stream or the disk reads, then you end up paying a constant factor more than that, but not a very large one.

Even if the attacker only has access to data at rest, if you have multiple OPE columns in a table, or if the database has foreign key relationships that link OPE data, you end up leaking all sorts of information about the underlying database.

It's hard to say anything polite about these schemes or the vendors that push them.


As with any secure system, it depends on your threat model but the early versions of OPE are woefully insecure. The attack you're describing sounds like the one on the Boldyreva scheme from 2009. While all OPE schemes leak some information, the more recent schemes are much more secure and often quite appropriate for many use cases (arguably at least).

ORE has different security properties, particularly Lewi-Wu 2016. There are also now hybrid schemes like the EncodeORE scheme of 2020 which is both more efficient and more secure than previous OPE and practical ORE schemes.


Precisely!!! Even ORE, a supposed improvement on OPE, is flawed. http://users.cs.fiu.edu/~mjura011/documents/Jurado_2021_CSF%...


Note that this paper is about the CLWW scheme of 2015. The Lewi-Wu Scheme of 2016 (Block Order Revealing Encryption) has entirely different security properties. Storage of the right ciphertexts (as they describe in the paper) is semantically secure.


The 2021 paper I posted does focus on 2015 CLWW, but the 2021 paper also buckets Lewi Wu as non ideal. The Lewi Wu paper is reference [17].


Thanks for posting, very interesting project!

I implemented something similar at the company I work with:

1. All private data is encrypted/decrypted client side using envelope encryption: a random AES key is generated to encrypt the data, and then we use KMS (which are basically the same services in both AWS and GCP) to encrypt the AES key with a public key. Thus, storing data requires no KMS API call to encrypt data, but decryption does. But we basically have per-user AES keys, and since our data access patterns are largely per-user it means we can cache the decrypted key.

2. We use blind indexes for lookups.

3. We then hit the issue of needing to order on encrypted data, and we used one of the order-respecting encryption algos but your explanation was super helpful to me! I'll definitely look more into the Lewi-Yu scheme.

All in all a very cool project I look forward to digging in more. I haven't looked yet at how you store/retrieve the encryption keys but integration with a service like KMS would be great. For us it is ideal because all calls to KMS are auditable so we can audit exactly when data was decrypted.


The only problem with KMS in this model is that to get that auditability you need a data key per value/record. That means every decryption requires a request to KMS as it does not (and likely won't ever) support batched requests. We tried this for ages and the performance was terrible. < 100ms queries blew out to over 3 or 4 seconds.


Only if you need the auditability at that granular a level.

For us, each user has (for the most part) their own data key, and most of the time a user is accessing their own data. So we can decrypt the key once and then cache it for the rest of the user's session. This tells us "the user accessed their private data", so we don't get the per value auditability, but for us that was sufficient. If you want, you could even have different data keys based on sensitivity, e.g. a user's name, phone, address is encrypted with one data key but their SSN or credit info is encrypted with another.


That's true except that if that session key is lost or exfiltrated, the scope of the breach is everything that key was used to encrypt (all of the user's data in your example).

The other consideration is how to safely cache the data key? What if the cache is popped?


Key management and ability to support SQL/searching are the 2 biggest considerations.

`pg_enquo` uses Block ORE which is reasonably secure but results in very large (like 100x) ciphertext sizes. For an alternative (also written in Rust) check out https://ore.rs. It will soon support variable block sizes for smaller encrypted values.

If you want to do partial text queries or LIKE, you'll need a Searchable Symmetric Encryption (SSE) or Structured Encryption (STE) scheme. There are literally dozens of these schemes out there, each with their own tradeoffs so it can be hard to choose (Seny Kamara alone has published several: https://cs.brown.edu/people/seny/papers/).

Amazon KMS (and Google/Azure equivalents) all require a network request per encryption unless you cache/reuse keys. To put that into perspective, 1 query with 3 fields encrypted and 100 rows returned would result in 300 separate network requests.

You can use data-key caching to reuse a data key for many records to improve encryption performance. However decryption performance tends not to improve much because data-keys because they likely won't be uniformly used across your data set. Not to mention that you lose the ability to apply controls to records based on data key.

At CipherStash, we created Tandem (https://cipherstash.com/products/tandem) which uses a revised version of ORE, STE and fast key (bulk-ops) management to encrypt columns of your choosing. The core encryption is AES-256-GCM and the whole thing is written in Rust. It runs as a Docker container or standalone binary. We are working on WASM support as well as a separate Rust SDK. Most SQL queries "just work" and performance overhead is tiny (< 10ms per request).

Tandem is in preview and will be generally available at the end of November.

For some other gotchas when doing encryption in Postgres, I did a talk at Linux Conf last year (based on some ideas from Paul Grubbs et al paper of the same name): https://www.youtube.com/watch?v=JD8dtLjhmAM


I’ve never seen pg_enquo before but from first skim it looks excellent. Since it’s built with Rust it should be able to run on RDS (in theory, now that AWS supports pl/Rust).

Also worth checking out this excellent extension which is a wrapper around libsodium:

https://github.com/michelp/pgsodium

We use it at Supabase to provide our in-database “Vault”:

https://supabase.com/blog/supabase-vault


We use traditional envelope encryption on a highly normalized base. At the end, only a few sensitive columns on a few sensitive tables get encrypted.

It’s kind of like a sparse matrix of encrypted vs. plain data, and works great for our scenario.


This is similar to what my shop is doing. Any notes on how you manage keys?


As for now, we only need equality operator, so we use deterministic encryption for deks and handle keks in vaults. Kek rotation is kind of a pain, so we index edeks in order to improve our queries, which basically build dynamic UNION ALL or IN statements.

I’m sure is not the most secure schema in the world, but it makes retrieval fast and most analytics can be worked out with dynamic query building, while making the db a scrambled mess for those with partial access.

I guess you could call it “Security by insanity.”


It sounds like both of you do something very similar to what we do. Our data keys are themselves encrypted by AWS KMS. The data keys are decrypted and kept in memory on application startup. They are stored encrypted in S3.


Is something like AES useful for field encryption?

The problem is that IV should be unique for each message/field. But that makes querying impossible or very slow.

If IV is same for all fields, some querying is possible, but that is not how AES is supposed to be used.

How bad is having a constant IV (for all fields) per column?


If you're using AES in GCM mode? Bad...like catastrophic. An attacker can reveal the key.

If you want to use constant IV for deterministic (exact) lookups. Make sure you use AES in SIV mode which is resistant to IV reuse or CBC mode with an HMAC tag. Its slower than GCM unfortunately but one of the only secure options when you want to use deterministic option.

This stuff is hard to get right and can bite you in subtle and unexpected ways.


> An attacker can reveal the key.

An attacker can reveal the keystream, but not the AES key. Still catastrophic.

And AES-SIV is a lot stronger than CBC with deterministic IV, since CBC reveals if two messages start with the same sequence of 16-byte blocks, while SIV only reveals if the messages are identical.

---

There is another interesting option: Create two columns, one using randomized authenticated encryption and one using an HMAC. Then you can use the HMAC column for equality lookups.


I was thinking of plain CBC mode without HMAC tag.

Is authentication needed for field encryption?

EDIT: I should add that is for fields with unique values. So, constant key & IV (per column), but unique data for each field.


It's a good idea. CBC without verification is vulnerable as well. An attacker can modify the IV and the value will still decrypt. It's quite easy to change a what plaintext will pop out the other side and the client will be none-the-wiser.

Depends on what kinds of data you're encrypting but if its anything to do with money or health data authenticity is a must.


So, when using CBC without verification, attacker with an access to DB won't be able to see original plaintext, but will be able to change the data?

But how can attacker control what plaintext will become, if he doesn't have a key? Wouldn't he be limited to either a random value or a value from another field?

Since IV is constant. It doesn't need to be stored in DB and can be treated like a key. So, attacker (with an access to DB) can't change IV for a server app reading from the DB.


An attacker who has write access to the database and gets feedback if a decryption was successful can still mount the standard padding oracle against CBC, because the first block acts as IV for the second block.


Thanks. So, AES-CBC + HMAC would prevent this, but AES-CBC does not.

How do the following methods compare when using constant IV?

AES-CBC + HMAC (encrypt then MAC) vs AES-GCM-SIV vs AES-SIV


Be careful how you combine these components: https://soatok.blog/2021/07/30/canonicalization-attacks-agai...

(The linked article talks about CTR + HMAC, but CBC + HMAC is also affected.)


If your IV is constant then you open yourself to chosen plaintext attacks. For example, if I can get your application to encrypt "hackernews" for me, I will just look for other rows with the same ciphertext prefix as my encrypted field and know they start with "hackernews" (this is going to be true up to the block size of AES).

One technique for using AES (assuming you're using AES in a secure mode) and still being able to search is to compute a corresponding hash digest based on a small predefined length of the plaintext. Then you can search on that and get back all the rows that start with that. The actual field may still not be what you're looking for but you've narrowed it down to a much smaller subset. Then you just decrypt the field from the subset of rows and return the ones that match. It's definitely inefficient and still has issues but an improvement over a constant IV for all rows.


Using separate hash column for querying is nice idea.

It seems to me that some type of data might allow constant IV and still be secure.

From what I understand, it is important (in CBC mode) that the combination of key + IV + first block of plaintext is unique.

So, if key and IV are constant, but data is unique it is still secure. For example, social security number is unique and nine digits, which means it fits into 128-bit block. Using constant IV to encrypt SSN should be secure, right?

Email can also be used as unique identifier. But length of an email can exceed 16 bytes, so we don't have a guarantee that first block of plaintext will be unique (as two different emails can have equal first 16 bytes). So, it's not secure to encrypt email address with constant IV.

But if we would use a 16 bytes long hash of an unique email as an IV, there would be very low chance of (IV + first block data) collision. Probably secure enough?


> From what I understand, it is important (in CBS mode) that the combination of key + IV + first block of plaintext is unique.

> So, if key and IV are constant, but data is unique it is still secure. For example, social security number is unique and nine digits, which means it fits into 128-bit block. Using constant IV to encrypt SSN should be secure, right?

I think you meant CBC mode? Yes, IV + Key + first block = unique ciphertex. But if you made IV constant, then anyone with the same first block as someone else would result in the same ciphertext first block because all three inputs are now the same.

SSN is unique(ish?) but the range is absolutely small. Do you allow your users to input their own data? If that's the case, then they can create a whole bunch of accounts that enumerates some range of SSNs and look for anyone with the same ciphertext for that field.


CBC mode yes. Thanks, I fixed it now.

Small range of unique values is good argument. But wouldn't hash field (used for lookup) expose the same information?

Is there any good online resource focused on field encryption techniques in databases?


> But wouldn't hash field (used for lookup) expose the same information?

Yes absolutely which is why you pick a small substring as the input to the hash function.

For example, for the SSN field, you can use the first 2 digits as the input. Even if you get a digest match, the fields themselves may not actually be the exact same value. SSN, however, given its small range is kind of hard to secure. You might be able to do something like the email address plus SSN as the input into the digest function.

I don't have any online resources specific to this topic but I did find this book to be very accessible for someone like me who's an engineer:

https://www.amazon.com/gp/product/1593278268/ref=ppx_yo_dt_b...


Thanks, I'll take a look at it.

I find encrypting data in DB challenging, as you often still need to run queries on that data in a performant way.


What kind of data you usually manually encrypt? Except tokens and api keys I have never went further. Encrypting anything else would make the app much less efficient imo


I encrypted healthcare data at a previous job. It was indeed making the app much less efficient. Each health institute had their private keys. I considered to have keypairs per user, but it would have been a bit too much work but more neat.

The indexes were a challenge. I was using some argon2d key derivation algorithm to hash the values to use in some indexes. For example if you want to get all the records with the value "toto" in a field, you derive toto (with some common salt), and then you can look in the index all the documents that have the same derived value.

It did leaks some information and some values couldn't be indexed like this because that would leak too much. So sometimes, we had to fetch all the documents from the database and filter on the application level. We also sometimes didn't encrypt the datetimes so we could do efficient queries on specific time ranges.

To be honest, I did that mostly for fun. I know that some people are content with the managed encryption from their favourite cloud provider.


Health data primarily.

You aim to encrypt the fields you can, without hampering usability too much.

Anything you need to be able to search for (name, ssn) to find patients, or filter on for reports, is generally plaintext.

More sensitive things such as "that patient has aids" you'll have to decide if you want to encrypt it, or do a massive select from the DB anytime you need statistics on it. (Or better yet, encrypt it, but store an anonymized tracker elsewhere. But this is less useful for cureable diseases)


oh I forgot about that indeed. I guess the proposed solution in the article would not even be useful since a filtering on a encrypted boolean would not work and there is no way to "hide" this value.

Is it okay if everything is plaintext but the name? In that case you have a row of sensitive data without anything to link to the actual patient if it leaks.


Sometimes the fact that someone is a patient is sensitive. Eg an abortion clinic.


It depends on use case but for example in EU a lot of PII needs to be encrypted for the medical/pharmaceutical Services. Of course health data needs to be encrypted as well as financial data etc.


* API keys

* Health data

* Financial data


Union membership status

Ethnicity

Sexual orientation

Political party affiliation


Usually not. AFAIK no special security risks or regulatory requirements/penalties like for the other data.


https://gdpr-info.eu/art-9-gdpr/

> Art. 9 GDPR Processing of special categories of personal data

> Processing of personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person’s sex life or sexual orientation shall be prohibited.


It’s not very much related to encryption.


What is considered: "special categories of personal data" is related to how such data is handled.

Encryption is one tool for protecting data - it's not immediately obvious that encryption is always a useful tool.


Another approach is to have to the DB run in a secure enclave (SGX, Nitro, etc.), the data encrypted in the enclave process during disk I/O and the querying clients also in their own enclaves with encrypted communication between the enclaves. While there are edge cases, this is a more general "treat the db as black box" approach.


If you're looking for the best way to take a container and run it with Nitro, I work on https://github.com/edgebitio/enclaver

Works great with Kubernetes as a DaemonSet or straight on a VM.

Like: protection and privacy for apps

Love: using Nitro attestation and provenance like SLSA


This is really slow unless you figure out how to encrypt large batches of rows. It’s harder to do as a postgres plug-in. Do you have an example of enclaved today/transparent database encryption?


Interesting.

I’m still building my automation platform product and looking to encrypt my data going into db.

In my case I’m using Go for the backend applications. Any suggestions there?

Also do you suggest I do per user level keys or organization/workspace level?

Working solo and there’s a lot I don’t know. Happy to learn, thanks!


Encryption of records should be tied to their location (table + column + row) in the DB, preferably using AEAD methods. This helps mitigate Confused Deputy attacks:

https://soatok.blog/2023/03/01/database-cryptography-fur-the...


I love the way this is written. In just a paragraph any reader no matter what level of experience gets a good understanding of assumptions, strengths, and weaknesses of (my new fav term) 'tortoise-shell security (w/vry-juicy meat inside)


pgsodium is also a very good option: https://github.com/michelp/pgsodium


homomorphic encryption should solve the problem of field encryption. But not now. Note: the choice of the different solutions should be aligned with the sensitivity of the data, such as: Field Encryption for Highly sensitive (among others, bank data, health data etc), while the DB is, off course, only encrypted at rest.


Homomorphic is literally hundreds of thousands of times slower than operations on plaintext. A comparison of 2 64-bit integers can take around > 50ms. So even with a b-tree where maybe ~100 comparisons could occur, the query will take 5s. A linear scan over 1m records would take 13 hours!

SSE, ORE, STE schemes are all far more practical.


Can you use that in Azure Flexible Server for Postgres as well?


> All this brings us to the project I run: Enquo. It takes application-layer encryption to a new level, by providing a language- and framework-agnostic cryptosystem that also enables encrypted data to be efficiently queried by the database.

If you want people to use this, don't bury the lede.

Is this a problem worth solving? How does aws/gcp/azure solve for this?


> Is this a problem worth solving? How does aws/gcp/azure solve for this?

Yes. I had to build something very similar, and neither GCP nor AWS "solve" this at all. They provide good building block to solve it, like KMS and tools for envelope encryption (e.g. https://cloud.google.com/sql/docs/postgres/client-side-encry...), but importantly if you want to search on this encrypted data you need to role your own with something like blind indexes (the linked project explains some of the problems with that), and even harder is if you need to sort by that data, which this Enquo project also addresses.

There are a bunch of "PII vaulting services", companies like Very Good Security, that provide similar solutions, but it would be ideal to have this all securely encrypted in the DB if you're already using Postgres.


First, you complain that the person started by explaining the current state of the art and why a better solution is needed ("don't bury the lede").

Second you wonder if the problem is worth solving and what the current state of the art is, which is exactly what the text you're complaining about answered clearly.

I'm personally quite happy with how they formatted this blogpost.


I know deployments which chosen Oracle because Oracle supports encryption and Postgres does not. They don't really care about encryption, but they need to check a mark for compliance.


Encrypting the data reduces the attacker’s options, and allows you to focus your resources on hardening the application against attack, safe in the knowledge that an attacker who gets into the database directly isn’t going to get anything valuable.

And what’s more likely, you get access to my public facing crappy node or Rails app or the Postgres server on a private local network?


Depends if every disk is wiped correctly when they're replaced.


Hence disk encryption?


I feel like this is security theater (like most modern IT security), a marketing checkbox or a certification requirement but doesn't protect against actually realistic attacks.

Are there any publicly available case reports of where an encrypted database protected data?


I think it depends on the situation, but there's some legitimate defense-in-depth benefits here.

I've personally run across "dump.sql" files before in public s3 buckets, and as I understand it, this would help in that case.


> "dump.sql" files before in public s3 buckets

That kind of sounds like something half of the companies I worked for would end up eventually doing, fair point.


It's not just theft, it can make DB management much easier.

For example, since our sensitive DB fields are encrypted, relevant developers can get full read only access to the DB for debugging/analysis purposes without needing to worry about leaked PII. Similarly, we can log all of our DB queries, including fields, because the sensitive fields are encrypted.

This has huge operational benefits, and for compliance reasons is usually the best way to solve this problem. You can do things like limit access to columns by DB roles, but that is much more fraught, and it doesn't give you the logging benefits.


The security properties become much more interesting when the data remains encrypted when it leaves the database. For example, an ETL process doesn't necessarily need to know what data is, just the structure. Transferring data between systems can remain encrypted the whole time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: