Recently went deeep on efficient kv storage in postgres, there is an order of ma...

jitl · on Nov 1, 2021

Can you share some more about your learnings? I'm familiar with EAV as a concept but not the performance implications of EAV vs jsonb. Googling "mapped generic columns" didn't turn up anything that seemed relevant, and I'm curious what you mean specifically by "array values" as a solution in this space.

nhoughto · on Nov 2, 2021

yeah sorry i just made up 'mapped generic columns',

my progression was trying to find a soln for immutable key/values (immutable enabling some different tradeoffs then others might have) where most of my values are bigints so only a couple of bytes to store.

I worked through approaches: 1. obvious skinny table with duplication 2. EAV optimisation to reduce duplication

The problem with both of these is that having 1 key/value per row with a small value (8 bytes~) meant 23-27 byte row overhead, which obviously increases your storage cost dramatically.

So i looked at how to have more values on each row, so maybe 23-27 bytes overhead for 500 values..

3. EAV with values grouped into an array per row, similar to EAV but instead of 1 value per row, map N values onto 1 row using an array column. So N values are all in one array on a row, and accessed by index. This has the lowest storage overhead, because of row compression you can get on average an 8 byte bigint stored for only 6 bytes per value (so no overhead, actual savings on the raw values..), but value access via array index isn't very efficient especially for sparse arrays, or sparse queries across millions of rows. And tom says maybe not: https://marc.info/?l=pgsql-performance&m=131229661705005&w=2

Good thing about 3 is that you can abstract the value -> row array mapping in SQL, so app doesn't need to care about it, but array access is slower than i'd like, so how to avoid that? Maybe map N values onto generic columns instead? No array access penalty, and less framework fighting..

4. So like 3 but instead of N values in an array per row, have values in N generic columns per row (value_1, value_2 etc). It is too hard/bad ergonomics to do this in SQL/pgplsql, so now the app has to be aware of what value is in which generic column name, which was ok for us and it has much better query performance. In terms of storage cost, there is a relationship between number of columns/values per row and how much data is required to be paged in to retrieve a value (performance) and row overhead and compression efficacy. We landed on 100 values per row I think, this meant bigint was 8-10 bytes to store instead of the more efficient for approach 3 is tunable tho upto the column per row limit of ~1600.

Can dig up proper numbers later if useful but thats the general idea.

nhoughto · on Nov 1, 2021

Ah yeah it’s probably jsonb (duh). That’s not efficient, but sometimes you don’t care about that!