This is really neat. I like how the compression can be applied to specific colum...

klysm · on Aug 1, 2022

I’ve been following that pattern in PostgreSQL and love it. It cleanly decouples the ingest and keeps the stack simple.

masklinn · on Aug 1, 2022

Do you mean gencols or compression?

klysm · on Aug 1, 2022

I meant consuming the raw json into a jsonb column and then using views or triggers to maintain the structured version of it.

KingOfCoders · on Aug 1, 2022

Can you elaborate?

allan_s · on Aug 1, 2022

Not the author, but if you have a write intensive workflow and I/O starts to be your blottleneck , compressing the data application-side allow your database to write less bytes, and thus to handle more load for the same capacity.

OnlyMortal · on Aug 1, 2022

Be aware of the number of iops. That’s a killer.

rovr138 · on Aug 1, 2022

Look into ETL vs ELT

ETL you grab the data from somewhere, transform it to clean it/format it, then load it (in this case to postgres).

ELT, you grab the data, load it as is. Then you use another step to extract what you need or transform it. You keep the original data in case you need it there too.

wmanley · on Aug 1, 2022

On that subject I've often thought of writing a combination decompressor/JSON parser that can parse the contents of the dictionary before decompression for faster JSON parsing and lower memory usage.

aabbcc1241 · on Aug 1, 2022

Maybe the compression algorithm used in https://github.com/beenotung/compress-json can inspire you.

wmanley · on Aug 1, 2022

Thanks, but that wasn't quite what I meant. What I was thinking about was processing existing gzipped or zstd compressed JSON rather than having a new compression format.

I could see how customising the compressor could be helpful - a bit like how there's an `--rsyncable` option to `gzip`, but I'd like to keep compatibility with existing formats. I like to avoid coming up with new formats whenever possible because you might want to mine the data in 20 years time - and if it's a custom format it'll be much harder to deal with.

I don't have a current use-case for this, but I like thinking about it. In particular I like data stores to be as simple as possible. Managing persistent state is the hard bit of most systems as unlike with runtime state or code you have to manage forwards and backwards compatibility, migrations, etc. and you can't just "turn it off and on again" to fix issues.

gcr · on Aug 1, 2022

a lot of document key/value databases that store arbitrary JSON tend to work this way (for example Cassandra does this, IIRC). some will even automatically build indices for you (though hand-tuning is usually better)

gcr · on Aug 1, 2022

wait, are you saying that the columns of view tables could be compressed? that doesn't make sense to me cause views don't use extra space to begin with....?

storing raw JSON in sqlite means the entire blob would be compressed and live in a single column, right?

wmanley · on Aug 1, 2022

You'd compress the JSON, and leave the data that you extract from the JSON uncompressed using Generated Columns[1]. That way you can save the original JSON data, but still have fast querying, etc.

It would look something like this:

    CREATE TABLE t1(
        json TEXT,
        data INT GENERATED ALWAYS AS (json_extract(body, '$.d')) STORED
    );

If all you need is fast querying you could make `data` a `VIRTUAL` column and set up an index on it.

You can read more about this kind of technique here: https://dgl.cx/2020/06/sqlite-json-support

[1]: https://www.sqlite.org/gencol.html