This is really neat. I like how the compression can be applied to specific columns rather than whole rows - or whole pages. I could imagine storing JSON from an HTTP request here and using Generated Columns[1] to extract the bits my application is interested in.
I also like how simple and powerful the custom partitioning is.
Not the author, but if you have a write intensive workflow and I/O starts to be your blottleneck , compressing the data application-side allow your database to write less bytes, and thus to handle more load for the same capacity.
ETL you grab the data from somewhere, transform it to clean it/format it, then load it (in this case to postgres).
ELT, you grab the data, load it as is. Then you use another step to extract what you need or transform it. You keep the original data in case you need it there too.
On that subject I've often thought of writing a combination decompressor/JSON parser that can parse the contents of the dictionary before decompression for faster JSON parsing and lower memory usage.
Thanks, but that wasn't quite what I meant. What I was thinking about was processing existing gzipped or zstd compressed JSON rather than having a new compression format.
I could see how customising the compressor could be helpful - a bit like how there's an `--rsyncable` option to `gzip`, but I'd like to keep compatibility with existing formats. I like to avoid coming up with new formats whenever possible because you might want to mine the data in 20 years time - and if it's a custom format it'll be much harder to deal with.
I don't have a current use-case for this, but I like thinking about it. In particular I like data stores to be as simple as possible. Managing persistent state is the hard bit of most systems as unlike with runtime state or code you have to manage forwards and backwards compatibility, migrations, etc. and you can't just "turn it off and on again" to fix issues.
a lot of document key/value databases that store arbitrary JSON tend to work this way (for example Cassandra does this, IIRC). some will even automatically build indices for you (though hand-tuning is usually better)
wait, are you saying that the columns of view tables could be compressed? that doesn't make sense to me cause views don't use extra space to begin with....?
storing raw JSON in sqlite means the entire blob would be compressed and live in a single column, right?
You'd compress the JSON, and leave the data that you extract from the JSON uncompressed using Generated Columns[1]. That way you can save the original JSON data, but still have fast querying, etc.
It would look something like this:
CREATE TABLE t1(
json TEXT,
data INT GENERATED ALWAYS AS (json_extract(body, '$.d')) STORED
);
If all you need is fast querying you could make `data` a `VIRTUAL` column and set up an index on it.
I also like how simple and powerful the custom partitioning is.
[1]: https://www.sqlite.org/gencol.html