One option not listed (under "Other options for compressing data in a database")...

phiresky · on Aug 1, 2022

> You can potentially get better results this way than a generic solution like the article because you might have some application knowledge about what rows are likely to be similar to others

The linked project actually does exactly this - you can basically specify an SQL expression (such as strftime(date, '%Y-%m')) that is then used to group rows together in order to decide what to train dictionaries on. What you're describing as picking every 10th row would be the expression (id/10). So you can use application knowledge, you just don't have to.

The difference is that in my project the dictionary is stored separately so not affected by row deletions, and the trained dictionary is based on all samples not just one of them. Since the dictionary is stored separately it's only worth it if you combine at least maybe 100-1000 rows under one dictionary.

> Python's zlib module, you'd use the zdict parameter

Yes, the training feature of zstd is basically a better version of keeping another sample of data around and using it as a prefix. In fact you can use a data sample instead of a trained dictionary as the `-D` parameter of the zstd command line, zstd just has the additional feature of reducing a set of samples down to the most essential (data-saving) parts.

quietbritishjim · on Aug 1, 2022

Thanks for making this project, it looks fantastic! I might end up using it if I can figure out how to from Python.

I did mention that this is what you're doing, I just thought it wasn't clear from the blog post. But I did miss that you can use an SQL expression to choose the grouping. This is ideal for the application I'm thinking of, which a "message type" field that corresponds pretty closely with which rows are similar.

phiresky · on Aug 1, 2022

You can use it from python as follows:

    db = sqlite3.connect(...)
    db.enable_load_extension()
    db.load_extension("libsqlite_zstd.so")

You probably still have to figure out how to make that work cross-platform / how to ship that binary though.

quietbritishjim · on Aug 1, 2022

Oh interesting, I was assuming that recompiling the whole sqlite module woudl be in order. Thanks again.

goldenkey · on Aug 1, 2022

Your explanation is great. I like the key frame codec analogy. Like you said though, this really belongs at the DB layer. An application level solution would be much more than a nightmare.

Plugin/extension/module development is severely underused in our field. Extensibility, if possible, is often the best way to handle edge cases or increments in a dependency, without forking it.

See "Benefits of Being an Extension to PostgreSQL": https://www.citusdata.com/blog/2017/10/25/what-it-means-to-b...

Some great software that is extensible in brilliant ways through plugins, that comes to mind, is:

postgres: https://www.postgresql.org/docs/current/external-extensions....

nginx: https://www.nginx.com/resources/wiki/modules/

sqlite: https://www.sqlite.org/loadext.html

redis: https://redis.io/docs/modules/

babel: https://babeljs.io/docs/en/plugins/