One option not listed (under "Other options for compressing data in a database") but probably closest to what this solution does under the hood, is to do row-interdependent compression at application level.
That is, while saving rows to the database, pick out every (say) 10th row, and compress it by itself but keep the compression state in memory. Then for the next 9 rows, compress that row based on that dictionary, and a reference to the row it's based on. (For example, in Python's zlib module, you'd use the zdict parameter [1] of compress and decompress.) A bit like keyframes in video compression.
You can potentially get better results this way than a generic solution like the article, because you might have some application knowledge about what rows are likely to be similar to others. But it obviously requires application level code (e.g. fetching a logical row requires fetching two physical rows so you have enough information to decode), and it would be a nightmare if you need to remove or update the rows.
> You can potentially get better results this way than a generic solution like the article because you might have some application knowledge about what rows are likely to be similar to others
The linked project actually does exactly this - you can basically specify an SQL expression (such as strftime(date, '%Y-%m')) that is then used to group rows together in order to decide what to train dictionaries on. What you're describing as picking every 10th row would be the expression (id/10). So you can use application knowledge, you just don't have to.
The difference is that in my project the dictionary is stored separately so not affected by row deletions, and the trained dictionary is based on all samples not just one of them. Since the dictionary is stored separately it's only worth it if you combine at least maybe 100-1000 rows under one dictionary.
> Python's zlib module, you'd use the zdict parameter
Yes, the training feature of zstd is basically a better version of keeping another sample of data around and using it as a prefix. In fact you can use a data sample instead of a trained dictionary as the `-D` parameter of the zstd command line, zstd just has the additional feature of reducing a set of samples down to the most essential (data-saving) parts.
Thanks for making this project, it looks fantastic! I might end up using it if I can figure out how to from Python.
I did mention that this is what you're doing, I just thought it wasn't clear from the blog post. But I did miss that you can use an SQL expression to choose the grouping. This is ideal for the application I'm thinking of, which a "message type" field that corresponds pretty closely with which rows are similar.
Your explanation is great. I like the key frame codec analogy. Like you said though, this really belongs at the DB layer. An application level solution would be much more than a nightmare.
Plugin/extension/module development is severely underused in our field. Extensibility, if possible, is often the best way to handle edge cases or increments in a dependency, without forking it.
That is, while saving rows to the database, pick out every (say) 10th row, and compress it by itself but keep the compression state in memory. Then for the next 9 rows, compress that row based on that dictionary, and a reference to the row it's based on. (For example, in Python's zlib module, you'd use the zdict parameter [1] of compress and decompress.) A bit like keyframes in video compression.
You can potentially get better results this way than a generic solution like the article, because you might have some application knowledge about what rows are likely to be similar to others. But it obviously requires application level code (e.g. fetching a logical row requires fetching two physical rows so you have enough information to decode), and it would be a nightmare if you need to remove or update the rows.
[1] https://docs.python.org/3/library/zlib.html#zlib.compressobj