Hacker News new | past | comments | ask | show | jobs | submit login

One option not listed (under "Other options for compressing data in a database") but probably closest to what this solution does under the hood, is to do row-interdependent compression at application level.

That is, while saving rows to the database, pick out every (say) 10th row, and compress it by itself but keep the compression state in memory. Then for the next 9 rows, compress that row based on that dictionary, and a reference to the row it's based on. (For example, in Python's zlib module, you'd use the zdict parameter [1] of compress and decompress.) A bit like keyframes in video compression.

You can potentially get better results this way than a generic solution like the article, because you might have some application knowledge about what rows are likely to be similar to others. But it obviously requires application level code (e.g. fetching a logical row requires fetching two physical rows so you have enough information to decode), and it would be a nightmare if you need to remove or update the rows.

[1] https://docs.python.org/3/library/zlib.html#zlib.compressobj




> You can potentially get better results this way than a generic solution like the article because you might have some application knowledge about what rows are likely to be similar to others

The linked project actually does exactly this - you can basically specify an SQL expression (such as strftime(date, '%Y-%m')) that is then used to group rows together in order to decide what to train dictionaries on. What you're describing as picking every 10th row would be the expression (id/10). So you can use application knowledge, you just don't have to.

The difference is that in my project the dictionary is stored separately so not affected by row deletions, and the trained dictionary is based on all samples not just one of them. Since the dictionary is stored separately it's only worth it if you combine at least maybe 100-1000 rows under one dictionary.

> Python's zlib module, you'd use the zdict parameter

Yes, the training feature of zstd is basically a better version of keeping another sample of data around and using it as a prefix. In fact you can use a data sample instead of a trained dictionary as the `-D` parameter of the zstd command line, zstd just has the additional feature of reducing a set of samples down to the most essential (data-saving) parts.


Thanks for making this project, it looks fantastic! I might end up using it if I can figure out how to from Python.

I did mention that this is what you're doing, I just thought it wasn't clear from the blog post. But I did miss that you can use an SQL expression to choose the grouping. This is ideal for the application I'm thinking of, which a "message type" field that corresponds pretty closely with which rows are similar.


You can use it from python as follows:

    db = sqlite3.connect(...)
    db.enable_load_extension()
    db.load_extension("libsqlite_zstd.so")
You probably still have to figure out how to make that work cross-platform / how to ship that binary though.


Oh interesting, I was assuming that recompiling the whole sqlite module woudl be in order. Thanks again.


Your explanation is great. I like the key frame codec analogy. Like you said though, this really belongs at the DB layer. An application level solution would be much more than a nightmare.

Plugin/extension/module development is severely underused in our field. Extensibility, if possible, is often the best way to handle edge cases or increments in a dependency, without forking it.

See "Benefits of Being an Extension to PostgreSQL": https://www.citusdata.com/blog/2017/10/25/what-it-means-to-b...

Some great software that is extensible in brilliant ways through plugins, that comes to mind, is:

postgres: https://www.postgresql.org/docs/current/external-extensions....

nginx: https://www.nginx.com/resources/wiki/modules/

sqlite: https://www.sqlite.org/loadext.html

redis: https://redis.io/docs/modules/

babel: https://babeljs.io/docs/en/plugins/




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: