Hacker News new | past | comments | ask | show | jobs | submit login

I am happy for the author, but want to warn the credulous that this is a very strange way to approach compression with SQLite. The much more common way is to use SQLite vfs:

https://www.sqlite.org/vfs.html

Briefly search for SQLite compressed vfs and you will find many extensions. Some are quite mature.

I get the impression that the author doesn't know about the vfs, as it's not even mentioned in the readme. The clickbait title similarly seems amateurish. Caveat emptor.




I did actually previously write a VFS, though it did something else entirely:

https://phiresky.github.io/blog/2021/hosting-sqlite-database...

You're right that a comparison to compression VFS is completely missing. I knew about their existence but I have to admit I didn't look too closely.

Note that the main "novelty" here is training a shared / partitioned compression dictionary specifically to take advantage of redundancy that wouldn't appear within only one row or even within a database page / block . The compression happens at the row level and can additionally use application knowledge - you couldn't do that in the VFS level. For example, you can have separate compression dictionaries per different columns and per groups of rows with some commonality.

I'll have to compare to a compression vfs (do you have a favorite?) and see if maybe these two methods can even be combined.

Edit: I see that https://github.com/mlin/sqlite_zstd_vfs does actually also train dictionaries. It's still at the database-page level so can't take application knowledge into account or compress only parts of the data, but that's still pretty smart then.


How does this compare with modern compression algorithms like Brotli that do context modeling, etc? I've found that they manage to aggressively compress types of data you wouldn't expect to compress well, to the point that investing energy into doing compression yourself doesn't provide big returns anymore. The downside is that codecs like Brotli tend to compress very slowly, but I can imagine being able to do a setup where you only compress old rows so it would be cool to see an experiment with just compressing rows or columns and comparing the sizes with your method.


Zstd with training libraries should beat Brotli and Brotli will struggle with non-text data although I haven’t benchmarked

Your underlying point though remains valid that the incremental complexity of building that training data probably doesn’t warrant it because the place where that becomes valuable is quite rare particularly for typical SQLite databases. Still a neat trick thing though.


I was surprised to discover that Brotli is actually very adaptable. I spent a month or two doing research on custom compression techniques for WebAssembly early in the spec process and we ended up discovering that you can just throw a naive binary executable through brotli and end up with at most like a 5% size loss vs doing a bunch of fancy compression, at which point the cost of the fancy compression starts looking questionable. We ended up not shipping custom compression as a part of the spec as a result.


Curious if you compared Zstd against Brotli. I'd expect Zstd to beat Brotli by a fair margin for non-text payloads (+ faster for decompression which matters since these are compressed once / decompress many).


We didn't, and I'll have to make a note to investigate zstd for my own purposes later! Brotli is kind of the only game in town since it's shipped in every web browser now as a transport codec (the other one is gzip)


Let's not water down the meaning of the word "clickbait". I have no problem with this title.


> amateurish

There's no need to be this dismissive.


welcome to HN


Meh, I find HN to be among the least dismissive places on the internet, especially when you only consider nerd-heavy places. I like to think that this is in part due to comments like the one I wrote being commonplace. I've been at the receiving end of them as well (incl a few kicks under the butt from dang himself) and it's helped me assume good intent, not be overly dismissive, etc. I'm just passing the favour along.


I think the frustrating thing about HN is the (relatively) high number of comments that are both dismissive and substantive. It's much easier (for my brain at least) to ignore comments where the entirety is "this sucks" or "lame" (or even "If I made something this crappy I would throw myself off a building" which isn't uncommon in some forums).

In comparison, gp comment is both dismissive and substantive in that it brings up SQLite VFS layers that can accomplish a similar goal. Author of TFA even updated the README to mention these, addressing gp's complaint.

The whole "this is crap because of X" where X is something that subjectively could be considered legitimate criticism is relatively common on HN, so comments that I don't automatically skip over are rather more dismissive than on other forums.


To call it strange or unusual is fine but it isn't a compelling argument why not to use it. Is it better or worse than a vfs approach and why?


Taking an amateurish or fancy or experimental approach to compressing a sqlite database is a great way to end up with data loss. Doesn't make it a worthless or "bad" project but it's worthy of a warning.


Why not read the sqlite teams own explanation:

https://www.sqlite.org/zipvfs/doc/trunk/www/howitworks.wiki


> Briefly search for SQLite compressed vfs and you will find many extensions.

I found 3 with a brief search - CEVFS, sqlite_zstd_vfs, and ZIPVFS (not free, therefore discounting.)

Neither CEVFS nor sqlite_zstd_vfs support WAL mode (which means, e.g., no litestream replication).

Can you recommend a mature one that does support WAL mode?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: