The problem with Parquet is it’s static. Not good for use cases that involve continuous writes and updates. Although I have had good results with DuckDB and Parquet files in object storage. Fast load times.
If you host your own embedding model, then you can transmit numpy float32 compressed arrays as bytes, then decode back into numpy arrays.
Personally I prefer using SQLite with usearch extension. Binary vectors then rerank top 100 with float32. It’s about 2 ms for ~20k items, which beats LanceDB in my tests. Maybe Lance wins on bigger collections. But for my use case it works great, as each user has their own dedicated SQLite file.
> The problem with Parquet is it’s static. Not good for use cases that involve continuous writes and updates.
parquet is columnar storage, so it’s use case is lots of heavy filtering/aggregation within analytical workloads (OLAP).
consistent writes / updates, i.e. basically transactional (OLTP), use cases are never going to have great performance in columnar storage. its the wrong format to use for that.
for faster writes/updates you’d want row-based, i.e. CSV or an actual database. which i’m glad to see is where you kind of ended up anyway.
There's no reason why an update query that doesn't change the file layout and only twiddles some values in place couldn't be made fast with columnar storage.
When you run a read query, there's one phase that determines the offsets where values are stored and another that reads the value at a given offset. For an update query that doesn't change the offsets, you can change the direction from reading the value at an offset to writing a new value to that location instead, and it should be plenty fast.
Parquet libraries just don't seem to consider that use case worth supporting for some reason and expect people to generate an entire new file with mostly the same content instead. Which definitely doesn't have great performance!
Columnar storage systems rarely store the raw value at fixed position. They store values as run length encoded, dictionary encoded, delta encoded, etc... and then store metadata about chunk of values for pruning at query time. So rarely can you seek to an offset and update a value. The compression achieved means less data to read from disk when doing large scans and lower storage costs for very-large-datasets that are largely immutable - some of the important benefits of columnar storage.
Also, many applications that require updates also update conditionally (update a where b = c). This requires re-synthesizing (at least some of) the row to make a comparison, another relatively expensive operation for a column store.
Also typically stored with binary compression (snappy, lib) after the snappy compression. In-memory might only be semantic, eg, arrow.
But it's... Fine? Batch writes and rewrite dirty parts. Most of our cases are either appending events, or enriching with new columns, which can be modeled columnarly. It is a bit more painful in GPU land bc we like big chunks (250MB-1GB) for saturating reads, but CPU land is generally fine for us.
We have been eyeing iceberg and friends as a way to automate that, so I've been curious how much of the optimization, if any, they take for us
Parquet files being immutable is not a bug, it is a feature. That is how you accomplish good compression and keep the columnar data organized.
Yes, it is not useful for continuous writes and updates, but it is not what it is designed for. Use a database (e.g. SQLite just like you suggested) if you want to ingest real time/streaming data.
I've had great luck using either Athena or DuckDB with parquet files in s3 using a few partitions. You can query across the partitions pretty efficiently and if date/time is one of your partitions, then it's very efficient to add new data.
> The problem with Parquet is it’s static. Not good for use cases that involve continuous writes and updates. Although I have had good results with DuckDB and Parquet files in object storage. Fast load times.
You can use glob patterns in DuckDB to query remote parquets though to get around this? Maybe break things up using a hive partitioning scheme or similar.
I like the pattern described too. Only snag is deletes and updates. Ime, you have to delete the underlying file or create and maintain a view that handles the data you want visible.
If you host your own embedding model, then you can transmit numpy float32 compressed arrays as bytes, then decode back into numpy arrays.
Personally I prefer using SQLite with usearch extension. Binary vectors then rerank top 100 with float32. It’s about 2 ms for ~20k items, which beats LanceDB in my tests. Maybe Lance wins on bigger collections. But for my use case it works great, as each user has their own dedicated SQLite file.
For portability there’s Litestream.