Demystifying the use of Parquet for time series

MrPowers · on Jan 17, 2024

Delta Lake solves a lot of the Parquet limitations mentioned in this post. Disclosure: I work on the Delta Lake project.

Parquet files store metadata about row groups in the file footer. Delta Lake adds file-level metadata in the transaction log. So Delta Lake can perform file-level skipping before even opening any of the Parquet files to get the row-group metadata.

Delta Lake allows you to rearrange your data to improve file-skipping. You can Z Order by timestamp for time-series analyses.

Delta Lake also allows for schema evolution, so you can evolve the schema of your table over time.

This company may have a cool file format, but is it closed source? It seems like enterprises don't want to be locked into closed formats anymore.

Malcolmlisk · on Jan 17, 2024

Wow ! I've been reading for a while from delta lake and Im interested in the company. Is there a chance to drop a CV for remote work (i am from spain).

The schema evolution is something that popped out in a water cooler conversation the other day in my team.

adammarples · on Jan 17, 2024

Can you z order in delta lake? I thought that was one of the features databricks had kept to themselves

speedgoose · on Jan 17, 2024

This seems to be an ad for a proprietary time series data format named HFiles that is locked behind a "contact us".

Thanks but I will stay with Parquet for now.

stargrazer · on Jan 17, 2024

This reminds me of HDF5, which, even thought the data is written/appended in row format, there is an API to chunk the data, organize into columns, compress based upon column regularities, and write to storage.

On reading the reverse happens.

This becomes the compute/space conundrum: space is reduced with column based regularity, but time is increased due to the extra overhead of columnar compression.

jononor · on Jan 18, 2024

I find that Parquet works great for time series data in general. We use it a bunch for datasets of some millions of rows for each of 10-1000 devices, with typically 10 ish columns per device. What I am missing is good support for delta encdoing, which often leads to the best compression for time series data. The format specifies delta encoding for all integer types however I find it to be poorly supported in most Python-friendly libraries (fastpaquet/polars/duckdb).