The pitch for this sounds very similar to the pitch for Vortex (i.e. obviating t...

apavlo · 2025-10-02T01:51:39 1759369899

The backstory is complicated. The plan was to establish a consortium between CMU, Tsinghua, Meta, CWI, VoltronData, Nvidia, and SpiralDB to unify behind a single file format. But that fell through after CMU's lawyers freaked out over Meta's NDA stuff to get access to a preview of Velox Nimble. IANAL, but Meta's NDA seemed reasonable to me. So the plan fell through after about a year, and then everyone released their own format:

→ Meta's Nimble: https://github.com/facebookincubator/nimble

→ CWI's FastLanes: https://github.com/cwida/FastLanes

→ SpiralDB's Vortex: https://vortex.dev

→ CMU + Tsinghua F3: https://github.com/future-file-format/f3

On the research side, we (CMU + Tsinghua) weren't interested in developing new encoders and instead wanted to focus on the WASM embedding part. The original idea came as a suggestion from Hannes@DuckDB to Wes McKinney (a co-author with us). We just used Vortex's implementations since they were in Rust and with some tweaks we could get most of them to compile to WASM. Vortex is orthogonal to the F3 project and has the engineering energy necessary to support it. F3 is an academic prototype right now.

I note that the Germans also released their own fileformat this year that also uses WASM. But they WASM-ify the entire file and not individual column groups:

→ Germans: https://github.com/AnyBlox

rancar2 · 2025-10-02T03:01:36 1759374096

Andrew, it’s always great to read the background from the author on how (and even why!) this all played out. This comment is incredibly helpful for understanding the context of why all these multiple formats were born.

Centigonal · 2025-10-03T15:16:32 1759504592

If I could ask you to speculate for a second, how do you think we will go from here to a clear successor to Parquet?

Will one of the new formats absorb the others' features? Will there be a format war a la iceberg vs delta lake vs hudi? Will there be a new consortium now that everyone's formats are out in the wild?

digdugdirk · 2025-10-02T02:57:25 1759373845

... Are you saying that there's 5 competing "universal" file format projects? Each with different non-compatible approaches? Is this a laughing/crying thing, or a "lots of interesting paths to explore" thing?

Also, back on topic - is your file format encryptable via that WASM embedding?

tomnicholas1 · 2025-10-02T14:37:08 1759415828

Thank you for the explanation! But what a mess.

I would love to bring these benefits to the multidimensional array world, via integration with the Zarr/Icechunk formats somehow (which I work on). But this fragmentation of formats makes it very hard to know where to start.