Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Why not just use traditional DBMS like PostgreSQL

Serialization. It's usually a non-trivial percentage of time spent in a lot of distributed systems, and for some workloads it can be the bulk of time spent.

If I want to grab 3GB of data from a remote host and process it locally, we have to agree on how that data is going to be transferred so I can use it. Could very well be SQL, so we have some sort of network-based tabular data stream. Maybe it's Parquet files so we're using NFS/S3 to copy the files to local disks before reading into a completely separate in-memory data structure. At the end of this workload, I have 1GB of data I now want to write it back. Maybe the data is stored in-memory as an array of mixed-type structs, but I can't just send those bytes as-is to SQL server or mmap to the filesystem and expect Parquet to know what it means.

Apache Arrow and DataFusion aims to eliminate all that work in rewriting bytes between hosts. Imagine being able to create a cost-based optimized query plan on Host A, send it to Hosts M-P for processing, and even have that query plan trickle down to Parquet predicates when reading files from disk, before returning data to Host A which can simply be copied from the network into local memory and you can start working with it right away.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: