Disclosure: I am a contributor to Datafusion. I have done a lot of work in the E...

eduren · on Aug 25, 2021

Oh hey, thanks for the info!

I spent some time evaluating Arc for my team's ETL purposes and I was really impressed. I hesitated somewhat to move forward with it because it seemed really tied into the Spark ecosystem (for great reasons). We just weren't at all familiar with deploying and operating Spark, so ended up rolling our own scripts on top of (an existing) Airflow cluster for now.

Besides performance reasons, are there any other advantages to porting Arc to run on top of datafusion? If the porting effort was shared somewhere I'd love to dig in and see what the proof-of-concept looks like.

seddonm1 · on Aug 26, 2021

Hi eduren. Give me a few days and Ill see what i can publish as a WIP repo. The aim of Arc was to always allow swapping the execution engine whilst retaining the logic - hence SQL -so this should hopefully be easy.

FridgeSeal · on Aug 25, 2021

Rust stuff tends to be a bit more resource efficient than Java.

Currently using DataFusion from Rust, and being more resource efficient means we can use smaller machines, which means our costs go down. Deploying services is also faster (smaller docker images, faster startup times) and puts less extraneous load on our machines.

I imagine Arc, and thus downstream users, would see similar benefits.