This kind of data infrastructure is a great use case for Rust. A lot of data inf...

MrPowers · on Jan 18, 2021

Databricks recently rebuilt Spark in C++ in a project called "Delta Engine" to overcome the JVM limitations you're pointing out. You're right, Rust is a great way to sidestep the dreaded JVM GC pauses.

kenhwang · on Jan 18, 2021

Our experience with Delta Engine has been that it's way more resource hungry than the JVM code it replaced. It doesn't handle resource exhaustion well at all; lots of crashing and deadlock when nearing full resource utilization.

I would love to have something more resource efficient than Spark on JVM, but Delta Engine isn't there yet.

sitkack · on Jan 18, 2021

At the same time the JVM is getting better memory tracking analysis and incremental pauseless collectors (C4, ZGZ, Shenandoah, G1 improvements)

https://blogs.oracle.com/javamagazine/understanding-the-jdks...

georgewfraser · on Jan 19, 2021

These new GCs are amazing technology, but they primarily target pause time, whereas in data processing the primary concern is the “headroom” of extra space in your heap to allow the GC to work efficiently.

sitkack · on Jan 19, 2021

For those cases large off heap structures of arrays can make hundreds of GB of data invisible to the GC.

One can do both.

lavender1a · on Jan 19, 2021

Dremio is a query engine for data lake that uses Apache Arrow heavily for a lot of processing. The application still runs on JVM.