Databricks recently rebuilt Spark in C++ in a project called "Delta Engine" to o...

kenhwang · on Jan 18, 2021

Our experience with Delta Engine has been that it's way more resource hungry than the JVM code it replaced. It doesn't handle resource exhaustion well at all; lots of crashing and deadlock when nearing full resource utilization.

I would love to have something more resource efficient than Spark on JVM, but Delta Engine isn't there yet.

sitkack · on Jan 18, 2021

At the same time the JVM is getting better memory tracking analysis and incremental pauseless collectors (C4, ZGZ, Shenandoah, G1 improvements)

https://blogs.oracle.com/javamagazine/understanding-the-jdks...

georgewfraser · on Jan 19, 2021

These new GCs are amazing technology, but they primarily target pause time, whereas in data processing the primary concern is the “headroom” of extra space in your heap to allow the GC to work efficiently.

sitkack · on Jan 19, 2021

For those cases large off heap structures of arrays can make hundreds of GB of data invisible to the GC.

One can do both.