Hacker News new | past | comments | ask | show | jobs | submit login

This kind of data infrastructure is a great use case for Rust. A lot of data infrastructure is memory-bound, so saving the memory overhead of GC is a huge win.

The use of Arrow to support multiple programming languages is also a great concept. Other distributed computing engines have ended up tied to the JVM (Spark, Presto, Kafka) as a way of avoiding serialization/deserialization costs when you go across a language boundary. Arrow is a really elegant solution, as long as you're willing to batch up operations.




Databricks recently rebuilt Spark in C++ in a project called "Delta Engine" to overcome the JVM limitations you're pointing out. You're right, Rust is a great way to sidestep the dreaded JVM GC pauses.


Our experience with Delta Engine has been that it's way more resource hungry than the JVM code it replaced. It doesn't handle resource exhaustion well at all; lots of crashing and deadlock when nearing full resource utilization.

I would love to have something more resource efficient than Spark on JVM, but Delta Engine isn't there yet.


At the same time the JVM is getting better memory tracking analysis and incremental pauseless collectors (C4, ZGZ, Shenandoah, G1 improvements)

https://blogs.oracle.com/javamagazine/understanding-the-jdks...


These new GCs are amazing technology, but they primarily target pause time, whereas in data processing the primary concern is the “headroom” of extra space in your heap to allow the GC to work efficiently.


For those cases large off heap structures of arrays can make hundreds of GB of data invisible to the GC.

One can do both.


Dremio is a query engine for data lake that uses Apache Arrow heavily for a lot of processing. The application still runs on JVM.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: