This kind of data infrastructure is a great use case for Rust. A lot of data infrastructure is memory-bound, so saving the memory overhead of GC is a huge win.
The use of Arrow to support multiple programming languages is also a great concept. Other distributed computing engines have ended up tied to the JVM (Spark, Presto, Kafka) as a way of avoiding serialization/deserialization costs when you go across a language boundary. Arrow is a really elegant solution, as long as you're willing to batch up operations.
Databricks recently rebuilt Spark in C++ in a project called "Delta Engine" to overcome the JVM limitations you're pointing out. You're right, Rust is a great way to sidestep the dreaded JVM GC pauses.
Our experience with Delta Engine has been that it's way more resource hungry than the JVM code it replaced. It doesn't handle resource exhaustion well at all; lots of crashing and deadlock when nearing full resource utilization.
I would love to have something more resource efficient than Spark on JVM, but Delta Engine isn't there yet.
These new GCs are amazing technology, but they primarily target pause time, whereas in data processing the primary concern is the “headroom” of extra space in your heap to allow the GC to work efficiently.
The use of Arrow to support multiple programming languages is also a great concept. Other distributed computing engines have ended up tied to the JVM (Spark, Presto, Kafka) as a way of avoiding serialization/deserialization costs when you go across a language boundary. Arrow is a really elegant solution, as long as you're willing to batch up operations.