Ballista: Distributed compute platform implemented in Rust using Apache Arrow

habitue · on Jan 18, 2021

It saddens me a little bit that a nascent protocol like arrow flight is using grpc + protobufs behind the scenes with some hacks on top to skip the cost of deserializing protobufs. It seems like a really common belief that protobufs have so much engineering time behind them and are cross-language that it's a no brainer to implement your new protocol on top of them.

In reality, all the engineering and optimization time is behind the implementations for the google internal languages, and even the python protobuf implementation is pretty bad.

Protobuf makes some stunningly bad decisions like using varints, etc that you shouldn't make the immediate assumption "google has tons of great engineers, google uses protobuf for everything internally, therefore, protobuf is a good foundation to build my new thing on top of"

In reality, path dependence and the (amazing) internal tooling ecosystem at google both play a huge part of why they use protobuf so extensively.

(Grpc is a little overly complicated to be a universal recommendation, but I could believe it's a good choice for Arrow Flight. But it seems like they didn't do grpc + arrow or grpc + flatbuffer + arrow in the hopes that "dumb" grpc + protobuf implementations would be able to still benefit. In my opinion, grpc implementations are so coupled, there's no reason to make this unnecessary concession to protobufs)

jfim · on Jan 18, 2021

I'm not sure I'd call varints a stunningly bad decision. It seems more like the kind of tradeoff one would make if storage and network transfer costs are considered to be more important than the serialization/deserialization speed.

That being said, the fact that that particular tradeoff was considered to be good for Google doesn't mean it actually is, or that it's applicable to one's application.

xiphias2 · on Jan 18, 2021

I guess it was good before compression. Nowdays working with protobufs is painful inside Google as well, but at least it's supported by everything.

Most of what the CPUs at Google are doing is just copying fields from one protobuffer to another.

cogman10 · on Jan 18, 2021

I think that's a fair description of pretty much every webapp :D

Most of them are doing nothing more than copying data from the database into an http stream.

ithkuil · on Jan 18, 2021

Well, all what CPUs do is copying memory from one memory location to another, perform some arithmetic and do some conditional jumps :-)

xiphias2 · on Jan 18, 2021

Most of the arithmetic is checking optional fields if they are empty or not before copying :)

cogman10 · on Jan 18, 2021

pop pop jump jump oh what a crunch it is.

(sung to this tune https://www.youtube.com/watch?v=iENQXIQ8wH0 )

Side note: It really is incredible what happened in the early days of computing when memory and computation were limited. How much care was taken in the precise layout of memory or even the timing of a calculation was insane.

munk-a · on Jan 18, 2021

Don't forget the validation!

Webapps are dumb middleware that pipes data from the database into an http stream - but it needs to determine which database calls to invoke and sanitize all the incoming junk.

lmeyerov · on Jan 19, 2021

We use arrow to do the validation 'for free': it has a typed columnar schema, so by passing the schema, we get the recordset validation without inspecting values.

we still need to do semantic validation, but w/ arrow, now we do that in bulk and on gpus :)

xiphias2 · on Jan 18, 2021

Sure, but when you are using C++, protobufs are not the best way to store data...but I guess it could be worse.

philsnow · on Jan 18, 2021

At least when I was there, proto (de)serialization consumed the plurality of cpu cycles, but not the majority.

It isn't really the majority these days, is it?

habitue · on Jan 18, 2021

You're right, it's probably fair to say it's a stunningly bad tradeoff for most applications most of the time, given we have fast compression like snappy & brotli available now and cloud costs are heavily weighted towards CPU costing more than storage & network transfer

lmeyerov · on Jan 19, 2021

We directly stream arrow across devices, skipping grpc indirection / confusion, and have done that since near the beginning of arrow. Others do too, works great :)

Flight is interesting to us bc they're thinking through parallel i/o. I'm guessing grpc overhead there is more tolerable, though yeah, I'd be curious. I had similar initial reservations, esp as we were happy to remove slow google protobuf stack stuff as part of our switch to arrow :)

sitkack · on Jan 18, 2021

What you describe is truly an https://github.com/TimelyDataflow/abomonation

ampdepolymerase · on Jan 18, 2021

What are some production ready alternatives to gRPC that have both a pleasant developer experience and great performance? Apache Avro? Apache Thrift?

e12e · on Jan 18, 2021

Cap'n'proto?

https://capnproto.org/

habitue · on Jan 18, 2021

I think for this kind of high performance stuff, grpc is a reasonable choice. For ergonomics though, http + json is fine for many/most applications and there is a lot more widely available tooling for it than there is for grpc.

It's very possible that will change over time

(My implicit assumption here is that a project like Arrow Flight wants a cross-language, widely used foundation for their protocol, and there's not a ton of things that fit that bill. But depending on your application's needs, implementing a language-specific rpc system is perfectly acceptable, and may have even better ergonomics. Rust and Python both have a plethora of mono-lingual rpc frameworks)

jeffnappi · on Jan 18, 2021

GraphQL is one.

Here's an article making the argument for it https://blog.spaceuptech.com/posts/why-we-moved-from-grpc-to...

e12e · on Jan 18, 2021

I think that if you honestly consider GraphQL a better fit than gRPC, you probably should never have considered gRPC to begin with...

And much as we're considering GraphQL for some services as work... I'm not sure I buy it as an RPC framework. I suppose it has about the same appeal as SOAP for that purpose.

jeffnappi · on Jan 19, 2021

Agreed - they both have their place.

hilbertseries · on Jan 18, 2021

gRPC has a pleasant developer experience? This is news to me.

riku_iki · on Jan 18, 2021

Maybe it is relatively pleasant when compared to alternatives.

ampdepolymerase · on Jan 18, 2021

It doesn't.

georgewfraser · on Jan 18, 2021

This kind of data infrastructure is a great use case for Rust. A lot of data infrastructure is memory-bound, so saving the memory overhead of GC is a huge win.

The use of Arrow to support multiple programming languages is also a great concept. Other distributed computing engines have ended up tied to the JVM (Spark, Presto, Kafka) as a way of avoiding serialization/deserialization costs when you go across a language boundary. Arrow is a really elegant solution, as long as you're willing to batch up operations.

MrPowers · on Jan 18, 2021

Databricks recently rebuilt Spark in C++ in a project called "Delta Engine" to overcome the JVM limitations you're pointing out. You're right, Rust is a great way to sidestep the dreaded JVM GC pauses.

kenhwang · on Jan 18, 2021

Our experience with Delta Engine has been that it's way more resource hungry than the JVM code it replaced. It doesn't handle resource exhaustion well at all; lots of crashing and deadlock when nearing full resource utilization.

I would love to have something more resource efficient than Spark on JVM, but Delta Engine isn't there yet.

sitkack · on Jan 18, 2021

At the same time the JVM is getting better memory tracking analysis and incremental pauseless collectors (C4, ZGZ, Shenandoah, G1 improvements)

https://blogs.oracle.com/javamagazine/understanding-the-jdks...

georgewfraser · on Jan 19, 2021

These new GCs are amazing technology, but they primarily target pause time, whereas in data processing the primary concern is the “headroom” of extra space in your heap to allow the GC to work efficiently.

sitkack · on Jan 19, 2021

For those cases large off heap structures of arrays can make hundreds of GB of data invisible to the GC.

One can do both.

lavender1a · on Jan 19, 2021

Dremio is a query engine for data lake that uses Apache Arrow heavily for a lot of processing. The application still runs on JVM.

speps · on Jan 18, 2021

> using Apache Arrow MEMORY MODEL

Probably got cut because of maximum title length but important nonetheless.

davesque · on Jan 18, 2021

Not sure what that omitted portion would have clarified. Apache Arrow, at its core, is a memory model spec. Also, it appears that this project is using the official Rust library which is developed in the same repo as all the other language implementations. So the simple statement that they're using Apache Arrow seems adequate.

eb0la · on Jan 18, 2021

Author is the same guy that wrote Arrow Rust library ;-)

davesque · on Jan 18, 2021

So he is, hah. Well there you go :).

superbcarrot · on Jan 18, 2021

It has Rust in the title, that will be enough.

DSingularity · on Jan 18, 2021

secondcoming · on Jan 18, 2021

Hopefully this is the beginning of the end for JVM use in data-centric applications like this. I'm not particularly bothered if it's Rust or C++

joshlk · on Jan 19, 2021

I still don't understand why lots of new databases and data applications (Neo4J, Cassandra, Hadoop ...) are written in Java when it is well known that Java is not well suited to these types to tasks: GC freezes, bad memory model for large data, not fully compiled as so much slower when compared to C++, C or Rust. Why choose Java for these applications?

pjmlp · on Jan 19, 2021

Productivity, unmatched tooling for cluster monitoring, memory safe by default, value types are coming.

And even without value types, there are off heap allocations, also if "Python" libraries can be actually written in C, so can Java ones, without loosing the plus of the ecosystem.

joshlk · on Jan 19, 2021

I think those are ok trade offs for a lot of projects but not for high performance and low latency applications such as databases. It seems to me that those project have shot themselves in the foot before even starting.

pjmlp · on Jan 19, 2021

I doubt Hadoop would have been as successful if written in C or C++.

While the database engine of most RDMS servers is written in C and C++, anc it won't change given the history behind the code, the bulk of the code is written in some form of managed SQL, with IBM, Oracle and Microsoft also allowing for Java and .NET code.

Finally, https://www.efinancialcareers.co.uk/news/2020/11/low-latency...

Once upon a time, anyone knew that high performance systems were naturally only viable in Assembly unless proven otherwise.

Or how C++ wasn't ever to be a thing in game development, C was the king of console SDKs, after years trying to take Assembly's place on 16 bit platforms.

secondcoming · on Jan 19, 2021

Well, to quote TFA:

> Essentially, we use a contrived form of Java that avoids all the Java constructs that make things go slow. We only use the constructs that are fast and efficient, and we avoid all the garbage

> The only problem with low latency Java is that most experience Java programmers struggle with the new paradigm. "A lot of people who program in Java are used to working in an environment where latency isn't a criteria," says Lawrey.

So, the best Java developers would be former C/C++ developers. That's hardly a ringing endorsement for the language. Look at LMAX's Disruptor, for example. It's hardly Java since it gets its performance from use of sun.misc.UnSafe.

Java does give you quite good IDEs though. That's about it.

pjmlp · on Jan 19, 2021

35 years ago,

So, the best C developers would be former Assembly developers. That's hardly a ringing endorsement for the language. Look at XYZ game, for example. It's hardly C since it gets its performance from use of inline Assembly.

C does give you quite good shell utilities though. That's about it.

secondcoming · on Jan 19, 2021

We switched from Cassandra (Java) to Scylla (C++) and saw immediate improvements in both query latency and machine count.

pjmlp · on Jan 19, 2021

Except value types will put an end to the anti-JVM dream.

C++ is never going to be safe by default, unless ISO is willing to do a Python 3.

Rust still needs to improve its usability story against everything that is available on the JVM, and compile times, oh boy.

eb0la · on Jan 18, 2021

Blog post that started the project. Worth reading. https://andygrove.io/2019/07/announcing-ballista/

andygrove · on Jan 18, 2021

There is also a more recent blog post which perhaps led to the project being posted here (I am guessing).

https://andygrove.io/2021/01/ballista-2021/

Kinrany · on Jan 19, 2021

I saw the project in github.com/trending

pjmlp · on Jan 19, 2021

> The choice of Rust as the main execution language means that memory usage is deterministic and avoids the overhead of GC pauses.

While such projects are welcomed, when value types finally land, the argumentation would need to be upgraded.

And as someone that has to track down memory issues in C and C++ projects, not having a GC doesn't mean the memory usage is deterministic.

mkevac · on Jan 19, 2021

What is a distributed compute platform? It's the one thing they are not explaining :-(

lumost · on Jan 18, 2021

I think it's telling on the state of rust in 2021 that this project can't compile fully for the latest rust versions. Maintaining these types of frameworks in the early days is an intensive and often thankless job, having your language leave you behind is a near guaranteed way to kill off your project, not too mention introduce the obvious "I tried using library X and hit compilation issue Y type issues".

I'm curious to see how this evolves as there are a number of motivated folks working on similar efforts such as Vega. I for one would welcome a mature rust based distributed compute platform.

jasonpbecker · on Jan 18, 2021

> With the release of Apache Arrow 3.0.0 there are many breaking changes in the Rust implementation and as a result it has been necessary to comment out much of the code in this repository and gradually get each Rust module working again with the 3.0.0 release.

This appears to be an issue with Arrows implementation hitting a new major version and the Rust libraries not yet being compatible with the newest versions of Arrow. That's not something specific to the Rust ecosystem. It's not like a new version of Rust broke this project.

But even if it had, maintenance is always hard and the health of a project is better measured by how long it takes to be working with new, major, stable versions after widespread community adoption of those new, major, stable versions.

I don't know if Arrow 3.0 is the most commonly used implementation-- it may not have even reached that milestone.

nevi-me · on Jan 18, 2021

Arrow 3.0 will be released likely in the next week. The Rust implementation has a lot of changes because we've had to make public-facing changes, mostly for performance benefits.

jasonpbecker · on Jan 18, 2021

Thanks! This is helpful context, and supports my notion.

andygrove · on Jan 18, 2021

The project uses stable Rust. Which version are you trying to compile with?

frankmcsherry · on Jan 18, 2021

I think maybe they were confused by this text (which I agree has nothing to do with Rust itself breaking):

> With the release of Apache Arrow 3.0.0 there are many breaking changes in the Rust implementation and as a result it has been necessary to comment out much of the code in this repository and gradually get each Rust module working again with the 3.0.0 release.

andygrove · on Jan 18, 2021

Ah, yes, that makes sense. I can see how this could have been misread.

lumost · on Jan 19, 2021

aye - it ultimately was my misreading of the commit history. Agree that this wasn't a rust specific change.