I've played with Arrow a bunch of times and have yet to figure out what it's int...

VHRanger · on April 26, 2021

Interchange format for tabular data.

Think pandas in python, but language agnostic.

lmeyerov · on April 26, 2021

Yep; fast+safe+well-tooled common IO format for tabular data, especially columnar, for going across runtimes/languages/processes/servers with zero copy. It supports a variety of common types, including nested (ex: lists, records, ...), modes like streaming, and schemas that can be inferred+checked, so it covers a lot of classic protobuf messaging use cases.

When messages look more like data, even better, as columnar representations are often better for compression & compute. So it's pretty common now for data frameworks, and increasingly, DB connectors. Ex: Our client https://github.com/graphistry/pygraphistry takes a bunch of formats, but we convert to Arrow before sending to our server, we zero-copy it between server-side GPU processes, and stream it to WebGL buffers in the client, all without touching it (beyond our own data cleaning).

The main reason I'd stick to protobuf is legacy, especially awkward data schemas / highly sparse data, and I'm guessing, Google probably has internal protobuf tooling that's faster+easier+better tooled than the OSS stuff.

infinite8s · on April 27, 2021

Are you leveraging arrow-flight to stream between server-side to client side? Or some custom protocol embedding the arrow data as a binary blob?

lmeyerov · on April 27, 2021

I'm probably way out of date, but arrow-flight seems primarily innovative for scaling reads across multiple machines, such as in dask envs

For server->client, arrow's record batch representation is useful for streaming binary blobs without having to touch them, but for the surrounding protocol, we care way more about http/browser APIs and how the protocol works, vs the envs + problems flight was built for. So we built some tuned layers for pushing arrow record batches through GPU -> CPU -> network -> browser. The NodeJS ecosystem is surprisingly good for streaming buffers, so it's a few libs from there, careful protocol use to ensure zero-copy on the browser side, and tricks around compression. Flight, at least at the time, was more in the way than a help for all that, and all arrow sponsor money goes to ursa labs, so we never ended up having the resources to OSS that part, just the core Arrow JS libs. Our commercial side is starting to launch certain cloud layers that may make it more aligned to revisit & OSS those pieces, but for now, still low on the priority list for what our users care about. Our protocol here is about to get upgraded again, but not fundamentally wrt the network/client parts.

Interestingly, for the server case... despite our using dask etc. increasingly on the server, we don't explicitly use arrow-flight, and I don't believe our next moves here will either. I recorded a talk for Nvidia GTC about our early experiments on the push to doing 90+ GB/s per server, and while parquet/arrow is good here, flight still seems more of a distraction: https://pavilion.io/nvidia (with more resources, I think it can be aligned)

infinite8s · on April 27, 2021

OK thanks for the insight. I havenv't been tracking arrow-flight recently, but it does seem more geared to an RPC use case than one way streaming.

I've been playing with pushing arrow batches using (db -> server-side turbodbc -> network -> browser -> webworker using transferable support), leveraging a custom msgpack protocol over websockets, since most databases don't support connectivity from a browser. Not dealing with big data by any means (max a million records or so), and so far it's been fairly performant.

lmeyerov · on April 27, 2021

yep that's pretty close (and cool on turbodbc, we normally don't get control of that part!)

for saturating network, afaict, it gets more into tuning around browser support for parallel streams, native compression, etc. we've actually been backing off browser opts b/c of unreliability across users, and thus focus more on what we can do on the server..