Why a new format ? Why couldn't protobuf, hdf5 or the Apache Feather/Arrow forma...

ezyang · on Sept 7, 2017

PyTorch/ONNX dev here. ONNX is a proto2 format https://github.com/onnx/onnx/blob/master/onnx/onnx.proto -- we definitely wanted to make it easy for people to parse and load ONNX exported graphs :)

jiayq84 · on Sept 7, 2017

Yangqing here (caffe2 and ONNX). We did use protobuf and we have an extensive discussion about its versions even, from our experience with the Caffe and Caffe2 deployment modes. Here is a snippet from the codebase:

// Note [Protobuf compatibility] // ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ // Based on experience working with downstream vendors, we generally can't // assume recent versions of protobufs. This means that we do not use any // protobuf features that are only available in proto3. // // Here are the most notable contortions we have to carry out to work around // these limitations: // // - No 'map' (added protobuf 3.0). We instead represent mappings as lists // of key-value pairs, where order does not matter and duplicates // are not allowed.

haberman · on Sept 7, 2017

Hi there, I work on the protobuf team at Google. A friendly note that protobuf 3.0 and proto3 are actually two separate things (this is admittedly confusing). Protobuf 3.0 was a release that added support for proto3 schemas, but Protobuf 3.0 continues to support proto2 schemas.

proto3 was designed to be a simplification of proto2 schemas. It removed several features such as extensions and field presence. Each .proto file specifies whether it is a proto2 or proto3 schema.

Protobuf 3.0 added several features such as maps, but these features were added to both proto2 and proto3 schemas.

Sorry this is kind of confusing.

jiayq84 · on Sept 7, 2017

Thanks so much @haberman! Yep, the whole thing is a little bit confusing... We basically focused on two things:

- not using extensions, as 3 does not support it - not using map, as 2 does not support it

and we basically landed on restricting ourselves to use the common denominator among all post-2.5 versions. Wondering if this sounds reasonable to you - always great to hear the original author's advice.

Plus, I am wondering if there are recommended ways of reducing protobuf runtime size, we use protobuf-lite but if there are any further wins it would be definitely nice for memory and space constrained problems.

haberman · on Sept 7, 2017

Hmm, I'm not sure I quite get your strategy. If you want to support back to Protobuf 2.5, then you'll need to use proto2 schemas (https://developers.google.com/protocol-buffers/docs/proto). Proto2 schemas will support extensions forever (even after Protobuf 3.0), so there's no need to avoid extensions.

You are only using proto3 schemas if you start your .proto file with 'syntax = "proto3;"': https://developers.google.com/protocol-buffers/docs/proto3 But if you do this, your .proto file will no longer work with Protobuf <3.0.

We have been working on some optimizations to help the linker strip generated code where possible. I recommend compiling with -ffunction-sections/-fdata-sections if you aren't already, and -gc-sections on your linker invocation.

jiayq84 · on Sept 8, 2017

So what we do is to keep syntax=proto2, but allow users to compile with both protobuf 2.x and protobuf 3.x libraries. Minumum need is 2.6.1. We kind of feel that this gives maximum flexibility for people who have already chosen a protobuf library version.

haberman · on Sept 8, 2017

Makes sense! All I'm saying is that there is no need to avoid using extensions if that is your strategy. Extensions will work in all versions of the library you wish to support.

sandGorgon · on Sept 7, 2017

pretty cool and thanks for that reply! Did you look at something like Arrow/Feather, which is looking to get adopted as the interoperable format in R/Pandas ... and maybe even Spark. There's been quite a bit of momentum behind it to optimize it for huge usecases - https://thenewstack.io/apache-arrow-designed-accelerate-hado...

It is based on Google Flatbuffers, but is undergoing enough engineering specifically from a big data/machine learning perspective. Instead of building directly over Protobuf, it might be interesting to build it on top of Arrow (in exactly the same way that Feather is based on top of Arrow https://github.com/wesm/feather).

jiayq84 · on Sept 7, 2017

Interesting data, thanks!

We chose protobuf mainly due to a good caffe adoption story and also the track record of it being compatible with many platforms (mobile, server, embedded, etc). We actually looked at thrift - which is Facebook owned - and it is equally nice, but our final decision was mainly to minimize the switching overhead for existing users such as Caffe and TensorFlow.

To be honest, protobuf is indeed a little bit hard to install (especially if you have python and c++ version differences). Would definitely be interested in taking a look at possible solutions - serialization format and the model standard is mor e or less orthogonal, so one may see a world where we can convert different serialization formats (JSON <-> protobuf as an overly simplified example)

Q6T46nT668w6i3m · on Sept 7, 2017

I want to second considering Arrow. In addition to Pandas and Spark, it is (or was) considered for scikit-learn model interchange.

mkl · on Sept 7, 2017

A tip: If your comment includes code, indent it at least two spaces (before copying from your editor) and it will be formatted correctly.

jiayq84 · on Sept 7, 2017

Learned one more thing today!

squarecog · on Sept 7, 2017

hdf5, Feather, Arrow, protobufs, json, xml -- all solve the problem of binary representation of data on disk. They all leave the question of how to map said data to a specific problem domain up to the developer.

Projects like ONNX define said mapping for a specific domain (in ONNX's case, by agreeing on a proto schema for ML models, and its interpretation).

To use a simplistic metaphor: protobufs are the .docx format; onnx is a resume template you can fill out in Word.