Yangqing here (caffe2 and ONNX). We did use protobuf and we have an extensive discussion about its versions even, from our experience with the Caffe and Caffe2 deployment modes. Here is a snippet from the codebase:
// Note [Protobuf compatibility]
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
// Based on experience working with downstream vendors, we generally can't
// assume recent versions of protobufs. This means that we do not use any
// protobuf features that are only available in proto3.
//
// Here are the most notable contortions we have to carry out to work around
// these limitations:
//
// - No 'map' (added protobuf 3.0). We instead represent mappings as lists
// of key-value pairs, where order does not matter and duplicates
// are not allowed.
Hi there, I work on the protobuf team at Google. A friendly note that protobuf 3.0 and proto3 are actually two separate things (this is admittedly confusing). Protobuf 3.0 was a release that added support for proto3 schemas, but Protobuf 3.0 continues to support proto2 schemas.
proto3 was designed to be a simplification of proto2 schemas. It removed several features such as extensions and field presence. Each .proto file specifies whether it is a proto2 or proto3 schema.
Protobuf 3.0 added several features such as maps, but these features were added to both proto2 and proto3 schemas.
Thanks so much @haberman! Yep, the whole thing is a little bit confusing... We basically focused on two things:
- not using extensions, as 3 does not support it
- not using map, as 2 does not support it
and we basically landed on restricting ourselves to use the common denominator among all post-2.5 versions. Wondering if this sounds reasonable to you - always great to hear the original author's advice.
Plus, I am wondering if there are recommended ways of reducing protobuf runtime size, we use protobuf-lite but if there are any further wins it would be definitely nice for memory and space constrained problems.
Hmm, I'm not sure I quite get your strategy. If you want to support back to Protobuf 2.5, then you'll need to use proto2 schemas (https://developers.google.com/protocol-buffers/docs/proto). Proto2 schemas will support extensions forever (even after Protobuf 3.0), so there's no need to avoid extensions.
We have been working on some optimizations to help the linker strip generated code where possible. I recommend compiling with -ffunction-sections/-fdata-sections if you aren't already, and -gc-sections on your linker invocation.
So what we do is to keep syntax=proto2, but allow users to compile with both protobuf 2.x and protobuf 3.x libraries. Minumum need is 2.6.1. We kind of feel that this gives maximum flexibility for people who have already chosen a protobuf library version.
Makes sense! All I'm saying is that there is no need to avoid using extensions if that is your strategy. Extensions will work in all versions of the library you wish to support.
pretty cool and thanks for that reply! Did you look at something like Arrow/Feather, which is looking to get adopted as the interoperable format in R/Pandas ... and maybe even Spark.
There's been quite a bit of momentum behind it to optimize it for huge usecases - https://thenewstack.io/apache-arrow-designed-accelerate-hado...
It is based on Google Flatbuffers, but is undergoing enough engineering specifically from a big data/machine learning perspective. Instead of building directly over Protobuf, it might be interesting to build it on top of Arrow (in exactly the same way that Feather is based on top of Arrow https://github.com/wesm/feather).
We chose protobuf mainly due to a good caffe adoption story and also the track record of it being compatible with many platforms (mobile, server, embedded, etc). We actually looked at thrift - which is Facebook owned - and it is equally nice, but our final decision was mainly to minimize the switching overhead for existing users such as Caffe and TensorFlow.
To be honest, protobuf is indeed a little bit hard to install (especially if you have python and c++ version differences). Would definitely be interested in taking a look at possible solutions - serialization format and the model standard is mor e or less orthogonal, so one may see a world where we can convert different serialization formats (JSON <-> protobuf as an overly simplified example)
hdf5, Feather, Arrow, protobufs, json, xml -- all solve the problem of binary representation of data on disk. They all leave the question of how to map said data to a specific problem domain up to the developer.
Projects like ONNX define said mapping for a specific domain (in ONNX's case, by agreeing on a proto schema for ML models, and its interpretation).
To use a simplistic metaphor: protobufs are the .docx format; onnx is a resume template you can fill out in Word.
I understand there could be something unique in this format, but really keen to understand what.