In general, the issue is with composition. Your full message is someone else fie...

lifthrasiir · on Sept 20, 2023

Protobuf is sort of unique in the serialization format that it can be indefinitely extended in principle. (BSON is close, but has an explicit document size prefix.) For example, given the following definition:

    message Foo {
        repeated string bar = 1;
    }

Any repetition of `09 03 41 42 43` (a value "ABC" for the field #1), including an empty sequence, is a valid protobuf message. In the other words there is no explicit encoding for "this is a message"! Submessages have to be delimited because otherwise they wouldn't be distinguishable from the parent message.

DougBTX · on Sept 20, 2023

This is an interesting design decision between protobuf and cap’n proto, one has “repeated field of type X” while the other has “field of type list of X”. So one allows adding a repeated/optional field to a schema without re-encoding any messages, while the other supports “field of type list of lists of X” etc.

kentonv · on Sept 20, 2023

> So one allows adding a repeated/optional field to a schema without re-encoding any messages,

Hmm don't they both allow that? Or am I misunderstanding what you mean here?

I guess the interesting (though only occasionally useful) thing about protobuf is if you concatenate two serialized messages of the same type and then parse the result, each repeated field in the first message will be concatenated with the same field in the second message.

nly · on Sept 20, 2023

Yes, "submessages"in Protobuf have the same field serialisation as strings

The field bytes dont really encode tag and type, they encode tag and size (fixed 32bit, 64bit or variable length)

Protobuf is a TLV format. In that regard, it's not unique at all.

crabbone · on Sept 21, 2023

> you will see that the notion of submessages are explicitly supported.

This is misinterpreting what actually happens. "Message" in Protobuf lingo means "a composite part". Everything that's not an integer (or boolean or enum, which are also integers) is a message. Lists and maps are messages and so are strings. The format is designed not to embed the length of the message in the message itself, but to put it outside. Why -- nobody knows for sure, but most likely a mistake. After all it's C++, and by the looks of the rest of the code the author seems like they felt challenged by the language, so they felt like it'd be too much work if / when they realized that the encoding of the message length was misplaced to put it in the right place, and so it continues to this day.

For the record, I implemented a bunch of similar binary formats, eg. AMF, Thrift and BSON. The problem in Protobuf isn't some sort of a theoretical impasse. It's really easy to avoid it, if you give it like an hour-long thinking, before you get to actually writing the code.

HelloNurse · on Sept 20, 2023

So the length of the submessage is part of its parent, and the top level message has no explicit length because it has no parent? It seems terrible for most purposes.

crabbone · on Sept 21, 2023

Precisely. This is also unique to Protobuf (at least, I don't know other formats, and I had to implement a handful, that do that).

pyrale · on Sept 20, 2023

> Having a header that only occurs in message initial position breaks this.

Why would it break it? It may make it slightly harder to parse, but since the header also determines the end of the message, anyone parsing the outer message would have a clear understanding that the inner header can be safely ignored as long as the stated outer length has not been matched.

saghm · on Sept 20, 2023

Yeah, it's not clear to me why this is an issue either. I'd expect that parsers would be written as "parse length, if it's valid, allocate that many bytes and read them in, then parse those bytes as a message".