In general, the issue is with composition. Your full message is someone else field. Having a header that only occurs in message initial position breaks this.
For protobuffs in particular, I have no idea. If you look at the encoding [0], you will see that the notion of submessages are explicitly supported. However, submessages are preceeded by a length field, which makes the lack of a length field at the start of the top-level message a rather glaring omission. The best arguement I can see is that submessages use a tag-length-value scheme instead of length-value-tag. This is because in general protobufs use a tag-value scheme, and certain types have the begining of the value be a length field. This means that to have a consistent and composable format, you would need to message length to start at the second byte of the message. Still, that would probably be good enough for 90% of the instances where people want to apply a length header.
Protobuf is sort of unique in the serialization format that it can be indefinitely extended in principle. (BSON is close, but has an explicit document size prefix.) For example, given the following definition:
message Foo {
repeated string bar = 1;
}
Any repetition of `09 03 41 42 43` (a value "ABC" for the field #1), including an empty sequence, is a valid protobuf message. In the other words there is no explicit encoding for "this is a message"! Submessages have to be delimited because otherwise they wouldn't be distinguishable from the parent message.
This is an interesting design decision between protobuf and cap’n proto, one has “repeated field of type X” while the other has “field of type list of X”. So one allows adding a repeated/optional field to a schema without re-encoding any messages, while the other supports “field of type list of lists of X” etc.
> So one allows adding a repeated/optional field to a schema without re-encoding any messages,
Hmm don't they both allow that? Or am I misunderstanding what you mean here?
I guess the interesting (though only occasionally useful) thing about protobuf is if you concatenate two serialized messages of the same type and then parse the result, each repeated field in the first message will be concatenated with the same field in the second message.
> you will see that the notion of submessages are explicitly supported.
This is misinterpreting what actually happens. "Message" in Protobuf lingo means "a composite part". Everything that's not an integer (or boolean or enum, which are also integers) is a message. Lists and maps are messages and so are strings. The format is designed not to embed the length of the message in the message itself, but to put it outside. Why -- nobody knows for sure, but most likely a mistake. After all it's C++, and by the looks of the rest of the code the author seems like they felt challenged by the language, so they felt like it'd be too much work if / when they realized that the encoding of the message length was misplaced to put it in the right place, and so it continues to this day.
For the record, I implemented a bunch of similar binary formats, eg. AMF, Thrift and BSON. The problem in Protobuf isn't some sort of a theoretical impasse. It's really easy to avoid it, if you give it like an hour-long thinking, before you get to actually writing the code.
So the length of the submessage is part of its parent, and the top level message has no explicit length because it has no parent? It seems terrible for most purposes.
> Having a header that only occurs in message initial position breaks this.
Why would it break it? It may make it slightly harder to parse, but since the header also determines the end of the message, anyone parsing the outer message would have a clear understanding that the inner header can be safely ignored as long as the stated outer length has not been matched.
Yeah, it's not clear to me why this is an issue either. I'd expect that parsers would be written as "parse length, if it's valid, allocate that many bytes and read them in, then parse those bytes as a message".
For protobuffs in particular, I have no idea. If you look at the encoding [0], you will see that the notion of submessages are explicitly supported. However, submessages are preceeded by a length field, which makes the lack of a length field at the start of the top-level message a rather glaring omission. The best arguement I can see is that submessages use a tag-length-value scheme instead of length-value-tag. This is because in general protobufs use a tag-value scheme, and certain types have the begining of the value be a length field. This means that to have a consistent and composable format, you would need to message length to start at the second byte of the message. Still, that would probably be good enough for 90% of the instances where people want to apply a length header.
[0] https://protobuf.dev/programming-guides/encoding/