Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Stepping up an abstraction level in this discussion...does anyone have any insight into _why_ an encoding format wouldn't want to have length prefixes standardized as part of the expected header of a message? From what I can tell, there isn't a strong argument against it; assuming you're comfortable with limiting messages to under 2^32 bits, prefixing an unsigned length should only take four bytes per message, which doesn't seem like it would ever be a bottleneck, and it allows the receiving side of a message to know up front exactly how much memory to allocate, and it makes it much easier to write correct parsing code that also handles edge cases (e.g. making it obvious to explicitly handle a message that's much larger than the amount of memory you're willing to allocate). The fact that there are formats out there that don't mandate length prefixing makes me think I might be missing something though, so I'd be interested to hear counterarguments.


In general, the issue is with composition. Your full message is someone else field. Having a header that only occurs in message initial position breaks this.

For protobuffs in particular, I have no idea. If you look at the encoding [0], you will see that the notion of submessages are explicitly supported. However, submessages are preceeded by a length field, which makes the lack of a length field at the start of the top-level message a rather glaring omission. The best arguement I can see is that submessages use a tag-length-value scheme instead of length-value-tag. This is because in general protobufs use a tag-value scheme, and certain types have the begining of the value be a length field. This means that to have a consistent and composable format, you would need to message length to start at the second byte of the message. Still, that would probably be good enough for 90% of the instances where people want to apply a length header.

[0] https://protobuf.dev/programming-guides/encoding/


Protobuf is sort of unique in the serialization format that it can be indefinitely extended in principle. (BSON is close, but has an explicit document size prefix.) For example, given the following definition:

    message Foo {
        repeated string bar = 1;
    }
Any repetition of `09 03 41 42 43` (a value "ABC" for the field #1), including an empty sequence, is a valid protobuf message. In the other words there is no explicit encoding for "this is a message"! Submessages have to be delimited because otherwise they wouldn't be distinguishable from the parent message.


This is an interesting design decision between protobuf and cap’n proto, one has “repeated field of type X” while the other has “field of type list of X”. So one allows adding a repeated/optional field to a schema without re-encoding any messages, while the other supports “field of type list of lists of X” etc.


> So one allows adding a repeated/optional field to a schema without re-encoding any messages,

Hmm don't they both allow that? Or am I misunderstanding what you mean here?

I guess the interesting (though only occasionally useful) thing about protobuf is if you concatenate two serialized messages of the same type and then parse the result, each repeated field in the first message will be concatenated with the same field in the second message.


Yes, "submessages"in Protobuf have the same field serialisation as strings

The field bytes dont really encode tag and type, they encode tag and size (fixed 32bit, 64bit or variable length)

Protobuf is a TLV format. In that regard, it's not unique at all.


> you will see that the notion of submessages are explicitly supported.

This is misinterpreting what actually happens. "Message" in Protobuf lingo means "a composite part". Everything that's not an integer (or boolean or enum, which are also integers) is a message. Lists and maps are messages and so are strings. The format is designed not to embed the length of the message in the message itself, but to put it outside. Why -- nobody knows for sure, but most likely a mistake. After all it's C++, and by the looks of the rest of the code the author seems like they felt challenged by the language, so they felt like it'd be too much work if / when they realized that the encoding of the message length was misplaced to put it in the right place, and so it continues to this day.

For the record, I implemented a bunch of similar binary formats, eg. AMF, Thrift and BSON. The problem in Protobuf isn't some sort of a theoretical impasse. It's really easy to avoid it, if you give it like an hour-long thinking, before you get to actually writing the code.


So the length of the submessage is part of its parent, and the top level message has no explicit length because it has no parent? It seems terrible for most purposes.


Precisely. This is also unique to Protobuf (at least, I don't know other formats, and I had to implement a handful, that do that).


> Having a header that only occurs in message initial position breaks this.

Why would it break it? It may make it slightly harder to parse, but since the header also determines the end of the message, anyone parsing the outer message would have a clear understanding that the inner header can be safely ignored as long as the stated outer length has not been matched.


Yeah, it's not clear to me why this is an issue either. I'd expect that parsers would be written as "parse length, if it's valid, allocate that many bytes and read them in, then parse those bytes as a message".


I'd say the main arguments are:

1. Many transports that you might use to transmit a Protobuf already have their own length tracking, making a length prefix redundant. E.g. HTTP has Content-Length. Having two lengths feels wrong and forces you to decide what to do if they don't agree.

2. As others note, a length prefix makes it infeasible to serialize incrementally, since computing the serialized size requires most of the work of actually serializing it.

With that said, TBH the decision was probably not carefully considered, it just evolved that way and the protocol was in wide use in Google before anyone could really change their mind.

In practice, this did turn out to be a frequent source of confusion for users of the library, who often expected that the parser would just know where to stop parsing without them telling it explicitly. Especially when people used the functions that parse from an input stream of some sort, it surprised them that the parser would always consume the entire input rather than stopping at the end of the message. People would write two messages into a file and then find when they went to parse it, only one message would come out, with some weird combination of the data from the two inputs.

Based on that experience, Cap'n Proto chose to go the other way, and define a message format that is explicitly self-delimiting, so the parser does in fact know where the message ends. I think this has proven to be the right choice in practice.

(I maintained protobuf for a while and created Cap'n Proto.)


About #1, FYI you should never trust the content-length of an HTTP request. It actually says that in one of the spec versions or another.

The problem with writing the length out at the beginning of the message is that you need to know the length before you write it out. For large objects that may cause memory issues/be problematic.

In many cases it works just fine. I doubt any protocol puts a "0" as the length for a dynamic length, but I can see a many-months long technical fight about that particular design decision.


> FYI you should never trust the content-length of an HTTP request. It actually says that in one of the spec versions or another.

That's not quite right. When a Content-Length is present, it is always correct by definition (it is what determines the body size; there's no way for the body to be a different size). What you probably mean is that you shouldn't ever assume that a Content-Length is available, because an HTTP message can alternatively be encoded using Transfer-Encoding: chunked, in which case the length is not known upfront. That is true, but doesn't change my point: Either Content-Length or chunked transfer encoding serve to delimit the HTTP entity-body, which makes any other in-band delimiter redundant.


About #2, it seems that every time someone creates some format with a prefix serialized size, the next required step is creating a chunked format without that prefix.

What IMO is perfectly ok. Chunking things is a perfectly fine way to allow for incremental communication. Just do it up-front, and you will not need the complexity of 2 different sizing formats.


In truth, the vast majority of protobuf users do not construct messages in a streaming way anyway; they have the whole message tree in memory upfront and the serialize it all at once.

"Streaming" in gRPC (or Cap'n Proto) involves sending multiple messages. I think this is probably the right approach. Otherwise it seems very hard to use a message as its streaming in as you can never know which fields are not present vs. just haven't arrived yet. But representing a stream as multiple messages leaves it up to the application to decide what it wants in each chunk, which makes sense.


Yeah, that's another good point I forgot to mention; it's often hard to know the difference between "I haven't received this info yet but it might still come" and "I know this info definitely isn't coming at all". I've dealt with this more often in the context of a specific schema rather than in the encoding format itself (e.g. sending back a simple "success" response when relaying a message where the response will be delivered asynchronously so that the other side can tell the difference between the message or response getting lost and the message being successfully sent and the response just not having arrived yet), but it's definitely possible for an encoding format to be designed in a way where it's not clear where the message should end, and having a length prefix is an effective way to deal with that as well.

I also fully agree with the "streaming can be done as multiple messages" approach; from the discussion here, it sounds like there may be some nice use cases where having a length prefix would be prohibitive (e.g. compression being generated on-the-fly), but these don't sound like typical use cases for encoding formats intended to be used generally; if anything, I'd expect something like a gzip response to be sent back as the entirety of a response (e.g. an HTTP get request for a specific file) rather than a part of a message in some custom protocol using protobuf or something similar.


> serialize incrementally

You cannot do it in Protobuf anyways (you need to allocate memory, remember? and you need to know how much to allocate, so you need to calculate the length anyways, you just throw it away after you calculate it, fun, fun fun!).


You don't necessarily have to allocate space for the whole message upfront. You can write some of it into a buffer, write that out to the socket, fill the buffer with more stuff, etc. There are actually programs that do this, although it's true that it's rarely worth the complexity.


Not going to work really.

The way it works in real life if, say, you want to serialize a list of varint is that you'd need some small memory chunk (let's call it staging) where you write individual integers (although, this is a bad idea for long lists, as you'd really want to write multiple elements at once if you have enough elements to justify spawning more threads). So, in this staging area you write those integers, more or less a byte at a time. You know they aren't going to take more than 8 bytes in the worst case, so you can have your staging area be 8 bytes.

Then you need to keep track of how many bytes in total you wrote. And, at this point you may start writing the field with the serialized list (in bytes). The field will contain the length of the list. So, you've already calculated the length even before you started writing. Also, you need to store those varints somewhere before you start writing the field with the list...

Protobuf isn't designed to do streaming. Well, really, it isn't designed at all. Like I wrote elsewhere, it was implemented first and then there was an attempt to describe what was implemented and call that "design". Having implemented several formats (eg. FLV and MP4) that were designed for streaming, I'm very confident Protobuf authors never concerned themselves with this aspect.


Among other things, length prefixing is annoying when streaming; it basically requires you to buffer the entire message, even if you could more efficiently stream it in chunks, because you need to know the length ahead of time, which you may very well not.


Remember FLV? What about MP4? Surprise! Both use length prefixing, and stream perfectly fine.

Length-prefixing is not a problem for streaming. Hierarchical data is, but even then, you have stuff like SAX (for XML).

The problem with Protobuf and why you cannot stream it is that it allows repetition of the same field, and only the last one counts. So, length-prefixing or not, you cannot stream it, unless you are sure that you don't send any hierarchical data (eg. you are sending a list of integers / floats).

Ah, also, another problem: "default fields". Until you parsed the entire message you don't know if default fields are present and whether they need to be initialized on the receiving end.


> length prefixing is annoying when streaming

This can be avoided by magic number. If length is 0, then message length isn't known.


That does leave one problem: you still need a way to segment your stream. Most length-prefixed framing schemes do not have any way to segment the stream other than the length prefix. What you wind up wanting is something like chunked encoding.

(Also, using zero as a sentinel is not necessarily a good idea, since it makes zero length messages more difficult. I'd go with -1 or ~0 instead.)


Using `-1` would require using a singed integer for the length, which I guess could be done if you're fine with having the maximum length be half as long, but that also raises the question of what to do with the remaining negative values; what does a length of -10 mean?

I thought -0 is only something in floating point numbers, not integers, and using floats for the length of a message sounds like a nightmare to me.


Ah, I was a little unclear. I mean ~0 as in NOT 0, an integer with all bits set. This is also the same as -1 in two's compliment. So basically, I'm suggesting you use the maximum unsigned integer value as a sentinel. That doesn't work if you're using a variable-length unsigned integer like base128vlq, but if you're doing base128vlq you could always make a special sentinel for that (e.g. unnecessarily set the high bit and follow it with a zero byte; this would never normally appear with a base128vlq.)


Or it doesn't have a length. For messaging protocols - or in general - magic should be avoided at all times.


If you have random access, you could leave some space and then go back and fill in the actual length value. Would work better with fixed size integer as you know ahead of time how much space to leave.


Just leave some empty electromagnetic waves in the cables before your message, got it.


If you’re streaming, you generally don’t have random access.


Somehow I missed the ‘streaming’ part… my bad


Some compression formats such as gzip support encoding streams.

This is useful when you don't know the size in advance, or if you compress on demand and want the receiver to start reading while the sender is still compressing.

One example could be a web service where you request dynamic content (like a huge CSV file). The client can start downloading earlier, and the server doesn't need to create a temporary file. The web service will stream the results directly and encoding it in chunks.


> Some compression formats such as gzip support encoding streams.

More accurately speaking gzip (and many other compressed file formats) has the file size information, but that information should (or can, for others) be appended after the data. Protobuf doesn't have any such information, so a better analogue would be the DEFLATE compressed bytestream format.


Gzip in particular has more something akin to a size checksum appended at the end, i.e., decompressed size modulo 2^32. This is not very helpful, especially as the maximum compression ratio is ~1032, it means that this "size" could already overflow for a gzip file that is only 4 MiB in compressed size.

https://stackoverflow.com/a/69353889/2191065


You may want to have yield/streaming senantics where length is not know in advance.


The length should be known in advance in order to be written, so the message cannot be incrementally written. You need more complex framing scheme like Consistent Overhead Byte Stuffing for that. And many applications do want a variable number of length bytes because i) 4 bytes is actually too long for short messages and ii) some message can exceed 2^32 bytes. Not to say the generic varint encoding is good for this purpose, though [1].

[1] If you ever have to design one, make sure that reading the first byte is enough to determine the number of subsequent length bytes.


> 4 bytes is actually too long for short messages

Would it ever be an actual bottleneck though? If it's not actually impeding throughput, I feel like this is more of an aesthetic argument than a technical one, and one where I'd happily sacrifice aesthetics to make the code simpler.

> some message can exceed 2^32 bytes

Fair enough, but that just makes the question "would 8 bytes per message ever actually be a bottleneck", which I'm still not convinced would ever be the case


I agree that 4 byte overhead in a single item is not a big deal even when the item itself is short. But if you have multiple heterogeneous items, it becomes worthwhile to set an upper bound for the relative ratio of overhead over data. (SQLite, for example, makes extensive use of varints for this reason [1].) I personally think a single byte overhead is worthy for sub-100-byte data, while larger ones should use a fixed size prefix because the overhead is no longer significant.

[1] https://www.sqlite.org/fileformat.html#record_format


I work on an embedded OS where our IPC buffer is only about 140 bytes.

Anything more requires multiple IPCs, with lots of expensive context switches.

Wasting even one precious byte on a pointless header would absolutely be an issue in this environment.


That makes sense. I've probably been thinking more about remote network protocols, where the time it takes for a message to reach its destination will be so large that the overhead for a few extra bytes per message is negligible; for a protocol used solely within a given local system, there would be a much lower threshold of acceptable overhead.


> and it allows the receiving side of a message to know up front exactly how much memory to allocate

Not necessarily. Can you really trust the length given from a message? Couldn't a malicious sender put some fake length to fool around with memory allocation?

I was under the impression that something like this caused Heartbleed (to use one example):

* https://en.wikipedia.org/wiki/Heartbleed


Heartbleed was caused by allowing the user to specify the length of the response, not that of a request.

When receiving a message, if the user gives you a wrong length, you'll simply fail in parsing their message. Of course, it is up to you to protect against DOS attacks (like someone sending you a 5 TB message, or at least a message that claims it is 5TB) - but that is necessary regardless of whether they tell you the size ahead of time or not.

With heartbleed, a user sent a message saying "please send me a 5MB large hearbeat message", and OpenSSL would send a 5MB re-used buffer, of which only the first few bytes were overwritten with the hearbeat content. The rest of the bytes were whatever remained out of a previous use of the buffer, which could be negotiated keys etc.


I don't see how sending a bad length could cause a memory issue in this case; if a message has a length that's much longer than expected, the receiving side could just discard any future messages from that destination (or even immediately close the connection). If the message is much shorter than the data received, the bytes following it would be treated as the start of a new message, and the same logic would apply.


This is nothing general. Just someone who created Protobuf "forgot" to do it. There was no reason not to do it given how everything else is encoded in Protobuf.

My guess is that Protobuf was first implemented then designed. And by the time it was designed, the designer felt too lazy to do anything about the top-level message's length. There are plenty of other technical bloopers that indicate lack of up-front design, so this wouldn't be very unlikely.


Streaming has already been mentioned. Efficiency might be another argument. If your messages are typically being sent through a channel that already has a way of indicating the end of the message then having to express the length inside the message as well would be a useless overhead in bytes sent and code complexity.


This assumes that only messages from controlled sources will be received though, right? If you're receiving messages over a TCP socket or something similar, that seems like a potentially flawed assumption; I'd think anything parsing messages coming from the network should be written in a way that explicitly accounts for malicious messages coming from the other side of a connection.

EDIT: I'm also still not any more convinced that four bytes per message would ever be a bottleneck for any general purpose protocol, but I'd be curious to hear of a case where that would actually be an issue.


> This assumes that only messages from controlled sources will be received though, right?

I don't think so. The question of whether you trust the length indication to be correct (you almost certainly shouldn't) seems to me to be independent of whether the length indication comes from inside the message or from some outside wrapper.


I might have misunderstood what you were suggesting; from rereading, it sounds like you're suggesting to rely on something like the end of a TCP packet rather than having an explicit length. If this is what you mean, my concern would be that requiring a protocol to map 1:1 to the transmission protocol's packets (or a similar construct) can be limiting; it would require all messages to fit in a single buffer, which could prevent it from working with clients or servers configured to use a different length and might make it difficult to use with other transmission protocols.

My question in the beginning of this thread was intended to be specifically about general purpose formats like protobuf; I think relying on the semantics of TCP or something like that might be a good choose for a bespoke protocol, but it doesn't seem like a great idea for something expected to be used in a wide variety of cases.


For a slow protocol like protobufs that is rarely streamed, I agree a length prefix should be the default

One way to make streaming work is just to allow the length value to be bigger than needed and add a padding scheme at the end of the message. This is overhead free in terms of processing time since fields must be decoded sequentially anyway.


> For a slow protocol like protobufs that is rarely streamed...

In my experience, protobufs are often streamed, especially in the cases where performance matters.


Protobuf over UDP can use the UDP payload length. Likewise for the many variants of self-sychronising DLE framing (DLE,STX..DLE,ETX) used on serial links.

A varint length field prepended to protobuf messages (sent over a reliable transport, such as TCP) seems sane.


Framing is a distinct concern from payload serialization.

Most protocols and serialization formats already define a form of length-prefixed framing; requiring that a protobuf payload also carry such a header would simply be a waste of bytes.

Additionally, it ensures that protobuf can be serialized and streamed without first computing the payload length, which would require serializing the entire message first.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: