> The Cap’n Proto encoding is appropriate both as a data interchange format and an in-memory representation, so once your structure is built, you can simply write the bytes straight out to disk!
Eh, I'd rather pay the cost of serialisation once and deserialisation once, and then access my data for as close to free as possible, rather than relying on a compiler to actually inline calls properly.
> Integers use little-endian byte order because most CPUs are little-endian, and even big-endian CPUs usually have instructions for reading little-endian data.
sob There are a lot of things Intel has to account for, and frankly little-endian byte order isn't the worst of them, but it's pretty rotten. Writing 'EFCDAB8967452301' for 0x0123456789ABCDEF is perverse in the extreme. Why? Why?
As pragmatic design choices go, Cap'n Proto's is a good one (although it violates the standard network byte order). Intel effectively won the CPU war, and we'll never be free of the little-endian plague.
Big-endian vs. little-endian is an ancient flamewar that isn't going to go away any time soon, but sure, let's argue.
Once you've spent as much time twiddling bits as I have (as the author of proto2 and Cap'n Proto), you start to realize that little endian is much easier to work with than big-endian.
For example:
- To reinterpret a 64-bit number as 32-bit in BE, you have to add 4 bytes to your pointer. In LE, the pointer doesn't change.
- Just about any arithmetic operation on integers (e.g. adding) starts from the least-significant bits and moves up. It's nice if that can mean iterating forward from the start instead of backwards from the end, e.g. when implementing a "bignum" library.
- Which of the following is simpler?
// Extract nth bit from byte array, assuming LE order.
(bytes[n/8] >> (n%8)) & 1
// Extract nth bit from byte array, assuming BE order.
(bytes[n/8] >> (7 - n%8)) & 1
There's really no good argument for big-endian encoding except that it's the ordering that we humans use in writing.
There's really no good argument for big-endian encoding except that it's the ordering that we humans use in writing. And not even always that. We call our numbering system "Arabic", but for the Arabs, it little-endian.
For some reason humans seem to want high powers on the left, even if it makes no sense in a left-to-right language.
Take polynomials, they are typically written big-endian
ax^2 + bx + c
But infinite series have to be little-endian.
c_0 + c_1*x + c_2*x^2 ....
If you think for a moment about how you would write multiplication, you will see the latter form is much easier to reason about and program with.
I think figure 1 illustrates a common misunderstanding. People who object to little-endian are often imagining it in their heads as order (3): they imagine that the bits of each byte are ordered most-significant to least-significant (big-endian), but then for some reason the bytes are ordered the opposite, least-significant to most-significant (little-endian). That indeed would make no sense.
But because most architectures don't provide any way to address individual bits, only bytes, it's entirely up to the observer to decide in which order they want to imagine the bits. When using little-endian, you imagine that the bits are in little-endian order, to be consistent with the bytes, and then everything is nice and consistent.
> When using little-endian, you imagine that the bits are in little-endian order, to be consistent with the bytes, and then everything is nice and consistent.
But isn't that kind of at odds with how shifting works? (i.e. that a left shift moves towards the "bigger" bits and a right shift moves toward the "smaller" ones.) Perhaps for a Hebrew or Arabic speaker this all works out nicely, but for those of us accustomed to progressing from left to right it seems a bit backwards...
> To reinterpret a 64-bit number as 32-bit in BE, you have to add 4 bytes to your pointer. In LE, the pointer doesn't change.
But one shouldn't do that very often: those are two different types. The slight cost of adding a pointer is negligible.
> Just about any arithmetic operation on integers (e.g. adding) starts from the least-significant bits and moves up. It's nice if that can mean iterating forward from the start instead of backwards from the end, e.g. when implementing a "bignum" library.
-- is a thing, just as ++ is.
> There's really no good argument for big-endian encoding except that it's the ordering that we humans use in writing.
That's like saying, 'there's really no good argument for pumping nitrogen-oxygen mixes into space stations except that it's the mixture we humans use to breathe.'
It's simplicity itself for a computer to do big-endian arithmetic; it's horrible pain for a human being who has to read a little-endian listing. A computer can be made to do the right thing. Who is the master: the computer or the man?
That line of argument suggests you'd be happier with a human-readable format like JSON. Which is another eternal flamewar that we aren't likely to resolve here. Needless to say I like binary formats. :)
Floating point isn't legible either, nor are are bitfields, opcodes, instruction formats, etc. That's why we use computers to do the 'right thing' and format the data if we need to read it.
I don't have a preference for either one, but using little-endian when most/every processor you will be targeting supports it makes more sense than using big-endian + extra work on x86 just so you can read it with less effort in a memory dump.
I've just began learning about endianness and I'm sorry if this comes off as pedantic, but doesn't the last example refer to bit numbering rather than (byte) endianness?
Endianness applies to both! Or, it should, if you're being consistent. It makes no sense to say that the first byte is the most-significant byte, but that bit zero is the least significant bit of the byte.
Because all modern computer architectures assign addresses to bytes, not bits, it's up to us to decide which way to number the bits. But we should always number the bits the same way we number the bytes.
> Eh, I'd rather pay the cost of serialisation once and deserialisation once, and then access my data for as close to free as possible, rather than relying on a compiler to actually inline calls properly.
The thing is, as close to free as possible is surprisingly expensive. Protobuf's varint encoding is extremely branchy, and hurts performance in a datacenter environment (where bandwidth is free, and CPU is expensive).
> As pragmatic design choices go, Cap'n Proto's is a good one (although it violates the standard network byte order). Intel effectively won the CPU war, and we'll never be free of the little-endian plague.
Did they though? Arguably there are far more ARM CPUs (like the one in your pocket) than there are server CPUs. Since cellphones and other low power devices are almost all big endian, it seems like network byte order would have been better to use. High powered servers can pay the cost of coding them, but battery powered devices cannot afford to do so.
ARM (since v3) is bi-endian for data accesses and defaults to little. You can confirm this by searching the ARM Information Center [1] for 'support for mixed-endian data', but I can't get a working URL for that exact page.
> > Integers use little-endian byte order because most CPUs are little-endian, and even big-endian CPUs usually have instructions for reading little-endian data.
> sob There are a lot of things Intel has to account for, and frankly little-endian byte order isn't the worst of them, but it's pretty rotten. Writing 'EFCDAB8967452301' for 0x0123456789ABCDEF is perverse in the extreme. Why? Why?
Little endian means that a CPU designer can make buses shorter, which makes the CPU more efficient and smaller. There are also several benefits from the programming side. So it is actually better than big endian in /some/ cases, at the cost of being less intuitive to humans.
So while Intel did choose little endian, they had very good reason to (and it's probably why everything except SystemZ and POWER use it).
Hah. According to Wikipedia: "The Datapoint 2200 used simple bit-serial logic with little-endian to facilitate carry propagation. When Intel developed the 8008 microprocessor for Datapoint, they used little-endian for compatibility. However, as Intel was unable to deliver the 8008 in time, Datapoint used a medium scale integration equivalent, but the little-endianness was retained in most Intel designs." Which I guess is a reason to use little-endian, though not one that's really relevant to anyone that's ever actually used an Intel CPU. (ARM, incidentally, was little-endian by default because the 6502 was.)
Eh, I'd rather pay the cost of serialisation once and deserialisation once, and then access my data for as close to free as possible, rather than relying on a compiler to actually inline calls properly.
> Integers use little-endian byte order because most CPUs are little-endian, and even big-endian CPUs usually have instructions for reading little-endian data.
sob There are a lot of things Intel has to account for, and frankly little-endian byte order isn't the worst of them, but it's pretty rotten. Writing 'EFCDAB8967452301' for 0x0123456789ABCDEF is perverse in the extreme. Why? Why?
As pragmatic design choices go, Cap'n Proto's is a good one (although it violates the standard network byte order). Intel effectively won the CPU war, and we'll never be free of the little-endian plague.
It's all so depressing.