One reason that we don't "just stored structured data as-is" is because there ar...

clarry · on Feb 19, 2020

The author's point is that if our APIs worked with structured data instead of byte streams, then we wouldn't need to care about the in-memory format. The standard API would do the hard work for us, allowing us to call send(fd, my_struct);

Since there's no standard API (blessed by the OS) and the lowest common denominator is byte streams, we're seeing all these ad-hoc solutions and a hodgepodge of formats and libraries to deal with. That's lots and lots of time and money spent on rather basic stuff.

Bystroushaak · on Feb 19, 2020

Thats one of my points, yes.

PaulDavisThe1st · on Feb 19, 2020

And that's why I think the author is wrong about this (he is probably right about thinking (more often) about filesystems as databases, but that's somewhat orthogonal).

The approach you're describing only works for POD-style "structured data". Once you start using OOP of almost any type (though not every type), you no longer have ... well, POD that you can move to/from a storage medium. You have objects whose in-memory format IS important and compiler dependent.

There are other concerns too. His WAV example (I write pro-audio software for a living) doesn't even begin to touch on the actual complexities of dealing with substantial quantities of audio (or any other data that cannot reasonably be expected to fit into memory). Nor on the cost of changing the data ... does the entire data set need to be rewritten, or just part of it? How would you know which sections matter? Oh wait, the data fundamentally IS a byte stream, so now you have to treat it like that. If you don't care about performance (or storage overhead, but that's less and less of a concern these days), there are all kinds of ways of hiding these sorts of details. But the moment that performance starts to matter, you need to get back the bytestream level.

And so yes, there's no standard API and yes the lowest common denominator is byte streams ... because the __only__ common denominator is byte streams. Thinking about this any other way is a repeat of a somewhat typical geek dream that the world (or a representation of the world or part of it) can be completely ordered and part of a universal system.

clarry · on Feb 20, 2020

Structured data can be streamed too, and indeed there is software that does it at scale. Data with much more complex structure than audio frames.

PaulDavisThe1st · on Feb 21, 2020

Of course it can be streamed! Nobody ever suggested it could not be. The point is that to stream it portably (i.e. without knowing the hardware characteristics - and possibly software characteristics too - of the reciever) you have to first serialize it and then deserialize it, because the in-memory representation within the sender is NOT portable.

clarry · on Feb 21, 2020

You're too hung up on in-memory representation. Yes, if it's not right, then it needs to be converted. That can be done for you, or you can do it manually with byte streams like cave man. If you can do it manually fast, then it can be done just as fast automatically based on the declared structure.

cycloptic · on Feb 21, 2020

It isn't possible to be in a world where the in-memory representation doesn't matter. Someone has to write and maintain that code, and if it's you, then you get the fun task of telling some users their workload won't fit into it. And then they go off and write their own in-memory representation.

silon42 · on Feb 19, 2020

Different architectures is actually a smaller problem than evolution of data format. In-memory formats are not extensible by default.

_8ljf · on Feb 19, 2020

As I’ve noted above, this problem would go away if data streams were adequately tagged in the first place.

Having that high-level knowledge of data structure enables all sorts of intelligent automation.

In the event that the client uses a different memory layout, it could look up a coercion handler that converts the supplied data from its original layout to the layout required by the client. This is, for instance, how the Apple Event Manager was designed to work: all data is tagged with a type descriptor:

typeSInt16, typeSInt32, typeSInt64 typeUInt16, typeUInt32, typeUInt64

typeIEEE32BitFloatingPoint, typeIEEE64BitFloatingPoint, typeIEEE128BitFloatingPoint

typeUTF8Text, typeUnicodeText (UTF16BE), typeUTF16ExternalRepresentation (UTF16 w. endian mark)

typeAEList (ordered collection) typeAERecord (keyed collection)

and so on. (The tags themselves are encoded as UInt32; nothing so advanced as MIME types, but at least they’re compact.)

The AEM includes a number of standard coercion handlers for converting data between different representations, and clients may also supply their own handlers if needed. Thus the server just packs data in its current representation, and if the client uses the same representation then, great, it can use it as-is. Otherwise the client-side AEM automatically coerces the data to the form the client as part of the unpacking process.

There are limits in the AEM’s design, not least the inability to describe complex data (arrays and structs) with a single “generic” descriptor, e.g. `AEList(SInt32)`. That would vastly simplify packing and unpacking—in best case to simple flat serialization/deserialization, at worst to a single recursive deserialization—instead of two recursive operations with lots of extra mallocs and frees for interim data. But the basic principle is sound, and adheres well to the “be liberal in what you accept” principle.

I believe Powershell does something similar when connecting outputs to inputs of different (though broadly compatible) types, intelligently coercing the output data to the exact type the input requires. No manual work required; it “Just Works”.

Or, if you don’t mind the extra overhead then content negotiation is also an option, which is something HTTP does very well (though web programmers very badly). That is advantageous when communicating with “less intelligent” clients as it permits the server, which best understands its own data, to pre-convert (e.g. via lossy coercion) its data to a form the client will accept.

Lots of ways that Unix’s “throw its hands up and dump the problem all over the users” non-answer can be massively improved on, in other words, without ever losing the lovely loose coupling that is a Unix system’s strength. It only requires a single piece of essential—yet missing—information: the data’s type.

PaulDavisThe1st · on Feb 19, 2020

The problem outlined in the original article isn't about data streams. It is, at bottom, about the contrast between data storage and in-memory representation.

Typed data streams were not invented by Apple. Back in the 1980s, there was (for example) Sun's RPC mechanism, which gave you "seamless" remote procedure call, including transfer of arbitrary structures over a network.

But the original post is much more about filesystems. I used the socket example merely to illustrate the problem, not the actual topic.