I see that, when you create several maps that have the same keys but with different values, the key-set is shared between instances, and this is basically what brings this particular implementation of maps up-to-par with replacing Erlang records in their use-case.
I'm curious whether this efficiency-in-storage translates to effiency-in-messaging, though:
• When you pass a map N times to another process, is the key-set copied N times? (Not that bad if true, so I'm guessing so.)
• When you pass a map N times over the distribution protocol, is the key-set serialized and transmitted N times? (This'd be pretty bad if true.)
The first one: Yes, it has to. Otherwise the keyset would be shared among processes and that is generally not possible in Erlang (Immutable binary data being the exception)
The second one: yes and no. The format is described at http://erlang.org/doc/apps/erts/erl_ext_dist.html and maps are laid out as a serialized construction like you say. Two points balances this out: There is an atom-cache which allow you to cache atoms and make their representation small. And you can zlib-compress the data.
It is not set in stone yet and might even change in a later versions. The term format is versioned so you can upgrade it later if need be.
Right, what I was asking in the second question is basically whether there exists, or is planned, a keyset cache to go along with the atom cache. I'd think that, if the atom cache is a good idea, a keyset cache would be good for exactly the same reasons.
If you do that, it is better to create an arbitrary caching construction of subtrees and then reap the benefit by tighter packing of data in general. I agree a keyset cache could be really nice to have going forward. It would resemble what happens in-heap.
But the rule of the Ericsson OTP team is to get it correct before making it fast.
Atoms and keysets both have pretty much the same caching semantics, though: they get repeated over the wire with pretty good locality, and don't have conflicting terms busting the cache in-between. That's not really true for anything else, which makes me guess that an arbitrary term-branch cache would be pretty useless.
The idea of having arbitrary subtrees is to support an efficient compression scheme. But you are indeed right that a keyset cache would be extremely effective at limiting the size of maps.
I'm curious whether this efficiency-in-storage translates to effiency-in-messaging, though:
• When you pass a map N times to another process, is the key-set copied N times? (Not that bad if true, so I'm guessing so.)
• When you pass a map N times over the distribution protocol, is the key-set serialized and transmitted N times? (This'd be pretty bad if true.)