A better solution might be to just emulate a block device with a sparse file and...

cyphar · on Jan 23, 2019

This is what Singularity (and to a lesser extent, LXD) do. The main problem with this is that you don't get de-duplication of transfer (or really of storage) -- any small change in your published rootfs and you'd have to re-download the whole thing. In addition, it requires that the system you're mounting it on supports the filesystem you use (and that the admins are happy using that filesystem).

There is also a potential security risk -- since filesystem drivers are generally not secured against potentially malicious sources (there are plenty of attacks that have been found against the big Linux filesystems when you attack them with un-trusted filesystem data). This is one of the reasons auto-mounting USBs is generally seen as a bad security practice.

Don't get me wrong, there is a _huge_ benefit to using your runtime format as your image distribution format. But there are downsides that are non-trivial to work around. I am thinking about how to bridge that gap though.

ofrzeta · on Jan 23, 2019

What about some kind of binary delta diffs such as bsdiff?

cyphar · on Jan 23, 2019

Yes, and this is what LXD does. I think I mentioned it in the article, but basically the issue is that it requires one of:

1. A clever server, which asks you which version do you have so it can generate a diff for you. This has quite a few drawbacks (storage and processing costs as well as making it harder to verify that the image you end up with is what was signed by the original developers). But this will guarantee that you will always get de-duplication.

2. Or you could pre-generate diffs for a specific set of versions, which means it's a lottery whether or not users actually get transfer de-duplication. If you generate a diff for _every_ version you're now back to storage costs (and processing costs on the developer side that increase with each version of the container image you have). You could make it so that the diffs only step you forward one version rather than instantly get you the latest, but then you now have clients having to pull many binary diffs again.

This system has existed for a long time with BSD as well as distributions having delta-RPMs (or the equivalent for debs). It works _okay_ but it's far from ideal, and the other negatives of using loopback filesystems only make it less ideal.

In my view, dumb formats are best.

bruce_one · on Jan 23, 2019

Using [zsync](http://zsync.moria.org.uk/) is an alternative option from my understanding?

I could be technically inaccurate, but my understanding is that it's rsync but with the server serving a metadata file which allows the rsync-diffing to happen from the client side rather than the server side - hence no clever server required.

It also doesn't require diffing particular revisions; but only the different blocks will be fetched. It does require server the metadata file, but they're note very large afaik.

drbawb · on Jan 23, 2019

I thought much the same thing. ZFS scratches most of these itches. (Sharing common blocks, it metadata is self-verifying, it's able to serialize a minimal set of changes between two snapshots, etc.) Just ship the filesystem images as, well, filesystem images. Plus if you want to go from development to production: you can `zfs-send` your image onto your HA cluster. ZFS makes for a durable & reliable storage subsystem that's been production-grade for many years.

This is essentially what Illumos/SmartOS does, and it seems to work out well for them.

cyphar · on Jan 23, 2019

The problem is when you have systems that don't have ZFS, or cases where you want to operate on an image without mounting it.

Also (from memory) ZFS de-duplication is not transmitted over zfs-send which means that you don't really get ZFS's de-dup when transferring (you do get it for snapshots -- but then we're back to layer de-duplication which is busted when it comes to containers).

Don't get me wrong, I'm a huge fan of ZFS but it really doesn't solve all the problems I went through.

waz0wski · on Jan 25, 2019

ZFS supports both compression and de-duplication on send streams, the behavior on the receiving side depends on the configuration of the pool+dataset where the data is being received.

There used to be some differences in features/behavior depending on the ZFS version in use (Solaris vs FreeBSD vs ZFSoL/OpenZFS) but I believe as of 2018 all ZFS implementations have these features for send streams

AceJohnny2 · on Jan 23, 2019

I thought Apple's approach with DMGs weird, but that's basically what they've been doing for forever.

ben509 · on Jan 22, 2019

It needs to archive individual files to be able to reconstruct the layers. They also don't want to build squashfs into the interface specification.