Lol AWS or Cloudlfare could cover these costs easily. Let’s see who steps up.
Tangentially, if they self hosted using say ZFS would the cached artifacts and other such things be candidates for the dedup features of the file system?
Bezos could cover my costs easily, yet he is not stepping up to help. It seems presumptuous to say that these companies should donate just because they could afford it. As distasteful as it seems, the open source community really needs a better funding model.
Their current sponsor isn’t a deep pocketed publicly traded cloud provider. All of these cloud providers advocate for open source. Cloudflare could come in and cover these costs and write them off as donations or something similar. For AWS they’d probably not even make a material difference in the accounting software.
I think you’ve got some deep seated angst against the Uber wealthy (join the club) but I never mentioned bezos. I mentioned these companies sponsoring selfishly for the goodwill it would bring.
> Tangentially, if they self hosted using say ZFS would the cached artifacts and other such things be candidates for the dedup features of the file system?
Yes, block-level deduplication would be hugely beneficial. If you're running Nix locally you can set auto-optimise-store = true, which enables file-level deduplication using hardlinks; that already saves a great deal of space.
The nix cache compresses store artifacts in NAR archives, which I think defeats de-duplication.
Its possible to serve a nix cache directly out of a nix store, creating the NAR archives on-the-fly. But that would require a) a big filesystem and b) a lot more compute.
Which are not insurmountable challenges, but they are challenges.
My org does the live-store-on-zfs model with a thin proxy cache on the front end to take the load off the CPU constantly regenerating the top-5% most popular NARs.
It seems to work well for us, but I am curious how that would look at 425TB scale— would filesystem deduplication pay for the losses in compression, and if so, by how much? How does that change if you start sharding the store across multiple ZFS hosts?
I think you are confusing deduplication and compression provided by the filesystem vs. deduplication and compression provided by nix.
If you were just storing a static set of pregenerated NAR archives, you will not see any benefit from filesystem-provided compression or deduplication.
If you host a live nix store (i.e. uncompressed files under /nix/store), then you could benefit from filesystem-provided compression and deduplication. Also, nix itself can replace duplicates with hard links. But the downside is then that you have to generate the NAR archives on the fly when a client requests a derivation.
That might be worth it, especially since they get great hit rates from Fastly. But on the other hand, that means a lot more moving parts to maintain and monitor vs. simple blob storage.
Definitely. It's not hard to imagine that even if they end up rolling their own for this, it might make sense for it to be a bunch of off-the-shelf Synology NASes stuffed with 22TB drives in donated rackspace around the world, running MinIO. If the "hot" data is all still in S3/CDN/cache, then you're really just keeping the rest of it around on in-case basis, and for that, the simpler the better.
Yeah that's what I was thinking. Say 3 nodes in 3 different data centers all sync'd running enough drives of spinning rust to meet their needs and then some plus the overhead that ZFS requires (i.e the rule of thumb of not using more than 80%? of the usable storage to prevent fragmentation I think) and then exposing that via an S3 compatible API via minio + a CDN of their choice.
de-dupe in ZFS is a "here be dragons" sort of feature, best left alone unless you're already familiar with ZFS administration and understand the tradeoffs it brings [0]
in this case, the files being stored are compressed archives (NAR files) so the block-level dedupe that ZFS does would probably find very little to work with, because the duplicated content would be at different offsets within each archive.
beyond that, there's also the issue that they would need a single ZFS pool with 500TB+ of storage (given the current ~425TB size). that's certainly doable with ZFS, but the hardware needed for that server would be...non-trivial. and of course that's just a single host, they would want multiple of those servers for redundancy. distributed object storage like S3 is a much better fit for that volume of data.
Tangentially, if they self hosted using say ZFS would the cached artifacts and other such things be candidates for the dedup features of the file system?