> Self-referentially incorruptible data is the standard to beat here, moving the concern to a different layer doesn't increase the efficiency or integrity of the data itself.
I respectfully disagree. By putting it in the layer below, there is the ability to do repairs.
For example, consider storing XZ files on a Ceph storage cluster. Ceph supports Reed-Solomon coding. This means that if data corruption occurs, Ceph is capable of automatically repairing the data corruption by recomputing the original file and writing it back to disk once more.
Even if XZ were able to recover from some forms of data corruption, is it realistic that such repairs propagate back to the underlying data store? Likely not.
You are thinking in the wrong direction, I tried to explain but maybe I can be clearer:
If you can't read the data in question though, you cannot do the repairs, it doesn't matter if you do Reed-Solomon coding or not. You are thinking about coding for the underlying hardware, which is what the data corruption you are talking about is designed to fix - it does not solve the problem for writes coming from above.
To do that, you actually have to decode the data in question and perform a reed-solomon encoding on the actual file inside of the archive, and this only gets worse e.g. if you have nested archives.
If the data is self-referentially repairable, however, it doens't matter if the file gets overwritten with e.g. a cat gif, the format will work around that. The filesystem on the other hand will have written the cat gif to the file and updated the Reed Solomon encoding for your file, assuming (incorrectly) the file writes were valid.
I suppose you could mandate that for any file to be written to your filesystem it must first be completely decompressed, and then store some encoding information alongside the archive, but this would be inefficient to the extreme, since merely copying a file onto the system would mean you have to decompress the file and then checksum it.
At any rate, even if you did decompress the file in question, you have failed to separate the layers like you want to, since now you have mandated the XZ and LZMA algorithms also be baked directly into the filesystem itself.
Better not to needlessly couple the filesystem to some compression algorithm, let the compression system handle its own error correction.
You use CoW on filesystem level and take snapshot every 5 mins. Your concerns about write coming from above is gone.
The point about being able to use different media like bluray drives is a valid point but since xz doesn't do any correction it doesn't really matter, it has to be done out-of-band anyway.
Right, but my concerns about an overly complex system coming in as a stand-in for a simple one are not.
The simple cost effective thing is not to engineer a complex redundancy system above and below to try to adhere to some misguided "separation of concerns", its to use the simplest, most effective solution which presents itself.
When you try to separate things which should not be separated in a software (or other) system, you get high coupling, low cohesion. Not everything should be attempted to be "de-coupled".
I respectfully disagree. By putting it in the layer below, there is the ability to do repairs.
For example, consider storing XZ files on a Ceph storage cluster. Ceph supports Reed-Solomon coding. This means that if data corruption occurs, Ceph is capable of automatically repairing the data corruption by recomputing the original file and writing it back to disk once more.
Even if XZ were able to recover from some forms of data corruption, is it realistic that such repairs propagate back to the underlying data store? Likely not.