I find it very, very hard to go wrong with Syncthing (for stuff I truly need replicated, code/photos/text-records) and ZFS + znapzend + rsync.net (automatic snapshots of `/home` and `/var/lib` on servers).
The only thing missing is -> I'd like to stop syncing code with Syncthing and instead build some smarter daemon. The daemon would take a manifest of repositories, each with a mapping of worktrees->branches to be actualized and fsmonitored. The daemon would auto-commit changes on those worktrees into a shadow branch and push/pull it. Ideally this could leverage (the very amazing, you must try it) `jj` for continous committing of the working copy and (in the future, with native jj formart) even handle the likely-never-to-happen conflict scenario. (I'd happily collaborate on a Rust impl and/or donate funds to one.)
Given the number of worktrees I have of some huge repos (nixpkgs, linux, etc) it would likely mark a significant reduction in CPU/disk usage given what Syncthing is having to do now to monitor/rescan as much as I'm asking it to (given it has to dumb-sync .git, syncs gitignored content, etc, etc).
> Given the number of worktrees I have of some huge repos (nixpkgs, linux, etc) it would likely mark a significant reduction in CPU/disk usage given what Syncthing is having to do now to monitor/rescan as much as I'm asking it to (given it has to dumb-sync .git, syncs gitignored content, etc, etc).
Are you really hitting that much of a resource utilization issue with syncthing though? I use it on lots of small files and git repos and since it uses inotify there's not really much of a problem. I guess the worst case is switching to very different branches frequently, or committing very large (binary?) files where it may need to transfer them twice, but this hasn't been a problem in my own experience.
I'm not sure you could really do a whole lot better than syncthing by being clever, and it strikes me as a lot of effort to optimize for a specific workflow.
Edit: actually, I wonder if you could just exclude the working copies with a clever exclude list in syncthing, such that you'd ONLY grab .git so you wouldn't even need the double transfer/storage. You risk losing uncommitted work I suppose.
Replication to another machine that has a COW file system with snapshots is backup though :-)
We backup our data storage for an entire HPC cluster, about 2 PiB of it to a single machine with a 4 disk shelves running ZFS with snapshots. It works very well. Simple raunchy every night, and snapshotted.
We use the backup as a sort of Time Machine should we need data from the past that we deleted in the primary. Plus, we don’t need to wait for the tapes to load or anything.. it is pretty fast and intuitive
The person you're replying to said "Syncthing ... and ZFS + znapzend + rsync.net" though. You're ignoring the rsync.net part.
I have something similar; it's Nextcloud + restic to AWS S3, but it's the same principle. You can give people the convenience and human-comprehensibility of sync-based sharing, but also back that up too, for the best of both worlds. Though in my case the odds of me needing "previous versions" of things approach zero and a full sync is fairly close to backup, but, even so I do have a full solution here.
When I mentioned de-duping and append-only logs, I had this in mind. It's hard to imagine implementing a backup system with those two properties that don't include snapshotting nearly by design-necessity.
(Beyond even the fact that ~/code is also on a ZFS volume that is snapshotted and replicated off-site, which I argue can be used in all of the same important ways any other "backup" is used.)
Hence the comment! After all this blockchain hoopla and everyone's understanding of how "cool" Git is, we really, really deserve better in our backup tools.
But, it makes things easy. I have e.g. a home computer, a server in the closet thing, a laptop and a work computer all with a shared Syncthing folder.
So to bolster that other thing, I just have a simple bash script that reminds me every 7 days to make a copy of that folder somewhere else on that machine. It's not precise because I often don't know what machine I will be using, but that creates a natural staggering that I figure should be sufficient of something goes weird and lose something; like I'm likely to have an old copy somewhere?
What is the actual difference between a backup and replication? If the 1’s and 0’s are replicated to a different host, is that any different than “backing up” (replicating them) to a piece of external media?
> What is the actual difference between a backup and replication?
Simplest way to think about it is that a backup must be an immutable snapshot in time. Any changes and deletions which happen after that point in time will never reflect back onto the backup.
That way, any files you accidentaly delete or corrupt (or other unwanted changes, like ransomware encrypting them for you) can be recovered by going back to the backup.
Replication is very different, you intentionally want all ongoing changes to replicate to the multiple copies for availability. But it means that unwanted changes or data corruption happily replicates to all the copies so now all of them are corrupt. That's when you reach for the most recent backup.
That's why you always need to backup and you'll usually want to replicate as well.
When those 1s and 0s are deleted and that delete is replicated (or other catastrophic change, such as ransomware) you presumably don't have the ability to restore if all you're doing is replication. A strategy that layers replication + backup/versioning is the goal.
I'll add that _usually_ a backup strategy includes generational backups of some kind. That is daily, weekly, monthly, etc to hedge against individually impacted files as mentioned.
Ideally there is also an offsite and inaccessible from the source component to this strategy. Usually this level of robustness isn't present in a "replication" setup.
Put more simply, backups account for and mitigate the common risks to data during storage while minimizing costs, ransomware is one of those common risks. Its organizational dependent based on costs and available budget so it varies.
Long term storage usually has some form of Forward Error Correction (FEC) protection schemes (for bitrot), and often backups are segmented which may be a mix of full and iterative, or delta backups (to mitigate cost) with corresponding offline components (for ransomware resiliency), but that too is very dependent on the environment as well as the strategy being used for data minimization.
> Usually this level of robustness isn't present in a "replication" setup.
Exactly, and thinking about replication as a backup often also gives those using it a false sense of security in any BC/DR situations.
I use Syncthing between Mac, Windows (have included Linux in the mix at one point), and with my Synology NAS. Syncthing is more for my short term backup though. I will either commit it to a repo, save it to a Synology share, or delete it.
*edit* my gitea server saves its backups to synology
Yes. I just let Syncthing sync among devices, using it for creating copies of the backup. The daily backup scripts do their things and create one backup snapshot, then Syncthing picks up the new backup files and propagate them to multiple devices.
Sparkleshare does something kind of similar. It uses git as the backend automatically sync directories on a few computers. https://www.sparkleshare.org/
My two 80% full 1tb laptops and 1tb desktop backup to around 300-400G after dedupe and compression. Currently have around 12tb of backups stored in that 300G.
Incremental backups run in about 5 mins even against the spinning disk's they're stored on.
Python programmer here, but I actually prefer Restic [0]. While more or less the same experience, the huge selling point to me is that the backup program is a single executable that can be easily stored alongside the backups. I do not want any dependency/environment issues to assert themselves when restoration is required (which is most likely on a virgin, unconfigured system).
I've been using Borg, Restic and Kopia for a long time and Kopia is my personal favorite - very fast, very efficient, runs in the background automatically without having to schedule a CRON or anything like that.
Only downside is that the backups are made of a HUGE number of files, so when synchronizing it can sometimes take a bit of time to check the ~5k files.
No, I distinctly don't want borg. It doesn't help or solve anything that Syncthing doesn't do. The obsession with borg and bup are pretty baffling to me. We deserve better in this space. (see: Asuran and another who's name I forget...)
Critically, I'm specifically referring to code sync that needs to operate at a git-level to get the huge efficiencies I'm thinking of.
Syncthing, or borg, scanning 8 copies of the Linux kernel is pretty horrific compared to something doing a "git commit && git push" and "git pull --rebase" in the background (over-simplifying the shadow-branch process here for brevity.)
re: 'we deserve better' -- case in point, see Asuran - there's no real reason that sync and backup have to be distinctly different tools. Given chunking and dedupe and append-logs, we really, really deserve better in this tooling space.
I don't think GP was talking about backups (which is what Borg is good for) but about synchronization between machines which is another issue entirely.
They work together. I use syncthing to keep things synchronized across devices, including to an always-on "master" device that has more storage. Then borg runs on the master device to create backups.
The only thing missing is -> I'd like to stop syncing code with Syncthing and instead build some smarter daemon. The daemon would take a manifest of repositories, each with a mapping of worktrees->branches to be actualized and fsmonitored. The daemon would auto-commit changes on those worktrees into a shadow branch and push/pull it. Ideally this could leverage (the very amazing, you must try it) `jj` for continous committing of the working copy and (in the future, with native jj formart) even handle the likely-never-to-happen conflict scenario. (I'd happily collaborate on a Rust impl and/or donate funds to one.)
Given the number of worktrees I have of some huge repos (nixpkgs, linux, etc) it would likely mark a significant reduction in CPU/disk usage given what Syncthing is having to do now to monitor/rescan as much as I'm asking it to (given it has to dumb-sync .git, syncs gitignored content, etc, etc).