There still are. As someone who has done both production and homelab deployments: unless you are specifically just looking for experience with it and just setting up a demo - don't bother.
When it works, it works great - when it goes wrong it's a huge headache.
Edit: As just an edit, if distributed storage is just something you are interested in there are much better options for a homelab setup:
- seaweedfs has been rock solid for me for years in both small and huge scales. we actually moved our production ceph setup to this.
- longhorn was solid for me when i was in the k8s world
- glusterfs is still fine as long as you know what you are going into.
I just want to hoard data. I hate having to delete stuff to make space. Things disappear from the web every day. I should hold onto them.
My requirements for a storage solution are:
> Single root file system
> Storage device failure tolerance
> Gradual expansion capability
The problem with every storage solution I've ever seen is the lack of gradual expandability. I'm not a corporation, I'm just a guy. I don't have the money to buy 200 hard disks all at once. I need to gradually expand capacity as needed.
I was attracted to this ceph because it apparently allows you to throw a bunch of drives of any make and model at it and it just pools them all up without complaining. The complexity is nightmarish though.
ZFS is nearly perfect but when it comes to expanding capacity it's just as bad as RAID. Expansion features seem to be just about to land for quite a few years now. I remember getting excited about it after seeing news here only for people to deflate my expectations. Btrfs has a flexible block allocator which is just what I need but... It's btrfs.
> ZFS is nearly perfect but when it comes to expanding capacity it's just as bad as RAID.
if you don't mind the overhead of a "pool of mirrors" approach [1], then it is easy to expand storage by adding pairs of disks! This is how my home NAS is configured.
Looks good. I don't mind the overhead. That seems to be much more resilient compared to RAID5/6 and its ZFS equivalents and addresses all the concerns I outlined in this comment:
I'm still somewhat alarmed by the possibility of the surviving mirror drive failing during resilver and destroying the pool... Are there any failure chance calculators for this pool of mirrors topology? No doubt it's much lower than the RAID5/6 setups but still.
Is this topology usable in btrfs without the famous reliability issues? How good is ZFS support is on Linux? I'm a Linux guy so I'd really like to keep using Linux if possible. Maybe Linux LVM/mdadm?
50% storage efficiency is a tough pill to swallow, but drives are pretty big and the ability to expand as you go means it can be cheaper in the long run to just buy the larger, new drives coming out than pay upfront for a bunch of drives in a raidz config.
ZFS using mirrors is extremely easy to expand. Need more space and you have small drives? Replace the drives in a mirror one by one with bigger ones. Need more space and already have huge drives? Just add another vdev mirror. And the added benefit of not living in fear of drive failure while resilvering as it is much faster with mirrors than raidX.
Sure the density isn't great as you're essentially running at 50% or raw storage but - touches wood - my home zpool has been running strong for about a decade doing the above from 6x 6tb drives (3x 6tb mirrors) to 16x 10-20tb drives (8x mirrors, differing sized drives but matched per mirror like a 10tb x2 mirror, a 16tb x2 mirror etc).
Edit: Just realised someone else as already mentioned a pool or mirrors. Consider this another +1.
> Replace the drives in a mirror one by one with bigger ones.
That's exactly what I meant by "just as bad as RAID". Expanding an existing array is analogous to every single drive in the array failing and getting replaced with higher capacity drives.
When a drive fails, the array is in a degraded state. Additional drive failures put the entire system in danger of data loss. The rebuilding process generates enormous I/O loads on all the disks. Not only does it take an insane amount of time, according to my calculations the probability of read errors happening during the error recovery process is about 3%. Such expansion operations have a real chance of destroying the entire array.
That's not the case it mirrored vdevs. There is no degredatuon of the array with a failed drive in a mirrored vdev, it continues humming along perfectly fine.
Also resilvers are not as intensive when rebuilding a mirror as you are just copying from one disk in the vdev to the other, not all X other drives and recalculating parity at the same time. This means less reads across the entire array and much much quicker resilver times, thus less window for drive failure.
I see. That addresses my concerns, and it's starting to make a lot of sense. I'm gonna study this in depth, starting with the post you linked. Thank you.
I've run Ceph at home since the jewel release. I migrated to it after running FreeNAS.
I use it for RBD volumes for my OpenStack cluster and for CephFS. With a total raw capacity of around 350TiB. Around 14 of that is nvme storage for RBD and CephFS metadata. The rest is rust. This is spread across 5 nodes.
I currently am only buying 20TB exos drives for rust. SMR and I think HSMR are both no goes for Ceph as are non enterprise SSDs, so storage is expensive. Ibdinhave a mix of disks though as the cluster has grown organically. So I have a few 6TB WD Reds in there, before their SMR shift.
My networks for OpenStack, Ceph and Ceph backend are all 10Gbps. With the flash storage when repairing I get about 8GiB/s a second. With rust it is around 270MiB/s. The bottle neck I think is due to 3 of the nodes running on first gen xeon-d boards, the the few Reds do slow things down too. The 4th node runs an AMD Rome CPU, and the newest an AMD Genoa cpu. So I am looking at about 5k CAD a node before disks. I colocate the MDS, OSDs and MONs, with 64GiB of ram each. Each node gets 6 rust, and 2 nvme drives.
Complexity is pretty simple. I deployed the initial iteration by hand, and then when cephadmin was released i converted it daemon by daemon smoothly. I find on the mailing lists and Reddit most of the people encountering problems deploy it via Proxmox and don't really understand Ceph because of it.
If you're willing to use mirror vdevs, expansions can be done two drives at a time.Also, depending on how often your data changes, you should check out snapraid. Doesn't have all the features of ZFS but its perfect for stuff that rarely changes (media or, in your case, archiving).
Also unionfs or similar can let you merge zfs and snapraid into one unified filesystem so you can place important data in zfs and unchanging archive data in snapraid.
On a single host, you could do this with LVM. Add a pair of disks, make them a RAID 1, create a physical volume on them, then a volume group, then a logical volume with XFS on top. To expand, you add a pair of disks, RAID 1 them, and add them to the LVM. It's a little stupid, but it would work.
If multiple nodes are not off the table, also look into seaweedfs.
Also consider how (or if) you are going to back up your hoard of data.
> Also consider how (or if) you are going to back up your hoard of data.
I actually emailed backblaze years ago about their supposedly unlimited consumer backup plan. Asked them if they would really allow me to dump into their systems dozens of terabytes of encrypted undeduplicable data. They responded that yes, they would. Still didn't believe them, these corporations never really mean it when they say unlimited. Plus they had no Linux software.
EOS (https://cern.ch/eos, https://github.com/cern-eos/eos) is probably a bit more complicated than other solutions to setup and manage, but does allow to add/remove new disks and nodes serving data on the fly. This is essential to let us upgrade harware of the clusters serving experimental data with minimal to no downtime.
Not sure what the multidisk consensus is for btrfs now-a-days but adding/removing devices is trivial, you can do "offline" dedupe, and you can rebalance data if you change the disk config.
As an added bonus it's also in-tree so you don't have to worry about kernel updates breaking things
I think you can also potentially do btrfs+LVM and let LVM manage multi device. Not sure what performance looks like there, though
> glusterfs is still fine as long as you know what you are going into.
Does that include storage volumes for databases? I was using glusterFS as a way to scale my swarm cluster horizontally and I am reasonably sure that it corrupted one database to the point I lost more than a few hours of data. I was quite satisfied with the setup until I hit that.
I know that I am considered crazy for sticking with Docker Swarm until now, but aside from this lingering issue with how to manage stateful services, I've honestly don't feel the need to move yet to k8s. My clusters is ~10 nodes running < 30 stacks and it's not like I have tens of people working with me on it.
> Storage optimizations: erasure coding or any other coding technique both increase the difficulty of placing data and synchronizing; we limit ourselves to duplication.
This is probably a nogo for most use cases where you work with large datasets....
Minio doesn't make any sense to me in a homelab. Unless I'm reading it wrong it sounds like a giant pain to add more capacity while it is already in use. There's basically no situation where I'm more likely to add capacity over time than a homelab.
You get a new nas (minio server pool) and you plug it into your home lab (site replication) and now it's part of the distributed minio storage layer (k8s are happy). How is that hard? It's the same basic thing for Ceph or any distributed JBOD mass storage engine. Minio has some funkiness with how you add more storage but it's totally capable of doing it while in use. Everything is atomic.
Ceph is sort of a storage all-in-one: it provides object storage, block storage, and network file storage. May I ask, which of these are you using seaweedfs for? Is it as performant as Ceph claims to be?
I really wish there was a benchmark comparing all of these + MinIO and S3. I'm in the market for a key value store, using S3 for now but eyeing moving to my own hardware in the future and having to do all the work to compare these is one of the major things making me procrastinate.
Minio gives you "only" S3 object storage. I've setup a 3-node Minio cluster for object storage on Hetzner, each server having 4x10TB, for ~50€/month each. This means 80TB usable data for ~150€/month. It can be worth it if you are trying to avoid egress fees, but if I were building a data lake or anything where the data was used mostly for internal services, I'd just stick with S3.
minio is good but you really need fast disks.
They also really don't like, when you want to change the size of your cluster setup.
No plan to add cache disks, they just say use faster disks.
I have it running, goes smoothly but not really user friendly to optimize
Note that the Red Hat Gluster Storage product has a defined support lifecycle through to 31-Dec-24, after which the Red Hat Gluster Storage product will have reached its EOL. Specifically, RHGS 3.5 represents the final supported RHGS series of releases.
For folks using GlusterFS currently, what's your plan after this year?
Curious, what do you mean by "know what you go into" re glusterfs?
I recently tried ceph in a homelab setup, gave up because of complexity, and settled on glusterfs. I'm not a pro though, so I'm not sure if there's any shortcomings that are clear to everybody but me, hence why your comment caught my attention.
When it works, it works great - when it goes wrong it's a huge headache.
Edit: As just an edit, if distributed storage is just something you are interested in there are much better options for a homelab setup:
- seaweedfs has been rock solid for me for years in both small and huge scales. we actually moved our production ceph setup to this.
- longhorn was solid for me when i was in the k8s world
- glusterfs is still fine as long as you know what you are going into.