There still are. As someone who has done both production and homelab deployments...

matheusmoreira · on Jan 19, 2024

I just want to hoard data. I hate having to delete stuff to make space. Things disappear from the web every day. I should hold onto them.

My requirements for a storage solution are:

> Single root file system

> Storage device failure tolerance

> Gradual expansion capability

The problem with every storage solution I've ever seen is the lack of gradual expandability. I'm not a corporation, I'm just a guy. I don't have the money to buy 200 hard disks all at once. I need to gradually expand capacity as needed.

I was attracted to this ceph because it apparently allows you to throw a bunch of drives of any make and model at it and it just pools them all up without complaining. The complexity is nightmarish though.

ZFS is nearly perfect but when it comes to expanding capacity it's just as bad as RAID. Expansion features seem to be just about to land for quite a few years now. I remember getting excited about it after seeing news here only for people to deflate my expectations. Btrfs has a flexible block allocator which is just what I need but... It's btrfs.

chromatin · on Jan 20, 2024

> ZFS is nearly perfect but when it comes to expanding capacity it's just as bad as RAID.

if you don't mind the overhead of a "pool of mirrors" approach [1], then it is easy to expand storage by adding pairs of disks! This is how my home NAS is configured.

[1] https://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs...

roygbiv2 · on Jan 20, 2024

This is also exactly how mine is done. Started off with a bunch of 2TB disks. I've now got a mixture of 16TB down to 4TB, all in the original pool.

matheusmoreira · on Jan 21, 2024

Looks good. I don't mind the overhead. That seems to be much more resilient compared to RAID5/6 and its ZFS equivalents and addresses all the concerns I outlined in this comment:

https://news.ycombinator.com/item?id=39068493

I'm still somewhat alarmed by the possibility of the surviving mirror drive failing during resilver and destroying the pool... Are there any failure chance calculators for this pool of mirrors topology? No doubt it's much lower than the RAID5/6 setups but still.

Is this topology usable in btrfs without the famous reliability issues? How good is ZFS support is on Linux? I'm a Linux guy so I'd really like to keep using Linux if possible. Maybe Linux LVM/mdadm?

Snow_Falls · on Jan 20, 2024

50% storage efficiency is a tough pill to swallow, but drives are pretty big and the ability to expand as you go means it can be cheaper in the long run to just buy the larger, new drives coming out than pay upfront for a bunch of drives in a raidz config.

deadbunny · on Jan 20, 2024

ZFS using mirrors is extremely easy to expand. Need more space and you have small drives? Replace the drives in a mirror one by one with bigger ones. Need more space and already have huge drives? Just add another vdev mirror. And the added benefit of not living in fear of drive failure while resilvering as it is much faster with mirrors than raidX.

Sure the density isn't great as you're essentially running at 50% or raw storage but - touches wood - my home zpool has been running strong for about a decade doing the above from 6x 6tb drives (3x 6tb mirrors) to 16x 10-20tb drives (8x mirrors, differing sized drives but matched per mirror like a 10tb x2 mirror, a 16tb x2 mirror etc).

Edit: Just realised someone else as already mentioned a pool or mirrors. Consider this another +1.

matheusmoreira · on Jan 20, 2024

> Replace the drives in a mirror one by one with bigger ones.

That's exactly what I meant by "just as bad as RAID". Expanding an existing array is analogous to every single drive in the array failing and getting replaced with higher capacity drives.

When a drive fails, the array is in a degraded state. Additional drive failures put the entire system in danger of data loss. The rebuilding process generates enormous I/O loads on all the disks. Not only does it take an insane amount of time, according to my calculations the probability of read errors happening during the error recovery process is about 3%. Such expansion operations have a real chance of destroying the entire array.

deadbunny · on Jan 20, 2024

That's not the case it mirrored vdevs. There is no degredatuon of the array with a failed drive in a mirrored vdev, it continues humming along perfectly fine.

Also resilvers are not as intensive when rebuilding a mirror as you are just copying from one disk in the vdev to the other, not all X other drives and recalculating parity at the same time. This means less reads across the entire array and much much quicker resilver times, thus less window for drive failure.

But don't just take my word for it. This is a blog post that go much into much more detail https://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs...

matheusmoreira · on Jan 21, 2024

I see. That addresses my concerns, and it's starting to make a lot of sense. I'm gonna study this in depth, starting with the post you linked. Thank you.

sekh60 · on Jan 20, 2024

I've run Ceph at home since the jewel release. I migrated to it after running FreeNAS.

I use it for RBD volumes for my OpenStack cluster and for CephFS. With a total raw capacity of around 350TiB. Around 14 of that is nvme storage for RBD and CephFS metadata. The rest is rust. This is spread across 5 nodes.

I currently am only buying 20TB exos drives for rust. SMR and I think HSMR are both no goes for Ceph as are non enterprise SSDs, so storage is expensive. Ibdinhave a mix of disks though as the cluster has grown organically. So I have a few 6TB WD Reds in there, before their SMR shift.

My networks for OpenStack, Ceph and Ceph backend are all 10Gbps. With the flash storage when repairing I get about 8GiB/s a second. With rust it is around 270MiB/s. The bottle neck I think is due to 3 of the nodes running on first gen xeon-d boards, the the few Reds do slow things down too. The 4th node runs an AMD Rome CPU, and the newest an AMD Genoa cpu. So I am looking at about 5k CAD a node before disks. I colocate the MDS, OSDs and MONs, with 64GiB of ram each. Each node gets 6 rust, and 2 nvme drives.

Complexity is pretty simple. I deployed the initial iteration by hand, and then when cephadmin was released i converted it daemon by daemon smoothly. I find on the mailing lists and Reddit most of the people encountering problems deploy it via Proxmox and don't really understand Ceph because of it.

Snow_Falls · on Jan 20, 2024

If you're willing to use mirror vdevs, expansions can be done two drives at a time.Also, depending on how often your data changes, you should check out snapraid. Doesn't have all the features of ZFS but its perfect for stuff that rarely changes (media or, in your case, archiving).

Also unionfs or similar can let you merge zfs and snapraid into one unified filesystem so you can place important data in zfs and unchanging archive data in snapraid.

bityard · on Jan 20, 2024

On a single host, you could do this with LVM. Add a pair of disks, make them a RAID 1, create a physical volume on them, then a volume group, then a logical volume with XFS on top. To expand, you add a pair of disks, RAID 1 them, and add them to the LVM. It's a little stupid, but it would work.

If multiple nodes are not off the table, also look into seaweedfs.

Also consider how (or if) you are going to back up your hoard of data.

matheusmoreira · on Jan 20, 2024

> Also consider how (or if) you are going to back up your hoard of data.

I actually emailed backblaze years ago about their supposedly unlimited consumer backup plan. Asked them if they would really allow me to dump into their systems dozens of terabytes of encrypted undeduplicable data. They responded that yes, they would. Still didn't believe them, these corporations never really mean it when they say unlimited. Plus they had no Linux software.

nijave · on Jan 20, 2024

> these corporations never really mean it when they say unlimited. Plus they had no Linux software

Afaik they rely on the latter to mitigate the risk of the former.

Snow_Falls · on Jan 20, 2024

Considering the fact that most data heavy servers are llnux, that would be a pretty clever way of staying true to their word.

amadio · on Jan 20, 2024

EOS (https://cern.ch/eos, https://github.com/cern-eos/eos) is probably a bit more complicated than other solutions to setup and manage, but does allow to add/remove new disks and nodes serving data on the fly. This is essential to let us upgrade harware of the clusters serving experimental data with minimal to no downtime.

nijave · on Jan 20, 2024

Not sure what the multidisk consensus is for btrfs now-a-days but adding/removing devices is trivial, you can do "offline" dedupe, and you can rebalance data if you change the disk config.

As an added bonus it's also in-tree so you don't have to worry about kernel updates breaking things

I think you can also potentially do btrfs+LVM and let LVM manage multi device. Not sure what performance looks like there, though

matheusmoreira · on Jan 20, 2024

That's all great but btrfs parity striping is still unusable. How many more decades will it take?

rglullis · on Jan 19, 2024

> glusterfs is still fine as long as you know what you are going into.

Does that include storage volumes for databases? I was using glusterFS as a way to scale my swarm cluster horizontally and I am reasonably sure that it corrupted one database to the point I lost more than a few hours of data. I was quite satisfied with the setup until I hit that.

I know that I am considered crazy for sticking with Docker Swarm until now, but aside from this lingering issue with how to manage stateful services, I've honestly don't feel the need to move yet to k8s. My clusters is ~10 nodes running < 30 stacks and it's not like I have tens of people working with me on it.

camkego · on Jan 20, 2024

Docker Swarm seems to be underrated, from a simplicity and reliability perspective, IMHO.

reactordev · on Jan 19, 2024

I'd throw minio [1] in the list there as well for homelab k8s object storage.

[1] https://min.io/

speedgoose · on Jan 19, 2024

Also garage. https://garagehq.deuxfleurs.fr/

BlackLotus89 · on Jan 20, 2024

Garage seems to only to duplication https://garagehq.deuxfleurs.fr/documentation/design/goals/

> Storage optimizations: erasure coding or any other coding technique both increase the difficulty of placing data and synchronizing; we limit ourselves to duplication.

This is probably a nogo for most use cases where you work with large datasets....

speedgoose · on Jan 21, 2024

Yes I mentioned it in the context of homelabs, it’s not an alternative to Ceph or MinIO for large installations.

But the simplicity is appreciated. You can recover all the data using rsync.

plagiarist · on Jan 20, 2024

Minio doesn't make any sense to me in a homelab. Unless I'm reading it wrong it sounds like a giant pain to add more capacity while it is already in use. There's basically no situation where I'm more likely to add capacity over time than a homelab.

reactordev · on Jan 20, 2024

You get a new nas (minio server pool) and you plug it into your home lab (site replication) and now it's part of the distributed minio storage layer (k8s are happy). How is that hard? It's the same basic thing for Ceph or any distributed JBOD mass storage engine. Minio has some funkiness with how you add more storage but it's totally capable of doing it while in use. Everything is atomic.

bityard · on Jan 19, 2024

Ceph is sort of a storage all-in-one: it provides object storage, block storage, and network file storage. May I ask, which of these are you using seaweedfs for? Is it as performant as Ceph claims to be?

dataangel · on Jan 19, 2024

I really wish there was a benchmark comparing all of these + MinIO and S3. I'm in the market for a key value store, using S3 for now but eyeing moving to my own hardware in the future and having to do all the work to compare these is one of the major things making me procrastinate.

rglullis · on Jan 19, 2024

Minio gives you "only" S3 object storage. I've setup a 3-node Minio cluster for object storage on Hetzner, each server having 4x10TB, for ~50€/month each. This means 80TB usable data for ~150€/month. It can be worth it if you are trying to avoid egress fees, but if I were building a data lake or anything where the data was used mostly for internal services, I'd just stick with S3.

woopwoop24 · on Jan 19, 2024

minio is good but you really need fast disks. They also really don't like, when you want to change the size of your cluster setup. No plan to add cache disks, they just say use faster disks. I have it running, goes smoothly but not really user friendly to optimize

cholmon · on Jan 20, 2024

GlusterFS support looks to be permanently ending later this year.

https://access.redhat.com/support/policy/updates/rhs

Note that the Red Hat Gluster Storage product has a defined support lifecycle through to 31-Dec-24, after which the Red Hat Gluster Storage product will have reached its EOL. Specifically, RHGS 3.5 represents the final supported RHGS series of releases.

For folks using GlusterFS currently, what's your plan after this year?

sob727 · on Jan 20, 2024

Curious, what do you mean by "know what you go into" re glusterfs?

I recently tried ceph in a homelab setup, gave up because of complexity, and settled on glusterfs. I'm not a pro though, so I'm not sure if there's any shortcomings that are clear to everybody but me, hence why your comment caught my attention.

asadhaider · on Jan 19, 2024

I thought it was popular for people running Proxmox clusters

geerlingguy · on Jan 20, 2024

It is, and if you have a few nodes with at least 10 GbE networking, it's certainly the best clustered storage option I can think of.