everyone blaming gentoo here is completely wrong. we had a very similar meltdown...

linsomniac · on Feb 9, 2018

From the reading of the article, it sounds like much of their problems were related to several upgrades of Ceph having happened, and possibly also an ABI incompatible libstdc++ update rolling out. Probably not the cause of your RHEL 7 issues, since it is a much more conservative release.

It sounds to me like the choice of Gentoo was at least partially responsible for that 5 day outage.

With RHEL/CentOS, it's pretty good about keeping things stable from a software version perspective.

The downside is when you really do need (or just plain really want) newer versions of software. But at least then you can make the conscious decision to manually maintain a few packages, while the rest of the core stays stable.

I won't run anything but an LTS release in production. People always say "Well <my favorite distro> has release cycles every 2 years, and upgrading your systems that often should be fine." But I've been involved in OS upgrades that require 6+ months to do, and not infrequently.

Most reasonably complex services (in my experience) take man-months of effort to go from, say, 12.04 to 16.04. Start having 3 or 4 such services, and non-upgrade work that needs to be done, and you start running out of time in a 2 year release cycle to complete updates.

CaptSpify · on Feb 9, 2018

This has been my experience with ceph as well. Hell, the documented setup commands for Debian didn't even work out of the box.

I also agree with everyone that gentoo was probably not a great choice, but ceph doesn't even seem production ready unless you have a dedicated team just to manage it.

cullenking · on Feb 10, 2018

Part time sysadmin here (founder, so biz duties, dev duties, PM and product duties, so not a ton of time for sysadmin stuff) and I've had no major problems with ceph. Had some performance issues but figured out how to tune scrubs appropriately for my workload, and found out the hard way that prosumer SSD do not make good journal devices. I run a three node 18 osd ceph cluster and it's been no problem. I may eat my words when I do a rolling upgrade from Firefly to jewel (multiple versions) but not dreading that process much.

I don't understand why people are complaining about docs here. The docs are good. Read them, read config options, read until you understand what config options actually do. Passively watch the mailing list. I think expecting to run your own petabyte capable storage cluster without a little effort is a bit misguided...