Anatomy of a Ceph meltdown

nimbius · on Feb 9, 2018

Most of this post-mortem has nothing to do with Ceph or Gentoo, and everything to do with competent system administration and change management. No one appears to have stopped to consider the ramifications of unannounced upgrades to the fileserver or OSD's. Once you're frantically rebuilding world, its over. the lessons learned completely ignore communications between sysadmins and management or, god forbid, even customers. "do everything whenever you like" is not system administration.

>If we notice something is going wrong with Ceph, we will not hesitate to shut down the cluster prematurely.

once bitten..twice shy? how does this solve future problems in the HA filesystem for your customer?

>We should not update Ceph on all machines at once.

HA/HPC 101. rolling upgrades are fine, slow and steady, with lots of testing and a documented rollback procedure if things go screwy. large enough/critical systems often have dev test and prod

>We will build glibc with debugging symbols.

you will plan your OS upgrades separate from your Ceph upgrades and with properly communicated actions before, during, and after the events.

>We will track Ceph releases more closely

as opposed to...what? being a sysadmin, Im on more than a dozen release mailing lists. I track load balancer known issues and patches to make sure i dont roll out, say, an F5 firmware with known failure to properly proxy HTTP2

saganus · on Feb 9, 2018

To be fair, the author says right at the beginning:

"Please remember that we are all unpaid volunteers who have our own studies and/or day jobs, and no one has had more experience with Ceph than what you get from reading the manual."

So while you might have a valid point, it's hard to develop those skills if you are not a proper sysadmin. If you are just an unpaid volunteer, it means that most likely the environment you work in can't afford (or won't pay) for actual sysadmins whose paid job is to know these things.

Maybe the discussion should be, why were there unpaid volunteers doing sysadmin work instead of an actual sysadmin, but I guess that is a normal occurrence in universities all over the world?

batbomb · on Feb 9, 2018

They also said:

> First and foremost, having a multiple-day spanning downtime is completely unacceptable for a central service like this [...]

Which goes to your last point a bit. It's not a normal occurrence to all universities, but to some. If it was me, I'd turn this thing off repeatedly until the people used it realized it's value and tried to dedicate a bit of manpower to it.

kbenson · on Feb 9, 2018

Considering sysadmins often have keys to the kingdom, unpaid volunteers for sysadmin work seems really risky to me. Want volunteers to design a webapp for you? Go ahead, your professional sysadmin will make sure they only get permission to what they need. It doesn't work well the other way around.

lathiat · on Feb 10, 2018

These kind of volunteer positions are exactly how you learn before you have a real job. Worked great for me. And then had experience to actually get me into a real job.

sekh60 · on Feb 10, 2018

In defense of ceph the documentation is fantastic and ceph clusters have only gotten easier toanage with every release. I am a homelabber, not a sysadmin and I run a cluster and have only run into one or two issues that the docs or patch notes didn't cover. Also the mailing list is incredibly friendly and helpful. The IRC channel is not as useful iny experience though, seems like with many there are a ton of idlers.

One does need to keep up with the release notes, mainly for the past couple releases, especially with the latest release, luminous which marked stable bluestore as a more performance alternative to filestore.

saganus · on Feb 10, 2018

I don't even think Ceph is at fault here, since they are not the ones making a critical piece of infrastructure be the responsibility of unpaid volunteers.

Even if said volunteers where seasoned sysadmins, I don't think they should be stressing over something so critical while being unpaid. Obviously you don't always get to be paid in money. Sometimes you feel like doing it for the greater good, or because you like the institution, or whatever the reason. However I still think as a university you should not rely on unpaid staff to handle critical systems because it's not a nice thing to do. I don't think most people enjoy stressing over a hobby.

amelius · on Feb 10, 2018

Or perhaps the question should be why the gun didn't have basic/full protection against foot-shooting.

0xbadcafebee · on Feb 9, 2018

I don't think people notice how difficult rollback is.

Some people think because their app is stateless, rollback is trivial. But virtually every single piece of technology that has an indirect effect on your app does have state, and each one moves your app away from a known good state. Everything from firewall rules and network routes, to switch firmware, to DNS, to BIOS updates, to LDAP directory changes, compute and storage configuration changes, software updates, etc all affect whether you can roll back successfully.

Do you know when your cloud provider updates its software, and how it will affect your app? If they change something and don't notify you, and it causes a problem, at what point will you notice when their latest changes were? Can they even roll back their change, and how long will that take?

This was an example of one tier or service failing. But it could have easily been a firewall change, a DNS change, a time sync issue, etc. If you want to roll back to the last time your app worked, all of those need a rollback method, too.

emmelaich · on Feb 10, 2018

> I don't think people notice how difficult rollback is.

Yep, while we always had a plan in release/upgrades for a fallback, everyone knew it would never happen.

We had enough testing in qa/preprod that no matter what happened, a fallback would cost much more time and money than any methods (manual or otherwise) to rectify the problem.

bane · on Feb 9, 2018

Ceph meltdowns seem to be the worst. I've seen two other projects struggle with Ceph and eventually both gave up and moved on to something else. The main problem seems to be lack of documented usage guidelines and not much written on what to do when something blows up...and things seemed to blow up with both Ceph systems frequently and in many cases without any discernable error logging.

The after actions on both systems seemed to indicate that they had simply been provisioned wrong, or had violated some other best practice, but that advice never seemed to be written down anywhere and the documentation on the Ceph site was woefully inadequate.

One team simply moved on to using HDFS and rewrote their entire approach to assume HDFS and it's been pretty solid since. I think the other group moved to using Gluster.

I really hope it's getting better, because the idea is really great, but I wouldn't recommend it to anybody right now based on what I've seen.

api · on Feb 9, 2018

There are lots of other options too like LizardFS:

https://lizardfs.com

Seems really nice and easy to deploy.

takeda · on Feb 9, 2018

That one is derived from MooseFS (https://moosefs.com/).

What turned me away from LizardFS was when they forked the project the very first thing they did was rewriting the code to be in C++, this makes much harder to merge future improvements, and feels like they make some decisions for the wrong reason (the user doesn't care what language it is written in)

MooseFS is quite good and has fairly good performance, but its weakness is that is also that the central point is its master, and the open source version doesn't have HA.

I know that LizardFS supposedly has HA (it didn't have it at the time I looked into it), how is that implemented?

xyproto · on Feb 9, 2018

MooseFS can have a second master that takes over if the first one should fail, though.

takeda · on Feb 11, 2018

I'm familiar with 1.x which was very basic, glad they added that. The previous approach with metadataloggers had the issue that before you can convert it to a master you had to merge changelogs which could take a while when you have many files.

samstave · on Feb 9, 2018

what is the use case where one would want to implement either lizard/moose into prod? Why? What are you solving for with these?

takeda · on Feb 11, 2018

When evaluating various distributed filesystems MooseFS[1] was the fastest one.

[1] I did not evaluate LizardFS but back then it was freshly forked so it was pretty much the same thing.

linsomniac · on Feb 9, 2018

But Ceph seems to be supported in everything. I know it's supported in Ganeti, which I currently run (and love) and Proxmox (which I used to run).

takeda · on Feb 9, 2018

LizardFS/MooseFS is mounted as regular filesystem (it behaves like NFS) so I would imagine everything should work with it.

xorcist · on Feb 10, 2018

Everything does not work with NFS, including things like file locking.

takeda · on Feb 11, 2018

NFS has lockd and MooseFS apparently also supports locking since version 3.0.

acd · on Feb 10, 2018

Have had very good experiances with MooseFS the predecessor to LizardFS, Rock steady never crashed or did something finicky. Performance was ok.

jakobdabo · on Feb 9, 2018

I don't understand why people use Gentoo in production servers. Is it good practice? I can see so many downsides.

hathawsh · on Feb 9, 2018

Compared with Debian or Red Hat, Gentoo is really a meta-distribution. It's a nice basis for creating your own distribution. In theory, Gentoo lets you mix the latest version of some packages with well-tested, old versions of other packages and generate a release perfected for your organization. If the Gentoo release of some package, Ceph for example, isn't as up to date as you want, you can easily update it yourself and publish a new release to your servers.

In practice, many people don't understand that Gentoo is a meta-distribution and try to run it like a simple distribution, leading to arbitrary ABI breakage and chaos. If you want to run Gentoo on servers, you need to test all upgrades on build machines before rolling them out. You need to publish your own binary packages. You need to do a lot of the testing that Debian, Ubuntu, Red Hat, etc. would be doing for you.

I have some friends who set up their own hosting business based on Gentoo and they seem to be doing it right. Their servers are very stable. Testing updates is an ongoing expense, but I think it has worked out well for them.

cies · on Feb 9, 2018

I do not want to sound annoying, but this is how I read your very informative comment:

> Compared with Debian or Red Hat, Gentoo is really a meta-distribution. It's a nice basis for creating your own distribution.

I do not want that on my production server. We're in the $SOME_BACKEND_SERVICE business, not in the distro creating business.

> In theory, Gentoo lets you mix the latest version of some packages with well-tested, old versions of other packages and generate a release perfected for your organization.

I want a hyper stable base that is tested by the masses and when possible has some big org behind it who's supporting it (and thus very conservative when changing stuff). On top of that base I might roll my own packages or include a fringy package repository. Never I want the base to be different for everyone.

> If the Gentoo release of some package, Ceph for example, isn't as up to date as you want, you can easily update it yourself and publish a new release to your servers.

This is not too hard for any distro I've used.

> In practice, many people don't understand that Gentoo is a meta-distribution and try to run it like a simple distribution, leading to arbitrary ABI breakage and chaos. If you want to run Gentoo on servers, you need to test all upgrades on build machines before rolling them out. You need to publish your own binary packages. You need to do a lot of the testing that Debian, Ubuntu, Red Hat, etc. would be doing for you.

So may we conclude, as GP said, that running Gentoo on production servers is just not a good idea?

> I have some friends who set up their own hosting business based on Gentoo and they seem to be doing it right. Their servers are very stable.

This is a little too anecdotal for me :)

I've done my share of Gentoo, and this system in intriguing. But when I need stability, as I need for backend services, then it will not even be on the long list.

I found some similar-level intriguing features in NixOS. I'd probably still not run it for backend services, but I might be compelled to run it on the application layer in some cases.

tene · on Feb 9, 2018

You've correctly understood the nontrivial costs involved in trying to use Gentoo on production servers, and correctly understood that the benefits aren't relevant to your personal use-cases, or the most-common use-cases for production distributions in most industries.

I don't at all read your parent as asserting that all production deployments should make use of a meta-distribution like Gentoo, but instead describing the specialized features that are useful for some specific niches. If you have some business reason to want to run arbitrary combinations of different versions of different software together, or to more-easily manage distributing software to your platform with many different features that you sometimes want compiled out, or otherwise if a slow stable extremely-reliable distribution isn't handling your needs, you might want to consider using Gentoo instead of building all that infrastructure yourself.

That's not the case for you, or for me, and that's fine. You've described some of the reasons I also prefer a stable well-supported distribution and don't use Gentoo. My reading of your comment is saying that because you personally have no use for the benefits, therefore Gentoo is completely unsuitable for any production use ever. If I've misread your comment, and you were just commenting on your personal use cases without trying to speak to any general audience, I apologize for my mischaracterization.

hathawsh · on Feb 9, 2018

It depends on the application. I can easily imagine supercomputers running Gentoo, since Gentoo lets you build the world with risky optimization flags that just might work for you. I can also imagine high-security environments using Gentoo because Gentoo lets you eliminate unnecessary dependencies that may be a security risk.

But honestly, I think most people who run Gentoo do it for the learning experience and for fun. :-)

cies · on Feb 10, 2018

> I can easily imagine supercomputers running Gentoo

Yeah, nice one! Very customized flags required.

I was more talking about typical backend services, like Ceph or, say, a Riak or Postgres cluster.

emmelaich · on Feb 10, 2018

You can get source rpms for say Centos or Scientific Linux and maintain your own set of patches and flags.

Minimise your deviation from the stable stuff that everyone else is using.

cies · on Feb 10, 2018

Which I expect is more common in super computers.

But still, maybe there is some stuff with special processors that needs all to be compiled specifically with support for that type of thing, in order to make proper use of the underlying machine.

nikanj · on Feb 9, 2018

>> Compared with Debian or Red Hat, Gentoo is really a meta-distribution. It's a nice basis for creating your own distribution.

> I do not want that on my production server. We're in the $SOME_BACKEND_SERVICE business, not in the distro creating business.

A lot of companies know that Google and Facebook have their own distros. Ergo, to become as successful, they too must build their own distro.

zbentley · on Feb 9, 2018

I know this comment was probably sarcastic, but it's also a good teachable moment: what did those huge companies with heavily customized distros base their distros on before they started customizing? You'll find a lot of RHEL/CentOS/Debian core in those lists, not much Gentoo or similar. The stability (and stability-emphasizing philosophy) of those original base distros will permeate customizations for years to come.

pfg · on Feb 9, 2018

The most notable exception to this would be Chrome OS (which also served as a starting point for CoreOS and, IIRC, Google's container OS).

This hasn't stopped Google from using a Debian variant internally as a desktop OS, so they all kind of have their place depending on what you intend to build.

jakobdabo · on Feb 9, 2018

Thank you for this information. So the bottom line is that, when comparing to more traditional distros, it's a lot of work to do it right.

mastax · on Feb 9, 2018

Thank you, that's a very interesting point. I suppose shared hosting providers already do their own builds of PHP, MySQL, Apache, etc. so a distro that's designed for that can work well.

hhh · on Feb 9, 2018

Like what?

jakobdabo · on Feb 9, 2018

Like you have to regularly compile and build big packages like Glibc (while under load) just to update the system. Also, as it's a rolling release distribution, as I understand you won't have stable enough software (an update might require configuration file change, this doesn't happen in CentOS or Debian Stable).

TheDong · on Feb 9, 2018

> you have to regularly compile and build big packages like Glibc (while under load) just to update the system

No, you should have build hosts for that, publish binpkgs, and then install the binpkgs.

Gentoo's binpkgs do not require compiling on your production machines; compile them elsewhere.

> an update might require configuration file change, this doesn't happen in CentOS or Debian Stable

Debian has debconf, and Centos has a similar tool. When configuration files change in the debian world, they often prompt with those maintainer scripts.

Gentoo's dispatch-conf/etc-update model is quite nice because it also lets maintainers recommend new config defaults which the user may easily accept or reject.

You clearly don't know what you're talking about. Then again, many people who use gentoo don't use it as you would in production, with binpkg mirrors and well managed package unmasking/masking/etc.

laumars · on Feb 9, 2018

I wish people would quit peddling the "rolling release means less stable" trope because it's simply untrue. Many rolling release distros tend to be bleeding edge, _thats_ why they might be unstable (though honestly I've never had ArchLinux crash for any reason that wasn't my own fault). What's more, CentOS and Debian do sometimes have updates they require configuration changes. That's why Debian likes to separate out base config files from ones that are likely to change. So you can put your preferential overrides in a .local (iirc) file and the master version is free to be upgraded. Eg This is how fail2ban works on Debian.

jeremyjh · on Feb 9, 2018

I had three breaking issues in a year and a half using Arch. By breaking - I mean some piece of software I rely on quit working reliably after an update. These were all upstream bugs - not Arch's fault (KDE in two cases, Erlang/OTP in the other), but Ubuntu and RHEL never saw these versions of software hit their repositories because the problems were found (in at least one case by Arch and Gentoo users) and fixed in upstream long before they made it through the long release and testing cycles.

The nice thing about Arch, its absolutely trivial to rollback a package, or even rollback all your packages to the ones published on any given date. So I was able to deal with all of these issues quite effectively which is a credit to Arch, but it simply is the case that you will see more breakage on the leading edge.

laumars · on Feb 10, 2018

> I had three breaking issues in a year and a half using Arch. By breaking - I mean some piece of software I rely on quit working reliably after an update. These were all upstream bugs - not Arch's fault (KDE in two cases, Erlang/OTP in the other), but Ubuntu and RHEL never saw these versions of software hit their repositories because the problems were found (in at least one case by Arch and Gentoo users) and fixed in upstream long before they made it through the long release and testing cycles.

What you're describing there is bleeding edge; not rolling release. If you ran a Ubuntu with the testing repos then you'd have experienced the same issues.

> The nice thing about Arch, its absolutely trivial to rollback a package, or even rollback all your packages to the ones published on any given date. So I was able to deal with all of these issues quite effectively which is a credit to Arch, but it simply is the case that you will see more breakage on the [bleeding] edge.

Indeed you will on bleeding edge. That was my point. Most of the complaints people attribute to rolling release are actually problems with bleeding edge distros rather than rolling release. You've downvoted me only to reiterate the same point.

jeremyjh · on Feb 10, 2018

I see your point now, but is there a rolling release distribution that isn't bleeding edge? I think that's the reason for the confusion. All the rolling release distributions are closing tracking their upstreams. Indeed, without a release cycle, what exactly would they be waiting for?

laumars · on Feb 10, 2018

> I see your point now, but is there a rolling release distribution that isn't bleeding edge?

I'd covered that in my first post as well (did you even read it? :P)

> without a release cycle, what exactly would they be waiting for?

Rolling release distros still have release cycles. eg they often have testing repos where many packages will be trialled before they hit the main repositories. Much like you see happen with packages which get updated between major releases on non-rolling release distros.

The difference between rolling release and non-rolling release is a bit less pronounced these days because even most non-rolling release distros now have easy upgrade paths from one major release version to the next. Heck with the Debian / Ubuntu derived distros you can just update your apt config to point to the next release repos and apt will carry on as if you're a rolling release distro. So I think the real difference between rolling release and non-rolling is simply just that you're given greater assurances[1] that you don't need to perform manual intervention during package upgrades outside of the major version upgrades where as with rolling release the risk of manual intervention can come without warning[2]

In terms of package stability, there's nothing stopping someone creating an Arch fork which runs packages a few months behind the main Arch repos. In fact I think that might have been the concept behind Arch Server - not that it ever took off.

So anyway, back to my original point: rolling release doesn't have to mean "unstable". It just means breaking changes don't all get held back until major milestones are reached. But that's often true for non-rolling release as well. It just so happens that most rolling release distros are also bleeding edge - and that is where the trope come from. So saying rolling release can only be unstable is akin to saying Redhat can only be bleeding edge if you only ever use Fedora.

[1] I was going to say "guaranteed" but that's not quite true either; eg FreeBSD will happily perform breaking changes to applications (within a major release version) if you're not careful what application versions are being installed

[2] This is also not quite true but many distros will warn you about breaking changes in their news feed. But also the package manager itself gives strong clues too (eg Apache 2.2.x -> 2.4.x will obviously result in manually updating some Apache config files).

0xbadcafebee · on Feb 9, 2018

Rolling release is unstable. You have absolutely no idea if any given update will crash some part of your system, because every variable of the known use cases has not yet been re-tested and certified.

Software and firmware updates from Linux distros have recently bricked modern hardware several times. This should never have happened, because a period of testing is necessary to evaluate whether a given update will cause failures. With rolling release, you update and pray.

laumars · on Feb 10, 2018

Again, what you're describing is bleeding edge; not rolling release.

switch007 · on Feb 9, 2018

Well it's not an OS upon which they "run a comprehensive functional, regression, and stress test suite ... on a continuous basis" for starters (CentOS and Ubuntu, only)

http://docs.ceph.com/docs/master/start/os-recommendations/

kitotik · on Feb 9, 2018

It’s almost impossible to achieve any sort of reproducibility when all packages are built from source.

Portage is a great tool, but I always thought it was better suited to build packages and operating systems as opposed being an end user package management solution.

KirinDave · on Feb 9, 2018

I've learned recently that even using Nix when building from source is something of a false hope of reproducibility. Many packages use the fetchgit function and fetchgit violates the contract.

Filligree · on Feb 9, 2018

fetchgit does no such thing. To use it, you need to specify the git revision and the hash of the output; it's a constant-output derivation just like fetchurl.

Reproducbility is still a difficult problem, but that isn't one of them, and Nix comes far closer than Gentoo.

KirinDave · on Feb 9, 2018

https://twitter.com/shlevy/status/956313877800566784?ref_src...

mst · on Feb 9, 2018

I do hope somebody either makes that start warning, then removes it, or at least adds it to a linter of some sort.

korethr · on Feb 9, 2018

If I am correctly recalling a conversation I had with Gentoo founder Daniel Robbins on IRC a couple years back, Portage was originally intended as a tool for building binary packages, not necessarily the end-user package management.

As for reproducibility when building from source, that depends on your practices. Is every box a special snowflake with it's own USE flags and package set? Well then yes, good luck with reproducibility. Or are you standardizing your targets and building one package to deploy to them all?

kitotik · on Feb 9, 2018

Build a gentoo system each day for seven days on the same hardware with the same configuration(package versions, USE flags, etc) and see how well they match up.

laumars · on Feb 9, 2018

I have done this. I used to manage a fleet of 50 servers with everything built from source aside the base Linux install and build utilities. They were all identical.

korethr · on Feb 9, 2018

Okay, I'm tempted now, partly out of a someone-might-be-wrong-on-the-Internet cussedness, and partly out of curiosity. By what metrics would you propose measuring how consistent the systems are or are not?

TheDong · on Feb 9, 2018

Binpkgs output from builds on the same machine are often bit-for-bit identical.

It, of course, depends on the project; if the ./configure script drops the current date in source code, it'll be different, but that doesn't happen often.

darkr · on Feb 9, 2018

Hard, but not impossible: https://wiki.debian.org/ReproducibleBuilds

old-gregg · on Feb 9, 2018

...like saying "5 days of downtime" followed by having to "rebuild the world"?

hpcjoe · on Feb 9, 2018

I hate to be blunt, but the moral of this story is that you should not trade good operational practice for some personal preferences. Distros generally don't matter, until they do. Choices made at the distro level, like, hey lets build world! don't mesh well into providing stable and reliable service.

You can run any distro you want, safely, with best practices. Which means you aim to not make upstream changes which have a high probability of breaking things ... like glibc ... and you focus upon maintaining a stable platform for your service. You can do that with any distro. Any unix(-like) thing really. There's no magic, just common sense.

The flip side is that some practices encouraged by various distros range between glacial change (read as stability), or near relativistic change (read as here be dragons). Gentoo encourages the latter, and Debian/CentOS encourage the former. This does not mean one is better than the other, just that you have to pay more attention with some distros, to maintaining operational discipline.

sitkack · on Feb 10, 2018

I would have run the Ceph in a VM, with the disks passed through so I could downgrade/crossgrade to different Ceph versions while having a rollback plan that is the identity function.

perlgeek · on Feb 9, 2018

The "restart after upgrade" lesson is one we learned the hard way too, though typically at the OS level, not the application level.

We had cases where machines with 200+ days of uptime changed from one operations team to another, and the new team did a reboot -- and found that some NFS mounts or ip routes or firewall routes had been added manually at run time, without persistent configuration (like calling mount directly instead of adding stuff to /etc/fstab). Of course, those things were lost after a reboot, and had to be reconstructed somehow.

And this is in an organization with several professional sysadmins per team, not "just" a team of volunteers.

As for the choice of OS: it makes sense to use one that people are familiar with, and that fits their style of operation. I don't want to pass any value judgment on that topic.

All in all, impressive debugging work!

linsomniac · on Feb 9, 2018

I used to be all about how long my uptime was.

Around a decade ago, I started hating big uptime. "That machine has been up 200 days? WAY past time for a reboot!" In my previous job I implemented a policy of rebooting at least every 6 months, or whenever kernel or glibc/etc updates were done.

This was very effective at ensuring that changes to systems were ALWAYS reflected in boot scripts, etc... Way better than finding out that all sorts of problems existed on a hundred servers during an emergency when power went out, and nobody could remember the change that was made a year or more ago.

grandinj · on Feb 10, 2018

Even that is too long. Once a week at a scheduled time is best. That way if anything goes wrong, it's relatively easy to find out what changed because people's memory is still fresh.

jlg23 · on Feb 9, 2018

What worked for me with volunteers and paid sysops: Monitor changes in a cronjob, mail diffs to every sysop involved. Someone usually spots that only the live fw rules have changed but not the config and after a few (friendly) applications of the LART people tend to be more careful ;)

lykr0n · on Feb 9, 2018

Hello, that sounds useful? Wouldn't happen to have the source code somewhere?

tenken · on Feb 10, 2018

Put sysconfig in Git. Run git diff in Cron. Profit.

darksim905 · on Feb 9, 2018

I have a dumb question when it comes to Ceph:

Whatever happened to the good ol' days of best practices? At least with Microsoft products, say, something like SCCM, you have metrics that can tell you what works & what doesn't work. Or, a software product on Microsoft's platform like PDQ Inventory/PDQ Deploy. I know with a decent 4GB memory/60GB HDD/1Gbps network, I can manage 40-50 desktop PCs fairly easily.

Do people not do baselines anymore? There's so much random documentation for Ceph. It's poorly documented. There are no logical help switches for it. And there's no "Oh, you should have x mons for this many OSDs"

It's just poor. Poor all around. :-(

nickvanw · on Feb 9, 2018

If you're operating it in an environment where you need it to be up with multiple 9s or you're losing money, you call Red Hat and pay for a consulting engagement. As an open source piece of software, I think it's significantly harder to keep up with the level of public documentation that MSFT can with some of their products.

They'll provide RH certified packages that have been tested in-depth on certain OSes (not Gentoo), provide someone on-site to make sure you're following their best practices around configuration, upgrade process, etc.

Ceph is a very complex distributed system and it's very difficult (IMO) to nail down what the best practices are around hardware and resources - it depends almost entirely on the workload, and requirements out of it. If you're operating it solely for sequential RGW workloads, it looks _completely_ different than what you'd want for high performance RBD devices.

They do have some documentation online, to the extent that they can: separate networks, 5 MON nodes (to avoid one going down and ruining your day), etc. I just think it's significantly harder given that it's used on such a diverse set of workloads, which is why they are more than happy to sell consulting for this.

jclulow · on Feb 10, 2018

I was forced by management at a previous gig to use a RedHat consultant to set up a satellite server. In the end, I had to help him set it up after it became clear he wasn't able to do it on his own. It was a huge waste of money, and I'd caution anybody to pay close attention to contract management if you're going to get them in.

xorcist · on Feb 10, 2018

Satellite is a product sold to management, not to people who do not mind the command line.

That can be reflected a little bit in Red Hat technicians. Do no let that affect your judgement of their technical and operational skills.

trengrj · on Feb 10, 2018

Obviously as others have mentioned there were a lot of mistakes here around engineering best practises: using a bleeding edge source based distribution for critical storage infrastructure, allowing drift of package versions, and not testing releases.

I've done some work in Ceph and in my opinion it is too ambitious: block, object, and file system storage, custom raw partition format (Bluestore), along with a somewhat legacy C/C++ codebase makes the system fairly fragile. I would only recommend it if you "know what you are doing" or are willing to get some help from Red Hat.

The other comment I'd make is that 12TB unreplicated (assuming they are using 3x replication for Ceph) is actually not a huge amount of data and in my opinion a ZFS setup would be cheaper and more stable. It is not too challenging to do a HA ZFS setup, and also ZFS's many mirroring, scrubing, and checksuming abilities makes it very resilient. Copy on write and ZFS snapshots are also a great features that can save on disk usage.

I feel the sweet spot for Ceph is probably 50TB - 1PB where you need both block and object storage and are unwilling to use cloud solutions. Lower than 50TB and the overhead and risk of managing Ceph makes it less practical than traditional solutions. From the other direction CERN did testing up to 30PB with Ceph but they had to make significant code changes and had Ceph committers in their team. To compare Hadoop is running in clusters of up to 600PB (but you will be accessing the data very differently).

[1] https://cds.cern.ch/record/2015206/files/CephScaleTestMarch2...

rconti · on Feb 9, 2018

I distinctly remember being dialed into a meeting by phone where someone was explaining the project requirements of what would be our first Ceph deployment. Over and over it was explained there was simply NO provision for restarting nodes without the remaining nodes trying to rebuild for the configured n+x redundancy. I completely disagreed and said it made no sense at all, until I read up all I could on the matter. No provision for patching. Period. If a node goes down, the cluster tries to rebuild.

I'm not sure what reality this design comes from, but it's not one I care to inhabit.

antongribok · on Feb 9, 2018

This is absolutely not true.

The normal flags you're looking for are: noout, noscrub, nodeep-scrub. Do your maintenance. Unset those flags.

The noout is what tells Ceph to essentially not shuffle data when OSDs go down. This is also referenced in the article.

puzzle · on Feb 9, 2018

Even that might be too rigid. A better way is to have a time-based setting. For x < N seconds, ignore the missing node. N could be 600 or 900. During that time, you have degraded performance, but the system is still serving. If you exceed that time, then perhaps the node is gone for good, so you should rebuild.

Think of the case where you have more than the three servers describe in this incident. You really don't want to disable cluster-wide automated repairs because of some other kind of maintenance. Something else WILL happen during your scheduled work and... ooops.

Why 600 or 900? Unavailability events in larger clusters mostly tend to last only 10-15 minutes, for routine things such as regular node reboots, per page 2 of https://static.googleusercontent.com/media/research.google.c...

antongribok · on Feb 9, 2018

Ceph already has this, the setting is called mon_osd_down_out_interval, and the default is 300 seconds (you can easily change it).

The reason for "noout" is you generally want to minimize IO operations while you're performing your maintenance. You should be closely monitoring your cluster while performing maintenance, and if something else goes wrong, you abort and unset noout, wait for it to finish rebuilding, and reassess.

EDIT: This used to be 300 in Jewel, and seems to have changed to 600 in a later version.

puzzle · on Feb 9, 2018

That's cool. Thanks!

I'd still argue that you want to provision N+2 capacity, so you can withstand a planned event and an unplanned one at the same time, without having to go through manual tweaking. I'd leave that for truly exceptional cases, such as when the whole cluster is hit by some nasty bug or has entered a spiral that calls for drastic measures.

tene · on Feb 9, 2018

That's also configurable. The default setting for "mon osd down out interval" is 600 seconds. See http://docs.ceph.com/docs/master/rados/configuration/mon-osd... for a lot more configuration that can be done here.

rconti · on Feb 9, 2018

Thank you. I was (obviously) very far from a Ceph expert, and I never had to implement it, in part because we could never get a satisfactory approach to this problem. Nobody I met with had an answer, and I couldn't find one after doing quite a bit of research. However, as usual with these kinds of things, simple document reading and keyword searches won't always find you the answer -- you need to understand the technology in depth to even understand which questions to ask.

In an ideal world, we would have had multiple people to dedicate to building the kind of cluster they wanted to build. In the real world, after a dozen people left, I'm afraid they may well have ended up in a situation not much better than the University in question.

tene · on Feb 9, 2018

I'm curious if there's a use case or edge scenario I'm not aware of that noout does not cover. I can't say I've got exhaustive testing for this, but when performing maintenance on a ceph cluster I ran at a previous employer, "ceph osd set noout" before maintenance did exactly what it is documented to do and prevented rebalancing while OSDs were stopped. Despite my poor memory, I'm pretty confident about that, because at least one upgrade I performed did end up requiring some data migration (I believe it was migrating from straw to straw2 bucket type), and we only made that change after all nodes were successfully upgraded.

http://docs.ceph.com/docs/master/rados/troubleshooting/troub...

My memory is generally quite poor, but I vaguely recall this feature being present for as long as I've been familiar with Ceph. I obviously have no way of knowing what happened in that meeting you were in, and maybe the other people who were proposing using Ceph were not very familiar with performing maintenance on it, or maybe there was other constraint or use case they had in mind, but I'm quite confident in saying that the normal case for Ceph maintenance involves only a very marginal amount of data movement (to bring the temporarily-down OSD back up-to-date with changes that occurred while it was down).

kjetijor · on Feb 9, 2018

I've used ceph since firefly - and noout existed back then as well.

Assuming that you run with 3 failure domains, and only maintain one failure domain at a time. Noout mostly gets the job done. What it doesn't do for you is save you from an actual failure in a different failure domain during maintenance. EC pools & k+(m>=2) or replication > 3 - would cover this as well.

We've had mostly great success with noout + maintain failure domain at a time, wait for recovery, proceed to next failure domain, repeat until done. To the point where we've been comfortable leaving a lot of the babysitting & work to machines.

etcet · on Feb 9, 2018

Nice debugging skills, but I can't help think that's what you get for running Gentoo in production. You should be using what everyone else uses (i.e. Debian or CentOS) so they run into problems before you.

mlosapio · on Feb 9, 2018

I didn’t read past the first paragraph before I formed the opinion that running all your VMs on a file system like this is a terrible idea.

If you can’t afford an enterprise SAN (I’m not even talking a NAS, I mean a real fiberchannel-based block-store) for your virtual environment (and I get it - many cannot) then just do yourself a favor and run on local disk.

The reduction in moving parts will dividends. I promise.

riffic · on Feb 10, 2018

Agreed, and this post does a good job describing why:

https://thornelabs.net/2014/06/14/do-not-use-shared-storage-...

mbid · on Feb 10, 2018

According to https://www.fs.lmu.de/angebot, this server hosts

* a bunch of wordpress pages

* mailing lists

* mail server for council adresses, likely mostly unused

* git server

for a few student councils at LMU. Git is likely used only by the CS council, maybe maths and physics.

Is it really necessary to have 3 * 12TB for this, and all this complexity? Fault tolerance? Gentoo? Would a single server running boring software and occasional backups not have been more appropriate for something like this? To me it looks like a few CS students tasked with managing digital went completely overboard.

jskrablin · on Feb 10, 2018

Ceph is a very finicky beast and I'd advise anyone to stay away unless you're wiling to dedicate significant resources to master it. I've seen more than one Ceph system gone belly up during my days at IBM. IBM is throwing significant resources at it have people and procedures developed to tackle the usual and less usual operations... and it still caused headaches.

It will make you hate your job whether there's some kind of networking problem (configuration, bad NIC or just plain ol' Softlayer dropping packets while support claiming no problems), some kind of HW problem (one bad OSD node in semi failed state will pretty much kill cluster throughput) or just some bad disks causing it to rebalance in the middle of high I/O time. It always made me nervous even with the most basic 'let's swap the failed OSD' operations. The blog post itself is a classic showcase of putting all eggs in one basket. They'd be better off with a simpler setup. Btw running only 1 monitor node will make you cry when the single monitor decides to take a break which will happen sooner or later.

KirinDave · on Feb 9, 2018

Another story where the root of the problem is code that lacks robust type checking, allowing for subtle memory corruption.

nailer · on Feb 9, 2018

For anyone else that doesn't know Ceph: http://docs.ceph.com/docs/master/

amq · on Feb 9, 2018

What would be a more fool-proof distributed storage? I tried GlusterFS briefly, but also heard some meltdown stories about it.

It almost seems like if you need some persistent storage running plain NFS would be the safest thing if you don't have a dedicated team.

tene · on Feb 9, 2018

The more-fool-proof option for any distributed storage system is to not deploy multiple package upgrades across your entire fleet without any testing. This entire problem would have been avoided (for any distributed storage system) if they had first tested the upgrade on a single server before rolling it across their entire fleet. If you've got different code sitting around on disk ready for a surprise upgrade whenever something eventually someday gets restarted, you're going to be in for a bad time.

> It turned out that all three systems had been updated a few times without restarting the OSD. No OSD could start anymore. We kept the last two OSD running (this turned out to be a mistake). The file servers, running Gentoo, also had a profile update done by another administrator.

If you don't have the resources to perform safe upgrades, the fool-proof option is to pay a vendor to run your storage for you. I agree that if you want to run a reliable distributed storage system, you need a professional sysadmin who has enough time to maintain it safely. I further agree that if you don't have any professionals who are funded to dedicate at least some time to this, you'll probably have a lot less failure by just running a big dedicated node providing NFS, iSCSI, SMB, or whatever.

I can't think of any software that I'd call fool-proof under "We've upgraded multiple versions without any testing, we have no systems in a known-good state, and we don't have any way to actually revert back to a known-good state".

mjevans · on Feb 9, 2018

A /lot/ of times the issue is one of budget.

It's //really// nice to actually have a budget so you can setup a testing environment and actually validate the things you're about to do to your production environment.

wmf · on Feb 9, 2018

That's what I've done. Hardware RAID + XFS + NFS + backups, done. It wasn't HA, but at least I understood how to get it back up.

snark42 · on Feb 9, 2018

> That's what I've done. Hardware RAID + XFS + NFS + backups, done.

Even if this was HA, it doesn't scale to HPC type needs. You end up with a single node bottleneck for all data you need. You also would have some sort of direct attached storage all connected to a single node.

Lustre/Gluster/Ceph/GFS and maybe HDFS (ideally with commercial MapR type NFS access) are probably the only viable options in the HPC use case.

If you have a lot of money to burn Isilon/NFS is an option as well.

wmf · on Feb 9, 2018

Yep, I haven't worked in a HPC/big data environment but neither are the people who wrote the original article. They have three OSDs and only two(?) servers accessing them.

majidazimi · on Feb 9, 2018

I can't emphasize this more. Being able to comprehend the system is my first design choice, no matter how the alternatives are great/stable/fast, ... You are going to pay the price some day as long as you don't know how stuff works or interacts with each other.

clon · on Feb 9, 2018

> That's what I've done. Hardware RAID + XFS + NFS + backups, done. It wasn't HA, but at least I understood how to get it back up.

Looks like in this case you never actually needed what Ceph brings to the table - for a tremendous cost in complexity (and fragility, from personal anecdotal evidence) you can allow your dataset and -load to grow beyond the physical capabilities of a single node.

The HA argument in less demanding settings ends up being a cargo cult mostly. Your setup will probably produce higher reliability figures than running Ceph, assuming the lack of some serious engineering capabilities.

fulafel · on Feb 10, 2018

Also, take a step back and ask if you really need distributed storage of the OS-mounted filesystem variety. Userspace file storage can be fundamentally less fragile and easier to understand. You really want it to be easy to understand and troubleshoot if you aim for high availability.

riffic · on Feb 11, 2018

Local storage would be advantageous here, and for most cases. VMs are cattle, not pets, and should be redeployable in the event the HV is having issues.

korethr · on Feb 9, 2018

I'm not surprised here to see comments here blaming them for using Gentoo in production, even though I argue that wasn't fully the cause of their outage. Hell, I use Gentoo on my personal boxen because I love it's package management, and even I winced when I read that. "Oh man, I hope they're keeping on top of their system administration. Damn, nope got bit by the libstd++ change, and the recent profile change. This is gonna be painful."

hathawsh · on Feb 9, 2018

I'm also a fan of Gentoo, but realistically, if they were running Debian or Red Hat or a derivative, they would be able to use the recommended releases directly from Ceph:

http://docs.ceph.com/docs/master/install/get-packages/

My interpretation of the writeup suggests that most of their problems would have been avoided by running the latest supported release of Ceph, on a supported distribution.

I ran Gentoo servers myself for a few years, but I had to give it up when I realized I wasn't getting much benefit for all the extra effort I was putting in. It was a great way to learn how free software packages interact with each other, but it became a fairly significant time sink to rebuild the world every so often.

ec109685 · on Feb 10, 2018

Unless they host their file system on public facing IPs, it would be simpler to use private ipv4 addresses versus ipv6. With IPv6, you are running your code against less hardened paths and depdenending the type of IP address assigned to your server, behavior can change like this.

gjs278 · on Feb 9, 2018

everyone blaming gentoo here is completely wrong. we had a very similar meltdown where the osds were hosted on rhel 7. they would not start. we had mds failures as well on rhel.

ceph is just a buggy piece of shit. it configures its options like a windows registry. at one point I literally built ceph on my gentoo system and ran the mds over the vpn. it was the only mds that wasn't crashing. nothing about gentoo is to blame here, it's more stable with ceph if anything.

linsomniac · on Feb 9, 2018

From the reading of the article, it sounds like much of their problems were related to several upgrades of Ceph having happened, and possibly also an ABI incompatible libstdc++ update rolling out. Probably not the cause of your RHEL 7 issues, since it is a much more conservative release.

It sounds to me like the choice of Gentoo was at least partially responsible for that 5 day outage.

With RHEL/CentOS, it's pretty good about keeping things stable from a software version perspective.

The downside is when you really do need (or just plain really want) newer versions of software. But at least then you can make the conscious decision to manually maintain a few packages, while the rest of the core stays stable.

I won't run anything but an LTS release in production. People always say "Well <my favorite distro> has release cycles every 2 years, and upgrading your systems that often should be fine." But I've been involved in OS upgrades that require 6+ months to do, and not infrequently.

Most reasonably complex services (in my experience) take man-months of effort to go from, say, 12.04 to 16.04. Start having 3 or 4 such services, and non-upgrade work that needs to be done, and you start running out of time in a 2 year release cycle to complete updates.

CaptSpify · on Feb 9, 2018

This has been my experience with ceph as well. Hell, the documented setup commands for Debian didn't even work out of the box.

I also agree with everyone that gentoo was probably not a great choice, but ceph doesn't even seem production ready unless you have a dedicated team just to manage it.

cullenking · on Feb 10, 2018

Part time sysadmin here (founder, so biz duties, dev duties, PM and product duties, so not a ton of time for sysadmin stuff) and I've had no major problems with ceph. Had some performance issues but figured out how to tune scrubs appropriately for my workload, and found out the hard way that prosumer SSD do not make good journal devices. I run a three node 18 osd ceph cluster and it's been no problem. I may eat my words when I do a rolling upgrade from Firefly to jewel (multiple versions) but not dreading that process much.

I don't understand why people are complaining about docs here. The docs are good. Read them, read config options, read until you understand what config options actually do. Passively watch the mailing list. I think expecting to run your own petabyte capable storage cluster without a little effort is a bit misguided...