Could you expand a little on your problems with Consul? (I have no experience wi...

pronoiac · on Oct 16, 2021

It's been a while, so I'm fuzzy on some details:

* in a legacy system, a server, perhaps with the leader node, filled up disk space and became unhealthy, the consul agent kept reporting it and itself as healthy, and failover and gossip generally wedged

* in a dev environment, after we replaced some servers in the cluster, the other nodes noted cert changes and refused to work with the new servers

* second-hand, in self-hosted installations, it caused a number of hard-to-troubleshoot outages

* something about circular dependencies and going by "wait _n_ seconds" rather than by healthiness

It was reliable enough that it could gather a really significant blast radius, and it had different gnarly failure modes, so documentation could be irrelevant from case to case.

proxysna · on Oct 16, 2021

I don't want to sound bad, but to me all of these sound like a misconfiguration and lack of understanding how consul works tbh. But i don't know the full context so eh, these things happen.

still_grokking · on Oct 16, 2021

And what is the replacement for it?

pronoiac · on Oct 16, 2021

We rearchitected. At one workplace, we built and distributed our own service. At another, we shifted to semi-automated more static lists of servers for roles; those servers were much less dynamic.

still_grokking · on Oct 16, 2021

There are exceptions but most of the time replacing an off the shelf std. solution with something self made looks like NIH syndrome to me.

The exceptions are the few cases where you know you will forever only need some strict set of features and your own solution can provide them by way more simple means than the in comparison "fat" off the shelf solution.

I'm not sure service discovery in a cluster is one of those cases.

shaklee3 · on Oct 16, 2021

still_grokking · on Oct 16, 2021

Never heard that someone moved form Consul to etcd. It's always the other way around.

kerblang · on Oct 15, 2021

Not OP, but go look at the consul documentation. In fact, just look at "Configuration" page alone: https://www.consul.io/docs/agent/options - it goes FOREVER. Maybe you don't need most of that, but whatever you do need, you're going to have to find it somewhere in there.

And yes, the error messages can be utterly cryptic.

One thing: Don't ever let junior devs try to create a cluster out of their own laptops (yes they will think it sounds like a good idea) as this will be a never-ending nightmare.