* in a legacy system, a server, perhaps with the leader node, filled up disk space and became unhealthy, the consul agent kept reporting it and itself as healthy, and failover and gossip generally wedged
* in a dev environment, after we replaced some servers in the cluster, the other nodes noted cert changes and refused to work with the new servers
* second-hand, in self-hosted installations, it caused a number of hard-to-troubleshoot outages
* something about circular dependencies and going by "wait _n_ seconds" rather than by healthiness
It was reliable enough that it could gather a really significant blast radius, and it had different gnarly failure modes, so documentation could be irrelevant from case to case.
I don't want to sound bad, but to me all of these sound like a misconfiguration and lack of understanding how consul works tbh. But i don't know the full context so eh, these things happen.
We rearchitected. At one workplace, we built and distributed our own service. At another, we shifted to semi-automated more static lists of servers for roles; those servers were much less dynamic.
There are exceptions but most of the time replacing an off the shelf std. solution with something self made looks like NIH syndrome to me.
The exceptions are the few cases where you know you will forever only need some strict set of features and your own solution can provide them by way more simple means than the in comparison "fat" off the shelf solution.
I'm not sure service discovery in a cluster is one of those cases.
Not OP, but go look at the consul documentation. In fact, just look at "Configuration" page alone: https://www.consul.io/docs/agent/options - it goes FOREVER. Maybe you don't need most of that, but whatever you do need, you're going to have to find it somewhere in there.
And yes, the error messages can be utterly cryptic.
One thing: Don't ever let junior devs try to create a cluster out of their own laptops (yes they will think it sounds like a good idea) as this will be a never-ending nightmare.