"Access was initially fronted by nginx with consul-template generating the confi...

GorsyGentle · on Aug 4, 2019

If I were to guess, reloads triggered from config changes.

Consul-template writes a config and then does an action. In the case of nginx, I would assume the action is to send a SIGHUP. I think haproxy would have also been an option here, it has better support for srv record to do updates from and the like.

hvindin · on Aug 4, 2019

Where I am at the moment we're running clusters of 400-800 containers sitting behind nginx instances and even thought we own nginx+ licenses, we've found the consul-template + SIGHUP route to be totally fine, even at a churn of maybe a dozen contained a minute everything still seems to be working fine. If a particularly busy node dies then we occassionally see a few requests get errors back, but Nginx's passive healthchecking (ie. checking response codes and not sending traffic to an upstream with a ton of 500's being returned) seems to handle all of that ok.

The only time our tried and tested consul-template + SIGHUP method is every unsuccessful (and we've ended up jusy having to put processes in place to stop this) is if we have the same nginx handling inbound connections to the cluster under high load and we try and respawn all the containers at once. Then things start to go wrong for 5 minutes or so then back to normal.

While "the occasions error response" isn't perfect, I suspect that for most use cases it's good enough, so I'd still be interested in knowing more specifically what happened to that nginx...

gurrone · on Aug 4, 2019

nginx behaves RfC conform. So if you sent it a SIGHUP it will try to respawn all workers by closing (from the server side) all open connections. The problem is that this behaviour confuses some HTTP libs/connection pooler more then others. For example OkHTTP seems to be able to deal with it, but others not so much. Once you reach like 6-12 reloads per second you run into latency issues because you've to establish a new connection for every request, and if you're still running with HTTP/1.1 every benefit of idle connections and connection pooling is defeated. Examples like Traefik (or more old school the F5 BigIP LTM) split frontend and backend handling of connections, and deal with so many reloads more gracefully. Beside of avoiding issues with HTTP libs it at least improves your latency.