So at one point we where doing scale testing for our product where we needed to simulate systems running our software connected back to a central point. The idea was to run as many docker containers as we could on a server with 2x24 core and 512GB of RAM. The RAM needed for each container was very small. No matter what the system would start to break around ~1000 containers (this was 4 years ago). After doing may hours the normal debugging we did not see anything on the network stack or linux limits side that we had not already tweaked (so we thought).So out comes strace! Bingo! We found out that the system could not handle the ARP cache with so many end points. Playing with net.ipv4.neigh.default.gc_interval and the stuff associated with it got us up to 2500+ containers.
512GB RAM/2500 containers is still 500MB per container. In former days™ this was enough for a computer to run a complete desktop environment with a web browser and 20 tabs open (source: I had a PC with physically 500MB RAM). Is this really the limit for such a decent equipped machine? (I guess a server grade 48 cores, 512GB RAM should be less then 5kEUR nowadays)
He said "The RAM needed for each container was very small" - the RAM is _not_ the limit here. The point is that running many containers is a very different (hard to compare directly) type of load than the main, well optimized use case of running a single large desktop environment.
Networking in particular: with thousands of containers, now there are lots more interfaces, routes, conntrack entries, "background" traffic, iptables rules etc.
Some problems are algorithmic: for example historically a lot of code has been written with the assumption that the number of network interfaces is small. With >1k interfaces, suddenly O(n) lookups take time. Similarly, iptables rules are run sequentially. etc.
Some resources have limits. Exceeding these may impact performance. In his case the load blew up the ARP cache. Nice!
In ~2003, I replaced my desktop that had been Intel-based with a Duron 800MHz system, only I didn't have enough budget to get it the RAM it required (new/different slot iirc), so I only had the 128MB it came with (whereas my old machine had 768MB cobbled together from like six dimms).
I figured that one hop over 100Mbit Ethernet to remote memory was going to be faster then swapping to spinning rust (remember this was before consumer SSDs, and onions on our belts), so I made a ramdisk on the old machine and mounted it over the network with the nbd (network block device) kernel driver, ran swapon on the nbd and boom, extra "512MB" of RAM.
It worked amazingly well, and (knock on wood) none of my roommates ever tripped over the Ethernet cables.
You're assuming that all the memory was used, even though it's not specified in the post. 512GB/2500 is the upper limit. Anything between 0 and that could be the case.
Is there any talk of increasing these defaults in higher memory systems. The low defaults feel like foot guns that people stumble into rather than something needed for optimal performance.
I've given up on expecting sane defaults from every piece of software. Some packages work perfectly fine out of the box, or simply work not at peak performance if not tuned slightly. Other software has so many dangerous default settings that it's hard to understand the rationale.
Case in point for me is Docker itself. By default it will write logs to json-file and not truncate or rotate these logs. Packages distributed for Ubuntu et al also don't set these limits, so unless you manually make it a system-wide default or set it for each container you run individually it will eventually eat your disk-space. This is a very dangerous setting since you're probably using Docker in a lot of cases where all you do is run ephemeral, stateless workloads. Having out-of-disk creep up on you on these kind of hosts is most unexpected. The same could be said of having the -p port forwarding option default to binding to 0.0.0.0 instead of 127.0.0.1. Also a footgun and an exposed ElasticSearch with PII waiting to happen...
I install PHP on my server (yeah I know) to run some PHP based web based software. By default PHP is configured in a very insecure manner. You would think with all the security problems that they would ship it with a set of secure defaults.
Probably but you can still have the configuration secure as default and people would be aware of the security implications when enabling insecure features.
I am talking about whoever is packaging PHP for the OS, there is a default php.ini that comes with PHP on CentOS is insecure by default (I can't remember off the top of my head which settings were set to something insecure).
We are talking about an ini file. This isn't rocket science.
Right, but they need to be conscious of their end user. If they secure by default, and someone upgrades, their software stops working. Should PHP have had these defaults to begin with, yes absolutely. But now we're all stuck with a million miles of code that will break if register_globals is turned off. That's the point. Everything you've stated above there might as well be an alien language to the majority of people using this stuff.
No it should be secure by default and people will have to enable insecure features. It doesn't stop old software from working as the person will be able to simply re-enable whatever the insecure feature is.
However they will now be aware that said feature is insecure and should know the consequences of enabling it.
This is really the million dollar question. Right now I'm not aware of any single cookbook example of how to tune your server for an optimal docker load. It's all buried inside engineering organizations, or blog posts like the one here.
One of the things the MySQL developers did originally was to ship the code with three examples of the my.cnf config file for small, medium and large memory systems. I wish there was something like that for docker container density and OS tuning parameters.
I think what we need is basically a matrix of OS settings which will max out your docker density for N CPUs and M gigs of memory. There are also probably settings for network configs which depend on your link capacity also.
I can dream of a day when you can spin up an Ubuntu Docker-flavor server which optimizes itself to give the highest density of containers given the hardware (or VM) it's running on.
For the port-binding thing, I'd just remember that it binds to 0.0.0.0 when not explicitly specified otherwise, and then use docker network and not port-forwards unless absolutely needed. For example, if you have an application and a couple of backing services (database, redis, ElasticSearch), then only your application needs a port forwarded from the host, the rest can live within the docker network.
Last time I checked (~year ago) Docker used different iptables chain(s) than ufw or added itself before ufw rules, so ufw was useless in securing access to ports exposed by containers.
The big bottleneck we had with docker containers per host was not sustained peak but simultaneous start. This was with 1.6-1.8 but we’d see containers failing to start if more than 10 or so (sometimes as low as 2!) were started at the same time.
Hopefully rootless docker completely eliminates the races by removing the kernel resource contention.
If I were to guess, reloads triggered from config changes.
Consul-template writes a config and then does an action. In the case of nginx, I would assume the action is to send a SIGHUP. I think haproxy would have also been an option here, it has better support for srv record to do updates from and the like.
Where I am at the moment we're running clusters of 400-800 containers sitting behind nginx instances and even thought we own nginx+ licenses, we've found the consul-template + SIGHUP route to be totally fine, even at a churn of maybe a dozen contained a minute everything still seems to be working fine. If a particularly busy node dies then we occassionally see a few requests get errors back, but Nginx's passive healthchecking (ie. checking response codes and not sending traffic to an upstream with a ton of 500's being returned) seems to handle all of that ok.
The only time our tried and tested consul-template + SIGHUP method is every unsuccessful (and we've ended up jusy having to put processes in place to stop this) is if we have the same nginx handling inbound connections to the cluster under high load and we try and respawn all the containers at once. Then things start to go wrong for 5 minutes or so then back to normal.
While "the occasions error response" isn't perfect, I suspect that for most use cases it's good enough, so I'd still be interested in knowing more specifically what happened to that nginx...
nginx behaves RfC conform. So if you sent it a SIGHUP it will try to respawn all workers by closing (from the server side) all open connections. The problem is that this behaviour confuses some HTTP libs/connection pooler more then others. For example OkHTTP seems to be able to deal with it, but others not so much.
Once you reach like 6-12 reloads per second you run into latency issues because you've to establish a new connection for every request, and if you're still running with HTTP/1.1 every benefit of idle connections and connection pooling is defeated.
Examples like Traefik (or more old school the F5 BigIP LTM) split frontend and backend handling of connections, and deal with so many reloads more gracefully. Beside of avoiding issues with HTTP libs it at least improves your latency.
"With /proc/sys/kernel/pid_max defaulting to 32768 we actually ran out of PIDs. We increased that limit vastly, probably way beyond what we currently need, to 500000. Actuall limit on 64bit systems is 222"
Because doubling, quadrupling, etc. the number of servers is quick and has a well-known cost compared to going for a complete re-write in another language?
I'm a massive go and rust fan, but I don't expect the entire world to be re-written in them any time soon.
The density per server savings is probably a drop in the hat compared to the cost of the engineers themselves... also by the sounds of it memory usage isn't the issue here which is the only thing I think you'd get from C++. I've seen well written java applications do amazing things performance wise even expert C programmers couldn't match (without an obtuse amount of effort).
>> The density per server savings is probably a drop in
>> the hat compared to the cost of the engineers themselves
For C++ and Rust yes, unless scale is huge. For Go, emphatically no. Go is simpler to work with, and on typical programs it uses half the RAM and fewer threads. Although even for C++ and Rust at medium scale I'd rather do proper engineering, and pay my (rather than somebody else's) people better. "Hard" languages tend to select better engineers. Go is a bit of an odd one in this regard because it selects better engineers by not allowing the kind of mindless GoF masturbation one often sees in Java programs, at the language feature level. Can't abuse OOP if there's no OOP.
>> [memory] is the only thing I think you'd get from C++
This is far, far from the "only" thing you'd get out of the languages I mentioned. Quarter of the memory, as few threads as you'd like, access to vector instruction sets, easier performance optimization where the code calls for it, not having to tune your GC for smaller heap, not having to new up a class every time you do anything (I know you don't have to, but that wouldn't be idiomatic Java), etc, etc.
Because of the existing code and libraries which are not available in other languages. For instance if I am writing some NLP related code I would write it in python.
Ironically, that'd still be better than JRE, because your NLP libs end up spending most of their time in high performance, memory efficient C++ code that underpins such libs.
No man I wouldn't want to call those horrific c functions compared to those sweet python functions ! and the libs calling those C functions do much more than just being a wrapper. they add a lot of tooling around.