Hacker News new | past | comments | ask | show | jobs | submit login
Go, Containers, and the Linux Scheduler (riverphillips.dev)
377 points by rbanffy on Nov 7, 2023 | hide | past | favorite | 138 comments



The common problem I see across many languages is: applications detect machine cores by looking at /proc/cpuinfo. However, in a docker container (or other container technology), that file looks the same as the container host (listing all cores, regardless of how few have been assigned to the container).

I wondered for a while if docker could make a fake /proc/cpuinfo that apps could parse that just listed "docker cpus" allocated to the job, but upon further reflection, that probably wouldn't work for many reasons.


Point of clarification: Containers, when using quota based limits, can use all of the CPU cores on the host. They're limited in how much time they can spend using them.

(There are exceptions, such as documented here: https://kubernetes.io/docs/tasks/administer-cluster/cpu-mana...)


Maybe I should be clearer: Let's say I have a 16 core host and I start a flask container with cpu=0.5 that forks and has a heavy post-fork initializer.

flask/gunicorn will fork 16 processes (by reading /proc/cpuinfo and counting cores) all of which will try to share 0.5 cores worth of CPU power (maybe spread over many physical CPUs; I don't really care about that).

I can solve this by passing a flag to my application; my complaint is more that apps shouldn't consult /proc/cpuinfo, but have another standard interface to ask "what should I set my max parallelism (NOT CONCURRENCY, ROB) so my worker threads get adequate CPU time so the framework doesn't time out on startup.


It's not clear to me what the max parallelism should actually be on a container with a CPU limit of .5. To my understanding that limits CPU time the container can use within a certain time interval, but doesn't actually limit the parallel processes an application can run. In other words that container with .5 on the CPU limit can indeed use all 16 physical cores of that machine. It'll just burn through it's budget 16x faster. If that's desirable vs limiting itself to one process is going to be highly application dependent and not something kubernetes and docker can just tell you.


It won’t burn through the budget faster by having more cores. You’re given a fixed time-slice of the whole CPU (in K8s, caveats below), whether you use all the cores or just one doesn’t particularly matter. On one hand, it would be would nice to be able to limit workloads on K8s to a subset of cores too, on the other, I can only imagine how much catastrophically complex that would make scheduling and optimisation.

Caveats: up to the number of cores exposed to your VM. I also believe the later versions of K8s let you do some degree of workload-core pinning and I don’t yet know how that interacts with core availability .


That interface partly exists. It's /sys/fs/cgroup/(cgroup here)/cpu.max

I know the JVM automatically uses it, and there's a popular library for Go that sets GONAXPROCS using it.


https://stackoverflow.com/questions/65551215/get-docker-cpu-...

Been a bit but I do believe that dotnet does this exact behavior. Sounds like gunicorn needs a pr to mimic, if they want to replicate this.

https://github.com/dotnet/runtime/issues/8485


You generally shouldn't set CPU limits. You might want to configure CPU requests which is guaranteed chunk of CPU time that container will always receive. With CPU limits you'll encounter situation when host CPU is not loaded, but your container workloaded is throttled at the same time, which is just waste of CPU resources.


It's complicated. I've worked on every kind of application in a container environment: ones that ran at ultra-low priority while declaring zero CPU request and infinite CPU limit. I ran one or a few of these on nearly every machine in Google production for over a year, and could deliver over 1M xeon cores worth of throughput for embarassingly parallel jobs. At other times, I ran jobs that asked for and used precisely all the cores on a machine (a TPU host), specifically setting limits and requests to get the most predictable behavior.

The true objective function I'm trying to optimize isn't just "save money" or "don't waste CPU resources", but rather "get a million different workloads to run smoothly on a large collection of resources, ensuring that revenue-critical jobs always can run, while any spare capacity is available for experimenters, up to some predefined limits determined by power capacity, staying within the overall budget, and not pissing off any really powerful users." (well, that's really just a simplified approximation)


The problem is your experience involves a hacked up Linux that was far more suitable for doing this than is the upstream. Upstream scheduler can't really deal with running a box hot with mixed batch and latency-sensitive workloads and intentionally abusive ones like yours ;-) That is partly why kubernetes doesn't even really try.


This. Some googlers forget there is a whole team of kernel devs in TI that are maintaining patched kernel (including patched CFS) specifically for Borg


I used Linux for mixed workloads (as in, my desktop that was being used for dev work was also running multi-core molecular dynamics jobs in the background). Not sure I agree completely that the Google linux kernel is significantly better at this.

At work at my new job we run mixed workloads in k8s and I don't really see a problem, but we also don't instrument well enough that I could say for sure. In our case it usually just makes sense to not oversubscribe machines (Google oversubscribed and then paid a cost due to preemptions and random job failures that got masked over by retries) by getting more machines.


I think you touch on the key issue which is the upstream scheduler does not have all the stats that you need to have confidence in the solution. You want to know how long threads are waiting to get on a CPU after becoming runnable.


According to the article, this is not true. The limits become active only when the host cpu is under pressure.


I don't think that's correct. --cpus is the same as --cpu-period which is cpu limit. You can easily check it yourself, just run docker container with --cpus set, run multi-core load there and check your activity monitor.


CFS quotas only become active under contention and even then are relative: if you’re the only thing running on the box and want all the cores but only set one cpu, you get all of them anyway.

If you set cpus to 2 and another process sets to 1 and you both try to use all CPUs all out, you’ll get 66% and they’ll get 33%.

This isn’t the same as cpusets, which work differently.


> CFS quotas only become active under contention

That's not true at all. Take a look at `cpu.cfs_quota_us` in https://kernel.googlesource.com/pub/scm/linux/kernel/git/glo...

It's a hard time limit. It doesn't care about contention at all.

`cpu.shares` is relative, for choosing which process gets scheduled, and how often, but the CFS quota is a hard limit on runtime.


Yes, there are hard limits in the CFS. I have used them for thermal reasons in the past, such that the system remained mostly idle although some threads would have had more work to do.

Not at my work environment right now, don't remember the parameters I used.


I just tried it and you are right. Ran

    docker run --cpus 0.1 --rm -it progrium/stress --cpu 16
and the machine's cpu sits idle.


`gunicorn --workers $(nproc)`, see my comment on the parent


I only use `nproc` and see it used in other containers as well, ie `bundle install -j $(nproc)`. This honors cpu assignment and provides the functionality you're seeking. Whether or not random application software uses nproc if available, idk

> Print the number of processing units available to the current process, which may be less than the number of online processors. If this information is not accessible, then print the number of processors installed

https://www.gnu.org/software/coreutils/manual/html_node/npro...

https://www.flamingspork.com/blog/2020/11/25/why-you-should-...


This is not very robust. You probably should use the cgroup cpu limits where present, since `docker --cpus` uses a different way to set quota:

    if [[ -e /sys/fs/cgroup/cpu/cpu.cfs_quota_us ]] && [[ -e /sys/fs/cgroup/cpu/cpu.cfs_period_us ]]; then
        GOMAXPROCS=$(perl -e 'use POSIX; printf "%d\n", ceil($ARGV[0] / $ARGV[1])' "$(cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us)" "$(cat /sys/fs/cgroup/cpu/cpu.cfs_period_us)")
    else
        GOMAXPROCS=$(nproc)
    fi
    export GOMAXPROCS
This follows from how `docker --cpus` works (https://docs.docker.com/config/containers/resource_constrain...), as well as https://stackoverflow.com/a/65554131/207384 to get the /sys paths to read from.

Or use https://github.com/uber-go/automaxprocs, which is very comprehensive, but is a bunch of code for what should be a simple task.


A shell script that invokes perl to set an environment variable used by Go. Some days I feel like there is a lot of duct tape involved in these applications.


Yeah, I'd also go with cgroups... but... You need to know if it's cgroups v1 or v2, where this filesystem is mounted, how to find your own process. Also, there's this hierarchical stuff going on there... also this can change dynamically while your program is running...


That's not what Go does though. Go looks at the population of the CPU mask at startup. It never looks again, which of problematic in K8s where the visible CPUs may change while your process runs.


We use https://github.com/uber-go/automaxprocs after we joyfully discovered that Go assumed we had the entire cluster's cpu count on any particular pod. Made for some very strange performance characteristics in scheduling goroutines.


My opinion is that setting GOMAXPROCS that way is a quite poor idea. It tends to strand resources that could have been used to handle a stochastic burst of requests, which with a capped GOMAXPROCS will be converted directly into latency. I can think of no good reason why GOMAXPROCS needs to be 2 just because you expect the long-term CPU rate to be 2. That long-term quota is an artifact of capacity planning, while GOMAXPROCS is an artifact of process architecture.


How do you suggest handling that?


> which of problematic in K8s where the visible CPUs may change while your process runs

This is new to me. What is this… behavior? What keywords should I use to find any details about it?

The only thing that rings a bell is requests/limit parameters of a pod but you can't change them on an existing pod AFAIK.


If you have one pod that has Burstable QoS, perhaps because it has a request and not a limit, its CPU mask will be populated by every CPU on the box, less one for the Kubelet and other node services, less all the CPUs requested by pods with Guaranteed QoS. Pods with Guaranteed QoS will have exactly the number of CPUs they asked for, no more or less, and consequently their GOMAXPROCS is consistent. Everyone else will see fewer or more CPUs as Guaranteed pods arrive and depart from the node.


If by "CPU mask" you refer to the `sched_getaffinity` syscall, I can't reproduce this behavior.

What I tried: I created a "Burstable" Pod and run `nproc` [0] on it. It returned N CPUs (N > 1).

Then I created a "Guaranteed QoS" Pod with both requests and limit set to 1 CPU. `nproc` returned N CPUs on it.

I went back to the "Burstable" Pod. It returned N.

I created a fresh "Burstable" Pod and run `nproc` on it, got N again. Please note that the "Guaranteed QoS" Pod is still running.

> Pods with Guaranteed QoS will have exactly the number of CPUs they asked for, no more or less

Well, in my case I asked for 1 CPU and got more, i.e. N CPUs.

Also, please note that Pods might ask for fractional CPUs.

[0]: coreutils `nproc` program uses `sched_getaffinity` syscall under the hood, at least on my system. I've just checked it with `strace` to be sure.


I don't know what nproc does. Consider `taskset`


I re-did the experiment again with `taskset` and got the same results, i.e. the mask is independent of creation of the "Guaranteed QoS" Pod.

FWIW, `taskset` uses the same syscall as `nproc` (according to `strace`).


Perhaps it is an artifact of your and my various container runtimes. For me, in a guaranteed qos pod, taskset shows just 1 visible CPU for a Guaranteed QoS pod with limit=request=1.

  # taskset -c -p 1
  pid 1's current affinity list: 1

  # nproc
  1
I honestly do not see how it can work otherwise.


After reading https://kubernetes.io/docs/tasks/administer-cluster/cpu-mana..., I think we have different policies set for the CPU Manager.

In my case it's `"cpuManagerPolicy": "none"` and I suppose you're using `"static"` policy.

Well, TIL. Thanks!


TIL also. The difference between guaranteed and burstable seems meaningless without this setting.


Even way back in the day (1996) it was possible to hot-swap a CPU. Used to have this Sequent box, 96 Pentiums in there, 6 on a card. Could do some magic, pull the card and swap a new one in. Wild. And no processes died. Not sure if a process could lose a CPU then discover the new set.


What is the population of the CPU mask at startup? Is this a kernel call? A /proc file? Some register?


On Linux, it likely calls sched_getaffinity().


hmm, I can see that as being useful but I also don't see that as the way to determine "how many worker threads I should start"


It's not a bad way to guess, up to maybe 16 or so. Most Go server programs aren't going to just scale up forever, so having 188 threads might be a waste.

Just setting it to 16 will satisfy 99% of users.


There's going to be a bunch of missing info, though, in some cases I can think of. For example, more and more systems have asymmetric cores. /proc/cpuinfo can expose that information in detail, including (current) clock speed, processor type, etc, while cpu_set is literally just a bitmask (if I read the man pages right) of system cores your process is allowed to schedule on.

Fundamentally, intelligent apps need to interrogate their environment to make concurrency decisions. But I agree- Go would probably work best if it just picked a standard parallelism constant like 16 and just let users know that can be tuned if they have additional context.


Yes, running on a set of heterogenous CPUs presents further challenges, for the program and the thread scheduler. Happily there are no such systems in the cloud, yet.

Most people are running on systems where the CPU capacity varies and they haven't even noticed. For example in EC2 there are 8 victim CPUs that handle all the network interrupts, so if you have an instance type with 32 CPUs, you already have 24 that are faster than the others. Practically nobody even notices this effect.


> in EC2 there are 8 victim CPUs that handle all the network interrupts, so if you have an instance type with 32 CPUs, you already have 24 that are faster than the others

Fascinating. Could you share any (all) more detail on this that you know? Is it a specific instance type, only ones that use nitro? (or only ones without?) This might be related to a problem I've seen in the wild but never tracked down...


I've only observed it on Nitro, but I have also rarely used pre-Nitro instances.


> I wondered for a while if docker could make a fake /proc/cpuinfo

This exists: https://github.com/lxc/lxcfs

lxcfs is a FUSE filesystem that mocks /proc by inferring cgroup values in a way that makes other applications and libraries work without having to care about whether it runs in a container (to the best of its ability - there are definitely caveats).

One such example is that /proc/uptime should reflect the uptime of the container, not the host; additionally /proc/cpuinfo reflects the number of CPUs as a combination of cpu.max and cpuset.cpus (whichever the lower bound is).

As others also mentioned, inferring the number of CPUs could also be done using the sched_getaffinity syscall - this doesn't depend on /proc/cpuinfo, so depending on the library you're using you might be in a pickle.


Containers are a crappy abstraction and VMware fumbled the bag, is my takeaway from this comment…


> VMware fumbled the bag

Oh they did, they're a modern day IBM.

> Containers are a crappy abstraction

They're one of the best abstractions we have (so far) because they contain only the application and what it needs.


> they contain only the application and what it needs.

Delusion level: over 9900.

I'm yet to find a container that contains only the application and what it needs. Most of the time I find that they contain at least libc and libpthreads (which are already present on the host, so not needed). More often I find metric tonnes of garbage that was not necessary by any metric, but was just too hard to remove, so was allowed to stay.


Blame that to glibc not the container


This is subtly incorrect - as far as Docker is concerned CFS cgroup extension has several knobs to tune - cfs_quota_us, cfs_period_us (typical default is 100ms not a second) and shares. When you set shares you get weighted proportional scheduling (but only when there's contention). The former two enforce strict quota. Don't use Docker's --cpu flag and instead use --cpu-shares to avoid (mostly useless) quota enforcement.

From Linux docs:

  - cpu.shares: The weight of each group living in the same hierarchy, that
    translates into the amount of CPU it is expected to get. Upon cgroup creation,
    each group gets assigned a default of 1024. The percentage of CPU assigned to
    the cgroup is the value of shares divided by the sum of all shares in all
    cgroups in the same level.
  - cpu.cfs_period_us: The duration in microseconds of each scheduler period, for
    bandwidth decisions. This defaults to 100000us or 100ms. Larger periods will
    improve throughput at the expense of latency, since the scheduler will be able
    to sustain a cpu-bound workload for longer. The opposite of true for smaller
    periods. Note that this only affects non-RT tasks that are scheduled by the
    CFS scheduler.
  - cpu.cfs_quota_us: The maximum time in microseconds during each cfs_period_us
    in for the current group will be allowed to run. For instance, if it is set to
    half of cpu_period_us, the cgroup will only be able to peak run for 50 % of
    the time. One should note that this represents aggregate time over all CPUs
    in the system. Therefore, in order to allow full usage of two CPUs, for
    instance, one should set this value to twice the value of cfs_period_us.


> "Don't use Docker's --cpu flag and instead use"

This is rather strong language without any real qualifiers. It is definitely not "mostly useless". Shares and quotas are for different use-cases, that's all. Understand your use-case and choose accordingly.


It doesn’t make any sense to me why —cpu flag is tweaking quota and not shares since quota is useful in tiny minority of usecases. A lot of people waste a ton of time debugging weird latency issues as a result of this decision


With shares you're going to experience worse latency if all the containers on the system size their thread pool to the maximum that's available during idle periods and then constantly context-switch due to oversubscription under load. With quotas you can do fixed resource allocation and the runtimes (not Go apparently) can fit themselves into that and not try to service more requests than they can currently execute given those resources.


And how is that different from worse latency due to cpu throttling from your users’ perspective?


this is a predictable worse latency due to CPU throttling, i.e nothing is suddenly introduced in the system. but the other case is worse i.e a non critical microservice can cause outage on your critical microservice.

imagine some non critical system like blog-service suddenly causing 2-3% new order creation failures


Fixed queue, so it'll only take as many as it can process and reject the rest, which can be used to do scaling, if you have a cluster. With shares it would think it has all the CPU cores available and oversize the queue.


Doesn’t answer my q really. At least in kubernetes scaling is done by measuring usage against the request (shares) not the limit (quota)


These two options are not mutually exclusive.

When you want to limit the max CPU time available to a container use quotas (--cpus). When you want to set relative priorities (compared to other containers/processes), use shares.

These two options can be combined, it all depends on what you need.


> Don't use Docker's --cpu flag and instead use --cpu-shares to avoid (mostly useless) quota enforcement.

One caveat is that an application can detect when --cpu is used as I think it's using cpuset. When quota are used it cannot detect and more threads than necessary will likely be spawned


It is not using cpuset (there is a separate flag for this). --cpus tweaks the cfs quota based on the number of cpus on the system and the requested amount.


—cpu sets the quota, there is is a —cpuset-cpu flag for cpuset and you can detect both by looking at the /sys/fs/cgroup


Hi I'm the blog author, thanks for the feedback

I'll try and clarify this. I think this is how the sympton presents but I should be clearer.


People using Kubernetes don't tune or change those settings, it's up to the app to behave properly.


False. Kubernetes cpu request sets the shares, cpu limit sets the cfs quota


You said to change docker flags. Anyway your post is irrelevant, the goal is to let know the runtime about how many posix threads should it use.

If you set request / limit to 1 core but you run on 64 cores node , then you runtime will see that which will bring performance down.


Original article is about docker. That’s the point of my comment - dont set cpu limit


I intended it to be applicable to all containerised environments. Docker is just easiest on my local machine.

I still believe it's best to set these variables regardless of cpu limits and/or cpu shares


All you did is kneecapped your app to have lower performance so it fits under your arbitrary limit. Hardly what most people describe as “best” - only useful in small percentage of usecases (like reselling compute)


I've seen significant performance gains from this in production.

Other people have encountered it too hence libraries like Automaxprocs existing and issues being open with Go for it.


Gains by what metric? Are you sure you didn't trade in better latency for worse overall throughput? Also, sure you didn't hit one of many CFS overaccounting bugs which we've seen a few? Have you compared performance without the limit at all?


Previously we had no limit. We observed gains in both latency and throughput by implementing Automaxprocs and decided to roll it out widely.

This aligns with what others have reported on the Go runtime issue open for this.

"When go.uber.org/automaxprocs rolled out at Uber, the effect on containerized Go services was universally positive. At least at the time, CFS imposed such heavy penalties on Go binaries exceeding their CPU allotment that properly tuning GOMAXPROCS was a significant latency and throughput improvement."

https://github.com/golang/go/issues/33803#issuecomment-14308...


This sort of tuning isn't necessary if you use CPU reservations instead of limits, as you should: https://home.robusta.dev/blog/stop-using-cpu-limits

CPU reservations are limits, just implicit ones and declared as guarantees.

So let the Go runtime use all the CPUs available, and let the Linux scheduler throttle according to your declared reservations if the CPU is contended for.


I don't set limits because I'm afraid of how a pod is going to affect other pods. I set limits because I don't want to get used to being able to tap on the excess CPU available because that's not guaranteed to be available.

As the node fills up with more and more other pods, it's possible that a pod that was running just fine a moment ago is crawling to a halt.

Limits allow me to simulate the same behavior and plan for it by doing the right capacity planning.

They are not the only way to approach it! But they are the simplest way to so it.


Limiting CPU to the amount guaranteed to be available also guarantees very significant wasted resource utilization unless all your pods spin 100% CPU continuously.

The best way to utilize resources is to overcommit, and the smart way to overcommit is to, say, allow 4x overcommit with each allocation limited to 1/4th of the available resources so not individual peak can choke the system. Given varied allocations, things average out with a reasonable amount of performance variability.

Idle CPUs are wasted CPUs and money out the window.


> with each allocation limited to 1/4th of the available resources so not individual peak can choke the system.

This assumes that the scheduled workloads are created equal which isn't the case. The app owners do not have control over what else gets scheduled on the node which introduces uncontrollable variability in the performance of what should be identical replicas and environments. What helps here is .. limits. The requests-to-limits ratio allows application owners to reason about the variability risk they are willing to take in relation to the needs of the application (e.g. imagine a latency-sensitive workload on a critical path vs a BAU service vs a background job which just cares about throughput -- for each of these classes, the ratio would probably be very different). This way, you can still overcommit but not by a rule-of-thumb that is created centrally by the cluster ops team (e.g. aim for 1/4) but it's distributed across each workload owner (ie application ops) where this can be done a lot more accurately and with better results. This is what the parent post is also talking about.


1/4th was merely an example for one resource type, and a suitable limit may be much lower depending on the cluster and workloads. The point is that a limit set to 1/workloads guarantees wasted resources, and should be set significantly higher based on realistic workloads, while still ensuring that it takes N workloads to consume all resource to average out the risk of peak demand collisions.

> This assumes that the scheduled workloads are created equal which isn't the case.

This particular allocation technique benefits from scheduled workloads not being equal as equality would increase likelihood of peak demand collisions.


That's why you use monitor and alerting, so you notice degraded performances before the pods is crawling to a halt.

You need to do it anyway because a service might progressively need more resources as it's getting more traffic, even if you're not adding any other pod.


Sure you need monitoring and alerting and sure there are other reasons why you need to update your requests.

But having _neighbours_ affecting the behaviour of your workload is precisely what creates the kind of fatigue that then results in people claiming that it's hard to run k8s workloads. K8s is highly dynamical, pods can get scheduled on a node by chance sometimes and on some clusters; pagers will ring, incidents will be created for conditions that may solve themselves because of another deployment (possibly of another team) happening.

Overcommit/bursting is an advanced cost saving feature.

Let me say it again: splitting up a large machine into smaller parts that can use the unused capacity of other parts in order to reduce waste is an advanced feature!

The problem is that the request/limits feature is presented in the configuration spec and in the documentation in a deceptively simple way and we're tricked to think it's a basic feature.

Not all companies have ops teams that are well equipped to do more sophisticated things. My advice for those teams who cannot setup full automation around capacity management is to just not use this advanced features.

An alternative is to just use smaller dedicated nodes and (anti)affinity rules, so you always understand which pods go with which other pods. It's clunky but it's actually easier to reason about what's going to happen.

EDIT: typos


> splitting up a large machine into smaller parts that can use the unused capacity of other parts in order to reduce waste is an advanced feature

That's an interesting take. For those of us who once used microcomputers back in the 1980s like Commodore 64s, Apple IIs, and IBM PCs running MS-DOS, sure. But time-sharing systems have been around since the 1960s and Unix dates back to the 1970s, and multi-user logins and resource controls like ulimits have been part of the picture for a very long time.

We've had plenty of time to get used to multitasking and resource contention and management. Is it complicated? It can be. But if you consider that containers are basically a program lifecycle/deployment and resource control abstraction on an OS that's had these features (maybe not cgroups and namespaces, but similar ones) since its birth, it's not really all that advanced.


The same arguments hold on old time-sharing systems. Setting up you limits in such a way that you can use excess capacity when available is very well suited (now and back then) for batch workloads. Scheduler priorities are a way to mix batch and interactive workloads.

When you're not latency sensitive but aiming at optimizing throughput you can achieve a pretty good utilization of the underlying resource. This is what resource<limit setting in k8s is good for: batch workload.

Batch workload still exists and it's important but the proportion of batch vs latency sensitive workload shifted significantly from the 70s to internet age. Request handlers that power that sit in the critical path for the user experience of a web or mobile app are not only latency sensitive but also _tail_ latency sensitive.

Tail latency often poses significant problems because stragglers can often determine the total duration of the interaction as perceived by the user, even if only one straggler request suffers from slowdowns.

While there are tricks to deal with tail latency while also improving utilization (i.e. reducing waste) they are hard to implement in practice and/or don't generalize well across workloads.

One thing always works and is your best option as a starting point: stop worrying about idle CPU in an _average_ usage metric. The behaviour during CPU usage spikes (even spikes that are shorter than your scraping period of your favourite monitoring tool!) determine latency envelope of your service. Measure how much CPU you need during those and plan accordingly.

Ideally it should be possible to some low priority (batch) process use that idle CPU, but AFAIK that's not currently possible with k8s.

EDIT: forgot to mention that all this discussion matters only inasmuch you have strict latency targets. If you're ok to occasionally have very slow requests and your customers are ok with that and your bosses don't freak out because they spoke to that one customer that is very angry that their requests time out a few times every month ... I'm very happy for you! Not everybody has these requirements and you can go pretty far with a production system consisting of a couple of VM on AWS without k8s, containers, cgroups or whatever in the picture. I know people who do and it works just fine for them. But in order to understand what the fuss about all of this we need to frame this discussion as pertaining to batch vs controlled-latency. Otherwise it's hard to explain why there are so many otherwise intelligent people making choices that appear to be a bit silly, isn't it? Sure, there are people who are "just wrong" on the internet :-) but more often than not if somebody ends up with a different solution is because they have different goals and tradeoffs.


imagine amazon's order history microservice causes outage for ordering creation microservice. regardless of the monitoring and alerting, you never want a non critical system to cause outage on the most critical system you have


In my experience working with containers -- both on my own behalf and for customers -- I can think of relatively few situations in which CPU limits were necessary or useful for the workload.

When you set CPU reservations properly in your container definitions, they don't "crawl[] to a halt" when other containers demand more CPU than they did before. Every container is guaranteed to get the CPU it requested the moment it needs it, regardless of what other containers are using right up till that moment. So the key is to set the requested CPU to the minimum the application owner needs in the contended case. That's why I say that "requests are limits in reverse" - a request declared by container A effectively creates an implicit limit on container B when container A needs the CPU.

Perhaps you're concerned that not setting limits causes oversubscription on a node. But this is not so: when scheduling pods, the K8S scheduler only considers requests, not limits. If every container is forced to declare a CPU request - as it should - then K8S will not overprovision pods onto a node, and every container will be entitled to all the CPU it requested when needed.

Consider, too, that most applications are memory-bound and concurrency-bound, not CPU bound. And a typical web server is also demand-driven and load-balanced; it doesn't spontaneously consume CPU. For those kinds of applications, resource consumption is roughly consistent across all the containers, and the "pod running just find a moment ago...crawling to a halt" isn't a phenomenon you're going to see, as long as they haven't yet consumed all the CPU they requested. Limits hurt more than help especially for these kinds of workloads, because being able to burst gives them headroom to serve while a horizontal scale-out process (adding more pods) is taking place.

Most CPU-bound workloads tend to be batch workloads (e.g. ETL, map/reduce). Teams who run those on shared nodes usually just want to get whatever CPU they can. These jobs tend not to have strict deadlines and the owners don't usually panic when a worker pod has its CPU throttled on a shared node. And those who do care about getting all the CPU available are frequently putting these workloads on dedicated nodes, often even dedicated clusters. And the nodes tend to be ephemeral - launched just-in-time to run the job and then terminated when the job has completed.

As others have pointed out, putting CPU limits on containers is a contributing cause of nodes having lower CPU utilization than they could otherwise sustain. Every node costs money -- whether you bought it or rent it -- so underutilized nodes represent wasted capital or cloud spend. You want nodes to run as hot as possible to get the most out of your investment. If your CFO/cloud financial team isn't breathing down your neck when they see idle nodes, they ought to be.


I run a few things on 128 core setups and I set CPU limits to much higher than request but still set them to make sure nothing runs ammok.

I would be curious to see this discussed but your article only states that people think you need limit to ensure CPU for all pods.


The Kubernetes community has this discussion every other week. This article isn't wrong per-se (and it's mostly content marketing, they have articles that are just plain wrong too, like https://home.robusta.dev/blog/containers-dont-use-chroot - i caused the "updated" tag on that one), but it's sweeping in its assertions, ignoring many good reason to set limits.

Workloads that use up all the bursting with little benefit, or prioritizing burst volume for an HTTP server instead of a cronjob that'll finish in designated time anyway. We even had a case where developers weren't updating their requests while their app grew in requirements, and then had an incident on their hand when spare CPU time suddenly was sparse.


Reservations are not limits, they are minimum guaranteed CPU usage constraints. while theoretically, those are minimum guaranteed resources, if there are other busy containers running on the host, you will still see your tail latencies and average latency increasing abnormally

imagine you got 4 core ec2 instance, the latencies you see at 50% cpu utilisation and 90% cpu utilisation are quite different. with reservations similar thing happens i.e even though each container is guaranteed to their reservations, the relative CPU utilisation still increased very high when there are other busy processes on the same host


Interesting. This is not true for Memory, correct? The OOMKiller might get you.

You also cannot achieve a QoS class of Guaranteed without both CPU and Memory limits, so the pod might be evicted at some point.


Correct regarding memory - not true for memory because it's non-fungible unlike CPU shares

> You also cannot achieve a QoS class of Guaranteed without both CPU and Memory limits, so the pod might be evicted at some point.

Evicted due to node pressure - yes (but if all other pods also don't have limits it doesn't matter). For preemption QoS is not factored in the decision [0]

[0] - https://kubernetes.io/docs/concepts/scheduling-eviction/pod-...


> Memory is different because it is non-compressible - once you give memory you can't take it away without killing the process


Swap (Disk, RDMA, Compression)? Page migration (NUMA, CXL)?


All of this is below K8s scheduler level, the K8s scheduler doesn't know how the underlying kernel handles memory, all it cares about whether it thinks there's enough free memory to give to a pod or not because it just keeps track of other pod requests and limits, the fact that giving this memory, which it thinks is unavailable, would actually not result in any swapping for the pod is unknown.


I've been bitten many times by the CFS scheduler while using containers and cgroups. What's the new scheduler? Has anyone here tried it in a production cluster? We're now going on two decades of wasted cores: https://people.ece.ubc.ca/sasha/papers/eurosys16-final29.pdf.


The problem here isn't the scheduler. It's resource restrictions imposed by the container but the containerized process (Go) not checking the OS features used to do that when calculating the available amount of parallelism.



Besides GOMAXPROCS there's also GOMEMLIMIT in recent Go releases. You can use https://github.com/KimMachineGun/automemlimit to automatically set this this limit, kinda like https://github.com/uber-go/automaxprocs.


Discovered this sometime last year in my previous role as a platform engineer managing our on-prem kubernetes cluster as well as the CI/CD pipeline infrastructure.

Although I saw this dissonance between actual and assigned CPU causing issues, particularly CPU throttling, I struggled to find a scalable solution that would affect all Go deployments on the cluster.

Getting all devs to include that autoprocs dependency was not exactly an option for hundreds of projects. Alternatively, setting all CPU request/limit to a whole number and then assigning that to a GOMAXPROCS environment variable in a k8s manifest was also clunky and infeasible.

I ended up just using this GOMAXPROCS variable for some of our more highly multithreaded applications which yielded some improvements but I’ve yet to find a solution that is applicable to all deployments in a microservices architecture with a high variability of CPU requirements for each project.


There isn't one answer for this. Capping GOMAXPROCS may cause severe latency problems if your process gets a burst of traffic and has naive queueing. It's best really to set GOMAXPROCS to whatever the hardware offers regardless of your ideas about how much time the process will use on average.


You could define a mutating webhook to inject GOMAXPROCS into all pod containers.


Thanks for sharing this!

And as a maintainer of ko[1], it was a pleasant surprised to see ko mentioned briefly, so that's for that too :)

1: https://ko.build


As someone not that familiar with Docker or Go, is this behavior intentional? Could the Go team make it aware of the CGroups limit? Do other runtimes behave similarly?


I'm fairly certain that that .net had to deal with it and Java had or still has a problem, I forget which. (Or did you mean runtimes like containerd?)


Supported in Java 10 (and backported to Java 8) since 2018. Not sure about .NET.

- "The JVM has been modified to be aware that it is running in a Docker container and will extract container specific configuration information instead of querying the operating system. The information being extracted is the number of CPUs and total memory that have been allocated to the container." https://www.oracle.com/java/technologies/javase/8u191-relnot...

- Here's a more detailed explanation and even a shared library that can be used to patch container unaware versions of Java. I wonder if the same could be done for Go?

"LD_PRELOAD=/path/to/libproccount.so java <args>"

https://stackoverflow.com/a/64271429

https://gist.github.com/apangin/78d7e6f7402b1a5da0fa3abd9381...

-

There are more recent changes to Java container awareness as well:

https://developers.redhat.com/articles/2022/04/19/java-17-wh...


Then in Java, if you don't set the limits, it gets the CPU from the VM via Runtime.getRuntime().availableProcessors()... this method returns the number of CPUs of the VM or the value set as CPU Quota. Starting from Java 11 the -XX:+PreferContainerQuotaForCPUCount is by default true. For Java <= 10 the CPU count is equal to the CPU shares. That method then is used to calculate the GC threads, fork join pool size, compiler threads etc. The solution would be to set -XX:ActiveProcessorCount=X where X is ideally the CPU shares value but as we know shares can change over time, so you would change this value over time...

Edit: or set -XX:-PreferContainerQuotaForCPUCount


Yes, I've experienced the same problem with the JVM (in Scala).


There are also GC techniques to make the pause shorter, for example, doing the work for the pause concurrently and then repeating it in the safepoint. The hope is that the concurrent work will turn the safepoint work into a simpler check that no work is necessary. Doubling the work may hurt GC throughput.


This is talking about containers but it seems that the problem is just whenever Go has access to less CPU time than it expects. Wouldn't the same thing happen when running Go on a system with another process that is using CPU? Or even just two Go programs at the same time?


I know that the .NET CLR team adjusted its behavior to address this scenario, fwiw!


So did OpenJDK and the Rust standard library.


I feel like this isn't the first time I've read about issues with schedulers-in-schedulers, but I also can't find any immediate references on hand for other examples. Anyone know of any?


Isn't this a bug in the Go runtime and shouldn't they fix it? It looks like they are using the wrong metric to tune the internal scheduler.


I think this is a great article talking about a thorny point in Golang but boy do I wish I never read this article. I wish this article was never useful to anyone.


How about GKE and containerd?


I still don't get the benefit of running Go binaries in containers. Totally get it for Rails, Python, etc, where the program needs a consistent set of supporting libraries. But Go doesn't need that. Especially now we can embed whole file systems into the actual binary.

I've been looking at going in the other direction and using a Go binary with a unikernel; the machine just runs the binary and nothing else. I haven't got this working to my satisfaction yet - it works but there's still too much infrastructure, and deployment is still "fun". But I think this is way more interesting for Go deployments than using containers.


As the author of Caddy, I have often wondered why people run it in containers. The feedback I keep hearing is basically workflow/ecosystem lock-in. Everything else uses containers, so the stuff that doesn't need containers needs them now, too.


For my personal site I run Caddy in a Docker container along with other containers using a compose file. By doing it this way, getting things up when moving to a new instance is as simple as running `docker compose up`. Also making changes to the config or upgrading Caddy version on deployments is the same as a other services since they're all containers. So it's easy to add CI/CD and have it re-deploy Caddy whenever the config changes and there's no need for extra GitHub Actions .yaml's. Setup as code like this also documents all the dependencies and I think it might be helpful in the future.

Having said that, for serious business, this setup doesn't make sense. It possibly takes more work to operate as container when the gateway runs on a dedicated instance.


Love your work :) Good to know I'm not totally out there on this :)


We just gave this a go (pun intended) on Unikraft / kraft.cloud , Caddy runs fine on it!


I find putting http proxies in containers to be a very effective method of building interesting dynamic L7 dataplanes on orchestrators like k8s. Packaging applications (particularly modern static SPAs), with a webserver embedded, is also a very intuitive way of plugging them into the leaves of this topology while abstracting away a lot of the connectivity policy from the app.

Of course there's also the well-known correlation between the quality of your k8s deployment and the number of proxies it hosts. /s


Containers can give you better isolation between the application and the host, as well as making horizontal scaling easier. It's also a must if you are using a system like Kubernetes. If you are running multiple applications in the same host, you also have control over resource limits.


Kubernetes requires a platform runtime that answers its requests, but there's no law enforcement agency that will prevent you from using a custom runtime that ignores the very existence of Linux control groups.


Yes, that is true. Though if you are using Google's or Amazon's managed Kubernetes services, I think you need to use Docker.


Certainly EKS is less managed than it appears, I’m quite confident that a node running something implementing the kubelet API convincingly would work. They changed recently (1.23 maybe) from Docker to containerd.


Go programs still make use of, e.g., /etc/resolv.conf and /etc/ssl/certs/ca-certificates.crt. These aren't strictly necessary, but using them makes it easier--more consistent, more transparent--to configure these resource across different languages and frameworks.


I use the semi-official Acme package [0], which handles all that really well. I haven't touched SSL or TLS config for years. I mean, I might be an outlier, but this seems pretty standard for Go deployments these days.

[0] https://pkg.go.dev/golang.org/x/crypto/acme


Hi, one of the maintainers of the Unikraft LF OSS project here (www.unikraf.org), clearly in agreement :) . We regularly run Go workloads, and even use Dockerfiles to build Go and other projects that we then turn into minimal unikernels. We also have a closed beta/free (unikernel) cloud platform if people want to try, sign up is at kraft.cloud .


A lot of companies are Kubernetes based now so containers are the default delivery mechanism for all software.


Here's what a container gives you:

  - Isolation of networking
  - Isolation of process namespace
  - Isolation of filesystem namespace
  - CPU and Memory limit enforcement
  - A near-universal format for packaging, distributing, and running applications, with metadata, and support for multiple architectures
  - A simple method for declaring a multi-step build process
  - Other stuff I don't remember
You're certainly welcome to go without them, but over time, everybody ends up needing at least one of the features containers bring. I get the aversion to adding more complexity, but complexity has this annoying habit of being useful.


A VM provides the first 4 anyway - if you're deploying to a cloud instance then having these in the container is redundant. If you're deploying to bare metal then it's possibly useful, but only if you're deploying multiple containers to the same machine.

Go doesn't need a format for packaging - it's one file. It's becoming common practice to embed everything else into the binary. (side note: I haven't done this with env files yet, and tend to deploy them separately, but I don't see any reason why we don't do this and produce binaries targetted at the specific deployment environments. I might give it a go).

I kinda prefer makefiles for the build stuff, or even just a script. The whole process of creating a Docker instance, pushing source files to it, trigging go build and then pulling back the binary seems redundant; there's no advantage of doing this in a container over doing it on the local machine. And it's a lot faster on the local machine.

Talking to people, it appears to be as mholt said: everyone just does everything in containers so apparently we do this too.


Are you suggesting to setup a separate VM for each process that may only require like 0.25 cpu? Another thing you can’t do with VMs is oversubscribe (at least not with cloud ones)


You can't have both oversubscription and isolation, almost by definition. If you want isolation, VMs are great. If you want oversubscription, OSes are still better than container runtimes at managing competing processes on the same host.


OK but I can tho - oversub cpu, isolate memory, systems resources (like ports) etc.

> OSes are still better than container runtimes at managing competing processes on the same host.

OSes and container runtimes are the same thing


> OSes and container runtimes are the same thing

For a subset of OSes


Go binaries tend not to be that lightweight, because we have goroutines for that.

And yes, setting up a separate VM for each instance of a process is perfectly feasible. That's what all this cloud business was about in the first place.


At Unikraft (OSS unikernel project) we do a bit of both: if wanted, people can specify via a Dockerfile what they want/need in their filesystem, and then we have a tool that compiles (if needed) and packs the files into a (OCI formatted) unikernel (which we then deploy via kraft.cloud ).


This goes against everything we've learned about effectively deploying and managing software at runtime. Using the golang binary as a packaging format for your app has the same energy as crafting it exclusively from impenetrable one-liners.


Sorry, I don't understand this. What's the difference between a Docker image made from a script and a Go binary made from a script?


The Docker image is more useful


I think we're back to square one here. Yes for other languages the Docker image is more useful, but for Go the image just contains a single binary file. It's just more bloat rather than more utility.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: