While the memory stalled information could be valuable in optimizing a program, it is correct to count it as part of the CPU busy.
Busy means the CPU was occupied with a task (other than the operating system's idle task), and thus not available for another task, regardless of how well or poorly it is making any sort of useful progress.
Just like you're busy at work even when you're wiping your monitor instead of coding.
Since "waiting for memory" isn't something handled by the scheduler (is not a scheduling wait), it just looks like any other non-scheduler-related busy.
A program can keep a CPU busy in multiple of ways such that there is no "utility" even in the absence of memory stalls. It might have a bug which causes it to loop infinitely. Or it could be a service application which is wrongly blowing through what should be a wait at the top of its loop.
In the days of multi-user systems at universities, administrations looked upon the running of game programs as a no-utility activity.
Yes, true by definition. But the article is trying to look past the definition to see what's really going on. Nobody says top says I'm 99% busy so that's as good as it gets
Memory stall means re-fetching an evicted or not-yet-faulted-in page from disk, so its disk I/O, which is done asynchronously and leaves the CPU available for another task.
A page fault stall is visible to tools like top. A task waiting for a page to be faulted in so that it can continue is descheduled, and so 0% CPU. So I don't think that's what the article is talking about.
I love Brendan's article about load averages but I feel like this is missing the mark.
A core, or an execution unit stalling within a core, still count as busy. I.e. the CPU can't process something else. The utilization metric is correct. The question of whether a CPU be used more efficiently is in the domain of optimization. It might be executing NOPs and not stalling. It could be using an O(N^3) algorithm instead of O(N).
With out of order execution, speculation, and SMT, it's hard to say if an instruction stalling means that the CPU can't process something else; CPUs are complex parallel streams of processing and trying to think of them in linear terms necessarily misses some complexity.
Another factor (also mentioned in the article) is dynamic frequency scaling. Is a core at 100% when it's running at it's running 'flat-out' at its nominal frequency or when it has boosted? The boost clock generally depends on thermals, silicon quality and maybe time - so in that case what do you make 100%? If you go for nominal being 100% then sometimes you're going to be at say 120%.
The most obvious answer: 100% is what the manufacturer claims it capable of handling continuously under the worst supported conditions.
The more difficult answer/question is how to communicate that 'full use' value as well as the current use (possibly greater than full) to software which calculates a usage estimate based on various already existing interfaces. Or if yet another interface (standard, if thinking about that XKCD comic) is needed.
I prefer the NASA approach: 100% is whatever as defined in the spec sheet, and any improvements above that are measured in percentages above 100.
As an example, the SSME was nominally operated at 104.5%, and the newer expendable RS-25Es nominally operate at 111%.[1]
So basically: 100% is the CPU running at base frequency, and anything higher (eg: turboing/boosting, overclocking) should result in even higher percentages above 100. This would be a lot more meaningful than whatever "100% CPU load" means today.
That's mostly what I stated, though with adjustable frequency products like CPUs and GPUs the question of 'what is the base frequency' is also a question. That's why I carefully phrased around manufacturer's claimed continuous operation in the worst environment supported spec. (Though implicitly with adequate cooling operating, not obviously broken cases like a CPU with it's heat sink shaken off.)
In the case of SMT (aka hyperthreading) you'd only know if the CPU can process something else by profiling a different thread that's running on the same core.
What users want from these metrics is the feedback about their hardware performance. It should absolutely reflect on issues related to memory latency. This is not about going faster, this is about making good use of the resource you have.
My typical use of similar metrics is from iostat: a tool that shows various statistics about how the system is doing I/O to block devices. Beside other things, it shows CPU utilization (which, in the context of this tool means the amount of CPU work dedicated to I/O). In the context of looking at the output of this tool, I don't use CPU utilization to directly judge the speed (it has read / write requests per second for that), this aspect tells me if I'm utilizing the capacity of the system to do I/O to its full extent (and I don't care if I may be writing in improperly aligned blocks causing write amplification, or not merging smaller blocks -- I will use different tools for that).
The problem is with CPU utilization as displayed by eg. top and our intuitive understanding of what it means to do work on CPU -- they are different. But, tools that display that utilization go for the metric that are easy to obtain rather than trying to match our intuition / be better sources of actionable information.
We want utilization to count progress along the code instructions, because that's where intuitively we'd draw the line between hardware utilization and software issues. Instead, we get a metric that never over-estimates utilization, but is usually wrong.
I disagree. The general user has no control over the code executing. It's an application written by someone else. When that application is utilizing a core, then it's utilizing a core and this is what this metric is (correctly) telling us. If you're in the business of writing software and trying to squeeze the most out of a core then you use different tools.
These tools aren't for "general user". They are either for system programmers, or for system administrators.
> When that application is utilizing a core,
Core of what? A real CPU? A virtual CPU? Do we count hyperthreading TM?
You are just repeating a term that you didn't define -- "utilization". I did define it in the way that to me seems plausible given how people usually understand it intuitively. You just keep throwing this word around, but you don't even care to explain what you mean.
We start with a physical core. Virtual cores have "virtual" utilization" and similarly hyperthreaded cores (which is a bit of a marketing term that isn't always useful in the real world). Naturally if you want to understand what a VM is doing you need to also look at the hypervisor. If you want to dive into exactly what's going on with hyper-threaded cores it can be harder given you don't have perfect visibility.
A physical core can either be idle. Or it can be executing instructions. The portion of the time that it's executing instructions is when it's utilized. I think this is a pretty clear and meaningful definition that's been used for decades.
A system admin running Outlook on a server is not going to be able to do anything about a pipeline stall in Outlook on some particular CPU/memory/motherboard. From their perspective when the utilization is 100% Outlook is cpu-bound and can't do more work. And that's why we have this metric. A stall, or an unused execution unit, or an inefficient sequence of instructions, or inefficient algorithms or many other things are all things that cause the actual work you're getting out of the core to be less than what you could get if you rewrote the program. This is not what CPU utilization % means. If there are power management or thermal considerations then that's also another thing you need to look at to get a complete picture.
Now Outlook might be I/O bound, which is a different problem, for which we look at different metrics. By the way, your I/O metrics reported by various tools are also all imperfect, things like whether the I/Os are sequential, or random, the block size, the mix of reads and writes, all have their own peculiar performance characteristics. Which again are of interest for some people optimizing I/O but not generally something that users of applications can do much about.
EDIT: It feels like you are looking for something that tells you as a programmer how much more you can squeeze out of your CPU. There's no such metric. It's up to you to use tools like profilers and your understanding of architecture and your imagination to figure that out. The utilization metric is super useful. I use it a lot. I've used it for years. Do I need to understand all the other factors that influence it - sure do. Is it something I'd use instead of profiling? no.
Hmm, I feel like this is a bit too big of a generalization to be useful. It's a cool observation but I don't think you can interpet IPC so directly as to conclude that IPC < 1 indicates "you are likely memory stalled". There are a million reasons why you'd have an IPC < 1, many of them which are not memory. Maybe you just have a lot of hard to predict branches which take a long time to resolve (all that mispeculated work is wasted cycles!), maybe you use a lot of long latency FP operations, maybe you just have really long dependency chains. If your core doesn't have full speculative wakeups for variable latency instructions (meaning things like loads but often things like division and multiplication have early exits), you can very easily end up with plenty of cycles where you don't complete an instruction. And, yes, of course, it could be that you're having a lot of cache misses. Memory is a good guess but it's not really actionable advice for performance analysis, you really need to pull out some proper tools to understand why you're getting that number.
I think this is a cool article in that, yes, Task Manager/top/Activity Monitor aren't telling you the full story when it comes to what "CPU %" means. In the end though, there really isn't an easy way to come up with a better metric for utilization that can be summarized in one number, so realistically "how much time did the scheduler put this process on a core for" is plenty good enough for most purposes.
>Memory is a good guess but it's not really actionable advice for performance analysis, you really need to pull out some proper tools to understand why you're getting that number.
I don't really understand and by extension agree with this point. If you're doing performance analysis on a regular basis, having more readily available tools that give you a solid guess at where you want to look next is useful. It's a lot easier for me to get pcm up and running on any random system than it is to dig in with vtune, for example, and if I can get a fairly accurate determination that pcm and pcm-memory are all I will need to run to conclusively know that it is in fact memory, then I've saved time out of my day.
While the absolute number of possible reasons to have a <1 IPC might be large, the handful of most common ones make up the vast majority of situations in my experience.
That's why there's likely in the passage you quoted. No, you don't know that for sure. Yes, you need to check out that possibility, because it's very likely the cause of poor utilization.
Did anyone had the impression that utilization reflects the actual computations the CPU does? It only reflects the fraction the of wall clock time spent in the context, nothing more.
SMT is pitched as a good way to "hide" low IPC loads by giving the core something else to do while waiting for memory. Hence IBM and (very briefly) Marvell both shipped 4-way SMT cores for database-heavy workloads.
AFAIK this further complicates the IPC metric. What looks like a stalled core might actually be working on another thread. And at the other extreme, two very high IPC threads on one core would lower the observed IPC of each thread.
Except that you've just split your available cache region between multiple processes. This can actually decrease performance, I remember cache thrashing back in the P4 days that could cause an application like MS Word to take minutes to load. One process evicting data the other one needs and vice versa. It circumstantially makes the situation worse instead of better.
Turn off hyper-threading and boom, everything started to function normally.
The solution to not having enough food for one mouth often isn't to add another mouth.
I still see this in modern benchmarks on modern systems where multicore throughput on many benchmarks is just better when you just turn off SMT.
AMD's 3D chips go faster because of the huge cache. Every time we add another process to a core we just split the cache in half instead.
Clearly, this is workload dependent. But I immediately turn off SMT on new systems I get. I'm not core limited, I'm usually single thread limited. Like, always. No reason to keep SMT on for this case.
I was always curious why they didn't just make pairs of CPUs that shared a few logic units in common. Like I don't know what the instruction mix is these days and just how long the pipeline is, but say most code can do 2-3 ALU instructions per clock, why not 2 cores with 1-2 ALUs each and 2 shared?
As things are, there is very little guarantee that the hyperthread will make any substantial progress.
AMD Bulldozer did something like this. Ultraspac T1 had a shared FPU too. There is are a savings there, but neither of these solutions ended up being practically fast. The amount of die space spent on an ALU is relatively low compared to the rest of the CPU, hence why they shared FPUs instread.
We might see it again now that the server world is having an ARM CPU renaissance. But I doubt AMD or Intel will make anything that exotic.
I sort of hope we don't see SMT ARM cores. SMT was/is a huge pain for side channels and all the ARM chip houses dodged that one by nature of just never implementing it. I would hope that the concern of cloud vendors over isolation would be enough to discourage bringing more SMT uarchs in to this world.
Or just disabling SMT? SMT can give a good utilization boost and it feels like a shame for it not to be an option, especially since CPUs are not getting faster at the same rate they used to anymore. You can disable it on AMD and Intel x86 cores, as well as on the IBM Power ones.
Generation after generation, hyperthreads on almost every generation of Intel core have turned in mediocre to terrible benchmarks. Neither thread makes full progress , and on some versions you would see more IPC with them turned off.
AMD might have managed better, though I see benchmarks where theirs is also a wash. It doesn’t mean Arm or RISC-V will see any benefit.
I've seen around 20% increase in even dumb "a web app returning some text" benchmarks. It certainly shouldt be charged as a second real core would but it isn't nothing (like in P4 era you apparently time travelled from)
But that's entirely beside the point and I'm not sure why you try to derail the conversation into different direciton.
ALUs[1] do not really consume that large of an area of a CPU so just sharing those is a significant complication with little benefits. Sharing register files, caches, store buffers, ROBs, TLBs, prediction tables, and the logic to drive them is a better trade-off.
[1] outside of large SIMD FPUs, which indeed bulldozer tried to share with not great results.
The article mixes a few things together. It essentially describes the von Neumann bottleneck, a consequence of the literal way many traditional computer architectures interpreted Turing’s vision. The thrust of the article is to warn performance engineers to not confuse the OS scheduler view (of whether a CPU is available for more work) and the micro-architectural view (of whether you expect the CPU to retire more instructions for a given number of clock cycles).
In my experience on large ARM cores, the max IPC can be high, but programs that do useful work rarely achieve it. Scientific code intended for HPC makes good use of vector units or just superscalar processing, along with (manual) interleaving of compute and memory I/O. Other code, like most web browsers, can hit IPC=1, only after a ton of tuning. Both categories are important, but usually the pot of money is larger for the HPC code, or at least the optimization path is clearer.
In other words: the article is intended primarily for someone to understand when they might want a performance engineer and not just call it a day when they see full CPU scheduling utilization.
I was recently in a similar situation having to first investigate how (various) GPU utilization metrics are defined, and then I also had to explain this information to someone who had the task of representing GPU-related metrics to users.
Both turned out to be quite difficult... First had to do with NVML (the library behind nvidia-smi), which isn't very well documented... but I had some help there. But, even after sort of making a mental image of available metrics, the task of presenting users with actionable statistics turned out to be also very hard.
On top of what OP writes, imagine now the added modality of having a hierarchical structure of PU workers, some of which, are not even hardware... and, subsequently, the stalls resulting from bad data alignment rather than waiting on physical processes to complete. Add to this yet another modality of vRAM occupancy and its bandwidth utilization...
And if you want to show a single number that tells users how loaded their system is... what do you do? Average? Sum? Weighted sum? What if you wanted some absolute units rather than percents? When the data doesn't properly align (to fill up all the SIMD lanes) should this be counted towards "bad" GPU performance or just assumed to be the nature of the workload?
Interestingly, the "utilization" nvidia-smi reports is about time, not the number of instructions executed. And it's about time that at least one SP wasn't marked as idle.
This stuff becomes even more fuzzy, when eg. NVLink is involved... it's really hard to give concise information to not extremely technical users that'd be actionable from their perspective.
Stalls are not necessary for thermal regulation, DVFS [1] is the tool we actually use for handling thermal limits (generally referred to as "thermal throttling"). In a modern CPU, you really don't expect (or, rather, want) to be fully stalled ever, that's the point of very deep reorder buffers. You can try this yourself, make a long loop with a ton of unrelated int and fp operations (unrelated so as not to cause dependency chains) and you'll get 100% occupancy without the core catching on fire.
I didn't realize how much pressure there can be on memory. This explains that part of Apple's M1/2 performance is just from SoC benefits, and that even in non-IO applications, hyperthreading is more than a marketing gimmick.
I'm not familiar with modern CPUs, but I remember that most instructions take more than one cycle to execute, without counting memory or cache delays. So, expecting to see 1:1 IPC is ... a fantasy?
Yes, that's true, instructions generally take more than one cycle to execute, although the most common ones are fast. But processors these days can execute more than one instruction at a time (this has nothing to do with multiple cores, this is on a single core) and aren't bound to the order the instructions appear in the code, it will make sure that instructions that are dependent on each other will be executed in the right order, otherwise it can opportunistically execute whatever it has room for. Modern processors even speculatively executes code from branches it doesn't yet know will be taken, which caused the whole Meltdown/Spectre security headache.
Many processing units can offset this via superscaler methods, that is tricks to have more than one instruction processing at the same time. pipelining, speculative execution, smt, etc.
The article does not go into great depth about it, but does say that the 1 ipc ratio number is based off more gut feel than anything else. I assume the idea is that the superscaler bits(greater than 1 ipc ratio) help compensate for the slow bits(less than 1 ipc ratio) normalizing out at around a 1 ipc ratio when your code is good.
They do take more than cycle (how many depend on how you count), but they are fully pipelined so they can start executing an ( independent) instruction before a previous one has finished.
They are also superscalar so they have multiple (pipelined) units that can start executing instructions at the same time.
Yes but also many instructions are in flight at the same time, so just because the total pipeline is many cycles long doesn’t mean you can’t have a high throughput.
That's why the number is 1 and not 4. The processor is 4 wide so if it was doing the absolute theoretical maximum you'd get an IPC of 4. Brendan's "rule of thumb" of 1 is taking the multi-cycle thing into account.
Why use top when you can use htop? It breaks down CPU usage into various parts like IO wait and vm guest time... they even integrated functionality from iotop so you eventually won't need that program either.
While the memory stalled information could be valuable in optimizing a program, it is correct to count it as part of the CPU busy.
Busy means the CPU was occupied with a task (other than the operating system's idle task), and thus not available for another task, regardless of how well or poorly it is making any sort of useful progress.
Just like you're busy at work even when you're wiping your monitor instead of coding.
Since "waiting for memory" isn't something handled by the scheduler (is not a scheduling wait), it just looks like any other non-scheduler-related busy.
A program can keep a CPU busy in multiple of ways such that there is no "utility" even in the absence of memory stalls. It might have a bug which causes it to loop infinitely. Or it could be a service application which is wrongly blowing through what should be a wait at the top of its loop.
In the days of multi-user systems at universities, administrations looked upon the running of game programs as a no-utility activity.