Disappointingly, it's a dark art often because the CPU is a black box. Intel X86...

cogman10 · 2024-09-07T18:26:38 1725733598

Fortunately, at least in my experience, the variability that CPUs introduce (which do matter in many contexts) aren't often the source of slowness. In my experience plain old algorithmic complexity would go a long way in making stuff faster.

I can't tell you the number of times I've fixed code like this

    matches = []
    for (first : items) {
      for (second : items) {
        if (first.name == second.name)
          matches.add(first);
     }
    }

Very frequently a bright red spot in most profiler output.

marginalia_nu · 2024-09-07T19:27:58 1725737278

I think there's an element of selection bias in that observation. Since that is the type of performance issue that a profiler is good at finding, those are the performance issues you'll find looking at a profiler.

cogman10 · 2024-09-07T20:07:01 1725739621

> I think there's an element of selection bias in that observation.

Almost certainly true, I can only speak of my own experiences.

> Since that is the type of performance issue that a profiler is good at finding, those are the performance issues you'll find looking at a profiler.

I have to disagree with you on this. Sampling profilers are good at finding out exactly what methods are eating performance. In fact, if anything they have a tendency to push you towards looking at single methods for problems rather than moving up a layer of two to see the big picture (it's why flame graphs are so important in profiling).

I have, for example, seen plenty of times where the profiler indicated that double math was the root cause of problems yet popping a few layers up the stack revealed (sometimes non-obviously) that there was n^2 behavior going on.

10000truths · 2024-09-07T18:00:45 1725732045

There is a plethora of information regarding instruction timings, throughout/latency, execution port usage, etc. that compilers make liberal use of for optimization purposes. You could, in theory, also use that information to establish an upper bound on how long a series of instructions would take to execute. The problem lies in the difference in magnitude between average case and worst case, due to dynamic execution state like CPU cache, branch prediction, kernel scheduling, and so on.

Mathnerd314 · 2024-09-08T04:51:11 1725771071

There is uiCA, it achieves an error of about 1% relative to actual measurements of basic block throughput across a wide range of microarchitectures. And then FACILE, similar to uiCA. I don't know of any compilers using these more accurate models, but it is certainly possible.

Conscat · 2024-09-07T19:02:06 1725735726

Intel also provides vtune to annotate sequenced instructions and profile the microseconds and power consumption down to the level of individual instructions.

I assume those NOPs you mention exist for alignment padding. Clang and GCC let you configure the alignment padding of any or all functions, and Clang lets you explicitly align any for-loop anywhere you want with `[[clang::code_align]]`.

pjmlp · 2024-09-07T17:36:09 1725730569

That is why tooling like VTune exist.