Hacker News new | past | comments | ask | show | jobs | submit login

Disappointingly, it's a dark art often because the CPU is a black box. Intel X86 chips translate the instructions you give them to some internal microcode and then execute them speculatively, out of order, etc. I'm still mystified by the performance gains afforded by randomly inserting NOP instructions.



Fortunately, at least in my experience, the variability that CPUs introduce (which do matter in many contexts) aren't often the source of slowness. In my experience plain old algorithmic complexity would go a long way in making stuff faster.

I can't tell you the number of times I've fixed code like this

    matches = []
    for (first : items) {
      for (second : items) {
        if (first.name == second.name)
          matches.add(first);
     }
    }
Very frequently a bright red spot in most profiler output.


I think there's an element of selection bias in that observation. Since that is the type of performance issue that a profiler is good at finding, those are the performance issues you'll find looking at a profiler.


> I think there's an element of selection bias in that observation.

Almost certainly true, I can only speak of my own experiences.

> Since that is the type of performance issue that a profiler is good at finding, those are the performance issues you'll find looking at a profiler.

I have to disagree with you on this. Sampling profilers are good at finding out exactly what methods are eating performance. In fact, if anything they have a tendency to push you towards looking at single methods for problems rather than moving up a layer of two to see the big picture (it's why flame graphs are so important in profiling).

I have, for example, seen plenty of times where the profiler indicated that double math was the root cause of problems yet popping a few layers up the stack revealed (sometimes non-obviously) that there was n^2 behavior going on.


There is a plethora of information regarding instruction timings, throughout/latency, execution port usage, etc. that compilers make liberal use of for optimization purposes. You could, in theory, also use that information to establish an upper bound on how long a series of instructions would take to execute. The problem lies in the difference in magnitude between average case and worst case, due to dynamic execution state like CPU cache, branch prediction, kernel scheduling, and so on.


There is uiCA, it achieves an error of about 1% relative to actual measurements of basic block throughput across a wide range of microarchitectures. And then FACILE, similar to uiCA. I don't know of any compilers using these more accurate models, but it is certainly possible.


Intel also provides vtune to annotate sequenced instructions and profile the microseconds and power consumption down to the level of individual instructions.

I assume those NOPs you mention exist for alignment padding. Clang and GCC let you configure the alignment padding of any or all functions, and Clang lets you explicitly align any for-loop anywhere you want with `[[clang::code_align]]`.


That is why tooling like VTune exist.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: