Sure, I think it's fair to say that if cachegrind finds the branching to be pred...

Sure, I think it's fair to say that if cachegrind finds the branching to be predictable, that any modern processor will do fine on it. But if cachegrind predicts poor performance, I wouldn't suggest changing your code unless you've discovered that the actual performance is poor on a real processor. Overall, I think cachegrind's branch prediction is nowadays less accurate than the rule of thumb that "if there is a pattern the CPU will find it". Since using performance counters is so simple and accurate (at least on x86/x64), I'm not sure that there is a benefit to using the slower, less-accurate older method of emulation.

You response encouraged me to check whether the branch predictor in cachegrind had been updated since I last looked at it. It doesn't look like it. It's still pretty simple, about 20 lines of actual code: https://github.com/fredericgermain/valgrind/blob/master/cach...

I greatly appreciated the honesty in one of the comments: "TODO: use predictor written by someone who understands this stuff."

It is a valid question might whether the consistency from processor to processor is any better than cachegrind or a rule of thumb. My experience was that for Nehalem/Sandy Bridge/Haswell things were more same than different (and only got better), but I don't know about other lines.