Yes, exactly. To the first order I think Spectre didn't really change the performance of existing userspace-only code. What slowed down was system calls, kernel code and some things which were recompiled or otherwise adjusted to mitigate some aspects of Spectre. There might be a rare exception, e.g., IIRC `lfence` slowed down on AMD in order to make it more useful as a speculation barrier on AMD but this is hardly an instruction that saw much use before.
> I don't know what the state of the art is, although I've seen results showing both speedups and slowdowns
Yeah. This seems like a pretty cut and dry case where you'd get a speedup from wrong-path misses, since the independent next search will be correctly predicted from the start and access exactly the right nodes, so it serves as highly accurate prefetching: it only gets thrown out because of a mispredict at the end of the _prior_ search.
Something like the misses within a single binary search are more ambiguous: for random input the accuracy drops off like 0.5^n as you predict n levels deep, but that still adds up to ~double MLP compared to not speculating, so in a microbenchmark it tends to look good. In the real world with 1 lookup mixed in with a lot of other code, the many cache lines brought in on the bad path may be overall worse than inserting a speculation barrier yourself.
That's the cool part: we can choose whether we want speculation or not if we know up front if it's harmful.
Yes, exactly. To the first order I think Spectre didn't really change the performance of existing userspace-only code. What slowed down was system calls, kernel code and some things which were recompiled or otherwise adjusted to mitigate some aspects of Spectre. There might be a rare exception, e.g., IIRC `lfence` slowed down on AMD in order to make it more useful as a speculation barrier on AMD but this is hardly an instruction that saw much use before.
> I don't know what the state of the art is, although I've seen results showing both speedups and slowdowns
Yeah. This seems like a pretty cut and dry case where you'd get a speedup from wrong-path misses, since the independent next search will be correctly predicted from the start and access exactly the right nodes, so it serves as highly accurate prefetching: it only gets thrown out because of a mispredict at the end of the _prior_ search.
Something like the misses within a single binary search are more ambiguous: for random input the accuracy drops off like 0.5^n as you predict n levels deep, but that still adds up to ~double MLP compared to not speculating, so in a microbenchmark it tends to look good. In the real world with 1 lookup mixed in with a lot of other code, the many cache lines brought in on the bad path may be overall worse than inserting a speculation barrier yourself.
That's the cool part: we can choose whether we want speculation or not if we know up front if it's harmful.