*My understanding (let me know if I'm wrong) is that Hyperthread aware OSes (whi...

namibj · on Aug 23, 2018

I have a handful of examples, https://github.com/simonfuhrmann/mve/tree/master/libs/dmreco... is one, which are coded without too much respect towards using cache-efficient data structures, in fact it's actually hrader in C++ to not totally ignore the cache handling data as whole cachelines. Note that in any cases the compiler could use more respectful datastructures with at least very similar performance even if they don't spill out of cache.

In this case, reconstruction 2 MP images on a quadcore E3 Skylake, the performance without HT was better, and even better after replacing some of the pathological uses with B-tree and similar structures under iirc MIT/BSD using the same interface (it was just a typedef away). Also they used size_t for thenumber of an image in your dataset, yet their software is far from scaling that far without a major performance fix due to the cost//benefit of optimization leaning towards a good couple sessions with a profiler, before spending the money on the compute (unless the deadline precludes it).

The dataset still doesn't fit into L3, and even then there are ways to block the image similar to matrix multiplication.

perf stat -dd works wonders. The ubuntu package is perf-tools-unstable, iirc, and setting lbr for the callgraoh of perf top if you run on Haswell or newer gives you stack traces for code compiled with -fomit-frame-pointer.

Sohcahtoa82 · on Aug 23, 2018

> In my floating-point heavy tests on i7 however, there is still a small advantage in leaving HT on, the common wisdom is if you are doing FP, HT is pointless and may actually harm performance, but that doesn't match my observations if your working set doesn't fit into L2 cache. YMMV.

I benchmarked this myself using POV-Ray (which is extremely heavy on floating point) when I first got my i7-3770k (4 cores, 8 threads).

Using two rendering threads was double the speed of one, four was double the speed of two, but eight was only about 15% faster than four.

I don't think I've ever actually seen an example of real-world tasks that get slowed down by HT. Every example I've seen was contrived, built specifically to be slower with HT.