Question: If I rent 4 core AWS instance, does it mean 4 physical cores or 4 hyper threaded cores? Is there a standard to this definition of “cores” across GCP, DO, Linode, etc. I don’t have the experience or knowledge about cloud computing but just have a DO instance running a web server. I’m curious.
A cloud "vCPU" is a hyperthread and in good providers (EC2/GCE) they are properly pinned to the hardware such that, for example, a 4-vCPU VM would be placed on two dedicated physical cores. This was probably done for performance originally but now it also has security benefits. You can get hints of this by running lstopo on VMs and similar bare metal servers.
On second and third tier cloud providers, the vCPUs tend to be dynamically scheduled so that they may share cores with other VMs.
You mean it's random then, right? I mean let's talk about what a hyperthread really is: It's the left over functional units (or execution units).
Say you have 100 adders, the processor tries to schedule as many instructions as it can on those 100 adders, but eventually it will run into data dependencies. The left over units can go to a hyperthread.
My understanding (let me know if I'm wrong) is that Hyperthread aware OSes (which is like what, everything since WinXP/Linux kernel 2.4?) will schedule lower priority tasks to the logical cores and higher priority tasks to the real cores.
So when it comes to a hosted provider (a.k.a cloud provider, a.k.a somebody else's computer), what you get pretty much depends on the virtualisation layer they use: Vmware, KVM, Xen, Hyper-V, etc.
Do hypervisors typically peg VMs to a real physical core? I was always under the impression they over-provision on most hosts, so you're getting part of a core and the vCPUs the list in the product documents just indicates your priority and how may vCPUs appear to your particular VM.
My understanding (let me know if I'm wrong) is that Hyperthread aware OSes (which is like what, everything since WinXP/Linux kernel 2.4?) will schedule lower priority tasks to the logical cores and higher priority tasks to the real cores.
You understand it wrong, even though you’re somewhat correct as to what scheduler actually does. In a single physical core, both logical processors are equivalent, and neither one has higher internal priority over the other. The hyper threading aware scheduler will take extra care in scheduling in this scenario, but not in a sense you describe — if you have 2 physical cores, and thus 4 logical processors, and 2 CPU intensive tasks, the scheduler might attempt to schedule them on different physical cores, instead of stuffing them on the two logical processors of a single physical core. It’s not because one logical core is better than the other, but rather it’s because the two tasks would simply compete with each other in a way they wouldn’t if they were on physically separate cores.
My understanding (let me know if I'm wrong) is that Hyperthread aware OSes (which is like what, everything since WinXP/Linux kernel 2.4?) will schedule lower priority tasks to the logical cores and higher priority tasks to the real cores
That is not how I understand it. The OS sees two identical logical cores per physical core and the CPU manages which is which internally. Also it's not really high and low priority - it's two queues multiplexing between the available execution units. If one queue is using the FPU then the other is free to execute integer instructions, but a thousand cycles later they might have switched places. Or if one queue stalls while fetching from main memory, the other gets exclusive use of the execution units until it gets unstuck.
In my floating-point heavy tests on i7 however, there is still a small advantage in leaving HT on, the common wisdom is if you are doing FP, HT is pointless and may actually harm performance, but that doesn't match my observations if your working set doesn't fit into L2 cache. YMMV.
A semi-modern OS will try to keep a process on the same physical core if it can, so it may be flipflopping between two logicals, but should still see the same cache. Disabling HT means the OS still sees logical cores, but half as many of them, with a 1:1 correspondence between logicals and physicals.
I have a handful of examples, https://github.com/simonfuhrmann/mve/tree/master/libs/dmreco... is one, which are coded without too much respect towards using cache-efficient data structures, in fact it's actually hrader in C++ to not totally ignore the cache handling data as whole cachelines. Note that in any cases the compiler could use more respectful datastructures with at least very similar performance even if they don't spill out of cache.
In this case, reconstruction 2 MP images on a quadcore E3 Skylake, the performance without HT was better, and even better after replacing some of the pathological uses with B-tree and similar structures under iirc MIT/BSD using the same interface (it was just a typedef away). Also they used size_t for thenumber of an image in your dataset, yet their software is far from scaling that far without a major performance fix due to the cost//benefit of optimization leaning towards a good couple sessions with a profiler, before spending the money on the compute (unless the deadline precludes it).
The dataset still doesn't fit into L3, and even then there are ways to block the image similar to matrix multiplication.
perf stat -dd
works wonders. The ubuntu package is perf-tools-unstable, iirc, and setting lbr for the callgraoh of perf top if you run on Haswell or newer gives you stack traces for code compiled with -fomit-frame-pointer.
> In my floating-point heavy tests on i7 however, there is still a small advantage in leaving HT on, the common wisdom is if you are doing FP, HT is pointless and may actually harm performance, but that doesn't match my observations if your working set doesn't fit into L2 cache. YMMV.
I benchmarked this myself using POV-Ray (which is extremely heavy on floating point) when I first got my i7-3770k (4 cores, 8 threads).
Using two rendering threads was double the speed of one, four was double the speed of two, but eight was only about 15% faster than four.
I don't think I've ever actually seen an example of real-world tasks that get slowed down by HT. Every example I've seen was contrived, built specifically to be slower with HT.
From my understanding, you can't (necessarily) even rely on your guest's CPUs mapping to the host's actual CPUs, which makes spending time twiddling NUMA actively useless. Assuming that's actually the case, I very much doubt the guest's scheduler has the ability to schedule tasks between logical and physical cores, based on priority.