More

melchebo · on April 16, 2016

Presentation PDF: http://www.i3s.unice.fr/%7Ejplozi/wastedcores/files/extended...

melchebo · on April 16, 2016

Most recent Intel cores actually appear to be slightly slower at integer performance. http://imgur.com/a/2fiLF

melchebo · on April 16, 2016

A lot of their bugs are induced when you have a cgroup with multithreaded/-process tasks and a cgroup with just a few using processor time.

Just testing one benchmark will not show it, unless you have something else running too.

brendangregg · on April 16, 2016

Ok, depends what you mean by something else running. Another PID? That's why I tested make -j32. I also tested multi-threaded applications from a single PID (and more threads than our CPU count), since that best reflects our application workloads.

They ought to be posting it to lkml, where many engineers regularly do performance testing. I've looked enough to think that my company isn't really hurt by this.

melchebo · on April 17, 2016

Just read the paper, it's explained. Or read the presentation, it uses pictures.

Basically they run R, a single threaded statistics tool which is setup to hog a core, and in some other cgroup a wildly multithreaded tool. If you have a NUMA system (check with `lstopo`) then it's possible that the scheduler thinks the many tasks in one domain of cores is balanced with just R on one core of another domain. Meaning you can have several (ex: 7 out of 8) cores idle. It has to do with the way hierarchical rebalancing is coded, and that their 8x 8-core AMD machine has a deep hierarchy.

melchebo · on April 15, 2016

It seems like everything with buffer/process queues needs a good polish to optimize for fairness.

E.g. this is just being readied: http://blog.cerowrt.org/post/fq_codel_on_ath10k/

(also note higher throughput as result)

melchebo · on April 15, 2016

You check wether your system in NUMA with `lstopo`.

melchebo · on April 15, 2016

That might be what they call the 'overload on wakeup' bug. Maybe try the patch? I've read that the patch additionally needs a small fix, the goto label position went missing.

Probably just before rcu_read_unlock() in that function: http://lxr.free-electrons.com/source/kernel/sched/fair.c#L51...

melchebo · on April 15, 2016

This research applies to 'NUMA' systems. Commonly servers with multiple physical CPUs that each have a connection to their own memory banks. They can access memory of the other CPU by requesting it, but that take time. So the process scheduler has to take that into account. Usually by keeping processes a slight bit affixed to the place where it was started.

Off-topic, but high-performance long running processes are mainly programmed in C, C++ & Java. Maybe stuff like Rust and Swift in the future. Fortran if you are doing mathematical computation, but then you'd probably already use it if you need it.

For what I estimate that you mean with high traffic on PHP or node systems on multiple servers, probably you want to look at Elixir and it's Phoenix web framework. It's more appropriate for responsiveness (as in low latency). And less boilerplate than Java. |> http://www.phoenixframework.org/docs/overview

melchebo · on April 15, 2016

From what I understand from this presentation the 'scheduling domain' abstraction is reused through different layers of the hierarchy. So for example the two hyperthreads on one logical core are also modeled as 'scheduling domain'.

https://events.linuxfoundation.org/images/stories/slides/lfc...

melchebo · on April 15, 2016

The scheduler looking at an idle core decides wether to steal work from an overloaded neighbor. It will only compare over the interconnects in the figure (between domains).

E.g. the the two cores in the dark grey box can steal work from each other. But they will only see load averages of the neighbouring domain. In certain cases the current scheduler calculates the load figures sort of odd, so the idle core decides that a neighboring overloaded 'scheduling domain' is not overloaded.

melchebo · on April 15, 2016

It should probably mainly be seen as a proof of concept for their scheduler decision visualization tools. Which are not public for the time being. That should make checking and fixing bugs easier in the future.