Article is better than expected. Because it looks like the application itself isn't doing much at all (receive message over socket, touch some memory) you're probably better off with a simple thread pool and some lock free data structures if you're really going for raw performance. On the other hand, it does serve as another data point that a Java solution can be fast enough.
Anyway, I think that thinking in terms of "message throughput" is really harmful. The author starts with a throughput of 120.000 messages per second and ends up with 800.000 messages per second which gives the advertised 6 times speedup. But essentially at 120.000 messages per second you have a "process overhead" of a mere 0.008 milliseconds. That's 8 microseconds. Microsecond!
I'll just repeat that. Overhead of 0.008 milliseconds per message in the SLOW case. So if you want to write the message to file, or do any kind of hash table lookup or want to do any kind of back end processing, whether you have 0.008 milliseconds or 0.002 milliseconds overhead per message is not going to be that big of a deal.
Usually the 'real work' required to process the message would drown this overhead in the noise. The moment the message requires you to establish an outgoing connection, open a file, hit the database or something similar it is going to completely blow this throughput.
Also, it seems that the overhead here is poor use of shared memory or threading semantics, something we cannot debug without the source (and the required time)
> it seems that the overhead here is poor use of shared memory or threading semantics
Worth noting on the graphs - even though CPUs aren't maxed out, they add up to (roughly) just above 100%. That just shouts "lock contention / thread switching".
There are plenty of 'real' workloads that don't involve opening a file, hitting a database or establishing an outgoing connection. Even a relatively simple task like serving up a web-app uses caching heavily to avoid being bogged down by the grinding slowness of disk access. If he's aiming for performance, then the message processing overhead could be very significant, and is worth optimising.
My advice is to find the bottleneck first. Watch performance counters. How do your cache hit and miss rates vary at the levels at different CPU counts? For example, poorly designed GCs don't scale well on multiple processors. You could very well be paying for a single shared allocator lock or have a young generation shared across multiple packages.
When are your threads idle? Can you produce log events for when your threads block on locks to see how often your threads are live versus sitting around?
Do you have bad thread migration? Even if your GC is careful and you only allocate thread-local data, if the OS keeps moving your threads between packages every timeslice, you'll pay a huge hit.
Analyzing performance data for multi-core program execution is hard, particularly in a virtual machine environment. It's even worse with some of the latest 4 and 6 core processors, where even measuring cache miss rates can have a noticeable impact on performance. (shameless plug: these issues and parallel language features are what our research group works on -- http://manticore.cs.uchicago.edu)
>For example, poorly designed GCs don't scale well on multiple processors. You could very well be paying for a single shared allocator lock or have a young generation shared across multiple packages.
Java's GC is anything but poorly designed, and it definitely scales to spades.
>When are your threads idle? Can you produce log events for when your threads block on locks to see how often your threads are live versus sitting around?
A good profiler should be able to show this. I've had a lot of luck with Yourkit, but others may work just as well.
Poorly designed was a harsh choice of words. A better way of putting it would be "not yet optimized for single program execution processor counts > 6 or so." The allocation wall (http://portal.acm.org/citation.cfm?id=1639949.1640116) is something a lot of folks are seeing in practice across a variety of virtual machine-based languages (including Java) when scaling to about 8 cores.
Hrm, interesting. I would have thought the way allocation is done with threadlocal variables negate a lot of the scaling for allocs, but I could be wrong. Do you have a direct link to the PDF, or mind putting it somewhere? I don't have an ACM login.
Sorry, but unfortunately I can't. The authors are permitted to put it online (where scholar.google and CiteSeerX usually immediately find them) but in this case they don't appear to have done so.
I think the bottleneck is the problem he is trying to solve: forward messages from one producer to one consumer in order. This is an inherently sequential problem, all it requires is a FIFO and a select on two file descriptors. The only thing that changes if you introduce treads is that they must continuously coordinate to find out who should send the next message. "I'm only dangerous with synchronized blocks", indeed.
While htop is a cool program with some nice features, the standard Linux top can show you everything you need to know. You can hit 'H' to show all the threads in a process. You can add the 'j' column which will show you the processes last known CPU assignment. You already get %cpu utilization on the process, and you can cut down what top shows you by hitting 'i' to only see processes/threads in the run queue. I like to add 'z' for cool color, and '1' to see SMP CPU details. While standard top doesn't show you cool CPU graphs, it's still pretty great.
It's a pity there is no source with the article, at a guess I would say there is some resource contention going on that drops away when assigning a task to a single core (effectively that stops that task from real multithreading, the only thing that will happen is that the threads are run in separate timeslices, but no longer concurrent).
Chances are that if you were to really inspect the code that you'd be able to pinpoint that single resource that's causing the threads to block.
After the single core is dealing with that process all the threads will be able to run to the completion of their slice, and so more messages will be handled per second.
On another note, I'm impressed with the authors focus on speed but I'd work on releasing something rather than optimizing something that already does an insane number of messages per second, unless there is a really important reason for this, otherwise it's just premature optimization.
The interesting part (to me) - is that regardless of the cause of this issue - it is likely that there are other (maybe many other) applications out there that could "benefit" from running on only a single core.
The server favors synchronized blocks to the use of volatiles (i.e. Atomics) at the moment. I tend to play things conservative until time that I find a problem or see an obvious opportunity. And, I hope I can say I'm not "over synchronizing" but I did synchronize as often as I determined it was needed (modulo the normal issues for humans not being very good at that).
Again however, even if the server is written in any silly manner you can imagine. The idea still seems to expose a an issue (and roundabout solution) I've never heard of before.
- The difference in performance is too big to be a memory bus issue
- Since the number of threads doesn't change, just the number of processes, it's not an un-scalability issue of the program.
- It's most likely not a issue with which CPUs he chose; in a hyperthreaded system there can be a pretty large performance hit if he pinned the server to logical cpus that were on the same core. I don't know how Intel numbers its cores in hyperthreaded processors, but their general method is "make the cores that are closest to each other have the biggest difference in cpu ids", which would mean that ex cpus 1 and 5 are on the same core.
- I would guess that it's not a thread migration issue, since I wouldn't expect the Linux scheduler to be that bad.
Poor cache performance is the most likely cause. As a minimal example of what the author concludes, if you have two threads that sit in a loop incrementing the same counter, it is much faster to run those two threads on the same cpu than on their own cpus. This is because when running on separate cpus, that particular cache line will continually be ping-pong'ed between the two. Even if threads don't access the same addresses, if they access addresses that are in the same cache line, you'll still get ping-ponging ("false sharing"). This explanation is also supported by the fact that it disappears on i7 processors; one of the largest improvements with the Nehalem line is the improved cache architecture.
Poor cache behavior is usually the culprit in non-cpu-bound and non-io-bound systems.
He should really be using dstat (or something like that) to show number of context switches per second.
His benchmark is mostly a measurement of Linux kernel thread context switching, which is a lot slower across cores/cpus. I suspect that if he tweaked the Linux kernel HZ to 1000, he'll see another 3-4x improvement.
In most real world servers, the threads are actually doing something real like updating some complex data structure and doing some transformation (encoding/decoding.), etc., which would made context switching overhead less of a problem.
However the benchmark does illustrate the importance of context switching performance for certain workloads.
If I had to guess I'd say that the code is doing too much synchronization between threads that, in the end, only one thread really gets to run at a time anyways at which point moving the process from CPU to CPU is the only thing going on besides running that one thread on the server, hence it's way faster if it doesn't have to do any moving around.
IMHO this is an indication of a bug in the code and not generic advice. Or does somebody see another issue that might cause this?
Sort of agree, although my intuition tells me not that there's an excessively-coarse-locking problem, but that there's most likely a contention problem.
The essence of a contention problem, for those who don't know, is that some shared structure (maybe a counter, or the head of a queue, or something) is polled and updated for every message that needs to be processed. Since processors don't (usually) share their caches, updating a value in shared memory causes that cache line on the other processors to be invalidated, so the next time the other processor needs to poll that value it has to retrieve it from memory, which takes a lot of extra time compared to getting it off the L1.
This problem often arises with shared work queues and counters. For work queues you can solve it by giving each processor its own work queue and having it steal work from other processors when it finishes its own queue.
A textbook by Herlihy and Shavit, The Art of Multiprocessor Programming, discusses this stuff in detail.
This hypothesis makes a lot of sense. One should be able to test it on a Nehalem architecture: benchmark the program running on two cores, once choosing them to be different physical cores versus the same physical core. Your theory predicts the latter option to be much faster, and faster than the single core mode, because the two virtual cores will share the same caches.
This is pretty much correct. Another solution to this is to have an array[#CPU] of counters/structures and you only update the counter/structure for that specific CPU.
He's using Intel Core i7. This architecture is very similar to AMD Hypertransport in that each CPU is directly attached to a bank of memory, and has an on-die memory controller.
This has obvious performance benefits because smart operating system kernels can take advantage of this and locate a processes memory on the bank connected directly to the CPU the process is running on. There is a slight disadvantage though when running highly multithreaded applications: If a thread running on CPU A wants to access memory on CPU B, it takes an extra hop to get there. This increases memory latency.
What is happening is that he has a highly multithreaded application where threads are accessing memory attached to a different CPU. When he restricts the process to a single core, all memory is directly attached to the CPU and there are no extra hops, decreasing memory latency and increasing performance dramatically.
This trick only works on processes that can effectively run on only one CPU and within the memory limit of that CPU's attached RAM.
I may have to revise my theory because it seems like he is running a single CPU system (quad core, hyperthreaded to look like 8 cores). Without a multiple physical CPU system the extra memory hop wouldn't exist.
This could also be an issue caused by hyperthreading itself. The extra 4 cores are not real physical cores on the die. They are just an extra thread that runs on each of the existing cores. What could be happening is that there is some thread locking going on where one thread is waiting for a message or event from a thread that is executing on the virtual second core, which in turn isn't executing right at the moment because it's not a real core.
An interesting test would be to have him turn off Hyperthreading in the BIOS and see if that improves performance without limiting to only 1 core. Alternatively, he could restrict the process to run on CPUs 0, 2, 4, and 6, which should run it on only the "real" cores.
I always disliked Intel's hyperthreading. Don't try to fool me into thinking I have more cores than I really do. I know it speeds up some apps, but it hurts others.
If you disliked HT in the Pentium 4, take another look at it today. The Pentium 4 had four dispatch ports and not too many execution units - a single thread was able to exploit that. The i7, in contrast, has 6 dispatch ports feeding 12 execution units. It's extremely difficult to find instruction-level parallelism within a single thread that can exploit that architecture, hence the need for HT to keep the execution units cranking. These days it's quite rare to find applications which are slowed by HT.
You seem to think a hyperthreaded system exposes to the OS 4 "real" and 4 "virtual", slower cores. This is not true, if you have 2 threads executing on a single core at the same time each gets equal attention / performance.
Also, hyperthreading really does make a lot of sense on today's hardware. The gap between CPU and memory performance is wide and getting wider each year, when you have a cache miss and your core has several hundred cycles to get its data from memory, you's better have another thread that can keep it busy.
I get your point, but by turning off HT or only running processes on 1 of 2 threads you can ensure that only a single thread is executing on a single core at a time.
Your point about cache misses is well taken. HT is good overall.
I've done a lot of work on similar topics, building real-time messaging systems in C++. The author's findings are not unique and are very common to any producer-consumer messaging system. Essentially the issue is that you want your L2 to stay hot. If a single core is enough to satisfy your processing requirements, adding more threads and more cores isn't going to speed you up. Instead the parallelism will actually put more pressure on the scheduler, causing your threads to bounce around to cores with a cold L2 cache. If your cache is cold, your pipeline stalls out, and your performance is dead.
>Essentially the issue is that you want your L2 to stay hot
Exactly the reason a 'parallel' app can slow down as you get more cores. More contention means more ACTUAL locking which mean more cache flushes, from my understanding.
The thing is this is a microoptimization, so the effect of taking a slightly different code path explodes and becomes significant. We could be measuring anything here.
Maybe we're measuring the internal locking mechanics of Java. Maybe we're measuring the TCP stack internals. Maybe it's just about the CPU cache. Maybe it's about context switching in his specific linux kernel.
It could be any number of things you're measuring here. Anything!
His idea of running apache tomcat and apache bench shows that he didn't really think it through, because OF COURSE a 6 microsecond difference is not going to show up because processing HTTP headers takes miliseconds! Orders of magnitude difference here!
If it was a mutex contention issue, moving it to a single core wouldn't change anything -- there would be still be contention. I'm guessing it has something to do with CPU cache hits.
No, that's not right. There is only one code path executing so there can never be another thread contending for the same lock.
A lock begins as a spinlock (busy wait loop) for a few thousand iterations before degrading to a full-blown lock that puts the thread to sleep. Because there is no other thread vying for the lock, the spinlock succeeds immediately and the thread keeps on running.
There is only one code path executing so there can never be another thread contending for the same lock.
Are you sure there is only one code path executing? Don't all the threads get allocated time slices of the single CPU? I believe it's true that there is only one concurrent code path, but a thread can still acquire a lock and then come off the CPU.
Yes, you are right - and my previous comment somewhat skims over the details. Sorry for that. :)
The difference in a single-core scenario is that the threads can't all become starved because a thread on another CPU holds the lock - the VM will simply cycle to the next thread and continue. In interpreted mode you don't even suffer a context switch.
One edge case is where a thread acquires a lock and is then itself put to sleep because of blocking I/O or trying to obtain another lock. But that is simply a bad practice.
Inter-core mutex is MUCH slower than single-core. Because of the "memory boundary" problem. See, any lock depends upon a single memory location that is atomically test and set. When more than one cpu is involved, then the memory has to be plumbed down thru caches until its in memory common to all involved cpus - which is probably main memory when more than 2 cores are involved. On a single CPU its the L2 cache.
I think you are on to something. His example code uses AtomicLong, a class that needs to synchronize on itself if the architecture has no native method to update 64 bits integers. Too bad Paul doesn't mention if he is running a 32 bits or 64 bits VM.
I'm very sure he has the code he haven't really made to properly scale with more cores or processors. The fact that he doesn't write much about the code in the article suggests me that he doesn't really know how to make scalable processing -- for anything nontrivial it's not easy at all, and I'd personally be able to make it more optimal in C than in Java, simply because there's more that I don't directly control in Java.
The fact that he doesn't write much about the code in the article suggests me that he doesn't really know how to make scalable processing
If you read his blog, you may revist that opinion. He wrote the custom email server that powers mailinator which processes over 2 million emails an hour.
I still claim he doesn't understand scalability for multi processor platforms. I don't claim he didn't made fast processing by writing his custom code, but he writes, his code is fastest on a single core, and as far as I see, he doesn't know why. So I conclude he knows to code but only non-parallel.
Guys he's only pushing 120mb/s over the loopback. That's totally lame. Sorry, but there's something wrong in the software stack. All the comments about hardware, caches, blah blah are just FUD. Locked to a single CPU, he should be pushing well over this (like nearing an order of magnitude) if the software / stack was doing the correct thing. Do your math, or read my comment when it arrives on the article.
You can reproduce this bandwidth usage in slow ass languages like ruby, and I've done so many times. Btw. it's ~120MB/s he's peaked at. DDR2 peaks at what in the region of 10GB/s? i7's architecture should totally murder that, iirc.
raggi@mbk: ~/volatile % time dd if=/dev/zero of=bigfile bs=1048576 count=500
500+0 records in
500+0 records out
524288000 bytes transferred in 1.527235 secs (343292283 bytes/sec)
real 0m1.532s
user 0m0.003s
sys 0m0.796s
raggi@mbk: ~/volatile % time sh -c "cat bigfile | cat - > bigfile2"
real 0m2.846s
user 0m0.117s
sys 0m2.185s
raggi@mbk: ~/volatile % time sh -c "cat bigfile | cat - > bigfile2"
The up-to-date version would be "sudo kill -9 eclipse" or something to that effect, although a "sudo killall firefox-bin" will do just fine in most cases.
It's not impossible emacs takes up less space and cycles than gedit...
Anyway, I think that thinking in terms of "message throughput" is really harmful. The author starts with a throughput of 120.000 messages per second and ends up with 800.000 messages per second which gives the advertised 6 times speedup. But essentially at 120.000 messages per second you have a "process overhead" of a mere 0.008 milliseconds. That's 8 microseconds. Microsecond!
I'll just repeat that. Overhead of 0.008 milliseconds per message in the SLOW case. So if you want to write the message to file, or do any kind of hash table lookup or want to do any kind of back end processing, whether you have 0.008 milliseconds or 0.002 milliseconds overhead per message is not going to be that big of a deal.