I like the factual tone of the article and specifically the fact that it honestly mentions that there has been no practical performance benefit yet in benchmarks, in a quote from the researchers. I wish more journalism had this balance, and I try to praise it when I see it. It's exciting while avoiding hype.
As scott_s says, you could start looking at the papers presented in PACT in the last years (not all are about coherence, but there are a few almost every year). You should also look at ISCA and HPCA.
In fact, you could start by looking at the Related Work section of this paper itself. The version at https://people.csail.mit.edu/devadas/pubs/tardis.pdf is better. It is quite telling that the paper does not make the outrageous claim of the title of MIT's press release.
O(log N) of memory overhead per block is nothing new. There were commercial systems in the 1990s that achieved that (search for SCI coherence). Note that there are other overheads to consider (notably latency and traffic).
This paper is very interesting and looks sound, but MIT's press release makes it look silly.
Excuse me for not even trying to make a summary of the last 30 years of research in this field.
I'm not an architecture researcher, so I can't say from first-hand knowledge. But looking at the titles from the papers in the previous years of the conference that this paper is published in, Parallel Architectures and Compilation Techniques, would be a good start: http://dl.acm.org/event.cfm?id=RE206
I don't approve of linkbait/hyperbolic titles either, and I don't think it's a good excuse that the titles are selected by the editor rather than the author, but it does seem to be a less serious sin than the typical crap found in article bodies.
The quote from the article
> MIT researchers unveil the first fundamentally new approach to cache coherence in more than three decades
is more hedged ("fundamentally"). I don't know if anyone in this thread has the expertise to evaluate that judgement.
The benefit is obvious in book keeping space overhead for the shared cache as the number of core rise. This could be very useful when/if we ever hit 256 cores per CPU.
The technique kind of reminds me of Jefferson's virtual time (and time warp), which rather exists in a distributed simulation context. Virtualizing time to manage coherence of reads and writes is a very good idea.
Yes, though IIRC they had to modify the long-standing IA memory ordering rules in order to make that happen. Xeon Phi requires manual fence management in most cases, I believe.
GPUs aren't cache-coherent yet. That said, they have fantastic atomic ops performance relative to CPUs. I'd guess that the lack of performance benefit so far is because it's possible to write many algorithms with the assumption that there is no cache coherency.
Their atomic operations used to be extremely costly from a utilization perspective. They would shut down all other threads in a warp while the thread performing the atomic ran alone. Is that still the case?
Fixed as of Maxwell to the best of my knowledge. But even then, I found them more efficient for reduction operations in global memory than any other method (using fixed-point math for places where a deterministic sum was required).
I'm not really a GPU person, but my understanding is that GPUs have very simple memory models, memory reads and writes are scheduled more manually. But this is part of their appeal, why they can run so fast, but also why they aren't as general as CPUs.
But my knowledge is a bit out of date, since GPUs are actually including caches now, and what they can do seems to evolve rapidly.
Yeah, the traditional GPU usage is a streaming thing. Textures are read only, vertex and fragment data is either output-only or input-only depending on shader type. The only R/W memory requiring synchronization is the framebuffer, which is handled by specialized hardware anyway.
This seems kind of hyperbolic to me (not unlike most MIT news releases). There have been plenty of new cache coherence mechanisms in the last 30 years. This may be the greatest departure from the classic MOESI and friends, but it's certainly not like the research community has been sitting on their hands all this time.
I implemented a distributed messages System/API a long time (10+ years) ago on SMP, AMP, x86 CPU that were completely no-lock, none-blocking. The APIs/system on both userspace and Linux kernel space.
One thing the APIs depended on was atomic-add. I tried to get to 10 millions msg/second between process/thread withing a SMP CPU group at that time. For 10 millions msg/s, the APIs had 100ns to routed and distributed the msg. The main issue was none-cache memory access latency especially for &atomic-add variables. The none-cache memory latency was 50+ns on DDR2 at that time when I measured that on 1.2GHz Xeon. It was hard to get that performance.
I even considered adding and FPGA on PCI/PCIe which can mmap to a physical/virtual address that will auto increment on every read access to get a very high performance atomic_add.
If that same FPGA is mapped to 128,256,1024 cores, one can easily build a very high speed distributed sync message system. Hopefully for 10+ millions / second for 1024 cores.
We're having enough trouble utilizing just a few cores and dealing with parallelism and concurrency, not sure how a hardware improvement is going to help.
The generalized popular programming industry, largely in web, desktop, or mobile high level programming. Not an entirely useful generalization, but one that is common enough to habitually pick up on when "we" is used in such a context.
N00b question. With more cores, wouldn't it also mean there's more overhead to "managing" process execution? There are probably some specific applications where it makes sense to have more cores, but does the everyday mobile user benefit from having thousands of cores?
Depends on the application. Ahmdahl's law comes into play; if there's a non-parallelizable component to what you're doing, it tends to dominate execution time. Managing lots of threads isn't really a big deal for modern operating systems, but communication between threads can be a big bottleneck.
Having lots of cores isn't likely to matter much for mobile users, simply because most mobile apps are neither optimized for parallelization nor CPU-hungry in the first place. If 4 cores are good enough, it doesn't matter if you add hundreds more. That said, there may be specific application that might benefit, such as computer vision.
Basically the workload that an average person is going to have: a bunch of processes running simultaneously with varying degrees of cpu and memory intensiveness.
In special cases you can get better performance with more, but less beefy cores (graphics is a prime example), but in general a few powerful cores performs better. The main reason is because communication among cores is hard and inefficient, so only embarrassingly parallel programs work well when divided among many cores. Plus the speedup you get from parallelize a program is minor in most cases. See Amdahl's law[1] for more on this topic.
Also, I'm not an expert in this area, but I have some familiarity with it. So hopefully someone with a bit more experience can come and confirm (or refute) what I've written.
I remember watching a talk on the C++ memory model, memory ordering and atomics where it was claimed that the CPU industry was moving toward sequential consistency because cache-coherency protocols haven't been shown to be a practical bottleneck in the near future.
This is good news for actually scaling caches with cores though, given how much die space is actually used for cache compared to cores.
Sounds like a nature inspired design by the brain tousands of cores at slower speeds.
Most likely maybe we will have a few speedy main cores for non parallel code like mobile Arm big little and then the thousands slower cores in parallel. Will there be transparent cloud execution where workloads can migrate from local cpu out to massive cloud cpu and back?
Apple's shown that pro and consumer-level workstations and laptops have reached the maturity level where integration matters much more than optimization of components. Is it about time for integration between hardware designers and language designers? Languages that have message passing semantics could benefit from a particular hardware architecture that enables that. Functional languages with persistent collections might benefit from specific hardware designs.
I suspect that an Erlang-like language running on hardware specifically designed to support it could achieve tremendous scalability.