More efficient memory-management could enable chips with thousands of cores

Pyxl101 · on Sept 10, 2015

I like the factual tone of the article and specifically the fact that it honestly mentions that there has been no practical performance benefit yet in benchmarks, in a quote from the researchers. I wish more journalism had this balance, and I try to praise it when I see it. It's exciting while avoiding hype.

langarto · on Sept 10, 2015

Well, the title is not very factual nor honest. It is completely hyperbolic.

The idea looks interesting (I have not yet read the actual paper), but calling it «The first new coherence mechanism in 30 years» is ridiculous.

Anderkent · on Sept 10, 2015

What other innovations in cache coherency handling have been created in the last 30 years?

langarto · on Sept 10, 2015

As scott_s says, you could start looking at the papers presented in PACT in the last years (not all are about coherence, but there are a few almost every year). You should also look at ISCA and HPCA.

In fact, you could start by looking at the Related Work section of this paper itself. The version at https://people.csail.mit.edu/devadas/pubs/tardis.pdf is better. It is quite telling that the paper does not make the outrageous claim of the title of MIT's press release.

O(log N) of memory overhead per block is nothing new. There were commercial systems in the 1990s that achieved that (search for SCI coherence). Note that there are other overheads to consider (notably latency and traffic).

This paper is very interesting and looks sound, but MIT's press release makes it look silly.

Excuse me for not even trying to make a summary of the last 30 years of research in this field.

scott_s · on Sept 10, 2015

I'm not an architecture researcher, so I can't say from first-hand knowledge. But looking at the titles from the papers in the previous years of the conference that this paper is published in, Parallel Architectures and Compilation Techniques, would be a good start: http://dl.acm.org/event.cfm?id=RE206

asb · on Sept 10, 2015

DeNovo is worth a read: http://rsim.cs.illinois.edu/denovo/index.html

jessriedel · on Sept 10, 2015

I don't approve of linkbait/hyperbolic titles either, and I don't think it's a good excuse that the titles are selected by the editor rather than the author, but it does seem to be a less serious sin than the typical crap found in article bodies.

The quote from the article

> MIT researchers unveil the first fundamentally new approach to cache coherence in more than three decades

is more hedged ("fundamentally"). I don't know if anyone in this thread has the expertise to evaluate that judgement.

seanmcdirmid · on Sept 10, 2015

The benefit is obvious in book keeping space overhead for the shared cache as the number of core rise. This could be very useful when/if we ever hit 256 cores per CPU.

The technique kind of reminds me of Jefferson's virtual time (and time warp), which rather exists in a distributed simulation context. Virtualizing time to manage coherence of reads and writes is a very good idea.

creshal · on Sept 10, 2015

Don't we already hit that on GPUs? Or do they only have shared caches?

btown · on Sept 10, 2015

Intel's Xeon Phi architecture has 60+ cores with multiple threads each, with cache coherence (albeit using a more standard dictionary): http://htor.inf.ethz.ch/publications/img/ramos-hoefler-cc-mo... - so we're getting there.

ajross · on Sept 10, 2015

Yes, though IIRC they had to modify the long-standing IA memory ordering rules in order to make that happen. Xeon Phi requires manual fence management in most cases, I believe.

seanmcdirmid · on Sept 10, 2015

Note that those particular cores are also doing SIMT like GPUs are, so the problem is different from cache coherence on general purpose cores.

varelse · on Sept 10, 2015

GPUs aren't cache-coherent yet. That said, they have fantastic atomic ops performance relative to CPUs. I'd guess that the lack of performance benefit so far is because it's possible to write many algorithms with the assumption that there is no cache coherency.

robmccoll · on Sept 10, 2015

Their atomic operations used to be extremely costly from a utilization perspective. They would shut down all other threads in a warp while the thread performing the atomic ran alone. Is that still the case?

varelse · on Sept 10, 2015

Fixed as of Maxwell to the best of my knowledge. But even then, I found them more efficient for reduction operations in global memory than any other method (using fixed-point math for places where a deterministic sum was required).

seanmcdirmid · on Sept 10, 2015

I'm not really a GPU person, but my understanding is that GPUs have very simple memory models, memory reads and writes are scheduled more manually. But this is part of their appeal, why they can run so fast, but also why they aren't as general as CPUs.

But my knowledge is a bit out of date, since GPUs are actually including caches now, and what they can do seems to evolve rapidly.

ajross · on Sept 10, 2015

Yeah, the traditional GPU usage is a streaming thing. Textures are read only, vertex and fragment data is either output-only or input-only depending on shader type. The only R/W memory requiring synchronization is the framebuffer, which is handled by specialized hardware anyway.

robmccoll · on Sept 10, 2015

Kind of, you get the experience of coherent access from a software perspective.

    http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf

seanmcdirmid · on Sept 10, 2015

Just a formatting note: if you could remove the verbatim space before the url, HN would recognize it as one.

MaulingMonkey · on Sept 10, 2015

http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_...

danbruc · on Sept 10, 2015

Sounds pretty much like a Lamport clock [1] in hardware.

[1] https://en.wikipedia.org/wiki/Lamport_timestamps

assface · on Sept 10, 2015

Lamport clocks are logical. You can think of these as physiological.

robmccoll · on Sept 10, 2015

This seems kind of hyperbolic to me (not unlike most MIT news releases). There have been plenty of new cache coherence mechanisms in the last 30 years. This may be the greatest departure from the classic MOESI and friends, but it's certainly not like the research community has been sitting on their hands all this time.

tkinom · on Sept 10, 2015

I implemented a distributed messages System/API a long time (10+ years) ago on SMP, AMP, x86 CPU that were completely no-lock, none-blocking. The APIs/system on both userspace and Linux kernel space.

One thing the APIs depended on was atomic-add. I tried to get to 10 millions msg/second between process/thread withing a SMP CPU group at that time. For 10 millions msg/s, the APIs had 100ns to routed and distributed the msg. The main issue was none-cache memory access latency especially for &atomic-add variables. The none-cache memory latency was 50+ns on DDR2 at that time when I measured that on 1.2GHz Xeon. It was hard to get that performance.

I even considered adding and FPGA on PCI/PCIe which can mmap to a physical/virtual address that will auto increment on every read access to get a very high performance atomic_add.

If that same FPGA is mapped to 128,256,1024 cores, one can easily build a very high speed distributed sync message system. Hopefully for 10+ millions / second for 1024 cores.

That would be cool!

lorenzhs · on Sept 10, 2015

The paper is available at http://arxiv.org/pdf/1501.04504.pdf [pdf]

sanxiyn · on Sept 10, 2015

A proof of the protocol is in the separate paper, which is at http://arxiv.org/abs/1505.06459

btown · on Sept 10, 2015

An alternate version: https://people.csail.mit.edu/devadas/pubs/tardis.pdf

_csoo · on Sept 10, 2015

We're having enough trouble utilizing just a few cores and dealing with parallelism and concurrency, not sure how a hardware improvement is going to help.

smrtinsert · on Sept 10, 2015

who is we?

TodPunk · on Sept 10, 2015

The generalized popular programming industry, largely in web, desktop, or mobile high level programming. Not an entirely useful generalization, but one that is common enough to habitually pick up on when "we" is used in such a context.

stefantalpalaru · on Sept 10, 2015

> largely in web

No, here we simply increase the number of upstream servers managed by a frontend like nginx.

ddorian43 · on Sept 10, 2015

redis

sahaj · on Sept 10, 2015

N00b question. With more cores, wouldn't it also mean there's more overhead to "managing" process execution? There are probably some specific applications where it makes sense to have more cores, but does the everyday mobile user benefit from having thousands of cores?

elihu · on Sept 10, 2015

Depends on the application. Ahmdahl's law comes into play; if there's a non-parallelizable component to what you're doing, it tends to dominate execution time. Managing lots of threads isn't really a big deal for modern operating systems, but communication between threads can be a big bottleneck.

Having lots of cores isn't likely to matter much for mobile users, simply because most mobile apps are neither optimized for parallelization nor CPU-hungry in the first place. If 4 cores are good enough, it doesn't matter if you add hundreds more. That said, there may be specific application that might benefit, such as computer vision.

jeffreyrogers · on Sept 10, 2015

I think the consensus is that when you get to more than about 4 cores the performance benefits turn negative for general purpose computing.

hyperpape · on Sept 10, 2015

I was surprised to hear this (though not because I know otherwise). What does general purpose computing mean in this context?

jeffreyrogers · on Sept 10, 2015

Basically the workload that an average person is going to have: a bunch of processes running simultaneously with varying degrees of cpu and memory intensiveness.

In special cases you can get better performance with more, but less beefy cores (graphics is a prime example), but in general a few powerful cores performs better. The main reason is because communication among cores is hard and inefficient, so only embarrassingly parallel programs work well when divided among many cores. Plus the speedup you get from parallelize a program is minor in most cases. See Amdahl's law[1] for more on this topic.

Also, I'm not an expert in this area, but I have some familiarity with it. So hopefully someone with a bit more experience can come and confirm (or refute) what I've written.

[1]: https://en.wikipedia.org/wiki/Amdahl's_law

nly · on Sept 10, 2015

I remember watching a talk on the C++ memory model, memory ordering and atomics where it was claimed that the CPU industry was moving toward sequential consistency because cache-coherency protocols haven't been shown to be a practical bottleneck in the near future.

This is good news for actually scaling caches with cores though, given how much die space is actually used for cache compared to cores.

acd · on Sept 10, 2015

Sounds like a nature inspired design by the brain tousands of cores at slower speeds.

Most likely maybe we will have a few speedy main cores for non parallel code like mobile Arm big little and then the thousands slower cores in parallel. Will there be transparent cloud execution where workloads can migrate from local cpu out to massive cloud cpu and back?

Kind of like GPU but for main CPU?

stcredzero · on Sept 11, 2015

Apple's shown that pro and consumer-level workstations and laptops have reached the maturity level where integration matters much more than optimization of components. Is it about time for integration between hardware designers and language designers? Languages that have message passing semantics could benefit from a particular hardware architecture that enables that. Functional languages with persistent collections might benefit from specific hardware designs.

I suspect that an Erlang-like language running on hardware specifically designed to support it could achieve tremendous scalability.

dazam · on Sept 10, 2015

This seems very much like the Disruptor[0] pattern implemented in hardware.

[0]https://lmax-exchange.github.io/disruptor/

jhallenworld · on Sept 10, 2015

Cache coherency is kind of the same problem as database locking. Use MVCC.