Making the Case for Feature-Rich Memory Systems (2016) [pdf]

wpietri · on Oct 16, 2017

Do folks have practical examples that would make development of alternative computational models worth it?

The GPU is the only recent example where I've seen people really willing to rethink how they work to get (very large) performance gains. But I'm not seeing that here. For the examples I can think of, it seems like it would be easier to install more servers with modest amounts of RAM, rather than having a smaller number of servers with lots more RAM and in-memory processing.

Of course, maybe I'm just not thinking of the right examples.

deepnotderp · on Oct 16, 2017

> Do folks have practical examples that would make development of alternative computational models worth it?

This is the question to ask people when they say "the von Neumann architecture is a bottleneck".

That being said, I think a strong contender is the dataflow computing model. The advent of the GPU already heavily punishes control flow, and workloads are being increasingly constrained, hence the possibility of a real world success for dataflow machines.

adrianratnapala · on Oct 16, 2017

Ok, I have seen dataflow as a model for writing programs. But such languages still have conditional constructs -- and I always assumed that once it gets down to the metal the cost of a conditional is going to be much the same.

deepnotderp · on Oct 15, 2017

I find it annoying that most of these jump straight from standard DRAM with a couple of tweaks straight to deep in-memory processing with memristors.

There are a lot of options for in-between options as well. For example, mallacc by Harvard modified to reside inside the memory itself could be very useful: http://www.eecs.harvard.edu/~skanev/papers/asplos17mallacc.p...

dom0 · on Oct 15, 2017

"The march toward specialized systems" is an interesting slogan (if you will), since that's exactly where we came from; we (as in: the industry) made a huge point out of doing as much as possible using general-purpose components (remember GPGPU?) and more or less open standards. It is of course obvious that the general purpose approach has inherent inefficiencies, but we gladly paid the price. I see some similarities here with the recent rise in popularity of lower level programming languages (better C++, Rust, even Go), after the move to VM-based highest-level languages (Python, Ruby, JavaScript, countless others, perhaps even Java).

We see the gains in productivity and the reduction in cost – at least according to some measures –, but it has inherent inefficiencies. Inefficiencies like simple applications using far more CPU and memory than they're due to.

This perhaps makes people think again what would be possible if all the abstraction (analogous to general-purposeness of hardware) were wiped away and what has been possible in the past using far fewer resources.

pjmlp · on Oct 15, 2017

The trend of finally adopting AOT on OpenJDK and .NET (NGEN is just for faster startup) kind of prove the point they should have offered AOT since v1.0 instead of leaving it to third parties.

Another example is the ongoing efforts to improve their support for value types.

A 20 years delay to catch up with what Common Lisp, Eiffel, Modula-3 and Oberon variants already offered in those days.

peoplewindow · on Oct 16, 2017

Well AOT even in OpenJDK is not necessarily about resource savings, just startup time. Native code is much larger than the equivalent bytecode, so compiling lots of cold code that's rarely used AOT to native can make things more bloated and slower, rather than tighter and faster.

I suspect we'll see the same thing for value types: it's not going to be quite the easy win it seems. Even in C++, it's easy to create accidental footprint explosions with templates and lose performance to excessive copying with value types. And C++ allows mutable values, which are out of fashion now so Java won't allow them ...

I was curious if the article would mention memory chips that knew how to do bulk memmoves by themselves. Does anyone know if memory subsystems can already do that? If you issue a memmove() for, say, 3kb of memory, does the CPU still have to read it all into the cache and then immediately write it out again, or is there some way the CPU can signal to the DRAM chips that they should do the copy themselves? Fast copying of memory would be useful for GCs.

boznz · on Oct 16, 2017

30 years ago we had blitter chips dedicated to moving memory around without using cpu, pretty sure most memory controllers would have something similar these days

dom0 · on Oct 16, 2017

memcpy is done by the CPU core, at least for x86. The IMCs don't process data.

The CPU core has more bandwidth than the IMC anyway, so there would be no speed-up from adding this complexity to the IMC (it would not only need to perform the operation, but it would also need a way to maintain cache coherence and communicate with the issuing CPU, none of which is a problem if you just do it in the core). It might not even save power.

pjmlp · on Oct 16, 2017

> Well AOT even in OpenJDK is not necessarily about resource savings, just startup time.

No quite, the long term idea is to bootstrap OpenJDK as much as possible, making it a meta-circular VM.

Check Project Metropolis and SubstrateVM projects.

Also the commercial JDKs with AOT support, do it for deployment scenarios where JIT isn't desired, e.g. embedded devices.

teddyh · on Oct 15, 2017

Forward Error Correction in hardware would, to my mind, be a nice and comparatively easy step up from simple ECC.

dbcurtis · on Oct 16, 2017

What are you proposing? ECC (classic SECDED for instance) is a form of FEC. What is it you want?

teddyh · on Oct 17, 2017

I must admit to not being an expert in these matters, and I therefore defer to this comment by JoshTriplett, where he talks about ECC as being inferior to FEC:

https://news.ycombinator.com/item?id=11604918

dbcurtis · on Oct 18, 2017

OK, so I read that, too. I'm still fuzzy on what is being asked for. Extrapolating based on my 20 years of experience as a CPU logic designer, and graduate work in error correcting codes, I'm guessing that what the author seems to want is to have uncorrectable memory read errors passed to user space as an exception that the application can handle as it sees fit. (I wouldn't call that FEC, but, /shrug ...)

The OS gets the exception. There are always (usually priviledged) instructions to read and set the memory check bits. It would be darn hard to write memory diagnostics without them, or sometimes even to boot a machine that powers up with random bits in the memory.

If I understand what the author wants, (and I have my doubts that the author understands what they want), then they are asking for an OS feature, not a CPU feature, and simply want the exception to bubble up to user space.

sitkack · on Oct 15, 2017

I can't believe it makes no mention of IRam [0] or Computational Ram [1]

[0] http://iram.cs.berkeley.edu/

[1] https://en.wikipedia.org/wiki/Computational_RAM