Do folks have practical examples that would make development of alternative computational models worth it?
The GPU is the only recent example where I've seen people really willing to rethink how they work to get (very large) performance gains. But I'm not seeing that here. For the examples I can think of, it seems like it would be easier to install more servers with modest amounts of RAM, rather than having a smaller number of servers with lots more RAM and in-memory processing.
Of course, maybe I'm just not thinking of the right examples.
> Do folks have practical examples that would make development of alternative computational models worth it?
This is the question to ask people when they say "the von Neumann architecture is a bottleneck".
That being said, I think a strong contender is the dataflow computing model. The advent of the GPU already heavily punishes control flow, and workloads are being increasingly constrained, hence the possibility of a real world success for dataflow machines.
Ok, I have seen dataflow as a model for writing programs. But such languages still have conditional constructs -- and I always assumed that once it gets down to the metal the cost of a conditional is going to be much the same.
"The march toward specialized systems" is an interesting slogan (if you will), since that's exactly where we came from; we (as in: the industry) made a huge point out of doing as much as possible using general-purpose components (remember GPGPU?) and more or less open standards. It is of course obvious that the general purpose approach has inherent inefficiencies, but we gladly paid the price. I see some similarities here with the recent rise in popularity of lower level programming languages (better C++, Rust, even Go), after the move to VM-based highest-level languages (Python, Ruby, JavaScript, countless others, perhaps even Java).
We see the gains in productivity and the reduction in cost – at least according to some measures –, but it has inherent inefficiencies. Inefficiencies like simple applications using far more CPU and memory than they're due to.
This perhaps makes people think again what would be possible if all the abstraction (analogous to general-purposeness of hardware) were wiped away and what has been possible in the past using far fewer resources.
The trend of finally adopting AOT on OpenJDK and .NET (NGEN is just for faster startup) kind of prove the point they should have offered AOT since v1.0 instead of leaving it to third parties.
Another example is the ongoing efforts to improve their support for value types.
A 20 years delay to catch up with what Common Lisp, Eiffel, Modula-3 and Oberon variants already offered in those days.
Well AOT even in OpenJDK is not necessarily about resource savings, just startup time. Native code is much larger than the equivalent bytecode, so compiling lots of cold code that's rarely used AOT to native can make things more bloated and slower, rather than tighter and faster.
I suspect we'll see the same thing for value types: it's not going to be quite the easy win it seems. Even in C++, it's easy to create accidental footprint explosions with templates and lose performance to excessive copying with value types. And C++ allows mutable values, which are out of fashion now so Java won't allow them ...
I was curious if the article would mention memory chips that knew how to do bulk memmoves by themselves. Does anyone know if memory subsystems can already do that? If you issue a memmove() for, say, 3kb of memory, does the CPU still have to read it all into the cache and then immediately write it out again, or is there some way the CPU can signal to the DRAM chips that they should do the copy themselves? Fast copying of memory would be useful for GCs.
30 years ago we had blitter chips dedicated to moving memory around without using cpu, pretty sure most memory controllers would have something similar these days
memcpy is done by the CPU core, at least for x86. The IMCs don't process data.
The CPU core has more bandwidth than the IMC anyway, so there would be no speed-up from adding this complexity to the IMC (it would not only need to perform the operation, but it would also need a way to maintain cache coherence and communicate with the issuing CPU, none of which is a problem if you just do it in the core). It might not even save power.
I must admit to not being an expert in these matters, and I therefore defer to this comment by JoshTriplett, where he talks about ECC as being inferior to FEC:
OK, so I read that, too. I'm still fuzzy on what is being asked for. Extrapolating based on my 20 years of experience as a CPU logic designer, and graduate work in error correcting codes, I'm guessing that what the author seems to want is to have uncorrectable memory read errors passed to user space as an exception that the application can handle as it sees fit. (I wouldn't call that FEC, but, /shrug ...)
The OS gets the exception. There are always (usually priviledged) instructions to read and set the memory check bits. It would be darn hard to write memory diagnostics without them, or sometimes even to boot a machine that powers up with random bits in the memory.
If I understand what the author wants, (and I have my doubts that the author understands what they want), then they are asking for an OS feature, not a CPU feature, and simply want the exception to bubble up to user space.
The GPU is the only recent example where I've seen people really willing to rethink how they work to get (very large) performance gains. But I'm not seeing that here. For the examples I can think of, it seems like it would be easier to install more servers with modest amounts of RAM, rather than having a smaller number of servers with lots more RAM and in-memory processing.
Of course, maybe I'm just not thinking of the right examples.