I feel that single-threaded processing power stopped increasing at 2 major events in history:
* The arrival of video cards around 1997 (focus shifted from general computation to digital signal processing)
* The arrival of the iPhone around 2007 (focus shifted from performance to power consumption)
I'd vote to undo these setbacks by moving to local data processing, where a large number of cores each have 1/N of the total memory, shared by M memory busses. Memory controllers would manage shuffling data to where it's needed so that the memory appears as 1 contiguous address space to any process.
In other words, this would look identical to the desktop CPUs we have today, just with a large number of cores (over 256) and a memory bandwidth many hundreds or thousands of times faster than what we have now if it uses content-addressable memory with copy-on-write internally. The speed difference is like comparing BitTorrent to FTP, and why GPUs run orders of magnitude faster than CPUs (unfortunately limited to their narrow use cases).
This would let us get back to traditional programming in the language of our choice (perhaps something like Erlang, Go or Octave/MATLAB) rather than shaders.
Apple appears to be trying to do this with their M1 and ideas loosely borrowed from transputers. But since their goals are proprietary, they won't approach anything close to the general computing power available from the transistor count for at least a decade, maybe never.
So there's an opportunity here for someone to reintroduce multicore CPUs and scalable transputers composed of them. Then we could write whatever OpenGL/Vulkan/Metal/TensorFlow libraries we wanted over that, since they are trivial with the right architecture.
This would also allow us to drop async and parallel keywords from our languages and just use higher-order methods which are self-parallelizing. Processing big data would "just work" since Amdahl's law only applies to serial and sequential computation.
The advantages are so numerous that I struggle to understand why things would stay the way they are other than due to the Intel/Nvidia hegemony. And I've felt this way since 1997, back when people thought I was crazy for projecting to the endgame like with any other engineering challenge.
> I'd vote to undo these setbacks by moving to local data processing, where a large number of cores each have 1/N of the total memory, shared by M memory busses. Memory controllers would manage shuffling data to where it's needed so that the memory appears as 1 contiguous address space to any process.
Cheap RAM is DDR. Fast RAM would be on-die but that would be very expansive, or maybe now on package (but with some tech to be developed). But appart from decoupling latencies of accesses, I don't really see the point of having N busses (from local core to its local memory), especially if you need a very large number of cores. More memory channels seems good enough. The bandwidth is already hard to saturate on well-designed SoC like the M1 Pro and above, probably improvement to the latency could yield to better benefits than trying to increase the bandwidth more.
> In other words, this would look identical to the desktop CPUs we have today, just with a large number of cores (over 256) and a memory bandwidth many hundreds or thousands of times faster than what we have now if it uses content-addressable memory with copy-on-write internally. The speed difference is like comparing BitTorrent to FTP, and why GPUs run orders of magnitude faster than CPUs (unfortunately limited to their narrow use cases).
"content-addressable memory with copy-on-write internally" are you describing what caches already kind of do, in a way (esp. if I mix that with: "memory appears as 1 contiguous address space to any process")? The good news would then be: we already have them :)
What remains, that I think I fully understand what you mean, seems to be: more cores. The other good news here is that: it is in progress. If 6 years ago you would have gotten 6 to 8 cores on an enthusiast platform, you would now probably chose 12 to 16 cores on just a basic one (and even more on a modern enthusiast one)
There has been a pause but in recent years but it was basically Intel having process difficulties, and being caught up by the rest of the industry. Including some with power consumption also in mind, and given what an high perf CPU dissipates today, power consumption has also become key to unlock raw performance anyway.
The shift to a focus on power consumption was already happening anyway without the iphone even on desktop. CPUs were already in the nuclear reactor territory as far as being able to produce as much heat per unit area
* The arrival of video cards around 1997 (focus shifted from general computation to digital signal processing)
* The arrival of the iPhone around 2007 (focus shifted from performance to power consumption)
I'd vote to undo these setbacks by moving to local data processing, where a large number of cores each have 1/N of the total memory, shared by M memory busses. Memory controllers would manage shuffling data to where it's needed so that the memory appears as 1 contiguous address space to any process.
In other words, this would look identical to the desktop CPUs we have today, just with a large number of cores (over 256) and a memory bandwidth many hundreds or thousands of times faster than what we have now if it uses content-addressable memory with copy-on-write internally. The speed difference is like comparing BitTorrent to FTP, and why GPUs run orders of magnitude faster than CPUs (unfortunately limited to their narrow use cases).
This would let us get back to traditional programming in the language of our choice (perhaps something like Erlang, Go or Octave/MATLAB) rather than shaders.
Apple appears to be trying to do this with their M1 and ideas loosely borrowed from transputers. But since their goals are proprietary, they won't approach anything close to the general computing power available from the transistor count for at least a decade, maybe never.
So there's an opportunity here for someone to reintroduce multicore CPUs and scalable transputers composed of them. Then we could write whatever OpenGL/Vulkan/Metal/TensorFlow libraries we wanted over that, since they are trivial with the right architecture.
This would also allow us to drop async and parallel keywords from our languages and just use higher-order methods which are self-parallelizing. Processing big data would "just work" since Amdahl's law only applies to serial and sequential computation.
The advantages are so numerous that I struggle to understand why things would stay the way they are other than due to the Intel/Nvidia hegemony. And I've felt this way since 1997, back when people thought I was crazy for projecting to the endgame like with any other engineering challenge.