DRAM uses a capacitor. Those capacitors essentially hit a hard limit at around 400MHz for our traditional materials a very long time ago. This means that if you need to sequentially read random locations from RAM, you can't do it faster than 400MHz. Our only answer here is better AI prefetchers and less-random memory patterns in our software (the penalty for not prefetching is so great that theoretically less efficient algorithms can suddenly become more efficient if they are simply more predictable).
As to capacitor sizes, we've been at the volume limit for quite a while. When the capacitor is discharged, we must amplify the charge. That gets harder as the charge gets weaker and there's a fundamental limit to how small you can go. Right now, each capacitor has somewhere in the range of a mere 40,000 electrons holding the charge. Going lower dramatically increases the complexity of trying to tell the signal from the noise and dealing with ever-increasing quantum effects.
Getting more capacitors closer means a smaller diameter, but keeping the same volume means making the cylinder longer. You quickly reach a point where even dramatic increases in height (something very complicated to do in silicon) give only minuscule decreases in diameter.
What does “faster than 400MHz” mean in this context? Does that mean you can’t ask for a unit of memory from it more than 400M times a second? If so, what’s the basic unit there, a bit? A word?
I built a little CPU in undergrad but never got around to building RAM and admit it’s still kind of a black box to me.
Bonus question: When I had an Amiga, we’d buy 50 or 60ns RAM. Any idea what that number meant, or what today’s equivalent would be?
The capacitors take time to charge and discharge. You can't do that more than around 400MHz with current materials. You are correct that it means you can't access the same bit of memory more than 400M/sec. This is the same whether you are accessing 1bit or 1M bits because the individual capacitors that make up those bits can't be charged/discharged any faster.
When we moved from SDR to DDR1, latencies dropped from 20-25ns to about 15ns too, but if you run the math, we've been at 13-17ns of latency ever since.
If it were even 20% faster than DRAM, there would be a market for it at the higher price. The post I replied to was asserting that there was a physical limit of 400MHz for DRAM entirely due to the capacitor. If SRAM could run with lower latency, memory-bound workloads would get comparably faster.
This is sort of the role that L3 cache plays already. Your proposal would be effectively an upgradable L4 cache. No idea if the economics on that are worth it vs bigger DRAM so you have less pressure on the nvme disk.
Coreboot and some other low-level stuff uses cache-as-RAM during early steps of the boot process.
There was briefly a product called vCage loading a whole secure hypervisor into cache-as-RAM, with a goal of being secure against DRAM-remanence ("cold-boot") attacks where the DIMMs are fast-chilled to slow charge leakage and removed from the target system to dump their contents. Since the whole secure perimeter was on-die in the CPU, it could use memory encryption to treat the DRAM as untrusted.
Yeah, you’re basically betting that people will put a lot of effort in trying to out/optimize the hardware and perhaps to some degree the OS. Not a good bet.
When SMP first came out we had one large customer that wanted to manually handle scheduling themselves. That didn’t last long.
Effort? It's not like it's hard to map an SRAM chip to whatever address you want and expose it raw or as a block device. That's a 100 LOC kernel module.
5nm can hold roughly a gigabyte of SRAM on a cpu-sized die, that's around $130/GB I believe. At some point 5nm will be cheap enough that we can start considering replacing DRAM with SRAM directly on the chip (aka L4 cache). I wonder how big of a latency and bandwidth bonus that'd be. You could even go for a larger node size without losing much capacity for half the price.
SRAM also requires more power than DRAM and the simple regular structure of SRAM arrays compared to (other) logic makes it possible to get good yield rates through redundancy and error correction codes so you could have giant monolithic dies, but information can't exceed the speed of light in a medium. There just isn't enough time for the signals to propagate to get the latency you expect of a L3 cache out of gigabytes (in relative terms) far away big dies containing gigabytes of SRAM. Also moving that the data would to perform computations without caching would be terrible wasteful given how much energy is needed just to move the data. Instead you would probably end up with something closer to the computing memory concept to map computation to ALUs close to the data with an at least two tier network (on-die, inter-die) to support reductions.
Oh yeah this would definitely be something like L4 cache rather than L3 like AMD's X3D cpus. The expectation is as an alternative to DRAM (or as a supplement), kind of like what Xeon Phi did.
What’s the likely ETA for DRAM?