The M1's memory is ported in such a way that there isn't contention for the memo...

kllrnohj · on Nov 30, 2020

Well, the article is basically just wrong on that, kinda. It is a single memory controller, which means the CPU & GPU are fighting for bandwidth. It's not an exclusive lock, no, but you will see a very sharp decline in memcpy performance if you're also hammering the GPU with blit commands.

Which is the same as basically all integrated GPUs on all modern (or even kinda modern) CPUs. You don't usually have such heavy CPU memory bandwidth and GPU memory bandwidth workloads simultaneously, so it's mostly fine in practice, but there is technically resource contention there.

compiler-guy · on Dec 1, 2020

Is this from real world observation, or just looking at the diagrams or what? Because by all accounts, these chips don’t seem to encounter any issues with the ui or encoding movies or whatever. Which would strongly imply that memory contention between the gpu and cpu is not a significant issue.

If you have actual data that backs up your position, I would love to see it.

kllrnohj · on Dec 1, 2020

As I said, in most scenarios this doesn't matter, so I'm not sure why you're pointing at an average scenario as some sort of counter-argument?

There's a single memory controller on the M1. The CPU, GPU, neural net engine, etc... all share that single controller (hence how "unified memory" is achieved). Given the theoretical maximum throughput of the M1's memory controller is 68GB/s, and that the CPU can hit around 68GB/s in a memcpy, I'm not sure what you're expecting? If you hammer the GPU at the same time, it must by design share that single 68GB/s pipe to the memory. There's not a secondary dedicated pipe for it to use. So the bandwidth must necessarily be split in a multi-use scenario, there's no other option here.

Think of it like a network switch. You can put 5 computers behind a gigabit switch and 99% of the time nobody will ever have an issue. At any given point you can speedtest on one of them and see a full 1gbit/s on it. But if you speedtest on all of them simultaneously you of course won't see each one getting 1gbit/s since there's only a single 1gbit/s connection upstream. Same thing here, just with a 68GB/s 8x16-bit channel LPDDR4X memory controller.

The only way you can get full-throughput CPU & GPU memory performance is if you physically dedicated memory modules to the CPU & GPU independently (such as with a typical desktop PC using discrete graphics). Otherwise they must compete for the shared resource by definition of being shared.

compiler-guy · on Dec 1, 2020

The point I was originally responding to was, "Unified memory is not a good thing, makes the CPU and GPU fight for access." There is no evidence here that this is a significant issue for the M1.

So even if the memory bandwidth is split, the argument that this is a problem is not in evidence.