> why couldn't Intel and AMD extend x86-64 more fully with SIMD / MIMD instructions
I think there is that latency vs bandwidth trade-off where CPU likes lower latency and GPU higher bandwidth, but you can't achieve the same with a single chip.
I guess this is fundamentally a homogeneous vs heterogeneous ISA question. I.e. is your ISA intended to operate one chip, or multiple cooperative chips / complexes?
Does it make sense for a hardware ISA to express cooperation between chips? I would think HW ISA is meant to control it's local microarchitecture. I could see a virutal ISA or compiler IR built with a multi chip view.
I think there is that latency vs bandwidth trade-off where CPU likes lower latency and GPU higher bandwidth, but you can't achieve the same with a single chip.