SIMD works great for doing the same thing to multiple pieces of data, but it doesn't do the scaling up that I described.
I'm no chip engineer, so maybe what I'm envisioning isn't possible. In essence, instead of making 4x 64-bit cores you make 128x 2-bit cores and then some architecture on the die to select groups of cores to build a processor of the required size, execute some instructions with that processor, and then disassemble the processor back into a pool of resources.
So SIMD might be able to calculate two 16-bit sums on a 32 bit processor in one cycle, but the hypothetical CPU I'm describing will be able to calculate a single 128 bit sum and eight 16 bit sums in one cycle, at the same time.
What you're describing is basically a modern FPGA[1]. You can wire it up as you want at runtime, and they can contain specialized hardware like hardware multipliers and fast local memory to accelerate certain workloads.
I'm no chip engineer, so maybe what I'm envisioning isn't possible. In essence, instead of making 4x 64-bit cores you make 128x 2-bit cores and then some architecture on the die to select groups of cores to build a processor of the required size, execute some instructions with that processor, and then disassemble the processor back into a pool of resources.
So SIMD might be able to calculate two 16-bit sums on a 32 bit processor in one cycle, but the hypothetical CPU I'm describing will be able to calculate a single 128 bit sum and eight 16 bit sums in one cycle, at the same time.