That's very verbose but not super clear. A better explanation is that the main p...

That's very verbose but not super clear.

A better explanation is that the main problems of processor design are that memory reads take 100-1000 times the time of an arithmetic operation and that hardware is faster when operation are run in parallel.

CPUs handle the those issues by having large memory caches, and lots of circuitry to execute instructions "out of order", i.e. run other instructions that don't depend on the memory read result or the result of other operations. This is great to run sequential code as fast as possible, but quite inefficient overall.

GPUs instead handle the memory problem by switching to another thread in hardware, and the parallelism problem by mainly using SIMD (with masking and scatter/gather memory accesses, so they look like multiple threads). This works well if you are doing mostly the same operations on thousands or millions of values, which is what graphics rendering and GPGPU is.

Then there are also DSPs that solve the memory access issue by only having very little on-CPU memory and either explicit DMA or memory reads giving a delayed result, and parallelism by being VLIW.

And finally the in-order/microcontroller CPUs that simply don't care about performance and do the cheapest and simplest thing.