Each of the internal units is reasonably simple compared to a modern superscalar...

Each of the internal units is reasonably simple compared to a modern superscalar x86. There are just huge numbers of units. The complexity is in the software needed to keep all those units busy and pumping around data between them. What you're doing down at the bottom of rendering or machine learning are usually very simple computations done a huge number of times. Somewhere above that is the problem of parceling out work to all those hardware resources in a somewhat optimal way. That's the hard problem.