It's a good question actually. Intel tried to make a GPU called Larrabee that was mostly a bunch of small x86 cores with giant vector units. Turns out that it couldn't compete in rendering performance on existing games (in 2010) without the fixed function units that GPUs have, so they canceled it as a GPU. It did result in the AVX-512 instruction set though.
I think the idea still has promise but there's a chicken and egg issue where you'd really need to rearchitect game engines and content pipelines to take full advantage of the flexibility before you'd see a benefit. It's possible that it would work better today, and it's also possible that Intel just gave up too early. In some cases we're already seeing people bypassing the fixed function rasterizer in GPUs and doing rasterization manually in compute shaders [1] [2].
Although I don't think the programming model with Larrabee would have really been any simpler. You still face many of the same issues that you do with GPUs, although being SIMD instead of SIMT would actually make it slightly harder to work with.
The actual hard part with GPUs is ensuring you can divide up the work and that it doesn't branch within a given chunk size. You have those same issues when trying to leverage a many-core CPU with AVX-512. You still want to keep those AVX-512 units loaded, which means work units of 16 FP32's must all take the same "branch" - not really any different from feeding warps on a GPU. And you've still got to scale across dozens if not hundreds of CPU cores.
AVX-512 has an execution mask and usually should be programmed with an SIMT-like model (e.g. with ispc). Writing SIMT kernels or chunking up the work is not the hard part.
The actual actual hard part with GPUs is writing portable code in the face of a million edge cases due to different proprietary hardware architectures and buggy drivers, which you can't test without actually buying and maintaining whole rooms full of hardware. Reducing fixed function parts of the hardware and using a documented ISA, as Larrabee tried, would help with that.
If I'm reading that right, Doom Eternal only uses compute shader rasterization for writing a lighting acceleration structure where they need to make some fine-grained/coarse-grained decisions depending on depth complexity. The scene is still using rasterization hardware.
Nanite uses compute shader rasterization partly because of the quad overdraw problem since they are targeting near 1 triangle per pixel. But they also say they are using traditional rasterization with recent hardware's addition of mesh shaders when it is faster (which remove a different set of fixed function stuff though, for transform, so still makes the same point).
For graphics use GPUs perform a very significant amount of work in hardware (rasterization and texture interpolation being the two computationally most intensive [probably followed by ROP, which blends pixel shader output into the framebufer]; you can easily calculate that the ALU bandwidth of the TMUs is about the same order of magnitude as all the shader cores), which gives them a huge efficiency lead over anything purely done with programmable hardware only.
Michael Abrash had a great series of articles in Dr. Dobbs detailing how he came to work for Intel (which spun Larabee) after talking at a game conference with some of their people to ask them for a lerp (linear interpolation) instruction in x86 extensions[0] :)
Oh and Larabee gave us more than AVX512, it also gave us the Xeon Phis, which were accelerators (much akin to the GPGPU of nvidia GPUs?) aimed at scientific code undeer the promise that "since it's x86, you don't need to change your code that much!". However:
> An empirical performance and programmability study has been performed by researchers, in which the authors claim that achieving high performance with Xeon Phi still needs help from programmers and that merely relying on compilers with traditional programming models is still far from reality. However, research in various domains, such as life sciences, and deep learning demonstrated that exploiting both the thread- and SIMD-parallelism of Xeon Phi achieves significant speed-ups.
(from Wikipedia[1])
So pretty much the same as a GPU. It is a bit unfortunate that, in theory, good OpenCL support could have made running this code in 2/4/8 core CPUs (with or without SMT) or in the thread-beast that are/were the Phis. But that woud've probably required OpenCL to be a bit more mature, and Intel skipped that train too.
OpenCL is very specifically tailored for GPUs (though FPGAs may benefit). The concept of "constant memory", "shared memory", and "global memory" is very GPU-centric, and doesn't benefit Xeon Phi at all.
I'd assume that any OpenCL program would simply function better on a GPU, even compared to a 60-core in-order 512-bit SIMD-based processor like Xeon Phi.
---------------
Xeon Phi's main advantage really was running "like any other x86 processor", with 60 cores / 240 threads. But you still needed to AVX512 up your code to really benefit.
Honestly, I think Xeon Phi just needed a few more revisions to figure out itself more. It was on the market for less than 5 years. But I guess it wasn't growing as fast as NVidia or CUDA.
Maybe I was mixing up names in my head, but I remember from 5~10 years back an Open[Something] (thought it was OpenCL) that in theory could transparently handle multithreaded code across single/dual/quad[0] core or GPGPU (either nvidia or AMD).
This is what I had in mind when I wrote "if Intel had given it good OpenCL support". Again, maybe I'm mixing things up in my head since my career never took me down that lane to write massively paralell code (though I am a user of it, indirectly, through deep learning frameworks).
Where you'd have to use float8 types to be assured of SIMD-benefits on CPU code. As such, its probably more useful to rely upon auto-vectorizers in C++ code (such as #pragma omp simd) and maybe intrinsics for the complicated cases.
Probably not, since it's optimized for scientific workloads (being designed specifically for the K computer replacement) (so it doesn't have texture units, ROPs, etc; you'd have to do too much in software to make it actually render things). However I think the overall design is really good and has enormous potential, if not for graphics at the very least for ML.
The vector architectures with extremely high memory bandwidth coming out of Japan recently (NEC SX-Aurora Tsubasa, Fujitsu A64FX) are pretty fascinating.
They showed quake ray tracing demos with it and other people assumed it was supposed to be sold as a GPU instead of listening to what they were actually saying.
No one thought a collection of atom CPUs with AVX512 SIMD was going to be able to compete head to head on rasterization of games with the best Nvidia cards.
Why would they put texture units on it if it wasn't intended to be sold as a GPU? Consumer gaming GPUs were explicitly planned. Initial released versions even had DirectX drivers. Here is Intel's SIGGRAPH paper featuring benchmarks of Half-Life 2 Episode 2, Gears of War, and F.E.A.R. http://download-software.intel.com/sites/default/files/m/9/4...
I think the idea still has promise but there's a chicken and egg issue where you'd really need to rearchitect game engines and content pipelines to take full advantage of the flexibility before you'd see a benefit. It's possible that it would work better today, and it's also possible that Intel just gave up too early. In some cases we're already seeing people bypassing the fixed function rasterizer in GPUs and doing rasterization manually in compute shaders [1] [2].
[1] Doom Eternal: http://advances.realtimerendering.com/s2020/RenderingDoomEte...
[2] Epic Nanite: https://twitter.com/briankaris/status/1261098487279579136