It's a good question actually. Intel tried to make a GPU called Larrabee that wa...

kllrnohj · on Sept 19, 2020

Although I don't think the programming model with Larrabee would have really been any simpler. You still face many of the same issues that you do with GPUs, although being SIMD instead of SIMT would actually make it slightly harder to work with.

The actual hard part with GPUs is ensuring you can divide up the work and that it doesn't branch within a given chunk size. You have those same issues when trying to leverage a many-core CPU with AVX-512. You still want to keep those AVX-512 units loaded, which means work units of 16 FP32's must all take the same "branch" - not really any different from feeding warps on a GPU. And you've still got to scale across dozens if not hundreds of CPU cores.

modeless · on Sept 19, 2020

AVX-512 has an execution mask and usually should be programmed with an SIMT-like model (e.g. with ispc). Writing SIMT kernels or chunking up the work is not the hard part.

The actual actual hard part with GPUs is writing portable code in the face of a million edge cases due to different proprietary hardware architectures and buggy drivers, which you can't test without actually buying and maintaining whole rooms full of hardware. Reducing fixed function parts of the hardware and using a documented ISA, as Larrabee tried, would help with that.

cma · on Sept 19, 2020

If I'm reading that right, Doom Eternal only uses compute shader rasterization for writing a lighting acceleration structure where they need to make some fine-grained/coarse-grained decisions depending on depth complexity. The scene is still using rasterization hardware.

Nanite uses compute shader rasterization partly because of the quad overdraw problem since they are targeting near 1 triangle per pixel. But they also say they are using traditional rasterization with recent hardware's addition of mesh shaders when it is faster (which remove a different set of fixed function stuff though, for transform, so still makes the same point).

dragontamer · on Sept 18, 2020

Fujitsu's A64FX ARM is proof that 512 bit SIMD can work on a CPU based platform.

formerly_proven · on Sept 19, 2020

For graphics use GPUs perform a very significant amount of work in hardware (rasterization and texture interpolation being the two computationally most intensive [probably followed by ROP, which blends pixel shader output into the framebufer]; you can easily calculate that the ALU bandwidth of the TMUs is about the same order of magnitude as all the shader cores), which gives them a huge efficiency lead over anything purely done with programmable hardware only.

dr_zoidberg · on Sept 19, 2020

Michael Abrash had a great series of articles in Dr. Dobbs detailing how he came to work for Intel (which spun Larabee) after talking at a game conference with some of their people to ask them for a lerp (linear interpolation) instruction in x86 extensions[0] :)

Oh and Larabee gave us more than AVX512, it also gave us the Xeon Phis, which were accelerators (much akin to the GPGPU of nvidia GPUs?) aimed at scientific code undeer the promise that "since it's x86, you don't need to change your code that much!". However:

> An empirical performance and programmability study has been performed by researchers, in which the authors claim that achieving high performance with Xeon Phi still needs help from programmers and that merely relying on compilers with traditional programming models is still far from reality. However, research in various domains, such as life sciences, and deep learning demonstrated that exploiting both the thread- and SIMD-parallelism of Xeon Phi achieves significant speed-ups.

(from Wikipedia[1])

So pretty much the same as a GPU. It is a bit unfortunate that, in theory, good OpenCL support could have made running this code in 2/4/8 core CPUs (with or without SMT) or in the thread-beast that are/were the Phis. But that woud've probably required OpenCL to be a bit more mature, and Intel skipped that train too.

[0] https://www.drdobbs.com/parallel/a-first-look-at-the-larrabe...

[1] https://en.wikipedia.org/wiki/Xeon_Phi

dragontamer · on Sept 19, 2020

OpenCL would have been a bad fit for Xeon Phi.

OpenCL is very specifically tailored for GPUs (though FPGAs may benefit). The concept of "constant memory", "shared memory", and "global memory" is very GPU-centric, and doesn't benefit Xeon Phi at all.

I'd assume that any OpenCL program would simply function better on a GPU, even compared to a 60-core in-order 512-bit SIMD-based processor like Xeon Phi.

---------------

Xeon Phi's main advantage really was running "like any other x86 processor", with 60 cores / 240 threads. But you still needed to AVX512 up your code to really benefit.

Honestly, I think Xeon Phi just needed a few more revisions to figure out itself more. It was on the market for less than 5 years. But I guess it wasn't growing as fast as NVidia or CUDA.

dr_zoidberg · on Sept 20, 2020

Maybe I was mixing up names in my head, but I remember from 5~10 years back an Open[Something] (thought it was OpenCL) that in theory could transparently handle multithreaded code across single/dual/quad[0] core or GPGPU (either nvidia or AMD).

This is what I had in mind when I wrote "if Intel had given it good OpenCL support". Again, maybe I'm mixing things up in my head since my career never took me down that lane to write massively paralell code (though I am a user of it, indirectly, through deep learning frameworks).

[0] back then this was as big a CPU would get

dragontamer · on Sept 20, 2020

There's a version of OpenCL that compiled to Intel, but I'm not very familiar with it.

I remember reading things like: https://software.intel.com/content/www/us/en/develop/documen...

Where you'd have to use float8 types to be assured of SIMD-benefits on CPU code. As such, its probably more useful to rely upon auto-vectorizers in C++ code (such as #pragma omp simd) and maybe intrinsics for the complicated cases.

Intel does seem to have some level of OpenCL -> AVX tech: http://llvm.org/devmtg/2011-11/Rotem_IntelOpenCLSDKVectorize...

modeless · on Sept 19, 2020

Sure, but the question is can it render graphics competitively with traditional GPUs.

PeCaN · on Sept 19, 2020

Probably not, since it's optimized for scientific workloads (being designed specifically for the K computer replacement) (so it doesn't have texture units, ROPs, etc; you'd have to do too much in software to make it actually render things). However I think the overall design is really good and has enormous potential, if not for graphics at the very least for ML.

The vector architectures with extremely high memory bandwidth coming out of Japan recently (NEC SX-Aurora Tsubasa, Fujitsu A64FX) are pretty fascinating.

Guthur · on Sept 19, 2020

Modern x86 processors are far from simple though and so it's arguable that it's not significantly less complicated.

Though to be fair I'm not sure it's really all that complex relatively to modern high end processors. Most of gpu is just the same unit repeated.

For mind boggling complexity in my mind is the manufacturing process undertaken by the likes of TSMC.

modeless · on Sept 19, 2020

The Larrabee cores were intentionally simpler than even most 2010 CPUs.

Yeah, modern semiconductor fabrication is pretty much the pinnacle of human achievement. My favorite video on the subject: https://www.youtube.com/watch?v=NGFhc8R_uO4

CyberDildonics · on Sept 19, 2020

They did actually make it and it was not supposed to be a GPU..

modeless · on Sept 19, 2020

They pivoted to HPC when the GPU thing didn't work out, and it has since been discontinued.

pjmlp · on Sept 19, 2020

Sure it was, I was at the GDCE 2009 session on Larrabe.

CyberDildonics · on Sept 19, 2020

They showed quake ray tracing demos with it and other people assumed it was supposed to be sold as a GPU instead of listening to what they were actually saying.

No one thought a collection of atom CPUs with AVX512 SIMD was going to be able to compete head to head on rasterization of games with the best Nvidia cards.

modeless · on Sept 20, 2020

Why would they put texture units on it if it wasn't intended to be sold as a GPU? Consumer gaming GPUs were explicitly planned. Initial released versions even had DirectX drivers. Here is Intel's SIGGRAPH paper featuring benchmarks of Half-Life 2 Episode 2, Gears of War, and F.E.A.R. http://download-software.intel.com/sites/default/files/m/9/4...