Cuda did not share many similarities to the graphics pipeline back then. Even no...

Cuda did not share many similarities to the graphics pipeline back then. Even now it's a stretch to compare the two.

x86 was and is still a weak selling point. Developers from all backgrounds are deploying their apps on varying architectures like ARM or the JVM without much stress. The hard part about writing code that is fast for an architecture is only made more complex by having your compute units be x86 rather than simple SP vector units of a GPU.

I know many folks without a graphics background writing cuda apps but no one outside of hpc research environments dabbling in the complexities of xeon phi.

The reason for this is simple; if you get your code to be in a cuda friendly structure you have created a data parallel rewrite that leverages the highly parallel memory bus of GPUs to get a pretty easy speedup. By being constrained to a semi opinionated programming interface people can see real speedups and not get bogged down in multithreading, multiprocessing, and buggy device drivers.