Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
What Can We Learn from the Intel SPMD Program Compiler? (intel.com)
72 points by matt_d on Dec 24, 2018 | hide | past | favorite | 18 comments


Interesting to see Intel boosting ISPC again. My impression had been that it was a skunkworks project they never really supported or accepted.

See Matt Pharr's very interesting blog posts on the development and open-sourcing of ISPC.

[1] https://pharr.org/matt/blog/2018/04/30/ispc-all.html


Management at Intel has seen a major shakeup recently, and the blog series you linked to was public enough that the new management probably knows about it, and makes good points not just about the technology but also about the organizational failures at Intel.

It's entirely possible that the winds have shifted and that the new management basically pointed a finger at the ISPC blog post series and said "more of that".


Thanks for this link. Just read through the whole series and I found it to be very valuable.

I'd +1 these recommendations and add it's probably valuable to anyone interested in doing GPGPU or fast parallel programming.

Also it links out to a bunch of other high-quality resources.


I second that.

Matt Pharr's posts on ispc are an extremely interesting read, both from a technical perspective (programming abstraction and hardware design) and a human perspective (Intel politics, managing style, and career decisions).


Thirded. One of the best take-away from his series of posts is that a programming model that a programmer can reason about is far better than compiler optimizations that a programmer can only guess about.


I worked on a project that compiled a declarative DSL to both vectorized CPU code and GPU code.

For the vectorized CPU code, I found ISPC generally pleasant to use, but don't be fooled by its similarities to C. It's not a C dialect, and I got burned by assuming any implicit type conversion rules that are identical across C, C++ and Java would also hold for ISPC. The code in question was pretty simple conversion of a pair of uniformly distributed uint64_ts to a pair of normally distributed doubles (Box Muller transform). As I remember, operations between a double to an int64_t result in the double being truncated rather than the int64_t being cast. I wrote some C++ code and ported it to Java more or less without modification, but was scratching my head as to why the ISPC version was buggy. I remember the feeling of the hair standing up on the back of my neck as it dawned on me that the implicit cast rules might be different than those of C/C++/Java.

Seemingly arbitrarily deviating from C/C++ behavior in a language that's so syntactically close is a big footgun. Honestly, I think that C made a mistake. An operation between an integer and a floating point number should result in the smallest floating point type that has mantissa and exponent range both as large and the floating point operand and capable of losslessly representing every value in the range of the integer operand's type. If no such type is supported, then C should have forced the programmer to choose what sort of loss is appropriate via explicit casts. Disallowing implicit type conversion is also reasonable. However, if your language looks and feels so close to C, you really need good reasons to change these sorts of details about implicit behavior.

Similarly, I think C should have made operations between signed and unsigned integers of the same size to result in the next largest sized signed integer (uint32_t + int32_t = int64_t), or just not allow operations between signed and unsigned types without explicit casts.


I very strongly think that rust went the right route here:

    error[E0277]: cannot add a float to an integer
     --> src/main.rs:2:6
      |
    2 |     1+1.0;
      |      ^ no implementation for `{integer} + {float}`
Casts are historically such a massive source of bugs and unstability that the correct casting rule for every numerical computation is always: "Make the programmer explicitly choose."


I disagree, while C signed/unsigned type conversion is a source of bug and should be explicit, it doesn't mean that all implicit conversion are bad..

I think that implicit conversion from int to float is mostly harmless, as the result is a float and the float to int conversion is explicit, this is unlikely to create bugs.

intX to intY or uintX to uintY if Y>=X are two other conversions which are unlikely to create bugs but save a lot of boilerplate.


Interestingly enough, most of the boilerplate and straight jacket rules from Algol language stream, that cowboy programmers used to complain about, are being adopted again thanks to the age of rising CVEs in always connected devices.


I agree that implicit conversions in C/C++ are something that ideally wouldn't exist.

IMO, though, the only way to avoid having those rules is by requiring explicitness on the part of the programmer. All other options just substitute a different set of rules which for some cases will be expected and in other cases will be unexpected.


ISPC vs OpenCL is a very interesting discussion. I think the discussion warrants a bit more introduction.

The ideal is to deliver the illusion of true threads to the programmer, where each thread is mapped to a SIMD unit. This is because on x86 systems, an AVX2 SIMD unit (YMM Register) can support 8x 32-bit values (256-bits). That's 16x 32-bit values for ZMM AVX512 registers. So you get a huge boost to parallelism if you treat each SIMD-unit as a separate thread.

In effect: it is far cheaper to make SIMD-"threads", than to make "real threads". So you get Vega64 (4096 Shaders) or NVidia V100 (5120 CUDA Cores). Even on Intel, you effectively get 8x SIMD cores per actual core.

The question is how to deal with this mapping, to keep things efficient and reasonable for the programmer. OpenCL, and ISPC have different strategies for dealing with this abstraction and trying to efficiently map this illusion to the programmer. The general "Thread Group" (OpenCL) and "Wavefront" (CUDA) abstraction seems to primarily capture this SIMD-width question.

But there's one additional thing on GPUs that don't exist on CPUs: and that's the LDS memory (AMD) or Shared memory (NVidia). Shared Memory is effectively a manually-managed L1 cache on GPUs.

GPUs have very weak cache hierarchies. In high-performance GPGPU compute, it is preferred for the programmer to manually manage transfers to and from global RAM to the ~64kB Shared Memory region. And of course, only Wavefronts / Threadgroups are guaranteed to share a portion of shared memory.

Intel systems don't need this abstraction: Intel CPUs have a full and proper cache which is unified with global RAM.

So I guess Intel is "taking a victory lap" with this blog post: noting that all of the complexity of Wavefronts / etc. etc. isn't really necessary on ISPC / Intel systems.

---------

Still, Intel never really achieved the performance that NVidia V100 or AMD Vega64 can achieve. Even Intel's Xeon Phi has far lower raw performance than what NVidia / AMD was able to put out.

Its certainly more difficult to write high-performance code on GPUs. But with so much more raw power on GPUs, many people have been able to extract higher performance in practice.


> Intel CPUs have a full and proper cache which is unified with global RAM

> more difficult to write high-performance code on GPUs. But with so much more raw power on GPUs

This seems two sides of the same coin; of course you can extract higher performance with lower coupling.

I don’t think the Xeon Phi was given a fair chance; they did two iterations on the idea before canning it. NVidia has been doing video cards forever.


Xeon Phi's origins are in Larrabee. They sold two iterations, but they had several more in house.


I sat on an Intel session at GDCE 2009 about introduction to Larrabee programming and how it would take over the GPU programming world for game developers.

Making it a niche product only available to a selected few, instead of mainstream GPGPUs, meant most of us hardly cared about it.


The primary advantage of ISPC over the GPU is latency. If you issue a small amount of parallizeable work to the GPU you are still likely waiting ~500 microseconds for the result.


I wonder if that is possible to optimize at compile or execution time. Is it possible to determineif the gains sending the work to the GPU is worth the latency hit?


It would be possible to generate both CPU and GPU versions at compile-time, and pick the best one at run-time based on the data encountered. I do research on a similar technique for the Futhark language.


I have little experience in this area, but if your workload size isn’t static and is dependent on say user input, I don’t see a good way to select “the better option” at compile time. Plus, do compilers take into account graphics capabilities when compiling programs?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: