High-Performance GPU Computing in the Julia Programming Language

jlebar · on Oct 27, 2017

> This is in part because of the work by Google on the NVPTX LLVM back-end.

I'm one of the maintainers at Google of the LLVM NVPTX backend. Happy to answer questions about it.

As background, Nvidia's CUDA ("CUDA C++?") compiler, nvcc, uses a fork of LLVM as its backend. Clang can also compile CUDA code, using regular upstream LLVM as its backend. The relevant backend in LLVM was originally contributed by nvidia, but these days the team I'm on at Google is the main contributor.

I don't know much (okay, anything) about Julia except what I read in this blog post, but the dynamic specialization looks a lot like XLA, a JIT backend for TensorFlow that I work on. So that's cool; I'm happy to see this work.

Full debug information is not supported by the LLVM NVPTX back-end yet, so cuda-gdb will not work yet.

We'd love help with this. :)

Bounds-checked arrays are not supported yet, due to a bug [1] in the NVIDIA PTX compiler. [0]

We ran into what appears to be the same issue [2] about a year and a half ago. nvidia is well aware of the issue, but I don't expect a fix except by upgrading to Volta hardware.

[0] https://julialang.org/blog/2017/03/cudanative [1] https://github.com/JuliaGPU/CUDAnative.jl/issues/4 [2] https://bugs.llvm.org/show_bug.cgi?id=27738

syllogism · on Oct 27, 2017

Does this mean we could hook Cython up to NVPTX as the backend?

I've always thought it weird that I'm writing all my code in this language that compiles to C++, with semantics for any type declaration etc...And then I write chunks of code in strings, like an animal.

nicwilson · on Oct 27, 2017

IDK about Cython, but I remember a blog post using Python's AST reflection to jit to LLVM ->NVPTX -> PTX. It's relatively simple to do, I've done it for LDC/D/DCompute[1,2,3]. It's a little tricker if you want to be able to express shared memory surfaces & textures, but it should still be doable.

[1] https://github.com/ldc-developers/ldc [2] dlang.org [3] http://github.com/libmir/dcompute

jlebar · on Oct 27, 2017

Running code on the GPU usually isn't as easy as "compile and go", but...yeah, if you can emit LLVM IR, you can get PTX which you can run on the GPU.

vchuravy · on Oct 27, 2017

I spend a while looking at debug information for NVPTX last year and came to the conclusion that it luckily dwarf, with some weird serialisation for the assembler.

The NVPTX backend would benefit imo to move towards the more general LLVM infrastructure so that emitting the dwarf info is not another special case.

jlebar · on Oct 27, 2017

We'd like this too. Unfortunately a lot of the special cases can't be eliminated because we have to interface with ptxas, the closed-source PTX -> SASS (GPU machine code) optimizing assembler.

vchuravy · on Oct 27, 2017

Yeah I know and the DWARF info special cases are even worse for ptxas. I never had enough time, but Nvidia has surprisingly a lot information on it out there.

waynecochran · on Oct 27, 2017

nvcc installed on a Mac seems tied to the current clang and the latest clang's don't support CUDA development so I have to retrograde my clang to an older version to use CUDA. Why is nvcc tied to clang?

jlebar · on Oct 27, 2017

nvcc installed on a Mac seems tied to the current clang and the latest clang's don't support CUDA development so I have to retrograde my clang to an older version to use CUDA. Why is nvcc tied to clang?

To be clear, there are two ways to compile CUDA (C++) code. You can either use nvcc (which itself may use clang), or you can use regular, vanilla clang, without ever involving nvcc.

Nvidia's closed-source compiler, nvcc, uses your host (i.e. CPU) compiler (gcc or clang) because it transforms your input .cu file into two files, one of which it compiles for the GPU (using a program called cicc), and the other of which it compiles for the CPU using the host compiler.

The other way to do it is to use regular open-source clang without ever involving nvcc. The version of clang that comes with your xcode may not be new enough (I dunno), but the LLVM 5.0 release should be plenty new, unless you want to target CUDA 9, in which case you'll need to build from head.

I don't know the technical reasons why nvcc is so closely tied to the host compiler version -- it annoys me sometimes, too.

CyberDildonics · on Oct 27, 2017

Are you guys working with the people on the ISPC team? They also had experimental compiling from ISPC to NVPTX, was it this backend?

jlebar · on Oct 27, 2017

I haven't been privy to any work with the ISPC folks. It's totally possible they're using the LLVM NVPTX backend, I dunno.

dragontamer · on Oct 27, 2017

In my experience, CUDA / OpenCL are actually rather easy to use.

The hard part is optimization, because the GPU architecture (SIMD / SIMT) is so alien compared to normal CPUs.

Here's a step-by-step example of one guy optimizing a Matrix Multiplication scheme in OpenCL (specifically for NVidia GPUs): https://cnugteren.github.io/tutorial/pages/page1.html

Just like how high-performance CPU computing requires a deep understanding of cache and stuff... high-performance GPU computing requires a deep understanding of the various memory-spaces on the GPU.

------------

Now granted: deep optimization of routines on CPUs is similarly challenging, and actually undergoes a very similar process in how to partition your work problem into L1-sized blocks. But high-performance GPUs not only have to consider their L1 Cache... but also "Shared" (or OpenCL __local) memory and "Register" (or OpenCL __private) memory as well. Furthermore, GPUs in my experience have way less memory than CPUs per thread/shader. IE: Intel "Sandy Bridge" CPU has 64kb L1 cache per core, which can be used as 2-threads if hyperthreading is enabled. A "Pascal" GPU has 64kb of "Shared" memory, which is extremely fast like L1 cache. But this 64kb is shared between 64 FP32 cores!!!.

Furthermore, not all algorithms run faster on GPGPUs either. For example:

https://askeplaat.files.wordpress.com/2013/01/ispa2015.pdf

This paper claims that their GPGPU implementation (Xeon Phi) was slower than the CPU implementation! Apparently, the game of "Hex" is hard to parallelize / vectorize.

---------------

Now don't get me wrong, this is all very cool and stuff. Making various programming tasks easier is always welcome. Just be aware that GPUs are no silver bullet for performance. It takes a lot of work to get "high-performance code", regardless of your platform.

And sometimes, CPUs are faster.

ViralBShah · on Oct 27, 2017

Absolutely. The goal with Julia is to make it easy to use whatever hardware is best suited for the problem you are solving. This work, IMO, reduces the barrier to entry for writing code for GPUs and gives Julia users more options.

gravypod · on Oct 27, 2017

> Julia has recently gained support for syntactic loop fusion, where chained vector operations are fused into a single broadcast

Wow. That's very impressive.

I hope one day we get this sort of tooling with AMD GPUs.

one-more-minute · on Oct 27, 2017

Ask and ye shall receive: https://github.com/JuliaGPU/CLArrays.jl

gravypod · on Oct 27, 2017

That's amazing. I'm very excited about the prospect of auto-magically trans-piling code into GPU code. This sort of tech will make GPUs approachable to many more scientists and programmers.

vchuravy · on Oct 27, 2017

ROCm will make similar things possible! I prefer native codegen, CLArrays.jl is an excellent solution in the meantime.

jernfrost · on Oct 27, 2017

How does the Julia approach compare to the alternatives in performance and ease of use? Can e.g. Python or R do this in any way?

vchuravy · on Oct 27, 2017

The big difference is that Julia can handle user defined structs and handle higher-level functions, e.g. you pass a Julia function to you GPU kernel and that function will get compile for the GPU without you having to declare it GPU-compatible.

ChrisRackauckas · on Oct 27, 2017

The key difference here is that, while Python and R has a lot of their standard library written in other languages (C), Julia's is mostly written in Julia. Same with Julia's packages. This means that you can throw a lot of library functions and they will GPU compile just fine because the whole stack is Julia all the way down (in many cases. There are of course exceptions).

kxyvr · on Oct 27, 2017

I keep hearing this, but each time I look at the links on HN, I see that the high-performance libraries being cited are those still written in C, C++, or some other low level language. For example, even in this link, the code is tying into things like cuBLAS, which is definitely not Julia code. For me, high performance linear algebra routines are important and I just checked here:

https://docs.julialang.org/en/latest/stdlib/linalg/

It looks like Julia uses a combination of LAPACK and SuiteSparse. These are good choices, but it's not Julia code and these routines are callable from all sorts of other languages like Python, MATLAB, and Octave. As such, it still appears as though Julia is operating more like a glue language rather than a write all of your numerical libraries in Julia language, which is fine, but I don't feel like that's what it's being sold as.

ViralBShah · on Oct 27, 2017

We use BLAS, LAPACK and SuiteSparse - because they are incredibly high quality libraries. For example, if you translate LAPACK or SuiteSparse into Julia, you will get the same performance. BLAS is a different story (and while not impossible to have a Julia one, the effort to build one would be better deployed elsewhere for now).

The benefit comes from user code, which in many dynamic languages is interpreted and is much slower than built-in C libraries. For example, look at the Julia `sum`. It is written in Julia. Or that we are in the process of replacing openlibm (based on freebsd libm) with a pure julia implementation. Or any of the fused array kernels (arithmetic, indexing, etc.). Our entire sparse matrix implementation (except for the solvers) is in pure Julia.

kxyvr · on Oct 27, 2017

To be sure, I agree and think it's the right thing to do to hook into external libraries when they provide the functionality we need. That's just an extension of the right tool for the right job philosophy.

Alright, so I write numerical codes professionally. Though it's not quite fair, I tend to bulk things into glue languages and computation languages. In a glue language, we combine all of our numerical drivers and produce an application. For example, optimization solvers don't really need to be written in a low-level language since their parallelism and computation is primarily governed by the function evaluations, derivatives, and linear system solvers. As long as these are fast, we can use something like like Python to code it and it runs about the same speed, and in parallel, as a C or C++ code. On the other hand, we have the computation languages where we code the low level and parallel routines like linear algebra solvers. Typically, this is done is C/C++/Fortran, but I'm curious to see how Rust can fit in with these language. For me, the primary focus of a computation language is one that it's fast and two that it's really, really easy to hook into glue languages. Since just about every language has a c-api, that's our pathway forward.

Alright, so now we have Julia. Is it a glue language? Is it a computation language? Maybe it's designed to be both. However, at the end of the day, most of the examples I see of Julia on HN are using Julia as a glue language. To me, we have lots of glue languages that already hook into whatever other stuff we care about be it plotting tools or database readers or whatever. If Julia is designed to be a computation language, great. However, that means we should be seeing people writing the next generation of things like parallel factorizations and then hooking them into a more popular glue language like Python or MATLAB or whatever. Maybe these examples exist and I haven't seen them. However, until this is more clear, I personally stay away from Julia and I advise my clients to as well.

And, to be clear, Julia may be wonderfully suited for these things. Mostly, I wanted to express my frustration of what I see as an ambiguity in the marketing.

sixbrx · on Oct 27, 2017

I think the biggest reason that Julia might not satisfy your definition of "computation language" is just that Julia has a significant runtime, as a garbage collected language. So it's not really suited to writing something as a library and then using it from glue languages as your proposing for "computation languages", at least currently. I think that would remain true even if it had the speed and flexibility and developer resources to not need to call out to native libraries for its own purposes.

Which reminds me a bit of Java, where the speed is either there or getting there for tight loops, but it just doesn't play well with others at all when they are wanting to do the driving.

kxyvr · on Oct 27, 2017

That's fair. And, certainly, there's nothing wrong with a glue language geared toward computation. Then, from my perspective, the question for me becomes whether Julia provides an good resources for the end application. Stuff like good plotting, reading from databases and diverse file formats, easy to generate GUIs, etc. Honestly, that's part of why I think Python became popular in the computation world. Personally, I dislike the language, but I support it because there's code floating around to do just about anything for the end application and that's hugely useful.

There's one other domain that, depending, Julia may fit well. At the moment, I prototype everything in MATLAB/Octave because the debugger drops us into a REPL where we can perform arbitrary computations on terms easily. Technically, this is possible in something like Python, but it's moderately hateful compared to MATLAB/Octave because factorizing, spectral analysis, and plotting can be done extremely easily in MATLAB/Octave. That said, I tend not to keep my codes there since MATLAB/Octave are not good, in my opinion, for developing large, deliverable applications. As such, in my business where I quickly develop one off prototype codes on a tight deadline, maybe it would be a reasonable choice.

Though, thinking about it, there may be licensing problems. The value in MATLAB is that they provide the appropriate commercial license for codes like FFTW and the good routines out of SuiteSparse rather than the default GPL. I'm looking now and it's not clear to me Julia provides the same kind of cover. This complicates the prototyping angle.

ska · on Oct 27, 2017

It's a difficult trade off, if you wait until all the basic functionality is written, debugged, and optimized in Julia then nobody can use it for "real" work for ages. On the other hand they've pretty clearly been making design decisions that allow for efficient native implementations (unlike, say python).

I haven't been following very closely recently but there has been some active native implementation work such as: https://github.com/JuliaDiffEq/DifferentialEquations.jl

ViralBShah · on Oct 29, 2017

ChrisRackauckas (above) is the author of DifferentialEquations.jl!

ChrisRackauckas · on Oct 28, 2017

Of course the linear algebra parts were replaced with cuBLAS because it's implemented in Julia via BLAS (OpenBLAS/MKL), but also because numerical linear algebra needs to be architecture-specific (though there are minor efforts for a JuliaBLAS). But most of those other functions, like sum, findfirst, etc (all of those higher level functions that you use Julia/Python/MATLAB for) will be available through this mechanism.