PGStrom: GPU-accelerated PostgreSQL

headgasket · on Sept 1, 2015

It actually works with openCL. I had to do a few changes to make it work on MacOSX, but I beleive the language barrier and the fact that his project is in a flux and I came out of the blue caused these few lines to not merge. Please see my fork of his dev branch, that works with 9.5dev: https://github.com/pg-strom/devel/pull/144

correction: those changes to compile on macox did get pulled so https://github.com/pg-strom/devel should build with openCL on your mac to link with 9.5dev.

tsuru · on Sept 1, 2015

Here are a couple more resources where the lead gives a bit more description:

http://on-demand.gputechconf.com/gtc/2015/presentation/S5276...

http://www.slideshare.net/kaigai/gpgpu-accelerates-postgresq...

edit: His slideshare page http://www.slideshare.net/kaigai has even more recent presentations. I've only had time to skim much of slides, but it seems he's flip-flopped a couple of times between OpenCL and CUDA

tempodox · on Sept 1, 2015

If I simplify this & other news to the statement, “our GPUs are Turing-complete”, then one question seems to follow logically: When will our CPUs become multi-core on the scale of a GPU? Or is that a stupid question that only looks logical?

Twirrim · on Sept 1, 2015

If you look at computing history, you'll find many parallels over the years. For example the Floating Point Unit started out as a co-processor, this massive unit you put next to your mainframe and plugged in. Over several iterations it went from external, to part of the same chasis, then on to an expansion card, before eventually making its way on-die, until by the time the first Pentium came out, the thought of having a CPU without an FPU was preposterous (I remember kicking myself for having bought the 486sx 25 instead of the 486dx 25 that had the integrated FPU.)

We've seen a similar thing with GPUs. The original 3DFX was this card you had to use a loopthrough video connector with because it had no 2D capabilities. Slowly it got integrated with 2D capabilities, and then grew to be GPUs, before starting to transition onto the motherboard, before eventually making its way into Intel chips. I don't believe you can buy an i or Xeon series Intel chip these days that doesn't have a GPU integrated in it. It's not the most powerful thing, but it is usable via OpenCL.

Intel currently has something called the Massive Integrated Core Architecture (https://en.wikipedia.org/wiki/Xeon_Phi) which has some really interesting technical advantages over the popular GPU approach. This started out as a separate expansion card you plugged into a PCI Express slot, but with the Knights Landing iteration due this year, it's also going to be available as an on-motherboard chip, and even support running the OS itself. (http://www.zdnet.com/article/intels-next-big-thing-knights-l... http://newsroom.intel.com/community/intel_newsroom/blog/2013...)

fake-name · on Sept 1, 2015

Most of the Xeon parts do not have integrated GPUs.

A few of the Xeon E3 parts do have GPUs, but that's because they're targeted at workstations, apparently.

None of the Xeon E5/E7 parts have GPUs.

Take a look at [here](https://en.wikipedia.org/wiki/List_of_Intel_Xeon_microproces...) for a list.

Retric · on Sept 1, 2015

Latency is a huge driver for most of that integration.

GPU's are an odd case because they need a lot of internal bandwidth and are minimally impacted by latency making external cards far more viable. And unlike sound cards they can easily eat more or less unlimited FLOPS.

papercrane · on Sept 1, 2015

GPU's were designed for a problem that is easy to make massively parallel. If we tried to make our CPUs like GPUs we'd end up with something not very useful for most of what we do.

Arelius · on Sept 1, 2015

Most CPU's these days already have a GPUs worth of core's onboard. So, about 5 years ago to answer your question?

The other answer to the rest of your question is probably about when the majority of software starts running entirely on the GPU.

So, let's take the Gefore GTX 980, a pretty heavyweight GPU right now. Tech specs say it has 2048 "CUDA cores" whatever that means. Well, a cuda processor is actually a SIMD core, and in this case, I think it's actually a 32 element wide SIMD core. Which makes the 980 actually contain 64 cores (Is that actually the number of cores, or just instruction schedulers)... Which is still a bit, but you can get a modern 18-core Xeon, which only puts us off in about a factor of 4 in core count.

So, the main reason for the gap at this point is functionality. The CPU does a lot of things that the GPU doesn't, and that all takes silicon that the GPU uses for extra ALU. Things like legacy instruction decode, instruction reordering, deep branch prediction pipelines, and a giant bucket of instructions that just aren't needed on the GPU, but many modern CPU targeted software really relies on for performance. And really, if the average piece of software properly took advantage of many-cores, we'd all have a ton of cores in our CPU. It much easier for CPU designers to bolt on another core than dealing with complicated branch prediction invalidation logic, or any other trick to make crappy code run well.

In all I think the gap is smaller than you think, and we will continue to see convergence, particularly as engineers start writing code better suited to the GPU, and rely less on the CPUs fancy performance tricks. But in all, there are some fundamental differences that make complete crossover tricky.

Symmetry · on Sept 1, 2015

There are tradeoffs in design both between few and many cores and between optimizing cores for throughput versus latency and between optimizing for DSP-like code with predictable memory access and control patterns and business logic with lots of pointer chasing.

It's actually pretty easy to have CPUs on the scale of a GPU right now. A 15 core Xeon can put out 600 gigFlops of compute so you just need 10 of them to equal one Titan X. Now, those 10 Xeons take much more silicon area and power to generate that processing power than the Xeons but that's because they have all these branch predictors and out of order execution engines and such that mean that their performance falls of much more slowly as you throw them at problems that are more complicated than DSPish code.

Aissen · on Sept 1, 2015

FYI, Intel tried that with the aborted "Larrabee" project, so not a stupid question at all. It ended up being a too hard problem to solve. They reused the technology to make the Xeon Phi coprocessors which are basically PCIE computing "accelerators"; for a class of workload, and if you're willing to rewrite your code, they are faster and more power efficient than GPUs.

Although there are many companies trying to crack this many-core problem (Kalray, AppliedMicro, Adapteva, Cavium, Tilera), I'm not aware of any "successful" solution.

pjmlp · on Sept 1, 2015

I was at GDCE when Larrabee was presented in the technical session about programming for it.

Of course, it was mostly slideware but we were given the promises that it would change computing as we know it, yeah right.

They really need to improve their GPU skills.

JohnBooty · on Sept 1, 2015

Reason #1: Many (most?) computational tasks are not easily parallelizable. The canonical analogy is making a baby:

- One woman can make a baby in nine months - Nine women can make nine babies in nine months - But nine women can't make a baby in one month

Each step of fetal development depends on the previous step. For a given single baby, you can't have one woman working on step #49 while another woman works on step #17.

A lot of computational tasks are similar. Think of the Fibonacci sequence: the computation of each number depends on previous results.

Reason #2: The overhead of communication and coordination. If you split up a task amongst 32 cores, those cores need to communicate with the parent process and perhaps with each other as well. This eats into your transistor budget.

Reason #3: Resource contention. It's nice that you can split up your memory bandwidth-intense task across 128 cores, but if most of those cores are just sitting around and idling because of memory bandwidth contention issues, you haven't gained anything.

Reason #4: It's hard. Programmers struggle to write parallel code. Some tools make it much easier of course.

mrdmnd · on Sept 1, 2015

Pedantic comment: you don't need the previous results to compute the nth fibonacci number; the recurrence has a closed form solution. It is not constant time to compute (unless you assume constant time, infinite precision arithmetic) in practice, but it can certainly be done without the prior terms.

The rest of this post is on point, though. Serialization bottlenecks (Amdahl's Law) prevent a lot of optimizations from making meaningful dents in real workloads.

coldtea · on Sept 1, 2015

It's a logical question, but unfortunately things don't work like that.

Whule both might be turing-complete, the kind of tasks a GPU does fast are the tasks that are extremely parallelizable.

3D image creation (the most basic thing they do) is like that. So are other tasks we program them to do (e.g. running some number crunching etc).

But normal programs we run (the OS, Word, Photoshop, the web browser etc) are not like that, and we can't easily make them like that.

Const-me · on Sept 1, 2015

Exactly.

GPUs were designed to run thousands of threads, CPUs to run only a few of them. CPU cores do branch predictions and instructions reordering to save latency. GPU cores do not. When they need to wait, they instead pause the thread and resume some other thread on the same core.

GPUs were designed to execute the same code on many threads, CPUs to ran arbitrary code on each thread. Each CPU core has instruction fetch, instruction decode, and branching modules. For GPU cores, there’s only one fetch/decode/branch modules per many cores (32 on nVidia).

This 2 factors are among the reasons why we have 4 cores in $200 mid-range desktop CPUs, and 1024 cores in $200 mid-range desktop GPU.

belorn · on Sept 1, 2015

When/If HP finish the machine with 2500 cores, that will basically result in the CPU becoming as parallel as GPU's.

Symmetry · on Sept 1, 2015

What NVidia calls a core and what Intel calls a core really aren't comparable. If you go by the Intel definition which is more or less "Capable of independantly issuing an instruction" then you might find 15 cores in a Xeon versus 24 SMM "cores" in the top NVidia chip. Going by the NVidia definition which is more or less "Things that can execute instructions from a queue" NVidia would have 3072 cores in it's top chip and Intel would have 120 cores or "execution ports" as they call them.

slashdev · on Sept 1, 2015

That's just a bunch of standard (ARM currently) boxes networked together with a special backplane that allows for DMA between different servers. Not really comparable to a GPU.

bibabo · on Sept 1, 2015

Does the machine share the same memory layout/architecture as a GPU (I don't know each one). Thanks.

michaelt · on Sept 1, 2015

I was under the impression that to get the speed benefits of CUDA you had to coalesce memory access. Does anyone know how nested loop joins can use indexes without messing up the memory access coalescing?

lugus35 · on Sept 1, 2015

What would be nice is a document which would explain how to convert queries to use the GPU as its best, a document which would not require its reader to be a GPU geek nor a Postgresql expert. A technology can only succeed when the average programmer can use it without doing much errors, or can find quickly a solution using google.

onion2k · on Sept 1, 2015

A technology can only succeed when the average programmer can use it without doing much errors, or can find quickly a solution using google.

I don't think that's true. A technology might only be able to become a mainstream product if the average programmer can use it, but there are a lot of niche products that most programmers will never use. That doesn't mean those technologies haven't succeeded.

If you only ever use things that are similar to the things you already know (e.g. things you can use without making errors) then you will only make very slow progress as a developer. You'll always be avoiding the things that are different to your existing skill set. Sometimes, if you want to use something radically different to what you already use, you have to put in the effort and learn something that is hard. You can't expect the technology to come to you.

logicallee · on Sept 1, 2015

>how to convert queries to use the GPU as its best, a document which would not require its reader to be a GPU geek nor a Postgresql expert

Uh, isn't this like asking "how do I do calculus without being a calculus geek"? I mean it's specialized - how do you get around being a "GPU geek" if you're coding for a GPU "at its best"?

spacemanmatt · on Sept 1, 2015

Agreed. This is squarely aimed at database geeks which is a fairly esoteric audience. It's gonna get geeky.

jjaredsimpson · on Sept 1, 2015

Not being a member of the target audience doesn't mean the speaker is doing something wrong.

niutech · on Sept 1, 2015

Here is a paper on how to accelerate SQL database operations on a GPU: http://www.cs.virginia.edu/~skadron/Papers/bakkum_sqlite_gpg... (University of Virginia, 2010) and an ealier work: http://www.researchgate.net/profile/Qiong_Luo2/publication/2... (Hong Kong University of Science and Technology, 2009)

And here are similar pojects:

http://www.mapd.com/

https://github.com/antonmks/Alenka

JohnBooty · on Sept 1, 2015

    technology can only succeed when the average 
    programmer can use it without doing much errors

GPUs have been pretty successful despite the fact that you need to be a GPU programmer to program them.

paisawalla · on Sept 1, 2015

A technology can only succeed when the average programmer can use it without doing much errors, or can find quickly a solution using google.

Well, if you want to be successful like PHP, sure that's true.

For me, I'd sooner sue something which works well when used by a person who's taken the time to learn what they're doing. It's not important to me what a person of median skill and effort can accomplish: there's only so many things the average engineer is going to be able to learn, for whatever reason.

mjcohen · on Sept 1, 2015

I think you meant "use", not "sue".

ltbarcly3 · on Sept 1, 2015

So you are complaining that a pre-alpha open-source project that only works for an unreleased version of Postgresql isn't documented well enough?

You must complain a lot.

irremediable · on Sept 1, 2015

I like it that there's lots of robust criticism in the comments, but I think people are maybe being a little pessimistic. This is pretty exciting, isn't it? GPUs have sped things up in a lot of technologies, and I'm intrigued to see what they can do for databases.

ck2 · on Sept 1, 2015

What the what? 10 times faster for one join?

https://wiki.postgresql.org/images/5/50/PGStrom_Fig_MicroBen...

Or am I reading that incorrectly?

How is that possible?

lazyjones · on Sept 1, 2015

> How is that possible?

I'll guess:

It's all in-memory, so not I/O-bound. Rows are processed in parallel with PGStrom (in 15MB chunks), while Postgres does it sequentially in one thread. Apparently moving the data around is fast enough not to matter.

dsugarman · on Sept 1, 2015

Think of GPUs as excelling at MapReduce, either they do the same operation on many data points or they reduce the data by combining them in an aggregate. The query listed is taking an avg on a variable # of joins. Each join is doing a sequential search which can be a lot faster if you can process this in parallel and then when you find the right data you want to aggregate, you are then doing the aggregate. Both are incredibly faster in parallel.

The gpu was designed so you can, lets say, move all vertices in a character's body 1 inch forward, you would need to process this one translation for potentially thousands of vertices. It was also designed to do bitmap operations very quickly where you need to get the same image, but smaller so the GPU would combine nearby pixels to provide a good heuristic of what that same image would look like further away.

mslot · on Sept 1, 2015

Joining a large number of tables is a sweet spot for PGStrom where in which there are a lot of parallelizable computations. For a lot of queries it can actually be slower.

calpaterson · on Sept 1, 2015

Hopefully the query planner would choose whether or not to use PGStrom for a specific query.

Many interesting queries in normalised databases involve a lot of joins. In a BCNF database most interesting queries use 3 or more joins. Whether or not I believe the microbenchmarks is another thing :)

iopq · on Sept 1, 2015

I got excited until I saw CUDA instead of OpenCL...

ogrisel · on Sept 1, 2015

And the use of GPL license which means that this code cannot be contributed upstream to PostgreSQL itself because it uses a more liberal license: http://opensource.org/licenses/postgresql.

ksec · on Sept 2, 2015

Ok this is annoying, why do they choose GPL instead of something similar to PostGre?

pjmlp · on Sept 1, 2015

It would help if OpenCL eco-system was half as good as CUDA, specially in Fortran, C++, alternative languages support and debuggers.

iopq · on Sept 1, 2015

Doesn't matter, I can't run it on my machine.

pjmlp · on Sept 1, 2015

It matters because the quality of the ecosystem is what drives developers that care about GPGPU, not ideology.

iopq · on Sept 1, 2015

It doesn't matter because I can't run it. What does the ecosystem help me when I have an AMD card?

spacemanmatt · on Sept 1, 2015

This is cool but is there actual news about the project other than existence?

jpgvm · on Sept 1, 2015

Unfortunately most of the discussion seems to be going down in Japanese.

That said if you don't mind just reading diffs you can get an idea of the state of the project by taking a gander at their github: https://github.com/pg-strom/devel

elchief · on Sept 1, 2015

Just our friendly monthly reminder...

kyloon · on Sept 1, 2015

This reminds me of MapD (http://www.mapd.com), though I am not sure which database MapD builds their stuffs on.

mosselman · on Sept 1, 2015

I imagine this isn't production ready. Or is it? It looks very cool and like a good direction to go in. You'd need a VPS with GPU access though.

javajosh · on Sept 1, 2015

Anyone have a sense for the kinds of queries this would actually help with?

pcwalton · on Sept 1, 2015

Probably queries that have a lot of brute-force comparisons involving full table scans. "select foo from bar where baz > 30" with no index, for example.

(Source: I've done a fair bit of experimentation in GPU queries in the domain of CSS selector matching.)

radiowave · on Sept 1, 2015

Interesting thought. I was assuming that the impressive performance figures based on joining 10 tables together would be making use of indexes, because with full table scans it might be difficult to keep the GPUs fed with data.

paisawalla · on Sept 1, 2015

In the article linked, they show a multi-way join on integer keys, which is able to leverage accelerated integer comparison (presumably.) I'd imagine a whole range of GIS queries also have the potential to be sped up by GPU offloading.

victorbojica · on Sept 1, 2015

does anyone know if this would work with postgis?