At this point real compute happens on the GPU. I think we'll see a shift to majo...

davemp · on Aug 16, 2017

What do you mean by major apps? Do you have some reason to believe that apps will suddenly become embarrassingly parallel? Major apps often don't even take full advantage of SIMD instructions on the CPU. As soon as you need a context switch, branching, or fast memory access your GPU is crap.

GPUs are only good at _very_ specific workloads.

slededit · on Aug 16, 2017

I used to believe that, until I started to get into cryptocoin mining. There were algorithms that were specifically designed to be GPU resistant and they all were ported and saw significant gains. It was that experience that pushed me to learn how to program these devices.

The tooling isn't good enough yet, but there is no question in my mind practically everything will run on a GPU like processor in the future. The speedups are just too tremendous.

Intel knows this, which is why they purposely limit PCIe bandwidth.

coldtea · on Aug 16, 2017

>I used to believe that, until I started to get into cryptocoin mining.

99% of the apps we use are totally unlike cryptocoin mining. And the style they are written (and more importantly, their function) is even more GPU resistant.

>The tooling isn't good enough yet, but there is no question in my mind practically everything will run on a GPU like processor in the future. The speedups are just too tremendous.

Don't hold your breath. For one, nobody's porting the thousands of apps we already have and depend on everyday to GPU.

slededit · on Aug 16, 2017

While true in the short term, it should be noted Intel is moving more towards the GPU model with its many slower core CPUs and very wide vector units. There is value in a large core capable of fast out of order execution for control purposes. But data processing can be done much faster with a GPU model.

You can even implement explicit speculative execution - simply use a warp for each path and choose at the end. It is very wasteful but can often come out ahead.

dom0 · on Aug 16, 2017

No, Intel's approach is very different from GPUs, because Intel has a strong negative interest in making porting to GPUs easy (also, wide vectors are far easier to do in a CPU than a "GPU like model").

TorKlingberg · on Aug 16, 2017

Cryptocoin mining is embarrassingly parallel by its nature. You are trying lots of different inputs to a hash function, so you can always run an arbitrary number of them in parallel. There are various ways to reduce the GPU/FPGA/ASIC advantage, like requiring lots of RAM, but the task is still parallel if you have enough RAM. Something like a JavaScript JIT on the other hand is fundamentally hard to parallelize.

IntelMiner · on Aug 16, 2017

"GPUs are only good at _very_ specific workloads."

Let me detail you a completely specific workload citing my own personal experience over objective facts with citations

Oh and let me also take a swipe at Intel again without any verifiable evidence

slededit · on Aug 16, 2017

Primecoin is a good example. If you look at the implementation it requires generating a tightly packed bitfield for the sieve and then randomly accessing it after. Lots of synchronization required so you don't overwrite previously set bits are required and the memory accesses are random so its suboptimal for the GPU's memory subsystem.

It took under a year for a GPU miner to come out. Having optimized it for Intel in assembly I was convinced it wasn't possible for a GPU to beat it - and yet it happened.

It turns out even when used inefficiently a thousand cores can brute force its way through anything.

imtringued · on Aug 16, 2017

But you don't have thousands of cores. You have a medium amount of cores (64 in the latest Vega 64 GPU) with a high amount of ALUs (and threads) per core. When the GPU executes an instruction for a thread it looks if the other threads execute the same instruction and then utilises all ALUs at once. This is great for machine learning and HPC where you often just have a large matrix or array of the same datatype that you want to process but most of the time this isn't the case.

purpleidea · on Aug 16, 2017

> a thousand cores can brute force its way through anything

Cool! Please factor this very large prime for me <insert some RSA prime>

slededit · on Aug 16, 2017

I can't tell what you are mocking...

That primecoin exists? That it's POW is finding prime numbers? That it has a GPU miner?

http://cryptomining-blog.com/2192-gpu-mining-for-primecoin-x...

hvidgaard · on Aug 16, 2017

> a thousand cores can brute force its way through anything

This is so fundamentally wrong that your complete lack of understanding the underlying math is obvious. Let us use 1024bit RSA as the key to bruteforce. If we use the entire universe as a computer, i.e. every single atom in the observable universe enumerate 1 possibility every millisecond, it would take ~6 * 10^211 years to go through them all. In comparison, the universe is less than ~14 * 10^9 years old.

And this is for 1024bit, today we use 2048bit or larger.

gruez · on Aug 16, 2017

Except you're not bruteforcing through all 2^1024, you're factoring a number, which is much easier and why rsa 768 is broken and rsa 1024 is deprecated.

hvidgaard · on Aug 18, 2017

Do you know how to bruteforce a prime factorization? Because I'm pretty sure from your comment that you don't. The calculation is based on enumrating all prime pairs of 512bit length or less each (the length of the indivitual primes may be longer, but for napkin math it's a very good approximation).

That is bruteforcing. That faster and better methods exists is irrelavent for a bruteforcing discussion, but I do mention in the very next message in that thread that they exist.

slededit · on Aug 16, 2017

XPM has nothing to do with finding primes used in RSA, I'm not sure where that came from. It's POW is finding cunningham chains of smaller sized primes. That said I wouldn't be surprised if it could be adapted to find larger primes on a GPU significantly faster than a CPU.

hvidgaard · on Aug 16, 2017

It was simply a response to your statement about bruteforcing anything.

There is some ways to factor a prime that is much faster than bruteforcing, and yes they work on a GPU. But that has nothing to do with bruteforcing, but clever math.

Dylan16807 · on Aug 16, 2017

The statement was about the relative speed of a thousand dumb cores vs. a couple really good cores. Not about absolute speed.

hvidgaard · on Aug 16, 2017

The statement was stating an absolute. Algorithmic complaxity matters more than the speed of the computing hardware. Obviously, more power means you can compute more, but not everything.

slededit · on Aug 16, 2017

That's a very uncharitable reading of what I wrote. The topic of the conversation was GPU performance vs CPU performance. Despite being less flexible, the sheer quantity of execution units more than makes up for it.

But no, I suppose its more likely I was really saying GPUs aren't bounded by the limits of the universe.

hvidgaard · on Aug 16, 2017

The context is Primecoin and ability to find primes. Prime factorization is related to that, and it would be obvious to read your statement in absolut for that context and it's relation. At least I did.

jacquesm · on Aug 16, 2017

That's just simply not true. The 'real computation' happening on GPUs is either very heavy floating point work or graphics related work, almost everything else is running on the CPU.

imtringued · on Aug 16, 2017

GPUs are only useful if your problem is data parallel. The majority of compute intensive problems that are also data parallel at the same time have been shifted to GPUs and SIMD instructions already. A GPU isn't some pixie dust that makes everything faster.

ant6n · on Aug 16, 2017

I've never run a piece of software that used a GPU for anything other than rendering. I believe I'm part of a very large majority.

oceanswave · on Aug 16, 2017

If you'd like to, these links will let you do so from the comfort of your (desktop) browser.

https://tenso.rs

http://gpu.rocks/

zanny · on Aug 16, 2017

Libreoffice has OpenCL acceleration for some spreadsheet operations. With the advent of NVME storage, and the potential bandwidth it yields, I would expect to see database systems emerging that can GPGPU accelerate operations on tables to be way, way faster than what a CPU can handle.

dom0 · on Aug 16, 2017

> I would expect to see database systems emerging that can GPGPU accelerate operations on tables to be way, way faster than what a CPU can handle.

Why do you expect that? Many DBMS operations that are not I/O limited are memory limited, and a GPU does not help you there (on the contrary, you get another bottleneck in data transfers to the small GPU memory). What can help is better data organization, e.g. transposed (columnar) storage.

arnon · on Aug 16, 2017

That's why all GPU databases I know of are columnar, or at least hybrid (in the case of IBM DB2 BLU)...

The more complex the operations performed on the columns (trasnform), the better the GPU database will be - because of the higher ops/bytes ratio

iainmerrick · on Aug 16, 2017

What on earth are people doing on spreadsheets that needs GPU acceleration??

jrockway · on Aug 16, 2017

People use spreadsheets for anything that you'd use a "normal" programming language for. When I worked at a bank, a real-time trading system was implemented as an Excel spreadsheet. There were third-party and internally-developed libraries to do the complicated stuff (multicast network protocols, complicated calculations that needed to be the same across all implementations, etc.) but the bulk of the business logic and UI were Excel. It's easy to modify, extend, and play with... which also makes it easy to break functionality and introduce subtle bugs. Though the same is true of any software development environment -- things you don't test break.

iainmerrick · on Aug 16, 2017

Right, but most things you do in a "normal" programming language don't run on a GPU either.

jcranmer · on Aug 16, 2017

Don't think of spreadsheets as glorified tables. Think of them as the world's most-commonly used business logic and statistical programming language. A competitor to R, if you will.

iainmerrick · on Aug 16, 2017

First, who runs business logic on GPUs?

Statistics, sure, that's definitely a good candidate for GPUs. I don't know much about R, but a quick google suggests you can run R code on a GPU, by working with certain object types, like matrices with GPU-accelerated operations.

That doesn't seem like it maps very well to a spreadsheet unless you have one big matrix per cell. I'm guessing (maybe incorrectly) that when people work with matrices in Excel, they're spread across a grid of cells. You probably could detect matrix-like operations and convert them to GPU batch jobs, but it seems very hard and I'm skeptical of how much you'd gain.

So I'm still wondering what kinds of typical Excel tasks are amenable to GPU acceleration in the first place. People use Excel to do a lot of surprising things, sure. But people use C++ and Python and Javascript for a lot of things too, and you can't just blithely move those over to the GPU.

Maybe it's specific expensive operations, like "fit a curve to the data in this huge block of cells"?

iainmerrick · on Aug 16, 2017

OK, so I googled around a bit more and found a useful presentation on LibreOffice internals: https://people.gnome.org/~michael/data/2014-05-13-iwocl-libr...

Looks like it is indeed identifying large groups of cells with common formulas, and running those calculations on the GPU.

a012 · on Aug 16, 2017

You never worked with accounting I guess. They easily get a single spreadsheet that's over 1GB, and that's normal.

iainmerrick · on Aug 16, 2017

I'm aware of big spreadsheets, but from what I've seen, it tends to be complex and very ad-hoc calculations that (I imagine) don't lend themselves very well to GPUs.

Making very complex tasks run well on a GPU is hard, whereas CPUs are great for dealing with that stuff.

If you have something like a 100,000 row spreadsheet where every row is doing exactly the same calculation on different input data, sure, that starts to make sense. If people are really doing that in Excel, I'm surprised! (but maybe I shouldn't be)

ktRolster · on Aug 16, 2017

Batch processing. Lots of rows.

frozenport · on Aug 16, 2017

Intel chips have a giant iGPU, that on certain models occupies near half the space. It can currently decode 4k Netflix. The technology is equally applicable to GPUs.

TorKlingberg · on Aug 16, 2017

The video decoding is done by a hard block called Quick Sync. It's part of the iGPU, but is not using the regular GPU cores.

frozenport · on Aug 16, 2017

Yes, but its made with the same lithography - if not the same mask (or equivalent). The lithography is GPU/CPU debate agnostic.

deepnotderp · on Aug 16, 2017

Let's try a compromise, how about multithreaded programs? Or how about programs utilizing less control flow so that static processors (yes, like the infamous VLIW) can extract ilp?

jostmey · on Aug 16, 2017

Right on. Core speeds are not getting any faster. Chips with thousands of cores are the future. Intel is way behind in this area

mrep · on Aug 16, 2017

A quick look at AWS shows probably 5 out of 100 services that use GPU as their main resources whereas all of the others use CPU's.

Machine learning may be all the hype right now, but it is still suboptimal for most software demands.