What do you mean by major apps? Do you have some reason to believe that apps will suddenly become embarrassingly parallel? Major apps often don't even take full advantage of SIMD instructions on the CPU. As soon as you need a context switch, branching, or fast memory access your GPU is crap.
I used to believe that, until I started to get into cryptocoin mining. There were algorithms that were specifically designed to be GPU resistant and they all were ported and saw significant gains. It was that experience that pushed me to learn how to program these devices.
The tooling isn't good enough yet, but there is no question in my mind practically everything will run on a GPU like processor in the future. The speedups are just too tremendous.
Intel knows this, which is why they purposely limit PCIe bandwidth.
>I used to believe that, until I started to get into cryptocoin mining.
99% of the apps we use are totally unlike cryptocoin mining. And the style they are written (and more importantly, their function) is even more GPU resistant.
>The tooling isn't good enough yet, but there is no question in my mind practically everything will run on a GPU like processor in the future. The speedups are just too tremendous.
Don't hold your breath. For one, nobody's porting the thousands of apps we already have and depend on everyday to GPU.
While true in the short term, it should be noted Intel is moving more towards the GPU model with its many slower core CPUs and very wide vector units. There is value in a large core capable of fast out of order execution for control purposes. But data processing can be done much faster with a GPU model.
You can even implement explicit speculative execution - simply use a warp for each path and choose at the end. It is very wasteful but can often come out ahead.
No, Intel's approach is very different from GPUs, because Intel has a strong negative interest in making porting to GPUs easy (also, wide vectors are far easier to do in a CPU than a "GPU like model").
Cryptocoin mining is embarrassingly parallel by its nature. You are trying lots of different inputs to a hash function, so you can always run an arbitrary number of them in parallel. There are various ways to reduce the GPU/FPGA/ASIC advantage, like requiring lots of RAM, but the task is still parallel if you have enough RAM. Something like a JavaScript JIT on the other hand is fundamentally hard to parallelize.
Primecoin is a good example. If you look at the implementation it requires generating a tightly packed bitfield for the sieve and then randomly accessing it after. Lots of synchronization required so you don't overwrite previously set bits are required and the memory accesses are random so its suboptimal for the GPU's memory subsystem.
It took under a year for a GPU miner to come out. Having optimized it for Intel in assembly I was convinced it wasn't possible for a GPU to beat it - and yet it happened.
It turns out even when used inefficiently a thousand cores can brute force its way through anything.
But you don't have thousands of cores. You have a medium amount of cores (64 in the latest Vega 64 GPU) with a high amount of ALUs (and threads) per core. When the GPU executes an instruction for a thread it looks if the other threads execute the same instruction and then utilises all ALUs at once. This is great for machine learning and HPC where you often just have a large matrix or array of the same datatype that you want to process but most of the time this isn't the case.
> a thousand cores can brute force its way through anything
This is so fundamentally wrong that your complete lack of understanding the underlying math is obvious. Let us use 1024bit RSA as the key to bruteforce. If we use the entire universe as a computer, i.e. every single atom in the observable universe enumerate 1 possibility every millisecond, it would take ~6 * 10^211 years to go through them all. In comparison, the universe is less than ~14 * 10^9 years old.
And this is for 1024bit, today we use 2048bit or larger.
Except you're not bruteforcing through all 2^1024, you're factoring a number, which is much easier and why rsa 768 is broken and rsa 1024 is deprecated.
Do you know how to bruteforce a prime factorization? Because I'm pretty sure from your comment that you don't. The calculation is based on enumrating all prime pairs of 512bit length or less each (the length of the indivitual primes may be longer, but for napkin math it's a very good approximation).
That is bruteforcing. That faster and better methods exists is irrelavent for a bruteforcing discussion, but I do mention in the very next message in that thread that they exist.
XPM has nothing to do with finding primes used in RSA, I'm not sure where that came from. It's POW is finding cunningham chains of smaller sized primes. That said I wouldn't be surprised if it could be adapted to find larger primes on a GPU significantly faster than a CPU.
It was simply a response to your statement about bruteforcing anything.
There is some ways to factor a prime that is much faster than bruteforcing, and yes they work on a GPU. But that has nothing to do with bruteforcing, but clever math.
The statement was stating an absolute. Algorithmic complaxity matters more than the speed of the computing hardware. Obviously, more power means you can compute more, but not everything.
That's a very uncharitable reading of what I wrote. The topic of the conversation was GPU performance vs CPU performance. Despite being less flexible, the sheer quantity of execution units more than makes up for it.
But no, I suppose its more likely I was really saying GPUs aren't bounded by the limits of the universe.
The context is Primecoin and ability to find primes. Prime factorization is related to that, and it would be obvious to read your statement in absolut for that context and it's relation. At least I did.
That's just simply not true. The 'real computation' happening on GPUs is either very heavy floating point work or graphics related work, almost everything else is running on the CPU.
GPUs are only useful if your problem is data parallel. The majority of compute intensive problems that are also data parallel at the same time have been shifted to GPUs and SIMD instructions already. A GPU isn't some pixie dust that makes everything faster.
Libreoffice has OpenCL acceleration for some spreadsheet operations. With the advent of NVME storage, and the potential bandwidth it yields, I would expect to see database systems emerging that can GPGPU accelerate operations on tables to be way, way faster than what a CPU can handle.
> I would expect to see database systems emerging that can GPGPU accelerate operations on tables to be way, way faster than what a CPU can handle.
Why do you expect that? Many DBMS operations that are not I/O limited are memory limited, and a GPU does not help you there (on the contrary, you get another bottleneck in data transfers to the small GPU memory). What can help is better data organization, e.g. transposed (columnar) storage.
People use spreadsheets for anything that you'd use a "normal" programming language for. When I worked at a bank, a real-time trading system was implemented as an Excel spreadsheet. There were third-party and internally-developed libraries to do the complicated stuff (multicast network protocols, complicated calculations that needed to be the same across all implementations, etc.) but the bulk of the business logic and UI were Excel. It's easy to modify, extend, and play with... which also makes it easy to break functionality and introduce subtle bugs. Though the same is true of any software development environment -- things you don't test break.
Don't think of spreadsheets as glorified tables. Think of them as the world's most-commonly used business logic and statistical programming language. A competitor to R, if you will.
Statistics, sure, that's definitely a good candidate for GPUs. I don't know much about R, but a quick google suggests you can run R code on a GPU, by working with certain object types, like matrices with GPU-accelerated operations.
That doesn't seem like it maps very well to a spreadsheet unless you have one big matrix per cell. I'm guessing (maybe incorrectly) that when people work with matrices in Excel, they're spread across a grid of cells. You probably could detect matrix-like operations and convert them to GPU batch jobs, but it seems very hard and I'm skeptical of how much you'd gain.
So I'm still wondering what kinds of typical Excel tasks are amenable to GPU acceleration in the first place. People use Excel to do a lot of surprising things, sure. But people use C++ and Python and Javascript for a lot of things too, and you can't just blithely move those over to the GPU.
Maybe it's specific expensive operations, like "fit a curve to the data in this huge block of cells"?
I'm aware of big spreadsheets, but from what I've seen, it tends to be complex and very ad-hoc calculations that (I imagine) don't lend themselves very well to GPUs.
Making very complex tasks run well on a GPU is hard, whereas CPUs are great for dealing with that stuff.
If you have something like a 100,000 row spreadsheet where every row is doing exactly the same calculation on different input data, sure, that starts to make sense. If people are really doing that in Excel, I'm surprised! (but maybe I shouldn't be)
Intel chips have a giant iGPU, that on certain models occupies near half the space. It can currently decode 4k Netflix. The technology is equally applicable to GPUs.
Let's try a compromise, how about multithreaded programs? Or how about programs utilizing less control flow so that static processors (yes, like the infamous VLIW) can extract ilp?
The CPU's performance has started matter less and less.