Open Hardware Pushes GPU Computing Envelope

amelius · on March 21, 2017

I'm wondering if an application like Deep Learning would require a fast interconnect network between GPUs. Without this requirement, upscaling would be much simpler, I suppose.

CoolGuySteve · on March 21, 2017

I think this is an area where Intel is sort of mucking things up for everyone.

Because of licensing issues, we'll never see a modern version of the nForce/XBox design where the CPU and GPU share relatively fast RAM via a common MMU from Intel/NVidia.

The best we can hope for future shared RAM designs are Zen/Radeon (or a new HyperTransport), Intel/Knights Landing, or an ARM/NVidia solution.

But I'm surprised NVidia doesn't make something like a Tegra on steroids for this application. Basically an ARM running its own Linux, a 10GigE ethernet port, and a Titan/1080Ti, all in a single blade/PCI-E card. I guess the market demand isn't there yet.

edit: Looks like the Drive PX2 is more or less what I'm talking about but meant for cars, so the market demand is there: http://wccftech.com/nvidia-pascal-gpu-drive-px-2/

arcanus · on March 21, 2017

https://www.nextplatform.com/2017/02/28/amd-researchers-eye-...

AMD Research, the research division of the chip maker, recently took a look at using accelerated processing units (APUs, AMD’s name for processors with integrated CPUs and GPUs) combined with multiple memory technologies, advanced power-management techniques, and an architecture leveraging what they call “chiplets” to create a compute building block called the Exascale Node Architecture (ENA) that would form the foundation for a high performing and highly efficient exascale-capable system.

_pmf_ · on March 21, 2017

> I'm wondering if an application like Deep Learning would require a fast interconnect network between GPUs.

NVIDIA® NVLink™ (or RDMA)

deepnotderp · on March 21, 2017

Actually, NVLink isn't fast enough in some cases either, and that only really helps with GPU to GPU communication. The best way is to not parallelize at all, since parallelizing SGD requires all sorts of trickery such as ring all reduce that can quickly go unstable.

On the matter of shoveling data from the CPU to the deep learning processor (in this case it's a GPU, but in other cases could be a deep learning processor), there are yet other ways, (much) better ways to do that.

jacquesm · on March 21, 2017

> all sorts of trickery such as ring all reduce that can quickly go unstable.

What do you mean by that?

> there are yet other ways, (much) better ways to do that.

Don't leave us hanging, such as?

deepnotderp · on March 21, 2017

1. SGD doesn't always play nice with parallelization since it's technically supposed to be a serial operation. There is a "bag of tricks" that can be used in order to parallelize SGD, such as Downpour, Hogwild and most recently, Ring Allreduce. Unfortunately, these can all go unstable without any warning and debugging it is a nightmare (and sometimes impossible since it's a black box). You also have to move a lot of data between nodes, which causes problems. The best way to deal with this is to not parallelize at all, which means you need to have a ridiculously fast single processor, which is what my startup is aiming to do.

The other ways to do it involve a special (proprietary even though I dislike that word) interconnect scheme. If you're interested you can contact me.

jacquesm · on March 21, 2017

> SGD doesn't always play nice with parallelization since it's technically supposed to be a serial operation.

Ok, that makes sense.

> here is a "bag of tricks" that can be used in order to parallelize SGD, such as Downpour, Hogwild and most recently, Ring Allreduce. Unfortunately, these can all go unstable without any warning and debugging it is a nightmare (and sometimes impossible since it's a black box).

Because there will be a lot of synchronization and dependencies between the nodes and any deviation from the path will result in cumulative errors?

> You also have to move a lot of data between nodes, which causes problems.

Because of latency or throughput? (at a guess: throughput)

> The best way to deal with this is to not parallelize at all, which means you need to have a ridiculously fast single processor, which is what my startup is aiming to do.

But if you solve that you have essentially simply moved the bar further up where parallelization makes sense for other problems too. After all parallelization is just an end-run around Moore's law for special use cases, you are saying that you have found such an 'end run' for a special use case on one processor or even for a more general case?

> The other ways to do it involve a special (proprietary even though I dislike that word) interconnect scheme.

Something along the lines of transputer links?

> If you're interested you can contact me.

Yes, I'm interested, but I'm not sure if I know enough about this stuff to be able to follow you for very long, it is more of a layman level interest than anything on the level.

I could not find any contact info for you, mine is jacques@mattheij.com

Thank you for the explanation.

deepnotderp · on March 21, 2017

Right, so parallelization is usually for special cases, and you're absolutely right that some people will still want parallelization. What I meant is that most labs probably will be fine with the speed we offer. But there are some cases in which you will still want to parallelize, primarily for AI supercomputers. For applications such as supercomputing, you need to move to something that does a lot of work per iteration but makes much more progress per iteration. The traditional way to do this is with big batch sizes, but this can lead to "sharp" local minima (I'm aware of the recent papers purporting to achieve good generalization with large mini batch sizes and another achieving good generalization with sharp minima, but my own experiments ad their reported ones leave me unconvinced) which don't lead to good accuracy. As a result, we've had to develop our own way to do that, separate from the interconnect scheme. As far as a little bit more about the details of the interconnect scheme for HPC/supercomputing (please note that it's DL specific, although there are some general purpose aspects), I'd be pretty reluctant naturally ;), but I'd be willing to go more in detail in private.

Tl;dr: What I meant is that it's good enough for most labs, but we can still parallelize better for HPC/supercomputing applications, we just don't recommend it for most users.