I feel I can talk about this! My master's thesis was computation of a neural network using FPGA + CPU. The original SNN code was in C++, my thesis implemented in in OpenCL. This was using the Altera OpenCL to FPGA implementation
Essencially taking the inner-most loop (that computed if a neuron would spike or not) and implementing it as a kernel in OpenCL.
Step 1 was showing increase from single-thread C++ to OpenCL kernel. Increase was 6-10x using a i7-2600k and running on all logical cores.
Step 2 was implementing in FPGA. This means pre-shipping data to the FPGA while CPU calculated other things, and beginning computation on the FPGA, and receiving responses back on CPU. Performance was 75x compared to single-thread C++ code.
Important notes that I didn't expect:
Bottleneck was memory transfer bandwidth across PCI-E.
Power consumption was less on FPGA compared to CPU.
Development time was significantly lessened. Altering the design is simple when going from OpenCL > FPGA, compared to Verilog > FPGA
In the late nineties I had the rare opportunity to work with a very exotic “FPGA hypercomputer” (yes, the marketing makes me cringe) that basically consisted of an array of FPGAs that instantiated logic at the hardware level and dynamically readjusted if required. It was a prototype built by now-defunct Starbridge Systems and designed, amongst others, by Fagin of Intel microprocessor fame.
This kind of book is always my favorite kind of books. When you talk about the whole topic and devote a chapter to each topic, you are essentially given a broad and firm grasp of what's going on in your field to your readers.
These kind of books are perfect introductory books.
If I read this book, would it be practical to build an Erlang VM targeting GPUs? The Erlang GPU work I've seen provides access via NIFs, but as I understand it, those are going to continue to hit the PCI-E bottleneck. I'm speculating about the feasibility of Erlang putting its "processes" onto the GPU cores, and the data staying on the GPU until it needed to do network, disk, or other OS mediated access.
GPUs are best under SIMD conditions: single instruction, multiple data. You're talking about running `eval` thousands of times. Each unit of execution is going to have different data, because each process is executing different code (especially when you consider different branches of a conditional statement).
My only complaint is that the PDF is formatted for print and uses typographic ligatures. To me, HTML-first makes sense for anything intended for widespread digital sharing. Reflowing and screen reading text are minimum design accommodations for a large number of people because of physical or hardware limitations.
Or to put it another way, I think that over the long run, such features are more important than the license because there are non-technical work arounds for the license that don't require duplicated effort.
A lot of interesting research can be done with FPGA+CPU in parallel computing.