Funny, just the other day at work I was trying to explain to someone what an FPGA was and I used the 7400 TLL example and said imagine if you had a crate full of TTL chips and you could put them all onto a single die, arrange them with the most common ones in little module called LUT's and then be able to cross connect any input to any output using a memory type array. The guy I was explaining it to was familiar with TTL from school and immediately the light went on.
Is there a good idiots guide to understanding FPGAs and how to develop for them? I have a CS background (so I've read both Hennessy & Paterson books) but don't know FPGAs nor have an EE background.
Reading about FPGAs always confuses me, from the perspective of someone who thinks as a normal SW developer. Like where does the FPGA begin execution, etc...
FPGAs are synchronous digital logic. That may sound complicated but it's actually conceptually quite simple. If I had a white board I could show you a diagram in 10 seconds that would probably clear everything up. But instead here's my attempt at a description.
1- digital logic. From your CS background you're familiar with AND, OR, NOT, etc... imagine having a bunch of these simple functions and being able to combine them and reorganize them to meet your needs. These functions are "continuous " in that they always update their output when their inputs change.
2 - synchronous: this essentially means "registers" that store values (flip flops) are controlled by a global clock signal. When the clock "triggers" (usually on the rising or falling edge) new values are loaded into the registers.
So with those definitions in mind, an FPGA, or any other synchronous digital system, can be thought of as: inputs feeding registers, feeding a cloud of logic, feeding registers, etc... for as many stages as is needed (or possible) until you get to the outputs.
Starting from a functional programming or even Prolog background may give you some useful insights. You also absolutely have to learn state machines.
The first thing to realise is that FPGAs do not have linear execution. It "begins" everywhere at once. You define a set of logic, and all of it executes concurrently. Between logic blocks are state elements (flip-flops) which capture the state at every tick of the clock. If you want to impose some sort of temporal order on things, you have to do it yourself - with a state machine or a counter. You also need to define a "reset" state.
Normally you would work with a simulator (Modelsim is the usual choice for students), which lets you run your code or a subset of it on your PC with a "testbench".
One of the frustrating things for newbies is that the same language - Verilog or VHDL - has some features that are "synthesisable" and some that only work in the simulator, and it's not at all obvious which is which!
When an FPGA powers up and self-configures, it's in an unknown state. If the design uses synchronous elements, then reset logic must be architected in by the designer; in contrast, purely combinatorial circuits would not require reset logic since their outputs are not dependent on a clock edge.
I think gricardo99's example is best, but I'll give one too just to see if I can cover it differently.
FPGA stands for field programmable gate array, where "field" here actually means "in the field," not some electrical term.
And it really is an array of gates, where a gate is a very simple function taking from one to several single-bit inputs, and providing a single-bit output. Such as AND, OR, NOT. Actually, as sort of a quirk of the transistor level design, it takes fewer transistors to design a NOT-AND gate (called NAND) and a NOT-OR gate (called) NOR, so those are usually the primitives.
The output of these gates is updated continuously, with small delay as the electricity flows through the gate. Maybe 100 picoseconds of delay or so. Varies by technology, but it's non-zero. It's called the propagation delay.
Gates are often understood by looking at truth-tables and waveforms.
So the truth table for an AND gate with inputs a,b and output z would be:
ab|z
----
00|0
01|0
10|0
11|1
And a waveform could look like this:
+------------------+
a |
+----+
+--------------+
b |
+--------+
=========================
+----------+
z |
+------------+
the X axis is time, and each input can be low or high (0 or 1)
In this diagram, A goes high first, but it's only after having both inputs high that the output goes high. I put a small propagation delay in output.
If one of the inputs went low, a very brief time later the output would drop as well.
By cascading gates such that the outputs of one feed to the inputs of another, arbitrarily complex logic can be built.
A latch is one such complex gate. It has two inputs, one being the control, the other being data. Each a single bit. When the control line is high, the input will propagate to the output. When the control line is low, the output "freezes" at the last value it had when the control was high.
Another way to describe a latch would be as being "transparent" allowing the input to flow to the output unchanged (other than delayed) when the control line is high. And then maintaining the last value once the control line drops.
You can also design a latch so that it works with the control line inverted, so it's transparent when the control line is low, and freezes when it goes high. You could describe one as positive and one as negative. I forget what the real terms are, but that should work.
A flip flop is a storage element made out of two latches. It is said to capture data on the clock edge, but this is sort of emulated. What happens is the first latch is transparent while the clock is low (negative), and the second latch is transparent while the clock is high. The effect is that all the while the clock is low, the first latch is letting the input through. When the clock transitions high, it freezes, and then the second latch becomes transparent, letting the frozen value of the first one through.
This gives the appearance of capturing data on the rising edge. When the clock goes low again, the output is frozen on the second stage.
I don't know if flops are really made this way anymore, but that's a sort of naive way to make one and it works.
A group of flops designed to work together loading in parallel are called a "register."
Logic gates that do not use a clock directly (like AND, OR, NOT) are used to build up a cloud of combinational logic. Named because it is a combination of gates.
Logic gates that do use a clock directly (latches and flip flops), are called sequential. This name comes from how they are used... a series or cascade of flip flops will pass input to output on the rising (or falling if so designed) edge of the clock on one flip flop at a time, in sequence.
This works because the clock signal path is carefully designed to get to all flip flops at nearly the same time. However the data path from the first flip flop to the second flip flop has normal propagation delay. So all the flops see the same rising edge at the same time, but the data hasn't made it through the first flop and all the way to the second flop when the second flop sees that edge.
So the data sort of hop-scotches along through the flops. On each clock edge, it jumps to the next flop.
Now if you take a set of parallel flops making up a register, and put a combinational cloud of logic at the output (taking in the output of the register) and feed back the output of the cloud of combo logic to the input of the register (forming a circle), you can make all sorts of interesting things.
If you made the cloud of logic add 1 to the input, you have a counter that increments on every clock edge. You have to initialize the flops, and while I didn't talk about it, they are designed with an extra input to set them to zero. But once initialized, it will count. And if you don't design the logic to handle it, the count will overflow and start over eventually.
I don't know if that was understandable or not, but that's pretty much the crux of all logic design. Well, really taking the "counter" idea a few steps further and you are building state machines. They are the crux really. They don't seem to show up in software nearly as often as they do in digital logic. I'd say the state machine in logic design is about as common as a for-loop (or any other loop).
-----
edit:
-----
Forgot about "programming" an FPGA. So it's an array of gates, and depending on the type of fpga, there are either fuses between all the gates which get blown (or not) when programmed, setting up the connection.
Or there is extra "hidden" logic that emulates fuses and uses memory such as flash to maintain the connection state.
Some use sram instead of flash, and when they lose power they have to load their connection configuration for you application from external flash.
This configuration of connections is created by a tool called a Synthesizer (well nobody calls them that, they just say the tool does Synthesis), and when people talk about "programming" an FPGA, they mean feeding instructions to a synthesizer, and usually aren't referring to the loading of configuration bitfile into the FPGA.
Synthesis tools take VHDL, Verilog, SystemVerilog and SystemC as input, and these are "description" languages. At least they started that way. They were more akin to HTML than javascript, but as Synthesis tools have evolved, so have the "description" languages, and it really is much closer to programming now than ever.
Some tools will take in C or other high level languages (don't laugh, it's high level comparitively), and what they do is called High Level Synthesis.
The main problem is that C doesn't have the semantics to describe things like clock edges.
So the guru's came up with SystemC which is a C++ library that includes a simulation engine and classes for the semantics of describing logic. It's not nearly as popular as the other three, but it's probably a good starting point for someone with a CS background.
One thing I've wondered is to what extent does discrete logic "scale proportionally" to integrated logic.
I mean if you make a big machine out of these things, are the trade-offs you make regarding e.g. signal fanout and propagation delay similar to those made in an IC that did the same computation?
One of the biggest issues is complexity. Sure you can make some fairly complex things within an FPGA fabric, like a CPU or GPU. However, you'll see a slow down in two areas. The first is routing, whereas an IC would just have a wire between two components, FGPAs have to route further and through different interchanges. Second, components can only be built up through the basic components of the FGPA, typically a LUT and FFs. The more complex components, the more small parts. Another aspect is memory; blockram works decent but it's an explicit design choice, DRAM via DMA is also a pain too.
Originally CPLDs used arrays of PLDs/PLAs which use a wide network of programmable AND-OR gates to implement logic equations with relatively sparse routing resources between logic elements. FPGAs will use a more fine grained element for implementing logic (SRAM LUTs or small clusters of muxes) with denser routing resources.
That distinction is a bit blurred nowadays because Altera has "CPLD" products that are using conventional FPGA architecture equivalent to what was state of the art for FPGAs 25 years ago.
It is mainly an architecture difference. An FPGA is typically made up of many logic blocks that contain a programmable look up table (LUT) connected to a flip flop. There are also various multiplexers (muxes) that route signals around the flip-flop or to other neighboring logic blocks. In this way, you can chain many of these logic blocks together to create your desired circuit.
A CPLD typically has a network of And or Or gates leading to the pins of the device. The outputs of these gates then feed into the complementary gate (Or for And and And for Or) and then most likely into flip flops. You program the device by selectively connecting the pins through a switching matrix to the various elements of the first logic array and then another switching matrix to the various elements of the second logic array. There is also the possibility for a third switching matrix for feedback through the system.
As the others have pointed out the technical details, the high level difference is that FPGA's are typically volatile and hence lose their programing once there is a power loss. CPLD's have non-volatile memory that allow for hard reseting and so are used in environments that need fault tolerance (factories, airplanes, etc.).
Bitcoin miners are already using custom ASIC's on expensive, process nodes. Anyone using discrete logic would be running at a snail's pace compared to them.
I don't know specifics on Ethereum. Basically, it has to be within the hardware constraints of FPGA's where CPU's, GPU's, and ASIC's don't convey an advantage.
Do you know how bitcoin miners use FPGAs? They're used for communication somehow. The Antminer S7 uses an SoC FPGA which allegedly was being used with ASICBoost.
You can implement many forms of I/O on FPGA's that let you bridge things that otherwise can't communicate for whatever reason. The FPGA can also be used to accelerate I/O vs a software implementation. For example, many network devices offload TCP/IP from the host. Finally, the connecting systems might have the necessary I/O built-in but FPGA accelerates some computational part of it. You see that in compression, TLS or IPsec engines.
I don't know what Bitcoin people are doing. The above are common use cases, though.
That's (sort of) asked and answered in the comments :)
"it would be theoretically possible. But if you calculate that... it isn't. For example: a DE0-Nano board ($79) mines at 12 MH/s. It has 22320 logic elements. Assuming the slices are equivalent, you would have to have 22320 of these boards. Each containing 12 7400 chips."