Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The History, Status, and Future of FPGAs (acm.org)
170 points by skovorodkin on July 23, 2020 | hide | past | favorite | 158 comments


As a bit of a counterpoint:

One of my prior projects involved working with a lot of ex-FPGA developers. This is obviously a rather biased group of people, but I saw a lot of feedback around that was very negative about FPGAs.

One comment that's telling is that since the 90s, FPGAs were seen as the obvious "next big technology" for HPC market... and then Nvidia came out and pushed CUDA hard, and now GPGPUs have cornered the market. FPGAs are still trying to make inroads (the article here mentions it), but the general sense I have is that success has not been forthcoming.

The issue with FPGAs is you start with a clock rate in the 100s of MHz (exact clock rate is dependent on how long the paths need to be), compared with a few GHz for GPUs and CPUs. Thus you need a 5× performance win from switching to an FPGA just to break even, and you probably need another 2× on top of that to motivate people going through the pain of FPGA programming. Nvidia made GPGPU work by being able to demonstrate meaningful performance gains to make the cost of rewriting code worth it; FPGAs have yet to do that.

Edit: It's worth noting that the programming model of FPGAs has consistently been cited as the thing holding back FPGAs for the past 20 years. The success of GPGPU, despite the need to move to a different programming model to achieve gains there, and the inability of the FPGA community to furnish the necessary magic programming model suggests to me (and my FPGA-skeptic coworkers) that the programming model isn't the actual issue preventing FPGAs from succeeding, but that FPGAs have structural issues (e.g., low clock speeds) that prevent their utility in wider market classes.


GPUs work great for accelerating many applications, and it's true that that reduces interest in FPGAs. For applications that map well to GPUs, you're absolutely correct that the higher clock speeds (and greater effective logic area) make GPUs superior as accelerators.

However, some applications do not map well to GPUs. Particularly those applications with a great deal of bit-level parallelism can achieve enormous speedups with bespoke hardware. For those applications where it doesn't make sense to tape out an ASIC, FPGAs are beautiful--even if they only operate at a few hundred MHz.

I think the "programming model" is actually the biggest barrier to wider adoption. Your comment is suffused with what I believe is the source of this disagreement: The idea that one programs an FPGA. One designs hardware that is implemented on an FPGA. The difference may sound pedantic, but it really is not. There is a massively huge difference between software programming and hardware design, and hardware design is downright unnatural for software developers. They are completely different skill sets.

On top of that add all the headaches that come with implementing a physical device with physical constraints (the article complains about P&R times but this is far from the only burden) and it becomes clear that FPGAs are quite frankly a massive pain in the ass compared to software running on CPUs or GPUs.


Very much this.

(Also, in general, FPGA tools are just some of the lowest quality garbage out there... and that is saying something. They're that bad. This is a completely unnecessary speedbump.)

The rebuttal to your objection is always tools like "HLS" (High-Level Synthesis), or in English it's "C to HDL" (FPGAs are 'programmed' in the two Hardware Definition Languages VHDL (bad) or Verilog (worse, but manageable if you learn VHDL first).) These are not programming languages, they are hardware definition languages. That means things like "everything in a block always executes in parallel". (Take that, Erlang?) In fact, everything on the chip always executes in parallel, all the time, no exceptions; you "just" select which output is valid. That's because this is how hardware works.

This model maps very, very poorly to traditional programming languages. This makes FPGAs hard to learn for engineers and hard to target for HLS tools. The tools can give you decent enough output to meet low- to mid-performance needs, but if you need high performance -- and if not, why are you going through this masochism? -- you're going to need to write some HDL yourself, which is hard and makes you use the industry's worst tools.

Thus, FPGAs languish.


The biggest problem with HLS is that the HLS vendors still want to pretend it's "C++ / OpenCL / whatever to gates". What you get is pretending that there is no such concept of a clock even though you know it is always there and you care about it, and the language you are really writing consists mostly of all the crazy pragmas that you have to sprinkle over everything. It ends up failing on both counts: it isn't C++ to gates, and it is an exceedingly difficult HDL to use because it tries to hide the clock from you always even when you really need to do something with it (e.g., a handshake).

A weak spot of high-end commercial HLS tools (Catapult, Stratus) is in interfacing with the rest of the hardware world, and how the clock is handled (SystemC, you handle it yourself) or kind of vaguely (Catapult's ac_channel). Getting HLS to deal with pipeline scheduling is great, but sometimes you want to break through and do something with the clock. Want to write a memory DMA in HLS? Talk AXI? Build a NoC in HLS? Build even something like a CPU in HLS? Interface with "legacy" RTL blocks, whether combinational or straight pipeline or with ready/valid interfaces or whatever? These things are sort of/just feasible at present with these commercial HLS tools, but very very hard (I've tried it).

If they want to stick with it, I think C++11 could provide a superior type-safe metaprogramming facility for building hardware (compared to the extremely primitive metaprogramming and lack of type safety notions in SystemVerilog) or generators such as Chisel or the hand-written Perl/Python/TCL/whatever ones in use at most companies, but sometimes you need to break down and do something with the clock or interface with things that care about a clock, much in the same way that one would put inline asm statements in code. I want to do that, but not have to deal with the clock 95% of the time when I don't really need to, which is where the generators fail (let the tool determine the schedule most of the time). HLS needs to sit between the two: not a generator (glorified RTL), but not "pretend you write untimed C++ all the time" (not hardware at all).


Again, a counterpoint:

I worked on hardware for something akin to a FPGA on a much coarser granularity (kind of like coarse-grained reconfigurable arrays)--close enough that you have to adapt tools like place-and-route to compile to the hardware. The programming for this was mostly driven in pretty vanilla C++, with some extra intrinsics thrown in. This C++ was close enough to handcoded performance that many people didn't even bother trying to tune their applications by resorting to hand-coding in the assembly-ish syntax.

This helped bolster my opinion that FPGAs aren't really the answer that most people are looking for, and that there are useful nearby technologies that can leverage the benefits of FPGAs while having programming models that are on par with (say) GPGPU.


For sure. FPGAs are probably not the answer that most people are looking for. FPGAs are but one point in the trade-off space, and they're not one you jump to "just because".

> [...] there are useful nearby technologies that can leverage the benefits of FPGAs while having programming models that are on par with (say) GPGPU

I think CGRAs are really cool but they're even more niche, and I suspect your original point about GPUs eating everyone's lunch applies particularly strongly to CGRAs. The point is well taken, though, and I don't necessarily disagree.


> FPGA tools are just some of the lowest quality garbage out there

I think things are about to change thanks to yosys and other open source tools.

> VHDL (bad) or Verilog (worse,

VHDL (and its software counterpart Ada) are very well thought and great to use once you get to know them (and understand why they are the way they are). Yeah, they are a bit verbose but I prefer a strong base to syntactic sugar.


> VHDL (and its software counterpart Ada) are very well thought and great to use once you get to know them (and understand why they are the way they are). Yeah, they are a bit verbose but I prefer a strong base to syntactic sugar.

As a professional FPGA developer: VHDL (and Verilog even moreso) are bad [1] at what they're used for today: implementing and verifying digital hardware designs. In fact, they're at most moderately tolerable at what they were originally intended for: describing hardware.

[1] They're not completely terrible – a completely terrible idea would be to start with C and try to bend it so that you can design FPGAs with it...


Parts of VHDL leave a little to be desired but overall I find it to be a really great language. To the extent I bought Ada 2012 by John Barnes and I kind of like that too after coding in C/C++ etc, but maybe I'm now biased after many years of VHDL coding :) It's not uncommon to see "VHDL is bad" and such like, and I do wonder what the reasons are for those comments.


> It's not uncommon to see "VHDL is bad" and such like, and I do wonder what the reasons are for those comments.

VHDL is bad because it's bad at prototyping and implementing digital hardware [1]. One reason why it's bad at that task is the mismatch between the hardware you want and the way you have to describe it in the language. For example: You want a 32-bit register x which is assigned the value of a plus b whenever c is 0, and you want its reset value to be 25. VHDL code:

    signal x: unsigned(31 downto 0);
    ...
    process (clk, rst)
    begin
        if rst then
            x <= to_unsigned(25, x'length);
        elsif rising_edge(clk) then
            if c = '0' then
                x <= a + b;
            end if;
        end if;
    end;
The synthesis software has to interpret the constructs you use according to some quasi-standard conventions, and will hopefully emit those hardware primitives you intended. I say "hopefully", because of the many, many footguns arising from those two translation steps.

[1] Okay, I concede that in theory, there might be a use case where VHDL is perfectly suited for, which would make VHDL a not-bad language. But designing digital hardware is not such a use case.


Writing this with good intentions, not trying to start a fight...

---

There are some minor issues with your code that shows you are probably a verilog/SV guy and not an experienced VHDL guy.

Please read Andrew Rushtons "VHDL for Logic Synthesis". I also recommend you read on VHDLs 9-valued logic and why it was designed this way and how it differs from verilogs Bit.


> you are probably a verilog/SV guy and not an experienced VHDL guy

Wrong on both counts.

Please, enlighten me, what's wrong with my code? Note that it's in VHDL-2008, and the async. reset is intentional.

> I also recommend you read on VHDLs 9-valued logic and why it was designed this way

My main issue with VHDL is not the IEEE 1164 std_(u)logic, although it really doesn't help that this de-facto standard type for bitvectors and numbers (via the signed/unsigned types) is just a second-class citizen in the language – as opposed to bit and integer, which are fully supported syntactically and semantically, but which have serious shortcomings.


Inconsistent Boolean expressions + lack of familiarity with unsigned and how it is supported by the tools.

Nothing major, but in my books this is the difference between a Jr and a Sr designer. Nitpicking, yes. But the hardware business is like.


> lack of familiarity with unsigned and how it is supported by the tools

Do you mean this: "x <= to_unsigned(25, x'length);" ? Some tools, like Synopsys, allow "x <= 25;" here, but other tools, like ModelSim, do not. The VHDL-2008 standard does not allow "x <= 25;".

> Inconsistent Boolean expressions

Do you mean because I wrote "if rst ..." but later "if c = '0'..."? Come on, you're not nitpicking, you're trying to find issues where there are none. Fixating on such anal-retentive details does not make you a "Sr designer", it makes you a bad engineer.


As someone who just said that exact thing upthread, half of it is general curmudgeonry. VHDL is not a terrible language, though it does have terrible tools. The IDE side of things is a big opportunity to improve the language. Making refactoring easier by not needing to manually touch up three different files to fix one name is a huge help. (And the IDEs have probably improved in recent times; I've done mostly hardware recently.) The compilers/synthesizers... those are vendor crud and so dragons lie there. VHDL-2008 support would go a long way to improving life....


If IDE support for basics is an issue,like consistent renaming, then language server protocol support will help:

https://github.com/ghdl/ghdl-language-server

Edit: typo in url


So what’s better?


I've heard good things about Bluespec. It is used for Cambridge's CHERI capability architecture extensions, for example - https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/


> The rebuttal to your objection is always tools like "HLS"

Yup. I know HLS has gotten a lot better recently but my impression is that, somewhat like fusion, HLS as a first-class design paradigm is always a decade away.

> FPGA tools are just some of the lowest quality garbage out there

Absolutely. I think the problem is vendors see FPGA tooling as a cost center and a necessary evil in order to use their real products, the chips themselves. Users are also highly technical and traditionally have no alternative, so (mostly) working but poor-quality software is simply pushed out the door. "They'll figure it out".

Finally, to expand on the difficulties imposed by physical constraints, I think another huge blocker to wide adoption is that FPGAs are physically incompatible. I cannot take a bitstream compiled for one FPGA and program it to any other FPGA. Hell, I can't even take a bitstream compiled for one FPGA and use that bitstream for any other device in the same device family. Without some kind of standardized portability, FPGAs will remain niche devices used only for very specific applications.


> cannot take a bitstream compiled for one FPGA and program it to any other FPGA.

Like considering dumping memory content on a PC and reinject it on another with different RAM layout and devices and complaining the OS and programs can't continue running? Is that a sane expectation?

There are upstream formats targeting FPGAs that can be shared, although yes redoing place and route is slow.

Should manufacturers provide new formats closer to final form yet would allow binaries that can be adjusted, kind of like .a .so or even llvm?

Alternatively, would building whole images for many families of FPGA make sense? Feels like programs distributed as binaries for p OS variants times q hardware architectures, each producing a different binary... random example https://github.com/krallin/tini/releases/tag/v0.19.0 has 114 assets.


> bitstream ... Is that a sane expectation?

No. Bitstream formats are not in any way compatible across devices. Because timing is a factor, even if you had the same physical layout of LUTs and routing, it's unlikely that your design would work.

(From parent)

> use that bitstream for any other device in the same device family

Not at the bitstream level. However, you can take a place&routed chunk of logic and treat it as a unit. You can replicate it (without repeating P&R), move it around, copy it onto other devices in the same family. This is super useful as most FPGA applications have large repeating structures, but P&R doesn't know that it's a factorable unit. It'll repeat P&R for each instance and you'll get unpredictable timing characteristics.

> Should manufacturers provide new formats closer to final form yet would allow binaries that can be adjusted, kind of like .a .so or even llvm?

> would building whole images for many families of FPGA make sense

You can license libraries that are a P&R'd blob and drop them into your design. There's no easy way to make this generalizable across devices without shipping the original RTL, and conversion from RTL->bitstream is where most of the pain lies.


> Like considering dumping memory content on a PC and reinject it on another with different RAM layout and devices and complaining the OS and programs can't continue running? Is that a sane expectation?

Even worse; it's more like that plus extracting the raw microarchitectural state of a CPU, serializing it in a somewhat arbitrary way, trying to shove that blob into a different CPU and still expecting everything to continue running.

I'm not necessarily complaining, just pointing out this significant difference WRT software programs running on CPUs.

> There are upstream formats targeting FPGAs that can be shared, although yes redoing place and route is slow.

Can you show me an example? I'd like to see this. You do not mean FPGA overlays, correct?

> Should manufacturers provide new formats closer to final form yet would allow binaries that can be adjusted, kind of like .a .so or even llvm?

Like you say, at the very least you will need to re-do place and route. But actually the problem is much worse than this. Different FPGAs have different physical resources. Not just differing amounts of logic area, but different amounts of block RAM, different DSP blocks and in varying numbers, high-speed transceivers, etc. This necessitates making different design trade-offs. Simply shoehorning the same design into different FPGAs, even if it were kind of possible, will not work well.

> Alternatively, would building whole images for many families of FPGA make sense?

Currently I think that's the only real option. But the extreme overhead, duplication of effort and maintenance burden make it very unattractive.

My napkin sketch is some sort of generalized array of partial reconfiguration regions with standardized resources in each region. Accelerator applications can distribute versions targeting different numbers of regions (e.g. one version for FPGAs supporting up to 8 regions, one for FPGAs supporting up to 16 regions, etc.). The FPGA gets loaded with a bitstream supporting a PCIe endpoint and management engine, and some sort of crossbar between regions. At accelerator load time, previously mapped, placed, and routed logical regions used in the application are placed onto actual partial reconfiguration regions and connections between regions are routed appropriately. The idea is to pre-compute as much of the work as possible, leaving a lower dimension problem to solve for final implementation. Timing closure and clock management are left as exercises for the reader :P.


> Can you show me an example? I'd like to see this. You do not mean FPGA overlays, correct?

Some of the coolest work to come out of the Chisel project is their intermediate representation FIRRTL.


Not sure why they think chip details and bitstreams need to be kept secret. If they would open up, people would make better tools for them.


Because competitors could make compatible chips.


>I think the problem is vendors see FPGA tooling as a cost center and a necessary evil

Yes to a degree, but another part of the problem is the "physical constraints" you mention. FPGA tooling has to solve multiple hard problems, on the fly, at large scale (some of the latest chips are edging up to 10M logic elements). Unfortunately for the FPGA industry, I think that this is unavoidable - though a lot of interesting work is being done around partial reconfiguration, which should allow for users to work with smaller designs on a large chip.


Well, that's an explanation for why FPGA compilation flows take so much time, but it's not a good explanation for why the software is so crap.

I think partial reconfiguration is really sexy, but it's been around for a long time. What's new and exciting there? Genuinely curious.


> HLS as a first-class design paradigm is always a decade away.

What about Chisel?


Chisel is not a HSL. Chisel is much closer to VHDL and Verilog, since the hardware is directly described.


Chisel would allow me to write say, a codec algorithm and compile it into hardware, correct? As well as specify the hardware that is necessary to describe it?

I'm a casual in that space but I thought Chisel was an HDL that could be used to support HLS.


And you do the same in VHDL and Verilog. And like in Chisel, you have to manually pipeline it and you can exactly control where registers are used and how resources are reused.

You could build something HLS like using Scala/JVM and Chisel, but Chisel itself is much closer to traditional HDLs.

https://en.m.wikipedia.org/wiki/High-level_synthesis


> These are not programming languages, they are hardware definition languages.

There's a subtle point in that Verilog/SystemVerilog and VHDL are also just not powerful languages. While parametric, they lack polymorphism, object oriented programming (excluding SV simulation-only constructs), functional programming, etc.

Your point about the abstraction being different is well taken---hardware description languages describe circuits and programming languages describe programs. However, it's exceedingly unfortunate that the industry is stuck in a rut of such weak languages and trying to explain that weakness to hardware engineers, who haven't seen anything else, runs into the "Blub paradox" (e.g., a programmer who only knows assembly can't evaluate the benefits of C++). [^1]

[^1]: http://www.paulgraham.com/avg.html


While there's plenty of room to improve a language like Verilog I fail to see how these paradigms would help me in RTL. What would polymorphism even look like in an environment without a concept of runtime? Can you elaborate and enlighten me?

Edit: Disclaimer, I'm well aware of the pros and cons of these paradigms in software development and use them plenty


(Sorry! Just saw this!)

Polymorphism makes it way easier to build hardware that can handle any possible data type. Things like queues and arbiters beg for type parameters (you should be able to enqueue any data). Without polymorphism you can make something parameterized by data width (and then flatten/reconstruct the data), but it's janky and you lose any concept of type safety (as you're "casting" to a collection of bits and then back).

There was some interesting work out of the University of Washington [^1] to build a "standard template library" using SystemVerilog. Polymorphism was identified as one of the shortcomings that made this difficult (Section 5: "A Wishlist for SystemVerilog"). [^2]

[^1]: https://github.com/bespoke-silicon-group/basejump_stl [^2]: http://cseweb.ucsd.edu/~mbtaylor/papers/BaseJump_STL_DAC_Sli...


Just let those programmers play around with Redstone in Minecraft before you hand them an FPGA. They'll understand it very quickly.


Another big advantage of FPGAs is low latency and the ability to hit precise timing deadlines. When working with radio hardware, you still need an FPGA for automatic gain control calculations and recording/playing out samples. Similarly, you need to do your CRC and other calculations in an FPGA if you need to immediately respond to incoming signals, such as the CTS->RTS->DATA->ACK exchange in 802.11.


I think that's the big advantage of FPGA. If you need acceleration to hit a 10 microsecond latency target, FPGA is what you need. If your latency target is like a millisecond or longer, then GPU can handle a lot more throughput. But GPU can't typically give you a 10-us guarantee.

Okay, bit-banging is another advantage of FPGA that GPU doesn't do as well. There are a few things.


Regarding DNN inference FPGA can provide low latency AND higher throughput than GPUS.

If you want to compare apples-to-apples, we have done a comparison with realistic (and not synthetic) data regarding the performance of GPUs and FPGAs.

https://medium.com/@inaccel/faster-inference-real-benchmarks...


Ugh, ad spam taking over HN.


See it's funny, I (software guy) have recently started doing a bunch of FPGA stuff on the side for "fun" and I find the programming model to not be the biggest challenge.

The tools, yes, because it seems like hardware engineers have a fetish for all-encompassing painful vendor specific IDEs with half the features that us software developers have, and with a crapload of vendor lock-in... but I digress.

I find working in Verilog to be pretty pleasant. Yes I can see that with sufficient complexity it wouldn't scale out well. But SystemVerilog does give you some pretty good tools for managing with modularity.

On the other hand, I've never particularly enjoyed working with GPUS, CUDA, etc.

So I would agree with your statement that the structural issues prevent their utility in wider market classes -- and those really are as you say ... lower clock speeds, cost, but also vendor tooling.

FPGAs could really do with a GCC/LLVM type open, universal, modular tooling. I use fusesoc, which is about as close to that as I will get (declarative build that generates the Vivado project behind the scenes), but it's not perfect, still.


I don't mean to belittle your exploration, but are you sure it's an apples-to-apples comparison? This suggests to me that it isn't:

> it seems like hardware engineers have a fetish for all-encompassing painful vendor specific IDEs

Hardware engineers feel pain just like you do. The reason why they put up with those awful software suites is because they have features they need that aren't available elsewhere. In particular, they interface with IP blocks and hard blocks, including at a debug + simulation level. Those tend to evolve quickly and last time I looked -- which admittedly was a while ago -- the open source FPGA tooling pretty much completely ignored them, even though they're critical to commercial development.

If you are content to live without gigabit transceivers, PCIe controllers, DRAM controllers, embedded ARM cores, and so on, I suspect it would be relatively easy to use the open source tooling, but you would only be able to address a small fraction of FPGA applications.


Vivado ships all kinds of "IP" for those things, yes. And once you get past the GUI wizards, drag and drop boxes and lines, and Tcl scripts you find in the end it's just a library of Verilog, all mangled to the point of illegibility.

I wasn't talking about open sourcing. I accept we won't have open source DRAM controllers and the like from them. I understand the licensing restrictions. I just don't like how they force all this stuff to be gatewayed through their baroque and over complicated GUI tools.

I prefer tools that are scriptable, that can work with the build system of my choice, that work properly with source control (imagine that!), where you have your choice of editor rather than having their garbage one rammed down your throat, where there's wizbang features like reformatting and auto-indentation... Hell, even refactoring.

Vivado and Quartus just get in the way. There's no reason to tie all the stuff you're talking about into an integrated tool. They could just ship libraries.

Fusesoc does in fact try to make them behave this way. But you can tell it's a bit of a war to make it happen.


Well yes, they shouldn't cram the awful GUI tools down HW engineers' throats, but they do.

I'm glad Fusesoc is fighting the good fight and I'm glad you're fighting the good fight, but as you point out, it's definitely a fight. It was hardly fair to call the desire to avoid said fight a "fetish."


I can only assume hardware engineers are asking for this kind of tooling, because I can't imagine why companies would be spending the enormous development effort on them and then giving them away for free if they weren't being asked for?

So many things that could be done in a programatic, testable, declarative, scriptable, repeatable way are done with futzy GUI tools in hardware land. Schematic design _could_ be a matter of declaring components, buses, etc. and letting the tool produce something (and then manually manipulate the visual layout if necessary) ; I mean you could literally describe your board using something similar to Verilog and get the tool to produce the schematic for you... we have these kinds of powers in the 21st century -- Instead it's futz with tools that are vaguely Illustrator-esque, find that half your connection points are not actually connected, etc. Why do people want to suffer like this?

Want to use a DRAM controller in Vivado? Find the wizard, enter into 10 text boxes... and if you're lucky you can find the Tcl scripts it generated and in the future just write your Tcl script... but they certainly won't make it easy.

Vivado project in source control? You're going to jump through hoops for that.

I want hardware engineers to demand better.


> the open source FPGA tooling pretty much completely ignored them, even though they're critical to commercial development.

"ignored" as in the vendors aren't cooperating with the developers of the open source tools? What the opensource tools are doing is hard enough as is. When you consider how fragmented FPGA chips are it's difficult to support a wide variety of them even if you wanted.


I'm not blaming the open source devs at all. I admire them greatly. Unfortunately, it's one thing to admire someone greatly and quite another to believe they have a compelling offering.


Please then explain why is still no standardized synthesizable subset for verilog yet? Even C/C++ at its worst was never this absurd.


LLVM folks have actually just started on such tooling: CIRCT. With Chris Lattner at the helm, and industry players like Xilinx and Intel seemingly on board.


Agreed. I never thought the mental leap to Verilog was a big hurdle. It's just C-like syntax with some new constructs around signaling and parallelism. I found this interesting rather than foreboding.

The main challenge I had was compilation time. It can sometimes take overnight to compile a simple application if there's a lot of nested looping, only to have it run out of gates. This can be a royal pain.

I'd expect most HPC scenarios would have lots of nested looping, and probably memory accesses, and thus have to spend a lot of time writing state machines to get around gate count limitations and wait for memory responses, at which point you're basically designing a 200 MHz CPU.

So I don't see it as being very useful for general purpose acceleration, but could be a good CPU offload for some very specific use cases that are more bit-banging than computing. Azure accelerates all its networking via FPGA, which seems like the ideal use case.


There's no such thing as a "loop" on an FPGA. If you declare a loop in Verilog, the synthesizer allocates one set of gates per iteration. That's probably why your runs take all night.

HLS notwithstanding, you don't use traditional control structures to tell an FPGA what to do. You use clocked FSMs and asynchronous expressions to tell it what to be.


Right. But for HPC, loops (in Verilog) will be the norm, to squeeze out as much from each clock tick as possible. Running everything as discrete steps in a FSM would defeat the purpose.


It’s not the speed, that holds FPGA adaptation back. It’s development process/time. While one can start with GPU immediately, there is a need for FPGA to develop whole PCIe infrastructure and efficient data movers. One is done with GPU while FPGA developers just start with algorithms. As long as one does not need real time capability, GPU is an obvious choice. My 200 MHz design outcompetes every CPU and GPU out there with very narrow data processing window, but development time is 5x compared to regular software.


You ever work with an FPGA? The programming model and the tooling are a huge part of the problem.

Verilog and VHDL have basically nothing in common with any language you've ever used.

Compilation can take multiple days. This means that debugging happens in simulation, at maybe 1/10000th of the desired speed of the circuit.

If you try to make something too big, it just plain won't fit. There is no graceful degradation in performance; an inefficient design will just not function, come Hell or high water.

The existing compilers will happily build you the wrong thing if you write something ill-defined. There are a ton of things expressible in a hardware description language that don't actually map onto a real circuit (at least not one that can be automatically derived). In any normal language anything you can express is well-defined and can be compiled and executed. Not so in hardware.

Timing problems are a nightmare. Every single logic element acts like its own processor, writing directly into the registers of its neighbours, with no primitives for coordination. Imagine if you had to worry about race conditions inside of a single instruction!

Maybe if all these problems are solved FPGAs still wouldn't catch on, but let's not pretend the programming model isn't a problem. Hardware is fundamentally hard to design and the tooling is all 50 years out of date.


> You ever work with an FPGA? The programming model and the tooling are a huge part of the problem.

I'd argue FPGAs aren't programmed and don't have a programming model. Complaints that the programming model of FPGAs holds their adoption back are thus conceptually ill-founded. (The tooling still sucks).


I mean, the problem is that in the FPGA world the tooling and synthesis languages are inextricably linked. HLS is an approach that, IMO, is also the completely wrong direction since a general purpose programming language like C/C++ won't map nicely to the constructs you need in FPGA design.

What we really need is a lightweight, open source toolchain for FPGAs and one or more "higher level" synthesis languages. I've always wondered if a DSL using a higher language like Python isn't a better way to do this. Rather than try to transpile an entire language, just provide building blocks and interfaces that can then be used to generate verilog/VHDL.


> What we really need is a lightweight, open source toolchain for FPGAs and one or more "higher level" synthesis languages.

nMigen: python based DSL to verilog translator

LiteX: Open source gateware

SymbiFlow: Open source verilog compiler + PnR tooling.

There a linux kernel running on liteX and a Risc V core running on an ECP5 running out on the internets.

A micropython version running on a risc V core and migen (earlier version of nMigen) can also be found here: https://fupy.github.io/


> I've always wondered if a DSL using a higher language like Python isn't a better way to do this

Like this? http://www.myhdl.org/


nMigen for python is where it's at these days.

https://github.com/m-labs/nmigen


There is another traditional FPGA use case where you need real time data capture or signal generation. That seems to be getting eaten from the bottom now that there are really high speed MCUs that are easier to program. It's less efficient, but easier to develop for.


The other problem with using an FPGA here is that microcontrollers are cheap and have great cheap dev boards. FPGAs, not so much. I've wanted to just "drop in" a small FPGA in several designs, the way you can drop in a microcontroller, but there's no available FPGA that's not a massive headache in that use case. Trust me, I've looked.

The iCE40 series is almost there but not quite. It's a bit pricey (this is sometimes okay, sometimes a dealbreaker) but its care and feeding is too annoying. Who wants to source a separate configuration memory? Sometimes I don't have the space for that crap.

If any company can bring a small, cheap, low power FPGA to the market, preferably with onboard non-volatile configuration memory, a microcontroller-like peripheral mix (UART, I2C, SPI, etc.), easy configuration (re)loading, and with good tool and dev board support, they'll sell a lot of units. They don't even have to be fast!


The MiniZED is $89 and a ton of fun! It has an ARM processor (Xilinx Zynq XC7Z007S SoC), Arduino compatible daughterboard connectors, microcontroller-like peripheral mix, and runs linux.

http://zedboard.org/product/minized

https://www.avnet.com/shop/us/products/avnet-engineering-ser...

Oh, and Vivado (the FPGA development IDE) is free (as in beer) for that FPGA as well as Xilinx' other mid to low end FPGAs.


The XC7Z007S is $46 in volume at distributors (though with no volume discounts; Xilinx pricing is weird).

Zynq chips are beautiful parts. But they are not "low-cost drop-in" anything. They are chips that you can architect an entire system around and replace a dozen other chips with. I know; I've done it. (But they didn't bite on our proposal, so my sketched architecture remained just a detailed sketch.)


In my last project, I just big-banged a port to load up the configuration bits in a 4K iCE40, something like 131KBytes; this was just a .h file that was included in the bit-banger; the static array ended up in Flash (the ST MPU had 2 MB flash, so no problem), and it only took a second or so to load the FPGA bits before it was ready-to-go. So, from my perspective, what you describe is already here. If even that's too much trouble, there's always TinyFPGA BX https://tinyfpga.com/ You can use the open source yosys or you can use Synplify and the Lattice dev system, which is free w/free license.


Dropping in a midsize MCU with 256kB of Flash just to program a single FPGA is not viable in a margin-constrained commercial product. It works great if it's already there, of course, but the applications I'm thinking of have been the ones where it isn't.

Not to mention there are many FPGA applications where one purpose of the FPGA is to avoid having software in the path. If software is only responsible for configuration load, it's better, but still can be a problem.


Crowd Supply has an endless variety of hobbyist-friendly variously FPGA / USB / MCU / PCIE / SDR combination boards.

It's ridiculous for anybody to insist that programming an FPGA isn't writing software. By definition, anything you can put in a text file that ends up controlling what some piece of hardware does is software. Probably almost all of what is wrong with FPGA ecosystems comes from failure to treat it like software.

It's not much like your typical C program, but that's a very parochial viewpoint. The languages available to program FPGAs in are abysmal, a poor match to the hardware: actually too much like ordinary programming languages, to their detriment. A person who makes an FPGA do something is going to be an engineer, and to an engineer any microprocessor and any FPGA are just two different state machines. Somebody who studied "computer science" will be disoriented, but that is just because the field has narrowed, as network effects pared down the field of computing substrates until practically nothing is left.

FPGAs emulating ASICs or von Neumann CPUs is the greatest waste of potential anywhere. If the architecture of (some) FPGAs could be elucidated, it could fuel a renaissance of programming formalisms. We could begin program them in a language actually well-suited to the task, and vary their configuration in real time according to the instantaneous task at hand.


FPGAs aren't state machines or processors. Not inherently, anyway, even if you can build those things out of them or if they sometimes are sold co-packaged.

And their internal architecture is pretty well documented. See, for example, the Spartan-6 slices: https://www.xilinx.com/support/documentation/user_guides/ug3...

What's less well documented, at least publicly, is the routing, but on some level that's less interesting since it's "just" how you get the electrons from point A to point B, not about choosing A or B. But even the routing is decently well described, though you have to look in some fairly obscure places (like the device floorplan viewer).

I'm not sure why you think FPGAs emulating ASICs is a "waste of potential". By definition, ASICs are strictly more capable and more powerful than FPGAs, so you're climbing up the potential ladder, not down!


Why? Because ASICs do one thing from the first time they are powered up until they are finally ground up into sand. But an FPGA could, if programmed right, do completely different things from one millisecond to the next. Their ability to do that is never exploited because our tooling is still much too primitive, and current devices' internal connectivity probably can't route signals to the places needed.

If you think an FPGA is not inherently and necessarily a state machine, no matter how it is programmed (provided power and clock are in specified bounds), that only means you don't know what a state machine is. All clocked digital devices are state machines, and can never be anything other than state machines.

(There is an argument to be made that an FPGA is, itself, an ASIC: an IC whose Specific Application is to be an FPGA. But such an argument would be transparent sophistry.)


There's also plenty of unclocked stuff in the FPGA... like the LUTs that do all the work. There's enough of this and it's important enough that I believe thinking of FPGAs as "just state machines" is dumb. But then I also believe that digital electronics are not "just digital circuits", but better thought of as "bistable analog circuits", so what do I know....


If the results of the LUTs don't end up clocked into a register, where do they go?

Of course everything is analog, and ultimately quantum-electrodynamic, but the languages FPGAs are programmed in don't provide access to those domains.


Gowin might just fill this niche. They are working with yosys on open source support as well.

https://www.gowinsemi.com/en/product/detail/2/

http://www.clifford.at/yosys/cmd_synth_gowin.html


I think Cypress had a product line that combined a CPU and a small programmable array, just big enough to implement your own custom IO and protocols and maybe some minimal logic beyond that.

Maybe that's what most hobbyists need?


You're probably thinking of the Cypress PSoC, Programmable System on Chip.

Those things are fantastic for hobbyists and can be nice for low-volume production. But they're kind of crap for higher volume work:

* Expensive

* Physically fragile/easy to kill: personal experience suggests they are noticeably more fragile than their competition; ALWAYS add pull resistors and ESD diodes to their JTAG/SWD pins and use a real voltage supervisor, not the internal PoR/brownout, no matter what the datasheet says because it does not speak the truth

* Actually, just add external ESD diodes to anything even the least bit sketchy

* On-chip analog not good enough for serious applications or stupidly limited (just give me two of those please? no?)

* On-chip routing is very, very limiting

* Weak MCU cores

* Few large parts (high GPIO, fast core, ...); the 5LP is better but needs a refresh with bigger, better, cheaper flagships

* More digital blocks (UDBs). They use a crappy old macrocell architecture, which wouldn't be a problem except they only give you TWO of them!

I've actually whined about the last one to the Cypress FAE (great guy!) and he just started laughing. Turns out, he's repeatedly said that to their higher-ups and gotten shot down... only to have customers like me ask for it again, over and over....

Hopefully under Infineon the PSoC line will be better managed. It could be a huge powerhouse, but right now it just does not have a good enough lineup of sane models.


Since you seem to have some experience with these: are the tools hoobyist friendly?

(Small install, no need for licences and license renewal, work reasonably well on a cheap laptop)


Yeah, not bad at all. A little annoying, but above average for the HW side of things.

But that's PSoC Creator, used for their PSoC 4 and 5 lines. (Avoid the 3 and older -- they're really old.) The newer 6 requires Modus Toolbox, which I think doesn't support the 4 or 5 lines (STUPID). I have no experience with that one. It's Eclipse based, so who knows.


In the hobbyist space, I also see a fair amount of CPLDs used when something like a GAL (https://en.m.wikipedia.org/wiki/Generic_array_logic) would be much cheaper and easier. Doesn't work for everything, but they can be handy.


I good example of this is XMOS. Their chips are divided into "tiles" which can simultaneously run code, together with multiple interfaces such as USB, i2s, i2c, and GPIO. Latency is very deterministic because the tiles are not using caches, interrupts, shared buses etc.

Their development environment is Eclipse based with numerous libraries such as audio processing, interface management, DFU etc. They use a variant of C (xc) that lets you send data between channels/tiles, and easily parallelize processing.

An example use is in voice assistants where multiple microphones need to be analyzed simultaneously, echo and background noise has to be eliminated, and the speaker isolated into a single audio stream. I've used it for an audio processing product that needed match hardware timers exactly, provide USB access, matched input and output etc.


Just to throw in one more complication, I'll assert that the only benefits of FPGAs over ASICs are one time costs and time to market. Those are big benefits, but almost by definition, they aren't as important for workloads that are large scale and stable. So, if you do have a workload that's an excellent match for FPGAs, and if that workload will have lots of long term volume, you should make an ASIC for it.

So, for FPGAs to be the next big thing in HPC, you'd need to find a class of workloads that benefit from the FPGA architecture, for long enough and with high enough volume to be worth the work to move over, and are also unstable or low volume enough that it's not worth making them their own chip.


Thats not entirely true - the flexibility can have its own value. Unlike an ASIC you can handle multiple workloads or update flows.

For example timing protocols on backbone equipment handling 100-400Gbps. Depending on how its configured you may need to do different things. Additionally you probably don't want to replace 6 figure hardware every generation.

Another example is test equipment where you can't run the tests in parallel. A single piece of hardware can be far more portable / cost effective.


I may not have said it well, but I broadly agree with you. If a workload needs high performance but not consistently (e.g. because you're doing serial tests by swapping bitstreams), predictably (e.g. because you need flexibility for network stuff you can't predict at design time), or with enough volume (e.g. costs in the low millions are prohibitive), an ASIC isn't the right solution.

But my point is that for FPGAs to come to prominence as a major computation paradigm, it probably won't be because it outperforms GPU on one really big workload like bitcoin or genetic analysis or something. It'll have to be a moderately large number of medium scale workloads.


There is also glue logic between different interfaces that can be satisfied with FPGAs or CPLDs.


> I'll assert that the only benefits of FPGAs over ASICs are one time costs and time to market.

There's one more big one: the ability to update the logic in the field.


Take a look at Vitis. Xilinx is aware of this problem and are seeking to capture the market of people that want magic programming solutions to speed up existing software. Who knows if it will be successful, but they are trying more than ever to make FPGAs usable without having to know how to make hardware designs and verification.


I work with fpgas, but from LabVIEW. NI have put some effort into making the same language work for everything including fpgas, and a graphical language is great for this kind of work.

It's so easy that it's quite common to see people pass off work onto the fpga if it involves some slightly heavier data processing, which is exactly how it should be.


I am working right now on bare metal websockets implementation on Xilinx Series 7 FPGAs. Currently it’s ZynQ SoC, but final product will probably have Kintex 7 inside, so no Linux. The tools make me cry, no examples, application notes from 2014 with ancient libraries. I hope, vendors will fix tooling. But I see, Xilinx has released Vitis, so their scope is elsewhere, no interest in old crap. Using Git with Vivado is already enough pain. So I keep my text sources in Git and complete zipped projects as releases. Ouch!


I posted this elsewhere, there are a lot of good resources and examples for the tools:

https://github.com/xupgit/FPGA-Design-Flow-using-Vivado/tree...

https://www.xilinx.com/support/university.html

https://www.xilinx.com/video/hardware/getting-started-with-t...

There are others thst cover the SDK side of things, but the HW side/Vivado is well documented.


I feel you completely. The Vivado IDE/toolchain is absolutely atrocious and the designers should be shamed for the horrifying bloatware they push as the STANDARD. Sometimes I have better luck doing everything in tcl/commandline there.


Vivado is amazing compared with the ASIC counterparts: Design compiler is for RTL synthesis only and you need years of experience to get any decent qor out of it. In ASIC land you have separate tools for every step, synthesis, STAs, PnR, simulation, floor planning, power analysis, etc. Vivado does all that in one seamless tool, and allows you to cross probe from a routed net right back to the RTL code it came from. Try doing that with ASIC tools. So to me it's a matter of perspective, once you understand how difficult the problem of hardware design is to solve, and what some of the existing de facto industry standard tools are like (for ASIC), you come to appreciate vivado for just how well it brings all of these complex facets together. Of course if you come from a SW background you make think vivado is terrible compared to VScode or some other IDE, but that's an unfair comparison. I guess to reframe the question - show me a hardware design environment that is better than Vivado. Also, I separate vivado fron the Xilixn SDK, as they are different tools, and Vivado is expclitly got the HW parts of the design


I added one small Verilog file to a Vivado project.

It froze the IDE for 45 minutes before I could do anything else.

This was on a beefy machine at AWS too, not some cheap home desktop thing.

That wasn't compiling, no synthesis, P&R, nothing.

There was no giant netlist I'd been working on either. Most of the FPGA was empty.

That was literally just adding a small source file which the IDE auto-indexed so you could browse the contents.

In Verilator, an open source Verilog simulator, that same source file loaded, completed its simulation and checked test results in less than a second. So it wasn't that hard to compile and expand its contents.

Vivado is excellent for some things. But the excellence is not uniform unfortunately. On that project, I had to do most of the Verilog development outside Vivado because it was vastly faster outside Only importing modules when they were pretty much ready to use and behaviorally validated.


That's definitely an anomaly, I use vivado with ASIC code reguarly, very large designs and have not seen anything like this. I use vivado to elaborate and a analyse code intended for ASIC use as its better than other ASIC tools for that purpose. Once I'm happy with it in vivado, then I push it through design compiler, etc. Elaborating a deign that is 4 hours in DC synthesis is about 3 mins in vivado elaboration.


FPGA vendors are in a tight spot, thanks to their customers. Their customers want better silicon, so they're forced to allocate their resources toward R&D, rather than making their software tools better. If you look at the Xilinx jobs page, you'll see maybe ONE job related to software tools programming, which is shocking given the complexity of Vivado/Vitis.

If some FPGA company comes along and throws out conventional market wisdom (the old Henry Ford quote seems pertinent: "If I'd asked customers what they wanted, they would have said "a faster horse"") and makes a FPGA with software tools that are fast, non-buggy, with good UI/UX, I think they would be able to steal significant market share. Early FPGA patents should be expiring by now...


Have you looked at open source solutions? Tim Ansell is managing some great projects on open source solutions. Check out Symbiflow, LiteX, Yosys etc.


Are these mature already? It took some time for KiCad to get to current usable state and I don’t want to be early adopter. In fact, I want to have my private hardware MVP next year with current tools. On the other hand I can’t imagine my slacker colleagues using anything else than Vivado. Learning Vivado for them was already mission impossible.


I wouldn't say KiCad is usable yet. I've made multiple attempts to use it and it just is fundamentally user hostile. Unfortunately the devs see any attempt to improve user friendliness as "dumbing down".

Fortunately there is (finally!) an open source PCB design program that doesn't suck: Horizon EDA. I've only made one PCB with it but honestly it was pretty great and the author fixed every usability bug I reported in a matter of hours, which is an insane difference from KiCad's "you're holding it wrong".

The only think I don't like about it is it has an unnecessarily powerful and confusing component system (there are modules, entities, gates, etc.). But really it is the best by far.

Anyway, on FPGAs, I think the tools are only vaguely mature for iCE40 and even then you basically need to already be an expert unfortunately.


What have you found wrong with KiCad?

I've only recently starting designing PCBs and I started with KiCad, but I've found it to be very easy to use after watching one video of someone going through a simple board design.


So many things. It was a few years ago that I tried so I don't remember the specifics but it's just generally very unintuitive and makes questionable UI choices. E.g. when you move a component in the schematic the wires don't stay attached to it.

I didn't need a video to figure out how to use Horizon.


Thank you, I’ll look at it. Last time I wasn’t happy about KiCad’s differential lines. My design was space constrained and it was really hard to match lengths of short traces.


Last time I used KiCad the UX itself was fine although it was totally lacking polish but the parts management was absolutely atrocious.


Parts management is a pretty significant part of the UX I would have said!

I have yet to design a PCB where I didn't have to create basically all of the components myself.


Have you tried LibrePCB? https://librepcb.org/


I actually haven't yet - that does look quite good! Will definitely check it out next time I do a PCB.


It is still in dev but I think it is way more usable than the Xilinx tools I guess.

I am curious to know if you are using Qemu by any chance to prototype your hardware. I am doing some work on Qemu to make prototyping easier of a custom hardware and would love the pain points.


I wonder if it is possible to add a (small) FPGA to a personal computer that could accelerate any specific software tasks (video/audio encoding, ML algorithms, compression, extra FPU capabilities) on user demand.


The problem with this will be the overhead of transferring data to/from the FPGA, which once accounted for often causes doing the computation on the CPU to make more sense. It's obviously not a show-stopper, since GPUs have the same problem, but are still useful, but it's hard to find a workload that maps well to this solution.


In a DAW, accelerating a heavy VST plugin might make sense. But often those are amenable to being translated to GPGPU code already.

I guess the one place where GPGPU-based solutions wouldn't work, is when the code you want to accelerate is necessarily acting as some kind of Turing machine (i.e. emulation for some other architecture.) However, I can't think of a situation where an FPGA programmed with the netlist for arch A, running alongside a CPU running arch B, would make more sense than just getting the arch-B CPU to emulate arch A; unless, perhaps, the instructions in arch-A are very, very CISC, perhaps with analogue components (e.g. RF logic, like a cellular baseband modem.)


This is normally handled in emulation by putting the inner parts of the testbench (the transactors) onto the FPGA as well, to minimize the amount of data that has to be transferred between the CPU and the FPGA. If the FPGA is to be used as a peripheral, again a division of labor needs to be found that minimizes the amount of data that needs to be communicated. But if there is FPGA logic on the same chip as the CPU cores, the overhead can be greatly reduced, and we're seeing more of that now.


I assumed this was kind of intel's plan when they purchased Altera. I this issue with this is the amount of time it takes to load the bitstream, but I thought I saw some things recently where progress was being made on this front.


> issue with this is the amount of time it takes to load the bitstream, but I thought I saw some things recently where progress was being made on this front

You saw correctly, work is indeed being done to build "shells" that can accept workloads without the user having to go through the FPGA tooling/build process.


It's been possible for a long time, but there are big challenges to adoption. Every FPGA is different and the image is tightly coupled to the chip, so you'd have to compile the algorithm specifically to your chip before loading, which can take hours. Then loading the image each time you change out accelerators for a different application can take minutes. Then the software that uses the accelerator would have to know which chip and which image you're running and send data to it accordingly. Then you have to remember that FPGA's aren't really that great of accelerators sometimes, since they run at such low clock speeds, have crummy memory interfaces, limited gate support for floating point or even integer multiplication, etc. CPU's commonly outperform them even at the things they're supposed to be good at.

So it's unlikely ever to gain broad acceptance because the software vendors would have to support such a high number of permutations and the return can be questionable. This is why you see far more accelerators based on ASICs that have higher clock speeds and baked-in circuitry for specific tasks, with standardized APIs.

But sure, there's nothing preventing you from buying an FPGA board, hooking it up to your PC, creating a few images that do the accelerations you want, and writing software that uses them, swapping the image in when your program loads. You could even write a smart driver that swaps the image only if it's not in use by another app, or whatever. It's just unlikely you'll ever find a bunch of third-party software that supports it.


There absolutely is. There are PCIe cards you can plugin and use them as accelerators, just like you would use a GPU. Of course programming them to do the task you want is harder, but it can do anything. Saw a great example where someone implemented memcached on a single FPGA plugin and replaced many Xeons with it.


Isn't that what Apple did with that Afterburner Card for the MacPro? I read in https://www.anandtech.com/show/15646/apple-now-offering-stan... that that card is an fpga.

I could imagine that Apple will include something like this in their Apple Silicon SOC for ARM macs.

The Afterburner Card is not user programmable, but maybe it may in the future and this was just the first try to get the hardware in the field.


Yes, and it has been done. There are FPGA's that you can connect to with PCIe, and you only have to pay the small price of writing an FPGA implementation for your usecase. It usually takes just a couple of weeks (OK, maybe months).


You might actuall go even faster than PCIe, by pretending being a DDR4 memory stick.


IIRC some CPUs of the Intel Atom series already have an embedded FPGA.


Intel has launched a couple of Xeon Gold CPUs (like a variant of the 6138P) with integrated FPGAs for specific markets. Nothing mass-market, though, and they don't seem to have caught on much.


Maybe you will find this article about Large-Scale Field-Programmable Analog Arrays [FPAAs] interesting as well: https://hasler.ece.gatech.edu/FPAA_IEEEXPlore_2020.pdf


FPGAs are good at nothing in the scale that can challenge non-configurable silicons...

They are good at a lot of things that are in a smaller scales. Like general prototyping/testing/simulation, telecom, special-purpose real-time computing etc.

The behind-scene logic is that FPGAs can never make things as flexible as software. And flexible software always offset the inefficiency in a non-configurable chips. Just comparing FPGAs and CPUs/GPUs will never teach FPGAs vendors the reality, or they choose to ignore after all...


I believe you are incorrect. A counterexample to your claim is the increasing use of FPGAs in the datacenter. And various AI engines are FPGA-based. You'll do better for a CPU in Real Silicon; but a full-featured MPU w/standard peripherals + FPGA for unusual & must-be-fast functions is hard to beat.


Tell me how much users are using FOGAs and why xillinx is just a fraction of nVidia's market cap. 5 years ago, nvidia was 2x of xillinx in market cap, now it's 10x.


2 are the main challenges of the FPGA utilization:

- The first one is the FPGA programming. Now using OpenCL and HLS is much easier compared to VHDL/verilog to design your own accelerators.

- The second one is the FPGA deployment and integration. Until now it was very difficult to integrate your design with applications, to scale-out efficiently and to share it among multiple threads/users. The main reason was the lack of an OS_layer (or abstraction layer) that would enable to treat FPGAs as any other computing resource (CPU, GPU).

This is why at inaccel we developed a unique vendor-agnostic orchestrator for FPGAs. The orchestrator allows much easier integration, scaling and resource sharing of FPGAs.

That way we have managed to decouple the FPGA designer from the software developer. The FPGA designer creates the bitstream and the software developer just call the function that wants to accelerate. No need to define the bitstream file, no need to define the interface or the memory buffer allocation.

And the best part: It is vendor and platform agnostic. The FPGA designer creates multiple bitstream for different platform and the software developer couldn't care less. The developer just call the function and the inaccel FPGA orchestrator magically configure the right FPGA for the right function.


> Intel, AMD, and many other companies use FPGAs to emulate their chips before manufacturing them.

Really? I'm assuming if this is true it can only be for tiny parts of the design, or they have some gigantic wafer-scale FPGA that they're not telling anyone about :-) Anyway I thought they mainly used software emulation to verify their designs.


Having been involved with CPU emulation in the past a couple of comments:

1. It's not just a single FPGA but a large box full of them. for example: https://www.synopsys.com/verification/emulation/zebu-server....

2. Software models are employed for parts of the system (For example, the southbridge and all the peripherals connected to it are generally a software model which communicates with the hardware emulated portion in the FPGA via a PCIe model which is partly in hardware and partly in software.) This saves a lot of gates in the FPGA - those parts have already been well tested anyway so no need to put them into the hardware emulation.


Of the half-dozen semiconductor- designing companies I've worked for, all of them used FPGAs for emulation.

- modern FPGAs are huge.

- when an asic design won't fit in a single FPGA, it's usually possible to partition the design into multiple FPGAs

- software emulation/ simulation is not guaranteed to be "more accurate". FPGAs can interact with a real-world environment in ways that simulation simply cannot

- simulations run 1000s of times slower than FPGAs. Months of simulation time can be covered in minutes on the FPGA

Edit: to be clear, they all use simulation too, but FPGAs are used to accelerate the verification process


Is that still true in 2020? Or is the simulation getting good enough to skip the FPGA prototyping phase?


Its still very much true. ASIC designs are described as massively parallel tiny communicating sequential processes. FPGA's are also extremely fine-grained CSP, to a degree that is much finer than anything a CPU can produce today.


Many years ago, we had a custom made board with 8 huge Xilinx Virtex 5 FPGAs (the largest available at the time) to emulate a large SOC. Those FPGAs were something like $20K a piece.

We had 10 such boards, good for millions of dollars in hardware, and a small team to keep it running.

These platform were mostly used by the firmware team to develop everything before real silicon came back. It could run the full design at ~1 to 10MHz vs +500MHz on silicon or 10kHz in simulation.

After running for a while, that FPGA platform crashed on a case where a FIFO in a memory controller overflowed.

Our VP of engineering said that finding this one bug was sufficient to justify the whole FPGA emulation investment.


Design verification is big business and your VP was exactly right, a factor of 100 to 1000 speed increase would allow for much more thorough testing and broader testing as well, for instance hooked up to other hardware with reasonable fidelity compared to the real thing. Still coarse but a lot better than nothing. Good call. It isn't rare at all to have a respin if you don't do design verification.

One of the nicer stories about the first ARM chip is that they built a software simulator to verify the design and as a result they found plenty of bugs in the hardware before committing to silicon. The first delivered chips worked right away.


The multiple FPGA on a board is generally from Dini Group right? Fantastic boards.

Ref: https://www.dinigroup.com/web/index.php


Dini's naming schemes are hilarious. They're all named like monsters in B-movies -- their latest system, the DNVUF4A, is called "Godzilla's Butcher on Steroids", for instance.

Also, Dini got acquired by Synopsys a few years ago.


Oh I love their humor. There is always something humorous written for their status LEDS.

"Although no specific testing was performed, sophisticated statistical finite element models and back of the envelope calculations are showing the number of status LEDs to be bright enough to execute dermatological procedures normally done with CO2 lasers. Contact the factory for more information about this sophisticated feature and make sure an adult is present during operation. These LEDs are user controllable from the FPGAs so can be used as visual feedback in addition to burning skin."

"As with all of our FPGA-based products boards, the DNVUPF4A is loaded with LEDs. The LEDs are stuffed in several different colors (red, green, blue, orange et al.). There are enough LEDs here to melt cheese. Please don't melt cheese without adult supervision. These LEDs are user controllable from the FPGAs so can be used as visual feedback in addition to the gratifying task of creating gooey messes."


There are a lot of companies who create multi-FPGA boards. The market for FPGAs-for-ASIC-prototyping is substantial.


No, it was in-house custom made for the purpose.

Huge PCBs, ~2ft by 2ft.


Curious, what was the reason for going with custom board instead of COTS boards?

Is board to board connection with high speed connectors feasible? This was what I heard from verification folks.


The largest FPGAs were reticle-busters when I used to work on them. Today I think the largest FPGAs use chiplet-style integration. Even with the inefficiency of an FPGA, many smaller chip designs can still fit on the largest FPGA.

Also, there are prototyping boards specifically built for emulation that integrate multiple FPGAs, although this does introduces a partitioning problem that has to be solved either manually or via dedicated emulator software.


The FPGA emulator for a chip I was working on involved an entire rack of FPGAs... for a single core.



IMO the next big application for FPGAs is going to be to serve as a programmable DMA-engine of sorts. Have some a bunch of hard logic like ALUs and/or IO/s strewn about. Like for hw accelerated sql queries, malloc/free, data-specific compressors and the like.


I wonder what would be the advantages of using an FPGA to test a CPU design - compared to relying on a (presumably more accurate) computer-based simulation. (I understand the reasons one might want to implement a CPU in an FPGA.)


This idea is more than 30 years old. It has been done, and one upon a time companies were built around this idea.

First off, mapping an entire CPU to an FPGA cluster is a design challenge itself. Assuming you can build an FPGA cluster large enough to hold your CPU, and reliable enough to get work done on it, you have the problem of partitioning your design across the FPGA's. Second problem: observability. In a simulator, you can probe anywhere trivially, with an FPGA cluster, you must route the probed signal to something you can observe. (I am not even going to talk about getting stimulus in and results out, since with FPGA or simulator, either way you have that problem, it is just different mechanics.)

The big problem is that an FPGA models each signal with two states: 1 and 0. A logic simulator can use more states, in particular U or "unknown". All latches should come up U, and getting out of reset (a non-trivial problem), to grossly oversimplify, is "chasing the U's away". An FPGA model could, in theory, model signals with more than two states. The model size will grow quickly.

Source: Once upon a time I was pre-silicon validation manager for a CPU you have heard of, and maybe used. Once upon a time I was architect of a hardware-implemented logic simulator that used 192 states (not 2) to model the various vagaries of wired-net resolution. Once upon a time I watched several cube-neighbors wrestle with the FPGA model of another CPU you have heard of, and maybe used.

Note: What would 3 state truth tables look like, with states 0,1,U? 0 and 1 is 0. 0 and U is 0. 1 and U is U -- etc. You can work out the rest with that hint, I think.

Edit to add: Why are U's important? They uncover a large class of reset bugs and bus-clash bugs. I once worked on a mainframe CPU where we simulated the design using a two-state simulator. Most of the bugs in bring-up were getting out of reset. Once we could do load-add-store-jump, the rest just mostly worked. Reset bugs suck.


> Reset bugs suck.

Indeed they do. And even if you have working chips you get the next stage: board level reset bugs. A MC68K board I helped develop didn't want to boot, some nasty side effect of a reset line that didn't stay at the same level long enough stopped the CPU from resetting reliably when everything else did just fine. That took a while to debug.


Because it's substantially faster. Simulating a large CPU design in software is slow and it doesn't parallelise well, so your tests will take a lot longer (and these aren't fast even with FPGA acceleration: runtimes can be days or weeks if you're running a large fraction of the design for even a tiny amount of time in the simulation).


SW-based simulation is mostly about functional correctness and robustness of an implementation. Even with cycle-accurate simulations there is a lot of data you can't just extrapolate from simulation results pertaining to timing and performance constraints. And that's where emulating CPU/GPU/ASIC designs generally help the most.


The thing with FPGA is that companies when faced with cash and time crunch will opt to use a FPGA instead of designing ASICs. The tools suck but companies will hire someone that will do it. FPGA fit a very particular constraint and still solves very specific problems efficiently


The problem that FPGAs have is that they are only good for low-volume solutions that require flexibility and have no power constraints.

That's a really narrow market. Telecom equipment and lab equipment, basically.

If I need volume, I need at least an ASIC. If I need to manage power, I need a full custom design.


MicroSemi (now part of Microchip) makes some low-power FPGAs. Xilinx has made the coolrunner CPLDs for years that are mighty low-power (they're not huge, but often are big enough for some needed extra logic.). (Another not care too much about power is Military.)


This is really interesting. If a cpu hardware vulnerability like spectre could be repaired by patching an fpga on the SOC that would be incredible. That type of functionality would overtake the entire cloud market in about 3 days.


I'm afraid it doesn't work like this. That would only be possible if the chip was using an FPGA fabric for the relevant parts of the design. For example if the L1 cache was implemented as an FPGA you could in theory patch around L1TF. But they wouldn't do that because it would be far slower/larger than implementing it directly as an ASIC.

Or you might imagine a chip that has an FPGA on the side (I expected Intel would ship this after acquiring Altera, but it never happened). But the FPGA would somehow have to have access to the paths that caused the vulnerability, which is highly unlikely, and would also be really slow compared to what they actually do which is hacking around it by microcode changes.


> Or you might imagine a chip that has an FPGA on the side (I expected Intel would ship this after acquiring Altera, but it never happened).

They did: https://www.anandtech.com/show/12773/intel-shows-xeon-scalab...

But I get the sense this part was aimed at a few very specific customers. It required some PCB-level power delivery changes, so you couldn't even drop it into a standard server motherboard.


FPGAs are too slow for that. I think you can get the clock rate up to about 600Mhz, but that is only for very small portions of the chip. Otherwise you run into timing issues. The clock speed for most of the chip will be significantly lower.


Yup. If you just want a CPU, use a CPU. an FPGA is a terrible substitute, and generally you only want to embed a CPU on them if you are either developing a CPU or you want a not very fast CPU as an addon to a design which is already using an FPGA (and generally for this nowadays the vendors make FPGAs whith a CPU on the same die, because it's so common and frees up quite a lot of the FPGA fabric and power budget).


Amazon already has FPGA's on the cloud: https://aws.amazon.com/ec2/instance-types/f1/

I don't think they are very popular though. Maybe they are used sometimes for machine learning?


It would also open up new attack vectors.


That's the real nightmare. Now all of a sudden, you can program the CPU itself if you can access the update mechanism. CPUs being non-programmable is a feature as well as a bug.


CPUs are already "programmable" via microcode updates.


Pretty much every new non-x86 CPU doesn't have updatable microcode, so that's a very x86-centric problem.


Microcode is loaded when the OS starts though right? At the very least it's not persistent.


BIOS or OS


And have been since ages, that was one of the themes regarding RISC Vs CISC design.


inaccel.com is making lots of steps to bring FPGA to 2020

Spark/k8s integration Abstraction of popular cores Python APIS Serverless deployments Etc




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: