We haven't specifically tested on any ICE40 FPGAs yet - if this is something that you'd really like to see, let me know! Taking a look at the lineup, the ICE40 LP8K and LP4K would be suitable for running a very small version of the Tensil accelerator. You'd want to run a small model in order to get reasonable performance.
Generally speaking, FPGAs with some kind of DSP (digital signal processing) capability will work best, since they can most efficiently implement the multiply-accumulate operations needed.
I think iCE40 LP/HX series are the biggest ones, but the iCE40UP5K is also neat: it has hardware multipliers unlike the LP/HX, and a relatively large 1 megabit RAM on-chip. Unfortunately, I think the UP family is relatively slow (as in propagation delay/max clock frequency).
I don't have hard numbers at hand, but I'd estimate something like an order of magnitude improvement for using DSP for multiplication vs not. If they're available on the fabric, you'll definitely want to use them! If this is an experiment you want to run, I'd be very happy to help you figure out how to do it.
So it definitely can be done with some careful attention to the limited number of multipliers on the device. I’ll be curious to check out how Tensil does in terms of mapping with highly resource constrained FPGAs. Regardless, Tensil looks like a very cool tool.
Wow, awesome project! This is exactly the kind of thing we had in mind when we built Tensil. I'd be very curious to hear what happens if you make a v2 perhaps using Tensil for comparison.
That's excellent - feel free to join our Discord if you'd like to brainstorm ideas or get help choosing models and boards https://discord.gg/TSw34H3PXr
We haven't specifically tested on any ICE40 FPGAs yet - if this is something that you'd really like to see, let me know! Taking a look at the lineup, the ICE40 LP8K and LP4K would be suitable for running a very small version of the Tensil accelerator. You'd want to run a small model in order to get reasonable performance.
Generally speaking, FPGAs with some kind of DSP (digital signal processing) capability will work best, since they can most efficiently implement the multiply-accumulate operations needed.