Really awesome project. I want to get into FPGAs, but honestly it's even hard to grasp where to start and the whole field feels very intimidating.
My eventual goal would be to create acceleration card for LLMs (completely arbitrary), so a lot of same bits and pieces as in this project, probably except for memory offloading part to load bigger models.
Reframe it in your mind. "Getting into FPGAs" needs to be broken down. There are so many subsets of skills within the field that you need to level expectations. No one expects a software engineer to jump into things by building a full computer from first principles, writing an instruction set architecture, understanding machine code, converting that to assembly, and then developing a programming language so that they can write a bit of Python code to build an application. You start from the top and work your way down the stack.
If you abstract away the complexities and focus on building a system using some pre-built IP, FPGA design is pretty easy. I always point people to something like MATLAB, so they can create some initial applications using HDL Coder on a DevKit with a Reference design. Otherwise, there's the massive overhead of learning digital computing architecture, Verilog, timing, transceivers/IO, pin planning, Quartus/Vivado, simulation/verification, embedded systems, etc.
In short, start with some system-level design. Take some plug-and-play IP, learn how to hook together at the top level, and insert that module into a prebuilt reference design. Eventually, peel back the layers to reveal the complexity underneath.
Edit: And if you want to get into CPU design and can get a grip on "Advanced Computer Architecture: Parallelism, Scalability, Programmability" by Kai Hwang, then i'd recommend that too. It's super old and probably some things are made differently in newer CPUs, but it's exceptionally good to learn the fundamentals. Very well written. But I think it's hard to find a good (physical) copy.
1. Clone this educational repo https://github.com/yuri-panchul/basics-graphics-music - a set of simple labs for those learning Verilog from the scratch. It's written by Yuri Panchul who worked at Imagination developing GPUs, by the way. :)
2. Obtain one of the dozens supported FPGA boards and some accessories (keys, LEDs, etc).
3. Install Yosys and friends.
4. Perform as many labs from the repo as you can, starting from lab01 - DeMorgan.
You can exercise labs while reading Harris&Harris. Once done with the labs and with the book, it's time to start your own project. :)
PS: They have a weekly meetup at HackerMojo, you can participate by Zoom if you are not in the Valley.
If you want to accelerate LLMs, you will need to know the architecture first. Start from that. The hardware is actually both the easy (design) and the hard part (manufacturing).
Depends heavily on what system it is supposed to provide acceleration for.
If it is a MCU based on a simple ARM Cortex M0, M0+, M3 or RISC-V RV3I, then you could use an iCE40 or similar FPGA to provide a big acceleration by just using the DSPs and the big SPRAM.
Basically add the custom compute operations and space that doesn't exist in the MCU, operations that would take several, many instructions to do in SW. Also, just by offloading to the FPGA AI 'co-processor' frees up the MCU to do other things.
The kernel operations in the Tiny GPU project is actually really good examples of things you could efficiently implement in an iCE40UP FPGA device, resulting in substantial acceleration. And using EBRs (block RAM) and/or the SPRAM for block queues would make a nice interface to the MCU.
One could also implement a RISC-V core in the FPGA, thus having a single chip with a low latency interface to the AI accelerator. You could even implement the AI acceleator as a set of custom instructions. There are so many possible solutions!
An ice40UP-5K FPGA will set you back 9 EUR in single quantity.
This concept of course scales up to performance and cost levels you talk about. With many possible steps in between.
Something that appears to be hardly known is that the transformer architecture needs to become more compute bound. Inventing a machine learning architecture which is FLOPs heavy instead of bandwidth heavy would be a good start.
It could be as simple as using a CNN instead of a V matrix. Yes, this makes the architecture less efficient, but it also makes it easier for an accelerator to speed it up, since CNNs tend to be compute bound.