You use DSPs for that. Effinix has direct bfloat16 support in their FPGAs. The real game changer is using the carry chain with your LUT based adders. Assuming 16 LUTs, you could be getting 11 teraops out of a Ti180 using a few watts. Of course that is just a theoretical number though but I could imagine using four FPGAs for speech recognition and synthesis and vision based LLMs operating in real time.