Like https://en.wikipedia.org/wiki/UltraSPARC_T1 ?

foota · 2025-06-09T19:21:20 1749496880

Similar in concept, I think the idea is that it would be used as an application coprocessor though, as opposed to the main processor, and obviously a lot more threads.

I don't remember all the details, but picture a bunch of those attached to different parts of the processor hierarchy remotely, e.g., one per core or one per NUMA node etc.,. The connection between the coprocessor and the processor can be thin, because the processor would just be sending commands to the coprocessor, so they wouldn't consume much of the constrained processor bandwidth, and each coprocessor would have a high bandwidth connection to memory.

saltcured · 2025-06-09T19:51:14 1749498674

There was also the Tera MTA and various "processor-in-memory" research projects in academia.

Eventually, it's all full circle to supercomputer versus "hadoop cluster" again. Can you farm out work locally near bits of data or does your algorithm effectively need global scope to "transpose" data and hit bisection bandwidth limits of your interconnect topology.