AFAIK this doesn't really work for interactive use, as LLMs process data serially. So your request needs to pass through all of the cards for each token, one at a time. Thus a lot of PCIe traffic and hence latency. Better than nothing, but only really useful if you can batch requests so you can keep each GPU working all the time, rather than just one at a time.