> but I wonder how important performance really is here.
Perf is important, but ime American MLEs are less likely to investigate GPU and OS internals to get maximum perf, and just throw money at the problem.
> solder on a large amount of high bandwidth memory and produce these cards relatively cheaply
HBM is somewhat limited in China as well. CXMT is around 3-4 years behind other HBM vendors.
That said, you don't need the latest and most performant GPUs if you can tune older GPUs and parallelize training at a large scale.
-----------
IMO, Model training is an embarrassingly parallel problem, and a large enough cluster leveraging 1-2 generation older architectures that is heavily tuned should be able to provide similar performance to train models.
This is why I bemoan America's failures at OS internals and systems education. You have entire generations of "ML Engineers" and researchers in the US who don't know their way around CUDA or Infiniband optimization or the ins-and-outs of the Linux kernel.
They're just boffins who like math and using wrappers.
That said, I'd be cautious to trust a press release or secondhand report from CCTV, especially after the Kirin 9000 saga and SMIC.
But arguably, it doesn't matter - even if Alibaba's system isn't comparably performant to an H20, if it can be manufactured at scale without eating Nvidia's margins, it's good enough.
Perf is important, but ime American MLEs are less likely to investigate GPU and OS internals to get maximum perf, and just throw money at the problem.
> solder on a large amount of high bandwidth memory and produce these cards relatively cheaply
HBM is somewhat limited in China as well. CXMT is around 3-4 years behind other HBM vendors.
That said, you don't need the latest and most performant GPUs if you can tune older GPUs and parallelize training at a large scale.
-----------
IMO, Model training is an embarrassingly parallel problem, and a large enough cluster leveraging 1-2 generation older architectures that is heavily tuned should be able to provide similar performance to train models.
This is why I bemoan America's failures at OS internals and systems education. You have entire generations of "ML Engineers" and researchers in the US who don't know their way around CUDA or Infiniband optimization or the ins-and-outs of the Linux kernel.
They're just boffins who like math and using wrappers.
That said, I'd be cautious to trust a press release or secondhand report from CCTV, especially after the Kirin 9000 saga and SMIC.
But arguably, it doesn't matter - even if Alibaba's system isn't comparably performant to an H20, if it can be manufactured at scale without eating Nvidia's margins, it's good enough.