Thanks for the context. I remember recently reading a paper from I think Baidu w...

smarterclayton · on Nov 10, 2023

I will note that a trend I have observed with recent ML - as we increasingly use accelerators and models correspondingly grow in size, we are returning to a "one machine, one workload" paradigm for the biggest training and inference jobs. You might have 8k accelerators, but only 1000 machines, and if you have one container per host 300 schedules / second is fast.

While at the same time as you note we have functional models for container execution that are approaching millions of dispatches for highly partitionable work, especially in data engineering and ETL.