Hacker News new | past | comments | ask | show | jobs | submit login

Thanks for the context. I remember recently reading a paper from I think Baidu where they claimed to have a container arrival rate in the millions per second, consequently it was practical to operate their whole site in the style of lambda/cloud functions.

Actually now that I am searching for that it seems Baidu has a number of papers on workload orchestration at scale specifically for learning.




I will note that a trend I have observed with recent ML - as we increasingly use accelerators and models correspondingly grow in size, we are returning to a "one machine, one workload" paradigm for the biggest training and inference jobs. You might have 8k accelerators, but only 1000 machines, and if you have one container per host 300 schedules / second is fast.

While at the same time as you note we have functional models for container execution that are approaching millions of dispatches for highly partitionable work, especially in data engineering and ETL.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: