I'm surprised how many hardware issues they were having. I am oncall for a ~10k ...

I'm surprised how many hardware issues they were having.

I am oncall for a ~10k node system, and this log looks pretty similar to my workload... Yet Facebook only had 1% of the number of machines I look after for this! With far fewer machines, they should have far fewer failures!

I suspect they are doing a bad job of root causing failures to make sure they never happen again. For example, that Nvidia infoROM message should have ended up with all the logs and a couple of troublesome boards sent to Nvidia engineering to find out why the corruption happens, how to make it never happen again, how to scan to find out if it has happened, how to auto-undo the corruption, etc.

The same with the infiniband bandwidth issues - get that stuff sent to someone who can hook up a logic analyzer or look at traces to find out exactly why it's happening, and adjust the design of the hardware, firmware or software to make sure it can't happen again and that you have good visibility of any future similar issues beyond just 'its kinda slow, shrug.'.