On compute clusters there are quite a few "exotic" things that can go wrong. The workload orchestration is typically SLURM, which can throw errors and has a million config options to get lost in.
Then you have storage, often tiered in three levels - job-temporary scratch storage on each node, a distributed fast storage with a few weeks retention only, and an external permanent storage attached somehow. Relatively often the middle layer here, which is Lustre or something similar, can throw a fit.
Then you have the interconnect, which can be anything from super flakey to rock solid. I've seen fifteen year old setups be rock solid, and in one extreme example a brand new system that was so unstable, all the IB cards were shipped back to Mellanox and replaced under warranty with a previous generation model. This type of thing usually follows something like a Weibull distribution, where wrinkles are ironed out over time and the IB drivers become more robust for a particular HW model.
Then you have the general hardware and drivers on each node. Typically there is extensive performance testing to establish the best compiler flags etc., as well as how to distribute the work most optimally for a given workload. Failures on this level are easier in the sense that it typically just affects a couple of nodes which you can take offline and fix while the rest keep running.
Then you have storage, often tiered in three levels - job-temporary scratch storage on each node, a distributed fast storage with a few weeks retention only, and an external permanent storage attached somehow. Relatively often the middle layer here, which is Lustre or something similar, can throw a fit.
Then you have the interconnect, which can be anything from super flakey to rock solid. I've seen fifteen year old setups be rock solid, and in one extreme example a brand new system that was so unstable, all the IB cards were shipped back to Mellanox and replaced under warranty with a previous generation model. This type of thing usually follows something like a Weibull distribution, where wrinkles are ironed out over time and the IB drivers become more robust for a particular HW model.
Then you have the general hardware and drivers on each node. Typically there is extensive performance testing to establish the best compiler flags etc., as well as how to distribute the work most optimally for a given workload. Failures on this level are easier in the sense that it typically just affects a couple of nodes which you can take offline and fix while the rest keep running.