They are using preemptible CPUs/GPUs on Google Compute Engine for model training? Interesting. The big pro of that is cost efficiency, which isn't something I expected OpenAI to be optimizing. :P
How does training RL with preemptible VMs work when they can shut down at any time with no warning? A PM of that project asked me the same question awhile ago (https://news.ycombinator.com/item?id=14728476) and I'm not sure model checkpointing works as well for RL. (maybe after each episode?)
Cost efficiency is always important, regardless of your total resources.
The preemptibles are just used for the rollouts — i.e. to run copies of the model and the game. The training and parameter storage is not done with preemptibles.
If these (or other similar) experiments would show viability of this network architecture, the cost could be decreased a lot with development of even more specialized hardware.
Also one could look at the cost of the custom development of bots and AIs using other more specialized techniques: sure, it might require more processing power to train this network, but it will not require as much specialized human interaction to adapt this network to a different task. In which case, the human labor cost is decreased significantly, even if initial processing costs are higher. So in a way you guys do actually optimize cost efficiency.
Disclosure: I work on Google Cloud (and with OpenAI), though I'm not a PM :).
As gdb said below, the GPUs doing the training aren't preemptible. Just the workers running the game (which don't need GPUs).
I'm surprised you felt cost isn't interesting. While OpenAI has lots of cash, that doesn't mean they shouldn't do 3-5x more computing for the same budget. The 256 "optimizers" cost less than $400/hr, while if you were using regular cores the 128k workers would be over $6k/hr. So using preemptible is just the responsible choice :).
There's lots of low hanging fruit in any of these setups, and OpenAI is executing towards a deadline, so they need to be optimizing for their human time. That said, I did just encourage the team to consider checkpointing the DOTA state on preemption though, to try to eke out even more utilization. Similarly, being tighter on the custom shapes is another 5-10% "easily".
How does training RL with preemptible VMs work when they can shut down at any time with no warning? A PM of that project asked me the same question awhile ago (https://news.ycombinator.com/item?id=14728476) and I'm not sure model checkpointing works as well for RL. (maybe after each episode?)