Possible and has been done, but super-slow and inefficient resulting in long training times for small models.
To keep compute occupied you need to pass gradients very fast.
Yes but could you break it up into chunks of sets of gradients to compute? I know that compute needs the full chunk to compute a set. Again, things Iām exploring but ultimately no different than just having the full dataset on disk and just scaling out compute nodes in ro mode.