We don't compute per-example gradients, so in your second code snippet there wou...

sillysaurusx · on Jan 31, 2021

Performance is rarely the issue, at least for us. The problem is, when the algorithms don’t work, then what?

A batch size of 1 != an average of larger batch sizes. It’s why the BigGAN paper reports “bigger batches = better FID.”

This proposal gives the advantage of a small batch size (and there are advantages) without sacrificing the option of large batches.

koningrobot · on Jan 31, 2021

You're right what you propose is not quite equivalent to batch size 1, as you don't update the parameters until processing the entire batch.

Still, having to process the examples in a batch sequentially seems like a very costly concession to make. Traditionally the reason to use batches has been because GPU-style parallelism makes them cheap. If you take away that reason by making the computation sequential, large batches become much harder to justify. Moreover it's not clear what you gain by making the computation sequential in this way -- do you think Adam actually has trouble keeping up with mean/variance of gradients so it needs more frequent updates? I would be surprised if so.