This style of documentation is called literate programming, and you should go and google about this term and the various implementations for various widespread programming languages if you never heard of this before. It's an eye-opener how clear, transparent and well-intertwined good code and comments can be.
I've used such a literate programming style with scientific python once in university classes and it was a breeze to prepare and hand in exercise sheets (rendered with Latex to PDF). My feeling was that today people use Jupyter/IPython notebook to archieve something similar (especially with embedding results), but a jupyter notebook is much more complex than a traditional, clean and terminal-readable literate programming source code.
One of the problems with notebooks for literate programming is that it kind of breaks down when you define a class or a long function. The entire code has to be within a single block.
Or you could use org-mode with org-babel in Emacs to get a great document allowing to use many different programming languages. The coffeescript URL already has an org in it. Coincidence? I think not! ;)
I don't think this qualifies as literate programming, at least not in my book. Yes, it looks nice and everything but from my point of view OP just moved regular code comments to a side bar.
For instance, there are many comments that explain what values are stored in opaque-looking variables/parameters. But these are simply necessary (especially in data science, where almost everything is a tensor) and should be part of every decently documented program, anyway, and don't make this literate programming yet just because they're now in a sidebar.
Besides, there are a lot of comments that 1) just repeat verbatim what the code does, e.g.
Run through each layer for layer in self.layers:
(https://nn.labml.ai/gan/cycle_gan.html)
or that 2) explain what the code does because the code, at times, is unnecessarily hard to read and/or little modularized (no offense, OP).
These types of comments, I'd say, belong to the sort of comments and documentation that can (and usually should) be avoided. They don't tell me more about the code than what the code already tells me (or should/could tell me) and thus are not what I would expect from literate programming, either.
There are a whole lot of other types of comments that I would expect from literate programming, though, and these are mostly missing. There was an excellent article[0] a few years back by Salvatore Sanfilippo aka antirez (of Redis fame) where he identified the following useful comment types:
Now, the OP's code checks off one or two items on that list but only in parts and in a few places. Overall, looking at the many code snippets antirez's article presents from the Redis code base, I find Redis's style is much closer to my idea of literate programming than the OP's code.
(Again, I hope OP is not taking offense to my comment. I am aware that they didn't claim they were doing literate programming.)
EDIT: When I wrote my comment, I hadn't looked at all the pages/files and simply assumed the others I hadn't looked at would be similar. I've now noticed, though, that some of them do follow the literate programming style quite closely. Nice. :)
Something like this could be incredibly helpful with arxiv articles: being able to pin-point a fragment of text or a formula to the actual implementation. This could save so much time and ping-ponging between the article and the code.
We don't compute per-example gradients, so in your second code snippet there would not be a loop across examples. We compute the batch-averaged gradient in the same time it would take to compute a single example's gradient, so it's much more efficient than your proposal, which is equivalent to using a batch size of 1.
You're right what you propose is not quite equivalent to batch size 1, as you don't update the parameters until processing the entire batch.
Still, having to process the examples in a batch sequentially seems like a very costly concession to make. Traditionally the reason to use batches has been because GPU-style parallelism makes them cheap. If you take away that reason by making the computation sequential, large batches become much harder to justify. Moreover it's not clear what you gain by making the computation sequential in this way -- do you think Adam actually has trouble keeping up with mean/variance of gradients so it needs more frequent updates? I would be surprised if so.
This seems a bit weird to me. You're accumulating statistics in an order dependent way without updating the parameters. You're also doing k updates of the statistics with noisier estimates of the gradient. I'm not really a stats guy, but this doesn't seem like it would provide better estimates of the adam statistics like you suggest. I'm sure this would have some impact, but it doesn't seem like it would be better than tuning the beta hyperparameters to incorporate gradient changes more quickly. But you probably just want to try it if you believe in it, not try to get traction with it in thw HN comments.
I've used such a literate programming style with scientific python once in university classes and it was a breeze to prepare and hand in exercise sheets (rendered with Latex to PDF). My feeling was that today people use Jupyter/IPython notebook to archieve something similar (especially with embedding results), but a jupyter notebook is much more complex than a traditional, clean and terminal-readable literate programming source code.