Show HN: Collection of deep learning implementations with side-by-side notes

ktpsns · on Jan 30, 2021

This style of documentation is called literate programming, and you should go and google about this term and the various implementations for various widespread programming languages if you never heard of this before. It's an eye-opener how clear, transparent and well-intertwined good code and comments can be.

I've used such a literate programming style with scientific python once in university classes and it was a breeze to prepare and hand in exercise sheets (rendered with Latex to PDF). My feeling was that today people use Jupyter/IPython notebook to archieve something similar (especially with embedding results), but a jupyter notebook is much more complex than a traditional, clean and terminal-readable literate programming source code.

vpj · on Jan 30, 2021

We used https://coffeescript.org/#literate for some products at https://forestpin.com and it was so much easier to maintain.

One of the problems with notebooks for literate programming is that it kind of breaks down when you define a class or a long function. The entire code has to be within a single block.

zelphirkalt · on Jan 30, 2021

Or you could use org-mode with org-babel in Emacs to get a great document allowing to use many different programming languages. The coffeescript URL already has an org in it. Coincidence? I think not! ;)

cinntaile · on Jan 30, 2021

You can define a class and then use it in a different block just fine, so it's not entirely clear to me what you're referring to here?

vpj · on Jan 30, 2021

All of the class definition has to be in a single cell. You can’t have notes within it only normal comments

codethief · on Jan 30, 2021

I don't think this qualifies as literate programming, at least not in my book. Yes, it looks nice and everything but from my point of view OP just moved regular code comments to a side bar.

For instance, there are many comments that explain what values are stored in opaque-looking variables/parameters. But these are simply necessary (especially in data science, where almost everything is a tensor) and should be part of every decently documented program, anyway, and don't make this literate programming yet just because they're now in a sidebar.

Besides, there are a lot of comments that 1) just repeat verbatim what the code does, e.g.

    Run through each layer                   for layer in self.layers:

  (https://nn.labml.ai/gan/cycle_gan.html)

or that 2) explain what the code does because the code, at times, is unnecessarily hard to read and/or little modularized (no offense, OP).

These types of comments, I'd say, belong to the sort of comments and documentation that can (and usually should) be avoided. They don't tell me more about the code than what the code already tells me (or should/could tell me) and thus are not what I would expect from literate programming, either.

There are a whole lot of other types of comments that I would expect from literate programming, though, and these are mostly missing. There was an excellent article[0] a few years back by Salvatore Sanfilippo aka antirez (of Redis fame) where he identified the following useful comment types:

    - Function comments
    - Design comments
    - Why comments
    - Teacher comments
    - Checklist comments
    - Guide comments

Now, the OP's code checks off one or two items on that list but only in parts and in a few places. Overall, looking at the many code snippets antirez's article presents from the Redis code base, I find Redis's style is much closer to my idea of literate programming than the OP's code.

(Again, I hope OP is not taking offense to my comment. I am aware that they didn't claim they were doing literate programming.)

[0]: https://web.archive.org/web/20191224044125/http://antirez.co...

EDIT: When I wrote my comment, I hadn't looked at all the pages/files and simply assumed the others I hadn't looked at would be similar. I've now noticed, though, that some of them do follow the literate programming style quite closely. Nice. :)

dynamite-ready · on Jan 30, 2021

Unfashionable though. I strongly believe in working this way, but in a webdev environment, such work rarely gets through code review.

The usual argument is that code should document itself, or something of the like.

Data scientists uphold the standard admirably though, as you say.

timohear · on Jan 30, 2021

In their Transformer section they have implementations of:

   - kNN-LM: Generalization through Memorization
   - Feedback Transformer
   - Switch Transformer

Which are all from recent, highly interesting papers

axegon_ · on Jan 30, 2021

Something like this could be incredibly helpful with arxiv articles: being able to pin-point a fragment of text or a formula to the actual implementation. This could save so much time and ping-ponging between the article and the code.

sillysaurusx · on Jan 30, 2021

I thought of a change to gradient accumulation, which I call Adam accumulation:

https://twitter.com/theshawwn/status/1355343951033602057

https://news.ycombinator.com/item?id=25964420

Unfortunately, no one seems to understand it, which isn't a great sign. I'm either not explaining it very well, or the idea doesn't make sense.

In short:

  for example in batch:
    accum += adam(gradients(example))

  param += accum
  accum = 0

That way, adam statistics are updated for every training example.

Traditional gradient accumulation looks like this:

  for example in batch:
    accum += gradients(example)

  param += adam(accum)
  accum = 0

... which only updates Adam once.

(It's equivalent to a bigger batch size.)

Probably best to just implement Adam accumulation and see if it works, I suppose.

(Sorry for rambling about this here. I was just hoping to find some prior work along these lines, if anyone knew of something.)

koningrobot · on Jan 30, 2021

We don't compute per-example gradients, so in your second code snippet there would not be a loop across examples. We compute the batch-averaged gradient in the same time it would take to compute a single example's gradient, so it's much more efficient than your proposal, which is equivalent to using a batch size of 1.

sillysaurusx · on Jan 31, 2021

Performance is rarely the issue, at least for us. The problem is, when the algorithms don’t work, then what?

A batch size of 1 != an average of larger batch sizes. It’s why the BigGAN paper reports “bigger batches = better FID.”

This proposal gives the advantage of a small batch size (and there are advantages) without sacrificing the option of large batches.

koningrobot · on Jan 31, 2021

You're right what you propose is not quite equivalent to batch size 1, as you don't update the parameters until processing the entire batch.

Still, having to process the examples in a batch sequentially seems like a very costly concession to make. Traditionally the reason to use batches has been because GPU-style parallelism makes them cheap. If you take away that reason by making the computation sequential, large batches become much harder to justify. Moreover it's not clear what you gain by making the computation sequential in this way -- do you think Adam actually has trouble keeping up with mean/variance of gradients so it needs more frequent updates? I would be surprised if so.

Eridrus · on Jan 31, 2021

This seems a bit weird to me. You're accumulating statistics in an order dependent way without updating the parameters. You're also doing k updates of the statistics with noisier estimates of the gradient. I'm not really a stats guy, but this doesn't seem like it would provide better estimates of the adam statistics like you suggest. I'm sure this would have some impact, but it doesn't seem like it would be better than tuning the beta hyperparameters to incorporate gradient changes more quickly. But you probably just want to try it if you believe in it, not try to get traction with it in thw HN comments.

mpfundstein · on Jan 30, 2021

maybe you should show with some well designed experiments that your idea improves upon the current practice?

sooheon · on Jan 30, 2021

The parent project, LabML, looks interesting. Anyone have any experience with how this stacks up against Weights and Biases?

misiti3780 · on Jan 30, 2021

Anyone know if this exists for Autoencoders?