Well it depends what you mean by “best” :-) removing the linear layer is the eas...

mike_hearn · on Aug 11, 2023

Why does adding a ReLU create more layers and parameters? Isn't the total number of neurons the same?

hansvm · on Aug 12, 2023

The representational capacity of two consecutive linear layers is the same as one slightly different linear layer. The capacity when you introduce a relu into the mix is (up to a complexity defined by the number of parameters) any "nice" function -- including things like e^sin(x) -- not just linear functions. With two consecutive linear layers many of the weights and computations are redundant.

mike_hearn · on Aug 12, 2023

Right, I get that: it increases learning capacity, but doesn't introduce more parameters? Like the GPU requirements would be the same beyond the extra cost of the ReLU operation itself, yes?

spi · on Aug 14, 2023

Yes of course, sorry my write-up was confusing: I meant that "adding a ReLU between the two linear layers" (the second option) would result in more parameters than "directly removing the second linear layer" (the first option). And my message just meant "I don't know which of the two options achieves the best trade-off between speed and quality". I didn't consider the option "leave it as it is in the blog post" because it is essentially equivalent to the first option (removing the linear layer) but slower (as you say, with exactly the same number of parameters as the second option), so it definitely shouldn't be a "best" option.

mike_hearn · on Aug 17, 2023

Thank you!