What is the guiding principle behind using Swiglu instead of Relu? Did the autho...

bravura · on Aug 9, 2023

Like a lot of research, unless there’s a clear explanation supported by rigorous study, they probably randomly hillclimbed a bunch of cool new one liner changes and stopped when it was time to start writing the paper and doing ablation studies.

bravura · on Aug 9, 2023

To be less glib, just wait until there are a bunch of papers picking Swiglu over Relu, and then you can stop handwringing. Because it doesn't really matter if there was a super specific concrete well-articulated reason that Swiglu worked well for their specific approach. You're still going to use Relu by default and quickly try Swiglu for now regardless.

It's fine, I waited a bit before default adopting Relu over Tanh for all hidden non-final (not outputting a probability) layers.

matroid · on Aug 9, 2023

Thanks a lot for your explanations :)