Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> Removed KL Divergence

Wait, how are they computing the loss?



Oh it's the KL term sorry - beta * KL ie they set beta to 0.

The goal of it was to "force" the model not to stray to far away from the original checkpoint, but it can hinder the model from learning new things


It's become trendy to delete it. I say trendy because many papers delete it without offering any proof that it is meaningless


It's just a penalty term that they delete




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: