Shortformer: Better Language Modeling using Shorter Inputs [pdf]

ofirpress · on Dec 31, 2020

Thanks for posting our paper! If anyone has any questions, I'll stick around this thread for a bit.

There's a summary of our paper on twitter: https://twitter.com/OfirPress/status/1344387959563325442

And our code is on GitHub: https://github.com/ofirpress/shortformer

cs702 · on Dec 31, 2020

No questions. After giving just a quick skim, this paper looks like great work. The findings are remarkable, and they're presented in clear, to-the-point language.

I confess to being a bit shocked that given the same number of parameters, training is 1.65x faster (whoa), generation is 9x faster (wait, what!?), and perplexity is better (which is a flawed measure, but still), and all by using a new form of "curriculum learning" and adding position embeddings to the queries and keys but not the values.

And it's so nice to see new ideas and improvements that don't rely on yet more computation or yet more parameters (I'm looking at you, GPT-3).

Congratulations!

ofirpress · on Dec 31, 2020

Thank you! We spent a lot of time on making this as easy to understand as possible.

igravious · on Dec 31, 2020

What's "perplexity" a measure of? First I've heard of it.

sillysaurusx · on Dec 31, 2020

e^loss. It's a bad name for a confusing concept: Loss. (e^loss is just another way of plotting loss, after all.)

Loss isn't the whole story -- the steepest slope during training often produces the worst quality language models. You want a nice, gentle downward slope.

SubsimulatorGPT2 (https://reddit.com/r/subsimulatorgpt2) continued to improve in terms of human evaluation even though the loss stayed flat for over a week.

gimzi · on Dec 31, 2020

Pretty amazing that such a small change to your code gets you so much performance!

Great find!