Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Transformers tend to be trained in parallel. BERT = 512 tokens per context, in parallel. GPT too is trained while feeding in multiple words in parallel. This enables us to build larger models. Older models, such as RNNs couldn't be trained this way, limiting their power/quality.


This is only sort of true, since you can still train RNNs (including LSTM, etc.) in big batches-- which is usually plenty enough to make use of your GPU's parallel capabilities. The inherently serial part only applies to the length of your context. Transformer architectures thus happen to be helpful if you have lots of idle GPU's such that you're actually constrained by not being able to parallelize along the context dimension.


In RNN, hidden states are to be sequential; in transformers with attention mechanism, we break free of the sequential requirement. Transformers are more amenable to parallelism, and make use of GPUs the most (within the context axis, and outside).


Ahh, that makes a lot of sense




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: