Transformers tend to be *trained* in parallel. BERT = 512 tokens per context, in...

zozbot234 · on Nov 12, 2023

This is only sort of true, since you can still train RNNs (including LSTM, etc.) in big batches-- which is usually plenty enough to make use of your GPU's parallel capabilities. The inherently serial part only applies to the length of your context. Transformer architectures thus happen to be helpful if you have lots of idle GPU's such that you're actually constrained by not being able to parallelize along the context dimension.

atomicnature · on Nov 12, 2023

In RNN, hidden states are to be sequential; in transformers with attention mechanism, we break free of the sequential requirement. Transformers are more amenable to parallelism, and make use of GPUs the most (within the context axis, and outside).

dartos · on Nov 12, 2023

Ahh, that makes a lot of sense