This is an implementation of GPT using the pytorch library. It is not meant to b...

This is an implementation of GPT using the pytorch library. It is not meant to be the shortest implementation of a trainable GPT, however it is very clean code. Pytorch does a lot of the heavy lifting, especially when it comes to training on multiple GPU. This implementation only works with data distributed parallel training, so one could not train models of the size of GPT-4 with it out of the box.