As the model weights (even quantized) would be several hundred GBs, it’s unlikel...

		woodson on May 3, 2022 \| parent \| context \| favorite \| on: OPT: Open Pre-trained Transformer Language Models As the model weights (even quantized) would be several hundred GBs, it’s unlikely, unless special inference code is written that loads and processes only a small subset of weights and calculations at a time. But running it that way would be painfully slow.

The code is already there: DeepSpeed