Out of curiosity, what's the file size on that?

learndeeply · on May 3, 2022

Depends which model, but assuming the largest: 175B * 16 bits = 350GB. Half of that if it's quantized to 8 bits. Good luck finding a GPU that can fit that in memory.

faebi · on May 3, 2022

Does the model need to be in memory in order to run it with current tooling?

PeterisP · on May 3, 2022

To run it at a reasonable speed, yes. Computing a single word requires all of the parameters; if you don't have them in memory you'd have to re-transfer all those gigabytes to the GPU for each full pass to get some output, which is a severe performance hit as you can't fully use your compute power because the bandwidth is likely to be the bottleneck - running inference for just a single example will take many seconds just because of the bandwidth limitations.

GPT-3 paper itself just mentions that they're using a cluster V100 GPUs with presumably 32GB RAM each, but does not go into detail of the structure. IMHO you'd want to use a chain of GPUs each having part of the parameters and just transfering the (much, much smaller) processed data to the next GPU instead of having a single GPU reload the full parameter set for each part of the model; and a proper NVLink cluster can get an order of magnitude faster interconnect than the PCIe link between GPU and your main memory.

So this is not going to be a model that's usable on cheap hardware. It's effectively open to organizations who can afford to plop a $100k compute cluster for their $x00k/yr engineers to work with.

thrtythreeforty · on May 3, 2022

Exactly! This is called "model parallelism" - each layer of the graph is spread across multiple compute devices. Large clusters like the V100s or the forthcoming trn1 instances (disclosure, I work on this team) need _stupid_ amounts of inter-device bandwidth, particularly for training.

freeone3000 · on May 3, 2022

My following post is entirely speculation.

NVLink also gives you memory pooling; 8*32GB just baaarely fits the model. NVBus is the public version of an InfiniBand interconnect allowing for V-RDMA (which people have been doing for years), which would then allow for distributed execution using pydist or Megatron (or DeepSpeed). So it's probably a similar infrastructure to Nvidia's supercomputers, since that's what everyone built before Nvidia started selling them.

wmf · on May 3, 2022

I wonder if a 64GB Orin or M1 Max could fit the 30B model...

Invictus0 · on May 3, 2022

Someone can correct me if I'm wrong, but "30B parameters" refers to a matrix with 30B elements, and assuming all the numbers are 16 bit, then that's 2 bytes * 30B = 60GB.

sanxiyn · on May 3, 2022

175B * 16 bits = 350GB, but it does compress a bit.

GPT-J-6B, which you can download at https://github.com/kingoflolz/mesh-transformer-jax, is 6B parameters but weighs 9GB. It does decompress to 12GB as expected. Assuming the same compression ratio, download size would be 263GB, not 350GB.