Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Out of curiosity, what's the file size on that?


Depends which model, but assuming the largest: 175B * 16 bits = 350GB. Half of that if it's quantized to 8 bits. Good luck finding a GPU that can fit that in memory.


Does the model need to be in memory in order to run it with current tooling?


To run it at a reasonable speed, yes. Computing a single word requires all of the parameters; if you don't have them in memory you'd have to re-transfer all those gigabytes to the GPU for each full pass to get some output, which is a severe performance hit as you can't fully use your compute power because the bandwidth is likely to be the bottleneck - running inference for just a single example will take many seconds just because of the bandwidth limitations.

GPT-3 paper itself just mentions that they're using a cluster V100 GPUs with presumably 32GB RAM each, but does not go into detail of the structure. IMHO you'd want to use a chain of GPUs each having part of the parameters and just transfering the (much, much smaller) processed data to the next GPU instead of having a single GPU reload the full parameter set for each part of the model; and a proper NVLink cluster can get an order of magnitude faster interconnect than the PCIe link between GPU and your main memory.

So this is not going to be a model that's usable on cheap hardware. It's effectively open to organizations who can afford to plop a $100k compute cluster for their $x00k/yr engineers to work with.


Exactly! This is called "model parallelism" - each layer of the graph is spread across multiple compute devices. Large clusters like the V100s or the forthcoming trn1 instances (disclosure, I work on this team) need _stupid_ amounts of inter-device bandwidth, particularly for training.


My following post is entirely speculation.

NVLink also gives you memory pooling; 8*32GB just baaarely fits the model. NVBus is the public version of an InfiniBand interconnect allowing for V-RDMA (which people have been doing for years), which would then allow for distributed execution using pydist or Megatron (or DeepSpeed). So it's probably a similar infrastructure to Nvidia's supercomputers, since that's what everyone built before Nvidia started selling them.


I wonder if a 64GB Orin or M1 Max could fit the 30B model...


Someone can correct me if I'm wrong, but "30B parameters" refers to a matrix with 30B elements, and assuming all the numbers are 16 bit, then that's 2 bytes * 30B = 60GB.


175B * 16 bits = 350GB, but it does compress a bit.

GPT-J-6B, which you can download at https://github.com/kingoflolz/mesh-transformer-jax, is 6B parameters but weighs 9GB. It does decompress to 12GB as expected. Assuming the same compression ratio, download size would be 263GB, not 350GB.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: