Just as an FYI/additional data point, I bought a 3090 FE from Ebay a few months ...

espadrine · on July 26, 2023

> The 24GB of VRAM should keep it relevant for a bit too

If anything, I think models are going to shrink a bit, because assumptions around small models reaching capacity during traiing don’t seem fully accurate in practice[0]. We’re already starting to see some effects, like Phi-1[1] (a 1.3B code model outperforming 15B+ models), and BTLM-3B-8K[2] (a 3B model outperforming 7B models)

[0]: https://espadrine.github.io/blog/posts/chinchilla-s-death.ht...

[1]: https://arxiv.org/pdf/2306.11644.pdf

[2]: https://www.cerebras.net/blog/btlm-3b-8k-7b-performance-in-a...

wongarsu · on July 26, 2023

We had a long phase of "models aren't good enough but get better if we make them bigger, let's see how far we can go". This year we finally reached "some models are pretty great, let's see if we can do the same with smaller models". I'm excited for where this will take us.

XCSme · on July 26, 2023

Is there any way to compute the "capacity" of a model? In theory, if it's encoding all data with 100% efficiency, I guess the data stored in the model should be something like 2^parameters count (weights + biases) ?

espadrine · on July 26, 2023

There’s a theoretical, but impractical, way: for a given model, each possible set of weight/bias values yields a specific loss value when ran against the full corpus. There’s at least one set of weight values which minimizes it, for which the idealized bit-per-byte entropy can be computed.

That can be compared to what OpenAI’s scaling law paper[0] calls the “entropy of natural language”, which they estimate at about 0.57 bits per byte, based on the differing power law for data vs. compute. In my mind, that highlights more the imprecision of the approach than the information-theoretic content of language semantics: an omniscient being would predict things better, so the closest thing to true entropy should be computed from the list of matching text prefixes among all texts ever.

[0]: https://arxiv.org/pdf/2001.08361

XCSme · on July 27, 2023

Thanks for the explanation!

> should be computed from the list of matching text prefixes among all texts ever

I initially thought that value is pretty low (possible things you can say), but it's probably infinite. Even though, in practice, we don't say too many different things and use a very limited subset of the words in the dictionary.

PeterStuer · on July 26, 2023

Anyone with experience running 2 linked consumer GPU's want to chime in how good this works in practice?

Filligree · on July 26, 2023

You get a fast link between the GPUs, which should help when you’ve got a model split between them.

However, that split isn’t automatic. You can’t expect to run a 40GB model on that, unless perhaps if it’s been designed for that—the way llama.cpp can split a model between the GPU and CPU, for instance.

What you can do without trouble is keep more models loaded, do more things at the same time, and occasionally run the same model at double speed if it batches well.

deaddodo · on July 26, 2023

CUDA multi-GPU with NVLink is pretty well tested with shared memory space. You still want to use NCCL to optimize the allocation, but many CUDA-aware libraries (and their subsequent ML tools) are capable.

pseg134 · on July 26, 2023

This is incorrect if you are talking about 3090 or 3090ti using nvlink.

PeterStuer · on July 26, 2023

You mean those would work like a virtual single GPU with 48GB vram?

Tepix · on July 26, 2023

No. But pytorch will automatically make use of both GPUs and a NVlink bridge if you use its model parallel and distributed data parallel approaches.

marcyb5st · on July 26, 2023

I think you need enterprise grade cards for it to make it work. If I remember correctly consumer cards with nvlink can't share resources to host a 40GB model in vram.

gymbeaux · on July 26, 2023

I bought a used 3090 FE from eBay for $600 too! Mine is missing the connector latch, but seems to be firmly inserted so I think fire risk is negligible.

I went with the 3090 because I wanted the most VRAM for the buck, and the price of new GPUs is insane. Most GPUs in the $500-1500 range, even the Quadros and A series, don’t have anywhere near 24GB of VRAM.

brucethemoose2 · on July 26, 2023

> It generates the output slightly faster than reading speed

For 33b? It should be much faster.

What stack are you running? Llama.cpp and exLlama are SOTA as far as I know.