Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Just as an FYI/additional data point, I bought a 3090 FE from Ebay a few months ago for £605 including delivery.

I've only just started using it for Llama running locally on my computer at home and I have to say... colour me impressed.

It generates the output slightly faster than reading speed so for me it works perfectly well.

The 24GB of VRAM should keep it relevant for a bit too and I can always buy another and NVLink them should the need arise.



> The 24GB of VRAM should keep it relevant for a bit too

If anything, I think models are going to shrink a bit, because assumptions around small models reaching capacity during traiing don’t seem fully accurate in practice[0]. We’re already starting to see some effects, like Phi-1[1] (a 1.3B code model outperforming 15B+ models), and BTLM-3B-8K[2] (a 3B model outperforming 7B models)

[0]: https://espadrine.github.io/blog/posts/chinchilla-s-death.ht...

[1]: https://arxiv.org/pdf/2306.11644.pdf

[2]: https://www.cerebras.net/blog/btlm-3b-8k-7b-performance-in-a...


We had a long phase of "models aren't good enough but get better if we make them bigger, let's see how far we can go". This year we finally reached "some models are pretty great, let's see if we can do the same with smaller models". I'm excited for where this will take us.


Is there any way to compute the "capacity" of a model? In theory, if it's encoding all data with 100% efficiency, I guess the data stored in the model should be something like 2^parameters count (weights + biases) ?


There’s a theoretical, but impractical, way: for a given model, each possible set of weight/bias values yields a specific loss value when ran against the full corpus. There’s at least one set of weight values which minimizes it, for which the idealized bit-per-byte entropy can be computed.

That can be compared to what OpenAI’s scaling law paper[0] calls the “entropy of natural language”, which they estimate at about 0.57 bits per byte, based on the differing power law for data vs. compute. In my mind, that highlights more the imprecision of the approach than the information-theoretic content of language semantics: an omniscient being would predict things better, so the closest thing to true entropy should be computed from the list of matching text prefixes among all texts ever.

[0]: https://arxiv.org/pdf/2001.08361


Thanks for the explanation!

> should be computed from the list of matching text prefixes among all texts ever

I initially thought that value is pretty low (possible things you can say), but it's probably infinite. Even though, in practice, we don't say too many different things and use a very limited subset of the words in the dictionary.


Anyone with experience running 2 linked consumer GPU's want to chime in how good this works in practice?


You get a fast link between the GPUs, which should help when you’ve got a model split between them.

However, that split isn’t automatic. You can’t expect to run a 40GB model on that, unless perhaps if it’s been designed for that—the way llama.cpp can split a model between the GPU and CPU, for instance.

What you can do without trouble is keep more models loaded, do more things at the same time, and occasionally run the same model at double speed if it batches well.


CUDA multi-GPU with NVLink is pretty well tested with shared memory space. You still want to use NCCL to optimize the allocation, but many CUDA-aware libraries (and their subsequent ML tools) are capable.


This is incorrect if you are talking about 3090 or 3090ti using nvlink.


You mean those would work like a virtual single GPU with 48GB vram?


No. But pytorch will automatically make use of both GPUs and a NVlink bridge if you use its model parallel and distributed data parallel approaches.


I think you need enterprise grade cards for it to make it work. If I remember correctly consumer cards with nvlink can't share resources to host a 40GB model in vram.


I bought a used 3090 FE from eBay for $600 too! Mine is missing the connector latch, but seems to be firmly inserted so I think fire risk is negligible.

I went with the 3090 because I wanted the most VRAM for the buck, and the price of new GPUs is insane. Most GPUs in the $500-1500 range, even the Quadros and A series, don’t have anywhere near 24GB of VRAM.


> It generates the output slightly faster than reading speed

For 33b? It should be much faster.

What stack are you running? Llama.cpp and exLlama are SOTA as far as I know.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: