> not ... web/db servers, lightweight stuff like that.
They scale very well for web and db servers as well. You just put lots of containers/VMs on a single server.
AMD EPYC has a separate architecture specifically for such workloads. It's a bit weaker, runs at lower frequency and power and takes less silicon area. This way AMD can put more such cores on a single CPU (192 vs 128 for Zen 5c vs 5). So it's the other way round - web servers love high core count CPUs.
not really - you can certainly put lots of lightweight services on it, but they don't scale. because each core doesn't really get that much cache or memory bandwidth. it's not bad, just not better.
Not true. You should look up Sienna chips and something like ASUS S14NA-U12. It has six DDR5-4800 channels, two physical PCIe 5.0 ports, two M.2 ports, and six MCIO x8 ports. All lanes are full-bandwidth. The 8434PN CPU gets you 48 physical cores in a 150W envelope. Zen 4c really is magic, and LOTS of bandwidth to play with.
The problem is with the form factor, not the server hardware per-se. If one buys regular ATX motherboard that accepts server CPUs and fits it in regular ATX case, then there's lots of space for a relatively silent CPU air cooler. 2690 v4 idles at less than 40W which is not much more than a regular gaming desktop with a powerful GPU.
The only problem in practice is that server CPUs don't support S3 suspend, so putting whole thing to sleep after finishing with it doesn't work.
Better build a single workstation - less noise, less power usage and the form factor is way more convenient. A budget of $3000 can buy 128 cores with 512GB of RAM on a single regular EATX motherboard, a case, a power supply and other accessories. Power usage is ~550W at maximum utilization which not much more than a gaming rig with a powerful GPU.
> Today it's a bit more complicated when you have servers with 100+ cores as an option for under $30k (guestimate based on $10k CPU price).
If one can buy used, then previous generation 128C 256T epyc server is less than $5k. For homelabs that can accept non-rackmount gear it's less than $3k.
That's just an artifact of Intel disabling ECC on consumer processors.
There's no reason for ECC to have significantly higher power consumption. It's just an additional memory chip per stick and a tiny bit of additional logic on CPU side to calculate ECC.
If power consumption is the target, ECC is not a problem. I know firsthand that even old Xeon D servers can hit 25W full system idle. On AMD side 4850G has 8 cores and can hit sub 25W full system idle as well.
For example, look into https://github.com/kvcache-ai/ktransformers, which achieve >11 tokens/s on a relatively old two socket Xeon servers + retail RTX 4090 GPU. Even more interesting is prefill speed at more than 250 tokens/s. This is very useful in use cases like coding, where large prompts are common.
The above is achievable today. In the mean time Intel guys are working on something even more impressive. In https://github.com/sgl-project/sglang/pull/5150 they claim that they achieve >15 tokens/s generation and >350 tokens/s prefill. They don't share what exact hardware they run this on, but from various bits and pieces over various PRs I reverse-engineered that they use 2x Xeon 6980P with MRDIMM 8800 RAM, without GPU. Total cost of such setup will be around $10k once cheap Engineering samples hit eBay.
Incorrect. https://en.wikipedia.org/wiki/USB_hardware#USB_Power_Deliver... is a good start about the subject: "PD-aware devices implement a flexible power management scheme by interfacing with the power source through a bidirectional data channel and requesting a certain level of electrical power <...>".
For all intents and purposes cache may not exist when the working set is 17B or 109B parameters. So it's still better that less parameters are activated for each token. 17B parameters works ~6x faster than 109B parameters just because less data needs to be loaded from RAM.
Yes loaded from RAM and loaded to RAM are the big distinction here.
It will still be slow if portions of the model need to be read from disk to memory each pass, but only having to execute portions of the model for each token is a huge speed improvement.
Most laptops are severely limited by heat dissipation. So it's normal that performance is much worse. The CPU cannot stay in turbo as long and must drop to lower frequencies sooner. On longer benchmarks they CPU starts throttling due to heat and becomes even slower.
They scale very well for web and db servers as well. You just put lots of containers/VMs on a single server.
AMD EPYC has a separate architecture specifically for such workloads. It's a bit weaker, runs at lower frequency and power and takes less silicon area. This way AMD can put more such cores on a single CPU (192 vs 128 for Zen 5c vs 5). So it's the other way round - web servers love high core count CPUs.