More

p12tic · 2026-01-06T23:53:27 1767743607

> not ... web/db servers, lightweight stuff like that.

They scale very well for web and db servers as well. You just put lots of containers/VMs on a single server.

AMD EPYC has a separate architecture specifically for such workloads. It's a bit weaker, runs at lower frequency and power and takes less silicon area. This way AMD can put more such cores on a single CPU (192 vs 128 for Zen 5c vs 5). So it's the other way round - web servers love high core count CPUs.

markhahn · 2026-01-07T06:10:00 1767766200

not really - you can certainly put lots of lightweight services on it, but they don't scale. because each core doesn't really get that much cache or memory bandwidth. it's not bad, just not better.

tucnak · 2026-01-07T08:50:36 1767775836

Not true. You should look up Sienna chips and something like ASUS S14NA-U12. It has six DDR5-4800 channels, two physical PCIe 5.0 ports, two M.2 ports, and six MCIO x8 ports. All lanes are full-bandwidth. The 8434PN CPU gets you 48 physical cores in a 150W envelope. Zen 4c really is magic, and LOTS of bandwidth to play with.

p12tic · 2025-09-19T21:11:18 1758316278

Depends on a server. This test got 79W idle for _two socket_ E5 2690-V4 server.

https://www.servethehome.com/lenovo-system-x3650-m5-workhors...

p12tic · 2025-09-19T21:05:55 1758315955

The problem is with the form factor, not the server hardware per-se. If one buys regular ATX motherboard that accepts server CPUs and fits it in regular ATX case, then there's lots of space for a relatively silent CPU air cooler. 2690 v4 idles at less than 40W which is not much more than a regular gaming desktop with a powerful GPU.

The only problem in practice is that server CPUs don't support S3 suspend, so putting whole thing to sleep after finishing with it doesn't work.

p12tic · 2025-09-19T20:59:28 1758315568

Better build a single workstation - less noise, less power usage and the form factor is way more convenient. A budget of $3000 can buy 128 cores with 512GB of RAM on a single regular EATX motherboard, a case, a power supply and other accessories. Power usage is ~550W at maximum utilization which not much more than a gaming rig with a powerful GPU.

p12tic · 2025-09-10T19:34:38 1757532878

> Today it's a bit more complicated when you have servers with 100+ cores as an option for under $30k (guestimate based on $10k CPU price).

If one can buy used, then previous generation 128C 256T epyc server is less than $5k. For homelabs that can accept non-rackmount gear it's less than $3k.

p12tic · 2025-08-08T20:18:55 1754684335

That's just an artifact of Intel disabling ECC on consumer processors.

There's no reason for ECC to have significantly higher power consumption. It's just an additional memory chip per stick and a tiny bit of additional logic on CPU side to calculate ECC.

If power consumption is the target, ECC is not a problem. I know firsthand that even old Xeon D servers can hit 25W full system idle. On AMD side 4850G has 8 cores and can hit sub 25W full system idle as well.

daymanstep · 2025-08-10T14:06:45 1754834805

My HP 800 mini idles at 3W

p12tic · 2025-06-01T20:51:19 1748811079

State of the art of local models is even further.

For example, look into https://github.com/kvcache-ai/ktransformers, which achieve >11 tokens/s on a relatively old two socket Xeon servers + retail RTX 4090 GPU. Even more interesting is prefill speed at more than 250 tokens/s. This is very useful in use cases like coding, where large prompts are common.

The above is achievable today. In the mean time Intel guys are working on something even more impressive. In https://github.com/sgl-project/sglang/pull/5150 they claim that they achieve >15 tokens/s generation and >350 tokens/s prefill. They don't share what exact hardware they run this on, but from various bits and pieces over various PRs I reverse-engineered that they use 2x Xeon 6980P with MRDIMM 8800 RAM, without GPU. Total cost of such setup will be around $10k once cheap Engineering samples hit eBay.

qeternity · 2025-06-01T21:58:55 1748815135

It's not impressive nor efficient when you consider batch sizes > 1.

p12tic · 2025-06-01T22:00:42 1748815242

All of this is for batch size 1.

qeternity · 2025-06-02T23:34:42 1748907282

I know. That was my point.

Throughput doesn't scale on CPU as well as it does on GPU.

p12tic · 2025-06-03T20:36:23 1748982983

We both agree. Batch size 1 is only relevant to people who want to run models on their own private machines. Which is the case of OP.

p12tic · 2025-05-28T12:53:35 1748436815

Incorrect. https://en.wikipedia.org/wiki/USB_hardware#USB_Power_Deliver... is a good start about the subject: "PD-aware devices implement a flexible power management scheme by interfacing with the power source through a bidirectional data channel and requesting a certain level of electrical power <...>".

nyrikki · 2025-05-28T13:05:33 1748437533

You can tell I have been in the 5v world too much, thanks for the correction.

p12tic · 2025-04-05T19:18:59 1743880739

For all intents and purposes cache may not exist when the working set is 17B or 109B parameters. So it's still better that less parameters are activated for each token. 17B parameters works ~6x faster than 109B parameters just because less data needs to be loaded from RAM.

TOMDM · 2025-04-05T19:34:10 1743881650

Yes loaded from RAM and loaded to RAM are the big distinction here.

It will still be slow if portions of the model need to be read from disk to memory each pass, but only having to execute portions of the model for each token is a huge speed improvement.

mlyle · 2025-04-05T21:39:13 1743889153

It's not too expensive of a Macbook to fit 109B 4-bit parameters in RAM.

utopcell · 2025-04-06T04:06:59 1743912419

Is a 64GiB RAM Macbook really that expensive, especially compared against NVidia GPUs?

mlyle · 2025-04-06T04:10:28 1743912628

That's why I said it's not too expensive.

utopcell · 2025-04-06T14:31:41 1743949901

Apologies, I misread your comment.

p12tic · on Feb 12, 2025

Most laptops are severely limited by heat dissipation. So it's normal that performance is much worse. The CPU cannot stay in turbo as long and must drop to lower frequencies sooner. On longer benchmarks they CPU starts throttling due to heat and becomes even slower.