What hardware advancement? There's hardly any these days... Especially not for t...

Sebguer · 2025-07-15T06:14:43 1752560083

Have you heard of TPUs?

Dylan16807 · 2025-07-15T07:03:06 1752562986

Sort of a hardware advancement. I'd say it's more of a sidegrade between different types of well-established processor. Take out a couple cores, put in some extra wide matrix units with accumulators, watch the neural nets fly.

But I want to point out that going from CPU to TPU is basically the opposite of a Moore's law improvement.

oblio · 2025-07-15T06:25:15 1752560715

Yeah, I'm a regular Joe. How do I get one and how much does it cost?

Dylan16807 · 2025-07-15T07:13:20 1752563600

If your goal is "a TPU" then you buy a mac or anything labeled Copilot+. You'll need about $600. RAM is likely to be your main limit.

(A mid to high end GPU can get similar or better performance but it's a lot harder to get more RAM.)

oblio · 2025-07-15T08:19:30 1752567570

I want something I can put in my own PC. GPUs are utterly insane in pricing, since for the good stuff you need at least 16GB but probably a lot more.

Dylan16807 · 2025-07-15T08:34:36 1752568476

9060 XT 16GB, $360

5060 Ti 16GB, $450

If you want more than 16GB, that's when it gets bad.

And you should be able to get two and load half your model into each. It should be about the same speed as if a single card had 32GB.

oblio · 2025-07-15T21:49:01 1752616141

> And you should be able to get two and load half your model into each. It should be about the same speed as if a single card had 32GB.

This seems super duper expensive and not really supported by the more reasonably priced Nvidia cards, though. SLI is deprecated, NVLink isn't available everywhere, etc.

Dylan16807 · 2025-07-15T22:11:00 1752617460

No, no, nothing like that.

Every layer of an LLM runs separately and sequentially, and there isn't much data transfer between layers. If you wanted to, you could put each layer on a separate GPU with no real penalty. A single request will only run on one GPU at a time, so it won't go faster than a single GPU with a big RAM upgrade, but it won't go slower either.

oblio · 2025-07-16T06:28:53 1752647333

Interesting, thank you for the feedback, it's definitely worth looking into!

haiku2077 · 2025-07-15T07:53:36 1752566016

$500 if you catch a sale at Costco or Best Buy!

haiku2077 · 2025-07-15T06:53:44 1752562424

Specifically, I upgraded my mac and ported my software, which ran on Windows/Linux, to macos and Metal. Literally >100x faster in benchmarks, and overall user workflows became fast enough I had to "spend" the performance elsewhere or else the responses became so fast they were kind of creepy. Have a bunch of _very_ happy users running the software 24/7 on Mac Minis now.

oblio · 2025-07-15T21:53:20 1752616400

The thing is, these kinds of optimizations happen all the time. Some of them can be as simple as using a hashmap instead of some home-baked data structure. So what you're describing is not necessarily some LLM specific improvement (though in your case it is, we can't generalize to every migration of a feature to an LLM).

And nothing I've seen about recent GPUs or TPUs, from ANY maker (Nvidia, AMD, Google, Amazon, etc) say anything about general speedups of 100x. Heck, if you go across multiple generations of what are still these very new types of hardware categories, for example for Amazon's Inferentia/Trainium, even their claims (which are quite bold), would probably put the most recent generations at best at 10x the first generations. And as we all know, all vendors exaggerate the performance of their products.