You haven't really been getting 'human written and thoughtful content' for a vast swath of search topics for probably 15-20 years now. You get SEO-hyper-optimized (probably LLM-generated for anything in the last 3 years) blog spam. In terms of searching for information and getting that information, there are a lot of topics where an LLM-generated result is vastly better just by virtue of not being buried inside blog spam. The slop ship sailed years ago.
I've been running the 'frontier' open-weight LLMs (mainly deepseek r1/v3) at home, and I find that they're best for asynchronous interactions. Give it a prompt and come back in 30-45 minutes to read the response. I've been running on a dual-socket 36-core Xeon with 768GB of RAM and it typically gets 1-2 tokens/sec. Great for research questions or coding prompts, not great for text auto-complete while programming.
Let's say 1.5tok/sec, and that your rig pulls 500 W. That's 10.8 tok/Wh, and assuming you pay, say 15c/kWh means you're paying in the vicinity of $13.8/mtok of output. Looking at R1 output costs on OpenRouter, it's costing about 5-7x as much as what you can pay for third party inference (which also produce tokens ~30x faster).
It's not really an apples-to-apples comparison - I enjoy playing around with LLMs, running different models, etc, and I place a relatively high premium on privacy. The computer itself was $2k about two years ago (and my employer reimbursed me for it), and 99% of my usage is for research questions which have relatively high output per input token. Using one for a coding assistant seems like it can run through a very high number of tokens with relatively few of them actually being used for anything. If I wanted a real-time coding assistant, I would probably be using something that fit in the 24GB of VRAM and would have very different cost/performance tradeoffs.
For what it is worth, I do the same thing you do with local models: I have a few scripts that build prompts from my directions and the contents of one or more local source files. I start a local run and get some exercise, then return later for the results.
I own my computer, it is energy efficient Apple Silicon, and it is fun and feels good to do practical work in a local environment and be able to switch to commercial APIs for more capable models and much faster inference when I am in a hurry or need better models.
Off topic, but: I cringe when I see social media posts of people running many simultaneous agentic coding systems and spending a fortune in money and environmental energy costs. Maybe I just have ancient memories from using assembler language 50 years ago to get maximum value from hardware but I still believe in getting maximum utilization from hardware and wanting to be at least the ‘majority partner’ in AI agentic enhanced coding sessions: save tokens by thinking more on my own and being more precise in what I ask for.
- For polishing Whisper speech to text output, so I can dictate things to my computer and get coherent sentences, or for shaping the dictation to specific format eg. "generate ffmpeg to convert mp4 video to flac with fade in and out, input file is myvideo.mp4 output is myaudio flac with pascal case" -> Whisper -> "generate ff mpeg to convert mp4 video to flak with fade in and out input file is my video mp4 output is my audio flak with pascal case" -> Local LLM -> "ffmpeg ..."
- Doing classification / selection type of work eg. classifying business leads based on the profile
Basically the win for local llm is that the running cost (in my case, second hand M1 Ultra) is so low, I can run large quantity of calls that don't need frontier models.
My comment was not very clear. I specifically meant Claude Code/Codex like workflows where the agent generates/run code interactively with user feedback. My impression is that consumer grade hardware is still too slow for these things to work.
You are right, consumer grade hardware is mostly too slow... although it's a relative thing right. For instance you can get Mac Studio Mx Ultra with 512GB RAM, run GLM-4.5-Air and have a bit of patience. It could work
I was able to run a batch job that lasted ~2 weeks of inference time on my m4 max by running it over night against a large dataset I wanted to mine. It cost me pennies in electricity and writing a simple python script as a scheduler.
This generally isn't true. Cloud vendors have to make back the cost of electricity and the cost of the GPUs. If you already bought the Mac for other purposes, also using it for LLM generation means your marginal cost is just the electricity.
Also, vendors need to make a profit! So tack a little extra on as well.
However, you're right that it will be much slower. Even just an 8xH100 can do 100+ tps for GLM-4.7 at FP8; no Mac can get anywhere close to that decode speed. And for long prompts (which are compute constrained) the difference will be even more stark.
A question on the 100+ tps - is this for short prompts? For large contexts that generate a chunk of tokens at context sizes at 120k+, I was seeing 30-50 - and that's with 95% KV cache hit rate. Am wondering if I'm simply doing something wrong here...
I did the same, then put in 14 3090's. It's a little bit power hungry but fairly impressive performance wise. The hardest parts are power distribution and riser cards but I found good solutions for both.
to the point that I had to pull an extra circuit... but tri phase so good to go even if I would like to go bigger.
I've limited power consumption to what I consider the optimum, each card will draw ~275 Watts (you can very nicely configure this on a per-card basis). The server itself also uses some for the motherboard, the whole rig is powered from 4 1600W supplies, the gpus are divided 5/5/4 and the mother board is connected to its own supply. It's a bit close to the edge for the supplies that have five 3090's on them but so far it held up quite well, even with higher ambient temps.
Interesting tidbit: at 4 lanes/card throughput is barely impacted, 1 or 2 is definitely too low. 8 would be great but the CPUs don't have that many lanes.
I also have a threadripper which should be able to handle that much RAM but at current RAM prices that's not interesting (that server I could populate with RAM that I still had that fit that board, and some more I bought from a refurbisher).
What pcie version are you running? Normally I would not mention one of these, but you have already invested in all the cards, and it could free up some space if any of your lanes being used now are 3.0.
If you can afford the 16 (pcie 3) lanes, you could get a PLX ("PCIe Gen3 PLX Packet switch X16 - x8x8x8x8" on ebay for like $300) and get 4 of your cards up to x8.
All are PCIe 3.0, I wasn't aware of those switches at all, in spite of buying my risers and cables from that source! Unfortunately all of the slots on the board are x8, there are no x16 slots at all.
So that switch would probably work but I wonder how big the benefit would be: you will probably see effectively an x4 -> (x4 / x8) -> (x8 / x8) -> (x8 / x8) -> (x8 / x4) -> x4 pipeline, and then on to the next set of four boards.
It might run faster on account of the three passes that are are double the speed they are right now as long as the CPU does not need to talk to those cards and all transfers are between layers on adjacent cards (very likely), and with even more luck (due to timing and lack of overlap) it might run the two x4 passes at approaching x8 speeds as well. And then of course you need to do this a couple of times because four cards isn't enough, so you'd need four of those switches.
I have not tried having a single card with fewer lanes in the pipeline but that should be an easy test to see what the effect on throughput of such a constriction would be.
But now you have me wondering to what extent I could bundle 2 x8 into an x16 slot and then to use four of these cards inserted into a fifth! That would be an absolutely unholy assembly but it has the advantage that you will need far fewer risers, just one x16 to x8/x8 run in reverse (which I have no idea if that's even possible but I see no reason right away why it would not work unless there are more driver chips in between the slots and the CPUs, which may be the case for some of the farthest slots).
PCIe is quite amazing in terms of the topology tricks that you can pull off with it, and c-payne's stuff is extremely high quality.
If you end up trying it please share your findings!
I've basically been putting this kind of gear in my cart, and then deciding I dont want to manage more than the 2 3090s, 4090 and a5000 I have now, then I take the PLX out of my cart.
Seeing you have the cards already it could be a good fit!
Yes, it could be. Unfortunately I'm a bit distracted by both paid work and some more urgent stuff but eventually I will get back to it. By then this whole rig might be hopelessly outdated but we've done some fun experiments with it and have kept our confidential data in-house which was the thing that mattered to me.
Yes, the privacy is amazing, and there's no rate limiting so you can be as productive as you want. There's also tons of learnings in this exercise. I have just 2x 3090's and I've learnt so much about pcie and hardware that just makes the creative process that more fun.
The next iteration of these tools will likely be more efficient so we should be able to run larger models at a lower cost. For now though, we'll run nvidia-smi and keep an eye on those power figures :)
You can tune that power down to what gives you the best tokencount per joule, which I think is a very important metric by which to optimize these systems and by which you can compare them as well.
I have a hard time understanding all of these companies that toss their NDA's and client confidentiality into the wind and feed newfangled AI companies their corporate secrets with abandon. You'd think there would be a more prudent approach to this.
You get occasional accounts of 3090 home-superscalers whereas they would put up eight, ten, fourteen cards. I normally attribute this to obsessive-compulsive behaviour. What kind of motherboard you ended up using and what's the bi-directional bandwidth you're seeing? Something tells me you're not using EPYC 9005's with up to 256x PCIe 5.0 lanes per socket or something... Also: I find it hard to believe the "performance" claims, when your rig is pulling 3 kW from the wall (assuming undervolting at 200W per card?) The electricity costs alone would surely make this intractable, i.e. the same as running six washing machines all at once.
I love your skepsis of what I consider to be a fairly normal project, this is not to brag, simply to document.
And I'm way above 3 kW, more likely 5000 to 5500 with the GPUs running as high as I'll let them, or thereabouts, but I only have one power meter and it maxes out at 2500 watts or so. This is using two Xeons in a very high end but slightly older motherboard. When it runs the space that it is in becomes hot enough that even in the winter I have to use forced air from outside otherwise it will die.
As for electricity costs, I have 50 solar panels and on a good day they more than offset the electricity use, at 2 pm (solar noon here) I'd still be pushing 8 KW extra back into the grid. This obviously does not work out so favorably in the winter.
Building a system like this isn't very hard, it is just a lot of money for a private individual but I can afford it, I think this build is a bit under $10K, so a fraction of what you'd pay for a commercial solution but obviously far less polished and still less performant. But it is a lot of bang for the buck and I'd much rather have this rig at $10K than the first commercial solution available at a multiple of this.
I wrote a bit about power efficiency in the run-up to this build when I only had two GPUs to play with:
My main issue with the system is that it is physically fragile, I can't transport it at all, you basically have to take it apart and then move the parts and re-assemble it on the other side. It's just too heavy and the power distribution is messy so you end up with a lot of loose wires and power supplies. I could make a complete enclosure for everything but this machine is not running permanently and when I need the space for other things I just take it apart, store the GPUs in their original boxes until the next home-run AI project. Putting it all together is about 2 hours of work. We call it Frankie, on account of how it looks.
edit: one more note, the noise it makes is absolutely incredible and I would not recommend running something like this in your house unless you are (1) crazy or (2) have a separate garage where you can install it.
Thanks for replying, and your power story does make more sense all things considering. I'm no stranger to homelabbing, in fact just now I'm running both IBM POWER9 system (really power-hungry) as well as AMD 8004, both watercooled now while trying to bring the noise down. The whole rack, along with 100G switches and NIC/FPGA's, is certainly keeping us warm in the winter! And it's only dissipating up to 1.6 kW (mostly, thanks to ridiculous efficiency of 8434PN CPU which is like 48 cores at 150W or sommat)
I stick the system in my garage when it is working... I very enthusiastically put it together on the first iteration (with only 8 GPUs) in the living room while the rest of the family was holidaying but that very quickly turned out to be mistake. It has a whole pile of high speed fans mounted in the front and the noise was roughly comparable to sitting in a jet about to take off.
One problem that move caused was that I didn't have a link to the home network in the garage and the files that go to and from that box are pretty large so in the end I strung a UTP cable through a crazy path of little holes everywhere until it reaches the switch in the hallway cupboard. The devil is always in the details...
Running a POWER9 in the house is worthy of a blog post :)
As for Frankie: I fear his days are numbered, I've already been eying more powerful solutions and for the next batch of AI work (most likely large scale video processing and model training) we will probably put something better together, otherwise it will simply take too long.
I almost bought a second hand NVidia fully populated AI workstation but the seller was more than a little bit shady and kept changing the story about how they got it and what they wanted for it. In the end I abandoned that because I didn't feel like being used as a fence for what was looking more and more like stolen property. But buying something like that new is out of the ballpark for me, at 20 to 30% of list I might do it assuming the warranty transfers and that's not a complete fantasy, there are enough research projects that have this kind of gear and sell it off when the project ends.
People joke I don't have a house but a series of connected workshops and that's not that far off the mark :)
1-2 tokens/sec is perfectly fine for 'asynchronous' queries, and the open-weight models are pretty close to frontier-quality (maybe a few months behind?). I frequently use it for a variety of research topics, doing feasibility studies for wacky ideas, some prototypy coding tasks. I usually give it a prompt and come back half an hour later to see the results (although the thinking traces are sufficiently entertaining that sometimes it's fun to just read as it comes out). Being able to see the full thinking traces (and pause and alter/correct them if needed) is one of my favorite aspects of being able to run these models locally. The thinking traces are frequently just as or more useful than the final outputs.
Kodak didn't really have the option to compete. Their business was largely film, which just disappeared completely, and even digital cameras got replaced pretty quickly with phones. There was nothing to pivot too for Kodak.
and some simple math says that 10-15 minus 5-7 still leaves Kodak in the lurch. But it's now 2025, and Kodak the corporation is still around, so I don't know that my supposition: Kodak would be doing better today if they'd gone all in on digital camera sensor technology, is disproven by that fact.
Did you have to do anything special to get the SSD to play nice with OS9? I tried adding one to a 300MHz G3 iMac and it took forever to initialize on boot and would randomly stall a lot.
I use a mSATA to IDE adapter that I buy in bulk. This is the Amazon available equivalent of it: https://amzn.to/48qEaOm
I use only 128 GB mSATA cards from reputable brands.
I always do the following:
- Boot from the Mac OS 9 Lives 9.2.2 image (v9 of the image) by CD
- Wipe the SSD using Disk Utilities 2.1
- Restore from the CD
I will say this fails perhaps 1 out of 20 times. Hard to say how often this is an actual hardware failure versus some kind of incompatibility with the mSATA SSD since I do use a range of brands. I am always using the same adapters.
If anyone can pay-as-you-go use a fully automated factory, and the factories are interchangeable, it seems like the value of capital is nearly zero in your envisioned future. Anyone with an idea for soup can start producing it with world class efficiency, prices for consumers should be low and variety should be sky-high.
Buy a used workstation with 512GB of DDR4 RAM. It will probably cost like $1-1.5k, and be able to run a Q4 version of the full deepseek 671B models. I have a similar setup with dual-socket 18 core Xeons (and 768GB of RAM, so it cost about $2k), and can get about 1.5 tokens/sec on those models. Being able to see the full thinking trace on the R1 models is awesome compared to the OpenAI models.
I use a dual-socket 18-core (so 36 total) xeon with 768GB of DDR4, and get about 1.5-2 tokens/sec with a 4-bit quantized version of the full deepseek models. It really is wild to be able to run a model like that at home.
Yeah, it was just a giant HP workstation - I currently have 3 graphics cards in it (but only 40GB total of VRAM, so not very useful for deepseek models).
Interns and new grads have always been a net-negative productivity-wise in my experience, it's just that eventually (after a small number of months/years) they turn into extremely productive more-senior employees. And interns and new grads can use AI too. This feels like asking "Why hire junior programmers now that we have compilers? We don't need people to write boring assembly anymore." If AI was genuinely a big productivity enhancer, we would just convert that into more software/features/optimizations/etc, just like people have been doing with productivity improvements in computers and software for the last 75 years.
Isn't that every new employee? The first few months you are not expected to be firing on all cylinders as you catch up and adjust to company norms
An intern is much more valuable than AI in the sense that everyone makes micro decisions that contribute to the business. An Intern can remember what they heard in a meeting a month ago or some important water-cooler conversation and incorporate that in their work. AI cannot do that
AI/ML and Offshoring/GCCs are both side effects of the fact that American new grad salaries in tech are now in the $110-140k range.
At $70-80k the math for a new grad works out, but not at almost double that.
Also, going remote first during COVID for extended periods proved that operations can work in a remote first manner, so at that point the argument was made that you can hire top talent at American new grad salaries abroad, and plenty of employees on visas were given the option to take a pay cut and "remigrate" to help start a GCC in their home country or get fired and try to find a job in 60 days around early-mid 2020.
The skills aspect also played a role to a certain extent - by the late 2010s it was getting hard to find new grads who actually understood systems internals and OS/architecture concepts, so a lot of jobs adjacent to those ended up moving abroad to Israel, India, and Eastern Europe where universities still treat CS as engineering instead of an applied math disciple - I don't care if you can prove Dixon's factorization method using induction if you can't tell me how threading works or the rings in the Linux kernel.
The Japan example mentioned above only works because Japanese salaries in Japan have remained extremely low and Japanese is not an extremely mainstream language (making it harder for Japanese firms to offshore en masse - though they have done so in plenty of industries where they used to hold a lead like Battery Chemistry).
> by the late 2010s it was getting hard to find new grads who actually understood systems internals and OS/architecture concepts, so a lot of jobs adjacent to those ended up moving abroad to Israel, India, and Eastern Europe where universities still treat CS as engineering instead of an applied math disciple
That doesn’t fit my experience at all. The applied math vs engineering continuum is mostly dependent on whether a CS program at a given school came out of the engineering department or the math apartment. I haven’t noticed any shift on that spectrum coming from CS departments except that people are more likely to start out programming in higher level languages where they are more insulated from the hardware.
That’s the same across countries though. I certainly haven’t noticed that Indian or Eastern European CS grads have a better understanding of the OS or the underlying hardware.
> I certainly haven’t noticed that Indian or Eastern European CS grads have a better understanding of the OS or the underlying hardware.
Absolutely, but that's if they are exposed to these concepts, and that's become less the case beyond maybe a single OS class.
> except that people are more likely to start out programming in higher level languages where they are more insulated from the hardware
I feel that's part of the issue, but also, CS programs in the US are increasingly making computer architecture an optional class. And network specific classes have always been optional.
---------
Mind you, I am biased towards Cybersecurity, DevOps, DBs, and HPC because that is the industry I've worked on for over a decade now, and it legitimately has become difficult hiring new grads in the US with a "NAND-to-Tetris" mindset because curriculums have moved away from that aside from a couple top programs.
ABET still requires computer architecture and organization. And they also require coverage of networking. There are 130 ABET accredited programs in the US and a ton more programs that use it as an aspirational guide.
Based on your domain, I think a big part of what you’re seeing is that over the last 15 years there was a big shift in CS students away from people who are interested in computers towards people who want to make money.
The easiest way to make big bucks is in web development, so that’s where most graduates go. They think of DBA, devops, and cybersecurity as low status. The “low status” of those jobs becomes a bit of a self fulfilling prophecy. Few people in the US want to train for them or apply to them.
I also think that the average foreign worker doing these jobs isn’t equivalent to a new grad in the US. The majority have graduate degrees and work experience.
You could hire a 30 year old US employee with a graduate degree and work experience too for your entry level job. It would just cost a lot more.
reply