Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> No one who claims this ever posts a benchmark

I meant to explain why no one ever posts a benchmark, it's expensive as hell to do a professional benchmark against accepted standards. Its several days work, very expensive rental of several pieces of $10K hardware, etc. You don't often hand that over for free. With my benchmark results some companies can save millions if they take my advice.

>any interesting use cases to massive amounts of memory outside of training?

Dozens, hundreds. Almost anything you use databases, CPUs, GPUs or TPUs for. 90% of computing is done on the wrong hardware, not just datacenter hardware.

The interesting use case we discussed here on HN last week was running full DeepSeek-R1 LLms on 778 GB fast DRAM computers locally. I benchmarked getting hundreds of tokens per second on a cluster of M4 Mac minis or a cluster of M2 Mac Studio Ultras where others reported 0.015 or 6 tokens per second on single machines.

I just heard of a Brazilian man who build a 256 Mac Mini cluster at double the cost that I would. He leaves $600K value on the table because he won't reverse engineer the instruction set, rewrite his software or even call Apple to negotiated a low price.

HN votes me down for commenting that I, a supercomputer builder for 43 years, can build better cheaper faster low power supercomputers from Mac Mini's and FPGA's than from any Nvidia, AMD or Intel state of the art hardware, it even beats the fastest supercomputer of the moment or the Cerebras wafer engine V3 (on energy. coding cost and performance per watt per dollar).

I design and build wafer scale 2 million core reconfigurable supercomputers for $30K a piece that cost $150-$300 million to mass produce. That's why I know how to benchmark M2 Ultra and M4 Macs, as they are the second best chip a.t.m. that we need to compete against.

As a consulting job I do benchmarks or build your on-prem hardware or datacenter. This job consists mainly teaching the customer's programming staf how to program massively parallel software or convincing the CEO not to rent cloud hardware but buy on-prem hardware. OP at Fly.io should have hired me, then he wouldn't have needed to write his blog post.

I replied to your comment in hope of someone hiring me when they read this.



Interesting! Fingers crossed someone who's looking for your skillset finds your post.

What is your process to turn Mac minis into a cluster? Is there any special hardware involved? And you can get 100x tok/s vs others on comparable hardware, what do you do differently - hardware, software, something else?


>What is your process to turn Mac minis into a cluster

1) Apply science. Benchmark everyting until you understand if its memory bound, i/o bound or compute bound [1].

2) Rewrite software from scratch in a parallel form with message passing.

3) Reverse engineer native instruction sets of CPU, GPU and ANE or TPU. Same for NVIDIA (don't use CUDA).

No special hardware needed but adding FPGA's for optimizing the network between machines might help.

So you analyse the software and hardware, then restructure it by reprogramming and rewireing and adaptive compilers. Then you benchmark again and you find what hardware runs the algorithm fastest for less $ using less energy and weigh that against the extra cost for reprogramming.

[1] https://en.wikipedia.org/wiki/Roofline_model


I discussed all the points you ask about in my HN postings last month, but never in enough detail so you must ask me to specify and that's when people hire me.

As you can see from this comments thread, most people, especially programmers, lack the knowledge we computer scientist, parallel programmers and chip or hardware designers have.

>What is your process

Science. To measure is to know, my prof always said.

To answer your questions in detail, email me.

You first need to be specific. The problem is not how to turn Mac minis into a cluster, with or without custom hardware ( I do both) on code X or Y. Or how to optimize software or rewrite it from scratch (which its often cheaper).

First find the problem. In this case the problem is find the lowest OPEX and Capex to do the stated compute load versus changing the compute load. Turns out in a simulation or a cruder spreadsheet calculation it becomes clear that the energy cost dominates of hardware choice, it trumps the cost of programming, the cost of off the shelf hardware and the difference if you add custom hardware. M4's are lower power, lower OPEX and lower CAPEX especially if you rewrite your (Nvidia GPU) software. The problem is the ignorance of the managers and their employee programmers.

You can repurpose the 2 x 10 Gbps USB-C, the 10 Gbps Ethernet and the three 32 Gbps PCIe ports or Thunderbolts but you have to use better drivers. You need to weigh if double the 960 Gbps 16 GB unified memory for 2 x $400 is faster than 2 Tbps memory at 1.23 times the cost versus 3 x 4 x 32 Gbps PCIe 4.0 versus 3 x 120 Gbps unidirectionally is better for this particular algorithm and wheat changes if you uses both the 10 CPU cores, 10 x 400 GPU corses and 16 Neural Engine cores (at 38 trillion 16 bit OPS) will work batter than just the CUDA cores. Ususally the answers is: rewrite the alogoritm and use an adaptive compiler and then a cluster of smaller 'sweet spot' off the shelf hardware will outperform the most fancy high end hardware if the network is balanced. This varies at runtime so you'll only know if you now how to code. As Akan Kay said and Steve Jobs quoted: if your serious about software you should do your own hardware. If you can't, then you can approach the hardware with commodity components if that turns out to be cheaper. I estimate for $42K labour I can save you a few hundred $k.


Sounds interesting, but I don’t see any HN submissions on your profile last month. Are you referring to comments you made?


>Are you referring to comments you made?

Yes. Several pages of comments about M4 clusters, wafer scale integrations and a few about DeepSeek.

https://news.ycombinator.com/threads?id=morphle (a few pages- press more).

https://news.ycombinator.com/item?id=42799072


> it's expensive as hell to do a professional benchmark against accepted standards. Its several days work, very expensive rental of several pieces of $10K hardware, etc

When people casually ask for benchmarks in comments, they’re not looking for in-depth comparisons across all of the alternatives.

They just want to see “Running Model X with quantization Y I get Z tokens per second”.

> That's why I know how to benchmark M2 Ultra and M4 Macs, as they are the second best chip a.t.m. that we need to compete against.

Macs are great for being able to fit models into RAM within a budget and run them locally, but I don’t understand how you’re concluding that a Mac is the “second best option” to your $30K machine unless you’re deliberately excluding all of the systems that hobbyist commonly build under $30K which greatly outperform Mac hardware.


>They just want to see “Running Model X with quantization Y I get Z tokens per second”.

Influencers on Youtube will give them that [1] but its meaningless. If a benchmark is not part of an in-depth comparison than it doesn't mean anything and can't inform you on what hardware will run this software best.

These shallow benchmarks influencers post on youtube and twitter are not just meaningless but also take days to browse through. And they are influencers, they are meant to influence you and are therefore not honest or reliable.

[1] https://www.youtube.com/watch?v=GBR6pHZ68Ho

>but I don’t understand how you’re concluding that a Mac is the “second best option” to your $30K machine

I conclude that if you can't afford to develop custom chips than in certain cases a cluster of M4 Mac Mini's will be the fastest cheapest option. Cerebras Wafers or NVDIA GPUs have always been too expensive compared to custom chips or Mac Mini clusters, independent of the specific software workload.

I also meant to say that a cluster of $599 Mac Minis will outperform a $6500 M2 Ultra Mac Studio with 192GB and be half the price for higher performance and DRAM but only if you utilize the M4 Mac Mini aggregated 100 Gbps networking.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: