> No one who claims this ever posts a benchmark I meant to explain why no one ev...

digdugdirk · 2025-02-15T04:18:11 1739593091

Interesting! Fingers crossed someone who's looking for your skillset finds your post.

What is your process to turn Mac minis into a cluster? Is there any special hardware involved? And you can get 100x tok/s vs others on comparable hardware, what do you do differently - hardware, software, something else?

morphle · 2025-02-15T18:37:41 1739644661

>What is your process to turn Mac minis into a cluster

1) Apply science. Benchmark everyting until you understand if its memory bound, i/o bound or compute bound [1].

2) Rewrite software from scratch in a parallel form with message passing.

3) Reverse engineer native instruction sets of CPU, GPU and ANE or TPU. Same for NVIDIA (don't use CUDA).

No special hardware needed but adding FPGA's for optimizing the network between machines might help.

So you analyse the software and hardware, then restructure it by reprogramming and rewireing and adaptive compilers. Then you benchmark again and you find what hardware runs the algorithm fastest for less $ using less energy and weigh that against the extra cost for reprogramming.

[1] https://en.wikipedia.org/wiki/Roofline_model

morphle · 2025-02-15T04:40:47 1739594447

I discussed all the points you ask about in my HN postings last month, but never in enough detail so you must ask me to specify and that's when people hire me.

As you can see from this comments thread, most people, especially programmers, lack the knowledge we computer scientist, parallel programmers and chip or hardware designers have.

>What is your process

Science. To measure is to know, my prof always said.

To answer your questions in detail, email me.

You first need to be specific. The problem is not how to turn Mac minis into a cluster, with or without custom hardware ( I do both) on code X or Y. Or how to optimize software or rewrite it from scratch (which its often cheaper).

First find the problem. In this case the problem is find the lowest OPEX and Capex to do the stated compute load versus changing the compute load. Turns out in a simulation or a cruder spreadsheet calculation it becomes clear that the energy cost dominates of hardware choice, it trumps the cost of programming, the cost of off the shelf hardware and the difference if you add custom hardware. M4's are lower power, lower OPEX and lower CAPEX especially if you rewrite your (Nvidia GPU) software. The problem is the ignorance of the managers and their employee programmers.

You can repurpose the 2 x 10 Gbps USB-C, the 10 Gbps Ethernet and the three 32 Gbps PCIe ports or Thunderbolts but you have to use better drivers. You need to weigh if double the 960 Gbps 16 GB unified memory for 2 x $400 is faster than 2 Tbps memory at 1.23 times the cost versus 3 x 4 x 32 Gbps PCIe 4.0 versus 3 x 120 Gbps unidirectionally is better for this particular algorithm and wheat changes if you uses both the 10 CPU cores, 10 x 400 GPU corses and 16 Neural Engine cores (at 38 trillion 16 bit OPS) will work batter than just the CUDA cores. Ususally the answers is: rewrite the alogoritm and use an adaptive compiler and then a cluster of smaller 'sweet spot' off the shelf hardware will outperform the most fancy high end hardware if the network is balanced. This varies at runtime so you'll only know if you now how to code. As Akan Kay said and Steve Jobs quoted: if your serious about software you should do your own hardware. If you can't, then you can approach the hardware with commodity components if that turns out to be cheaper. I estimate for $42K labour I can save you a few hundred $k.

sota_pop · 2025-02-15T18:00:27 1739642427

Sounds interesting, but I don’t see any HN submissions on your profile last month. Are you referring to comments you made?

morphle · 2025-02-15T19:41:54 1739648514

>Are you referring to comments you made?

Yes. Several pages of comments about M4 clusters, wafer scale integrations and a few about DeepSeek.

https://news.ycombinator.com/threads?id=morphle (a few pages- press more).

https://news.ycombinator.com/item?id=42799072

Aurornis · 2025-02-15T14:28:13 1739629693

> it's expensive as hell to do a professional benchmark against accepted standards. Its several days work, very expensive rental of several pieces of $10K hardware, etc

When people casually ask for benchmarks in comments, they’re not looking for in-depth comparisons across all of the alternatives.

They just want to see “Running Model X with quantization Y I get Z tokens per second”.

> That's why I know how to benchmark M2 Ultra and M4 Macs, as they are the second best chip a.t.m. that we need to compete against.

Macs are great for being able to fit models into RAM within a budget and run them locally, but I don’t understand how you’re concluding that a Mac is the “second best option” to your $30K machine unless you’re deliberately excluding all of the systems that hobbyist commonly build under $30K which greatly outperform Mac hardware.

morphle · 2025-02-15T15:47:20 1739634440

>They just want to see “Running Model X with quantization Y I get Z tokens per second”.

Influencers on Youtube will give them that [1] but its meaningless. If a benchmark is not part of an in-depth comparison than it doesn't mean anything and can't inform you on what hardware will run this software best.

These shallow benchmarks influencers post on youtube and twitter are not just meaningless but also take days to browse through. And they are influencers, they are meant to influence you and are therefore not honest or reliable.

[1] https://www.youtube.com/watch?v=GBR6pHZ68Ho

>but I don’t understand how you’re concluding that a Mac is the “second best option” to your $30K machine

I conclude that if you can't afford to develop custom chips than in certain cases a cluster of M4 Mac Mini's will be the fastest cheapest option. Cerebras Wafers or NVDIA GPUs have always been too expensive compared to custom chips or Mac Mini clusters, independent of the specific software workload.

I also meant to say that a cluster of $599 Mac Minis will outperform a $6500 M2 Ultra Mac Studio with 192GB and be half the price for higher performance and DRAM but only if you utilize the M4 Mac Mini aggregated 100 Gbps networking.