96-core ARM supercomputer using the NanoPi-Fire3

dragontamer · on Nov 8, 2018

Doesn't seem practical. It might be useful as a learning-framework for MPI / Supercomputer programming... but it wouldn't be a tool that I'd use personally.

A practical baseline for anyone interested in ARM-compute, would be the Thunder X CPU (Cloud rental: https://www.packet.com/cloud/servers/c1-large-arm/). 48-cores per socket, 2x for 96-core servers.

As another commenter said: the primary use of this NanoPi is the ability to emulate a "real" super-computer and really use MPI and such. MPI is a different architecture than a massive node (like a 96-core Thunder X ARM), and you need to practice a bit with it to become proficient.

rwmj · on Nov 9, 2018

I wonder if you could run distributed QEMU[1] on it and present it as a single (very "NUMA-ish") virtual machine? I know - node to node latency would kill you, but it could be fun to try.

[1] https://events.linuxfoundation.org/wp-content/uploads/2017/1...

eiaoa · on Nov 9, 2018

> Doesn't seem practical. It might be useful as a learning-framework for MPI / Supercomputer programming... but it wouldn't be a tool that I'd use personally.

I read somewhere that some real supercomputer systems programmers actually use toy clusters of Raspberry Pi's to test their scheduling software. It helps speed up their development cycle because they can do initial testing on their desktops.

Edit: I think this is what I was thinking of: https://www.youtube.com/watch?v=78H-4KqVvrg

http://www.bitscope.com/blog/FM/?p=GF13L

marmaduke · on Nov 8, 2018

> emulate a "real" super-computer and really use MPI and suc

Wouldn’t containers be a easier way to do that?

dragontamer · on Nov 9, 2018

No. Containers on a single node are too fast.

Super-computers have high-latency communications through thick pipes. True, Super Computers have 40Gbit or 100Gbit connections between nodes, but it can take multiple microseconds to send messages around.

A bunch of containers all sitting on the same box would be able to handle communications within dozens of nanoseconds. So its a bad "emulator" for SuperComputers.

Coordinating all your nodes to compute a problem, while achieving high utilization, is tricky. Its not like programming a normal computer where threads share RAM and can communicate in nanoseconds.

skissane · on Nov 9, 2018

> No. Containers on a single node are too fast.

You can add artificial delay to your local container network to better simulate a production environment.

For example using https://github.com/alexei-led/pumba and "pumba netem delay" you can add networking delay between Docker containers, and "pumba netem rate" can limit the bandwidth between them as well.

("pumba" is just using the underlying Linux networking traffic control technologies, such as the "tc" command from "iproute2", so you don't have to use "pumba", you can set this up manually, but a tool like "pumba" makes it a lot easier.)

dragontamer · on Nov 9, 2018

I've used netem to emulate millisecond delays before. But I'm not sure if it has the granularity to emulate micro-second level delays.

Basically, netem is designed to provide milliseconds of delay, emulating a worldwide network. Supercomputers are thousands of times faster than that. I'd have to play with netem before I was certain that it could handle a sub ~10uS delay that supercomputers have node-to-node.

Considering that Linux task-switching is on the order of ~10mS or so, I have severe doubts that uS level delays will work with netem.

The NanoPi-Fire3 uses normal Gigabit Ethernet, which probably has latencies in the ~50uS region. Which is slower than a real supercomputer, but "proportionally" should be representative of supercomputers (since the tiny embedded ARM chips are around 50x slower than a real supercomputer node anyway).

A bunch of Rasp Pis on Gigabit Ethernet seems like a better overall "cheap supercomputer" architecture, for students of supercomputing. Better than containers or software-emulated network delays

geezerjay · on Nov 9, 2018

Thank you for mentioning pumba. Sounds very interesting.

marmaduke · on Nov 11, 2018

> handle communications within dozens of nanoseconds. So its a bad "emulator" for SuperComputers

But ping on local host reports .3 ms latency..?

In any case it’s still an easy way to get started, and arguably when exact latency starts counting for something, you have to be tuning your code on the system it’s going to run on. An RPi cluster could skew that in all sorts of ways, eg the TCP stack being disproportionately slower etc

fluxty · on Nov 9, 2018

Had to read your answer twice... "High latency???" But I see you consider 1-2us high latency. :P

It is an interesting question what people who don't have access to a supercomputer, but would like to learn and optimize for HPC-style distributed memory programming should use.

I've found AWS to be pretty nice, except there are no RDMA drivers for the elastic NIC and the BW is a bit low. (25Gbit vs. 100Gbit). For MPI bulk synchronous programs, it's probably a pretty close model, though.

laumars · on Nov 9, 2018

> Had to read your answer twice... "High latency???" But I see you consider 1-2us high latency. :P

Relative to accessing the hardware resources on a host, it very much is. Just as accessing the RAM is slow relative to accessing the L2 cache on a CPU

nine_k · on Nov 8, 2018

60 GFlops on 96 cores is not that large.

OTOH if you want to see how your massively parallel algorithm behaves on a 96-node cluster / network, such a box is just $500, and is portable and can work offline.

patrioticaction · on Nov 9, 2018

The comparisons by GFlops was more or less a lark. Especially the ones comparing energy efficiency with a supercomputer from the 90s. This 96 core rig produces 1 GFlop per Watt, compare that to an i9-9900k (250GFlop), z390 chipset and 1 stick of DDR4 (95W + 7W + 2.5W = 104.5W) which does ~2.3 GFlop per Watt.*

* this is back of napkin, real world results will vary

DuskStar · on Nov 9, 2018

I think the i9-9900k has a real-world sustained power draw of around 170W, actually: https://www.anandtech.com/show/13400/intel-9th-gen-core-i9-9...

Add in all the ancillary hardware (motherboard, memory, hard drive, PSU losses) and that efficiency number is going to take a nosedive.

magila · on Nov 9, 2018

That page is embarrassingly wrong about how power management works in Intel CPUs. By default Intel CPUs will not allow their rolling average power consumption over a period of ~1 minute to exceed the specified TDP (95 W in this case). Once the limit is reached the CPU reduces its frequency to bring power consumption down. Intel optimizes their CPUs to achieve a good balance between efficiency and performance when operating at the TDP.

What you see in Anandtech's review is the result of motherboard firmware effectively disabling the power limit by setting it to a very high value. This is a common practice among enthusiast motherboards in order to boost scores in reviews. Unfortunately it also results in drastically lower power efficiency and lots of clueless people, including many tech writers, complaining about unrealistic TDP numbers.

DuskStar · on Nov 9, 2018

> By default Intel CPUs will not allow their rolling average power consumption over a period of ~1 minute to exceed the specified TDP (95 W in this case).

From the page in question: "In this case, for the new 9th Generation Core processors, Intel has set the PL2 value to 210W. This is essentially the power required to hit the peak turbo on all cores, such as 4.7 GHz on the eight-core Core i9-9900K. So users can completely forget the 95W TDP when it comes to cooling. If a user wants those peak frequencies, it’s time to invest in something capable and serious."

95W is the required power to sustain the base clocks.

Also, calling AnandTech clueless... Are there any better hardware review sites? I would consider them a tier 1 site, with HardOCP and not a whole lot else...

dragontamer · on Nov 9, 2018

The "new" sites seem to be up-and-coming Youtube channels.

Anandtech's quality has dropped since Anand Lal Shimpi left for Apple. Its still decent, but they're missing that Anand chip-level wizardry that they used to have. I still consider them a good website, just down a few notches.

The new sites with quality are Youtube-based. Its just where the eyeballs and money are right now.

GamerNexus is probably the best up-and-coming sites (they have a traditional webpage / blog, but also post a Youtube video regularly). And Buildzoid is one of the best if you want to discuss VRM-management on motherboards. These focus more on "builder" issues than chip-level engineering like Anand used to write about.

TechReport is my favorite overall reviewer.

magila · on Nov 9, 2018

Like I said, the Anandtech article has a lot of inaccurate information in it. Unfortunately the quality of tech journalism has taken a dive the last few years as most of the good writers have been hired away by the very tech companies they used to cover.

See this article on Gamers Nexus for a much better summary of the power consumption situation for Intel CPUs

https://www.gamersnexus.net/guides/3389-intel-tdp-investigat...

DuskStar · on Nov 10, 2018

AnandTech actually has a new article on Intel TDP limits out today: https://www.anandtech.com/show/13544/why-intel-processors-dr...

I would find it hilarious if this conversation somehow prompted it.

Anyways, AnandTech's position seems to be:

We test at stock, out-of-the-box motherboard settings, except for memory profiles. We do this for three reasons -

1. This is the experience almost all users will have.

2. This is what the benchmarks published by Intel reflect.

3. This is what damn near every other review site has done forever, and to do otherwise would make results less useful.

So that's why their power draw number was 170W and not 95W for the i9-9900k - motherboard vendors take Intel's recommended settings and laugh. But so does Intel for benchmarks.

walterbell · on Nov 8, 2018

Link or search term?

monocasa · on Nov 8, 2018

Which component of his comment are asking about?

walterbell · on Nov 9, 2018

This part:

> how your massively parallel algorithm behaves on a 96-node cluster / network, such a box is just $500, and is portable and can work offline.

danbolt · on Nov 9, 2018

I think the offline/portability part comes from the fact that the computer exists in a little box you can carry around.

monocasa · on Nov 9, 2018

That's the article you're commenting on.

sannee · on Nov 8, 2018

Can these NanoPis boot over PXE? I was pleasantly surprised a few weeks ago by the fact that the Raspberry Pi can do network boot without an SD card.

ElBarto · on Nov 9, 2018

What's fascinating in that article is to see that a Raspberry Pi 3 has about 10% of the floating-point processing power of a Cray C90...

Cue the many forum questions: "I'm planning to use a Raspberry Pi to control a <simple-ish device>. Will it be powerful enough?"

adrianN · on Nov 9, 2018

The real question is "Will it be powerful enough even though I use a Desktop operating system and a software stack designed for programmer comfort rather than efficiency to control <simple-ish device>".

geezerjay · on Nov 9, 2018

Truth be told, due to today's wealth of computational resources, some very popular software stacks were not designed to be lean or efficient.

qwerty456127 · on Nov 9, 2018

> The NanoPi Fire3 is a high performance ARM Board developed by FriendlyElec for Hobbyists, Makers and Hackers for IOT projects. It features Samsung's Cortex-A53 Octa Core S5P6818@1.4GHz SoC and 1GB 32bit DDR3 RAM

Who needs such a powerful CPU with so little RAM? The reason I have still not bought any Pi is all of them have 2 or less GiBs of RAM and I don't feel interested in buying anything with less than 4.

giancarlostoro · on Nov 9, 2018

You'd be wanting to look into ARM-64 boards like the ROCKPro64:

https://www.pine64.org/?page_id=61454

There's others that are pricier (> $100) with x86 arch the UDOO boards if you really want a SBC with much more RAM too.

gnulinux · on Nov 9, 2018

> Who needs such a powerful CPU with so little RAM?

What do you need that much RAM for? What do you plan to run in this machine?

epanchin · on Nov 9, 2018

Little RAM needed for processing streaming data.

orlp · on Nov 10, 2018

I do for real-time audio synthesis.

sheepybloke · on Nov 9, 2018

I've been trying to do something similar with 4 Orange Pi Zero Plus boards (this blog was one of my main inspirations). While I know it's not practical, it's fun to design the case and the stand, how everything needs to connect, and route it all together. I hope to in the end host a distributed personal website on it and a MQTT server on it for any IoT tinkering I'd want to do!

floatboth · on Nov 8, 2018

Significantly cheaper than the 24-core (also A53) SynQuacer Developerbox. But of course you're getting a cluster instead of one machine…

megous · on Nov 9, 2018

Nice! Distcc based compilation might be something to try on this. :) One thing I noticed is that heatsink fins are oriented in a wrong direction. Air should be going through the fins, not to the side of them. But I guess any air movement is enough to cool this.

otherlife35 · on Nov 9, 2018

Here is a simple study on distcc, pump and using of the make -j# option on low end hardware. It seems that the network could be a bottleneck. The compilation time probably would decrease to 1/4. But I think the use of -j# is the best advice.

https://forums.gentoo.org/viewtopic-t-1056580-start-0.html

mschaef · on Nov 9, 2018

The only supercomputer they compare it to is 27 years old, and it uses Gigabit Ethernet as its interconnect. I think they have a much looser definition of 'Supercomputer' than most people.

geezerjay · on Nov 9, 2018

It's a few SoC crammed in a shoebox. Of course the comparison was never meant to be taken seriously.

fluxty · on Nov 9, 2018

I wonder what topology this has--it definitely seems reminiscent of older supercomputers like the famous Thinking Machines CM-5, which used a hypercube.

aepiepaey · on Nov 9, 2018

Probably nothing interesting.

There are two 8-port ethernet switches.

With 12 nodes, this leaves 4 unused port (2 in each switch).

From the pictures you can see that the box itself has two jacks, both of which are likely connected to one switch each.

The switches don't seem to support link aggregation, so likely to look like this:

    switch1
    ├── external
    ├── nano-pi1
    ├── nano-pi2
    ├── nano-pi3
    ├── nano-pi4
    ├── nano-pi5
    └── nano-pi6
    switch2
    ├── external
    ├── nano-pi7
    ├── nano-pi8
    ├── nano-pi9
    ├── nano-pi10
    ├── nano-pi11
    └── nano-pi12

and if you connect both the switches to the same external switch, you'd get something like:

     switch1
    ┌┼── switch2
    │├── nano-pi1
    │├── nano-pi2
    │├── nano-pi3
    │├── nano-pi4
    │├── nano-pi5
    │└── nano-pi6
    │switch2
    └┼── switch1
     ├── nano-pi7
     ├── nano-pi8
     ├── nano-pi9
     ├── nano-pi10
     ├── nano-pi11
     └── nano-pi12

albertgoeswoof · on Nov 8, 2018

This is cool! But why test on this instead of using a virtual environment locally?

zamadatix · on Nov 8, 2018

Unless you have 96 physical cores testing it in a virtual environment doesn't tell you the same thing.

magila · on Nov 9, 2018

A bunch of Pis networked together is sufficiently far removed from a real HPC cluster that you could probably create a more realistic simulator without too much trouble.