Doesn't seem practical. It might be useful as a learning-framework for MPI / Supercomputer programming... but it wouldn't be a tool that I'd use personally.
As another commenter said: the primary use of this NanoPi is the ability to emulate a "real" super-computer and really use MPI and such. MPI is a different architecture than a massive node (like a 96-core Thunder X ARM), and you need to practice a bit with it to become proficient.
I wonder if you could run distributed QEMU[1] on it and present it as a single (very "NUMA-ish") virtual machine? I know - node to node latency would kill you, but it could be fun to try.
> Doesn't seem practical. It might be useful as a learning-framework for MPI / Supercomputer programming... but it wouldn't be a tool that I'd use personally.
I read somewhere that some real supercomputer systems programmers actually use toy clusters of Raspberry Pi's to test their scheduling software. It helps speed up their development cycle because they can do initial testing on their desktops.
Super-computers have high-latency communications through thick pipes. True, Super Computers have 40Gbit or 100Gbit connections between nodes, but it can take multiple microseconds to send messages around.
A bunch of containers all sitting on the same box would be able to handle communications within dozens of nanoseconds. So its a bad "emulator" for SuperComputers.
Coordinating all your nodes to compute a problem, while achieving high utilization, is tricky. Its not like programming a normal computer where threads share RAM and can communicate in nanoseconds.
You can add artificial delay to your local container network to better simulate a production environment.
For example using https://github.com/alexei-led/pumba and "pumba netem delay" you can add networking delay between Docker containers, and "pumba netem rate" can limit the bandwidth between them as well.
("pumba" is just using the underlying Linux networking traffic control technologies, such as the "tc" command from "iproute2", so you don't have to use "pumba", you can set this up manually, but a tool like "pumba" makes it a lot easier.)
I've used netem to emulate millisecond delays before. But I'm not sure if it has the granularity to emulate micro-second level delays.
Basically, netem is designed to provide milliseconds of delay, emulating a worldwide network. Supercomputers are thousands of times faster than that. I'd have to play with netem before I was certain that it could handle a sub ~10uS delay that supercomputers have node-to-node.
Considering that Linux task-switching is on the order of ~10mS or so, I have severe doubts that uS level delays will work with netem.
The NanoPi-Fire3 uses normal Gigabit Ethernet, which probably has latencies in the ~50uS region. Which is slower than a real supercomputer, but "proportionally" should be representative of supercomputers (since the tiny embedded ARM chips are around 50x slower than a real supercomputer node anyway).
A bunch of Rasp Pis on Gigabit Ethernet seems like a better overall "cheap supercomputer" architecture, for students of supercomputing. Better than containers or software-emulated network delays
> handle communications within dozens of nanoseconds. So its a bad "emulator" for SuperComputers
But ping on local host reports .3 ms latency..?
In any case it’s still an easy way to get started, and arguably when exact latency starts counting for something, you have to be tuning your code on the system it’s going to run on. An RPi cluster could skew that in all sorts of ways, eg the TCP stack being disproportionately slower etc
Had to read your answer twice... "High latency???" But I see you consider 1-2us high latency. :P
It is an interesting question what people who don't have access to a supercomputer, but would like to learn and optimize for HPC-style distributed memory programming should use.
I've found AWS to be pretty nice, except there are no RDMA drivers for the elastic NIC and the BW is a bit low. (25Gbit vs. 100Gbit). For MPI bulk synchronous programs, it's probably a pretty close model, though.
OTOH if you want to see how your massively parallel algorithm behaves on a 96-node cluster / network, such a box is just $500, and is portable and can work offline.
The comparisons by GFlops was more or less a lark. Especially the ones comparing energy efficiency with a supercomputer from the 90s. This 96 core rig produces 1 GFlop per Watt, compare that to an i9-9900k (250GFlop), z390 chipset and 1 stick of DDR4 (95W + 7W + 2.5W = 104.5W) which does ~2.3 GFlop per Watt.*
* this is back of napkin, real world results will vary
That page is embarrassingly wrong about how power management works in Intel CPUs. By default Intel CPUs will not allow their rolling average power consumption over a period of ~1 minute to exceed the specified TDP (95 W in this case). Once the limit is reached the CPU reduces its frequency to bring power consumption down. Intel optimizes their CPUs to achieve a good balance between efficiency and performance when operating at the TDP.
What you see in Anandtech's review is the result of motherboard firmware effectively disabling the power limit by setting it to a very high value. This is a common practice among enthusiast motherboards in order to boost scores in reviews. Unfortunately it also results in drastically lower power efficiency and lots of clueless people, including many tech writers, complaining about unrealistic TDP numbers.
> By default Intel CPUs will not allow their rolling average power consumption over a period of ~1 minute to exceed the specified TDP (95 W in this case).
From the page in question: "In this case, for the new 9th Generation Core processors, Intel has set the PL2 value to 210W. This is essentially the power required to hit the peak turbo on all cores, such as 4.7 GHz on the eight-core Core i9-9900K. So users can completely forget the 95W TDP when it comes to cooling. If a user wants those peak frequencies, it’s time to invest in something capable and serious."
95W is the required power to sustain the base clocks.
Also, calling AnandTech clueless... Are there any better hardware review sites? I would consider them a tier 1 site, with HardOCP and not a whole lot else...
The "new" sites seem to be up-and-coming Youtube channels.
Anandtech's quality has dropped since Anand Lal Shimpi left for Apple. Its still decent, but they're missing that Anand chip-level wizardry that they used to have. I still consider them a good website, just down a few notches.
The new sites with quality are Youtube-based. Its just where the eyeballs and money are right now.
GamerNexus is probably the best up-and-coming sites (they have a traditional webpage / blog, but also post a Youtube video regularly). And Buildzoid is one of the best if you want to discuss VRM-management on motherboards. These focus more on "builder" issues than chip-level engineering like Anand used to write about.
Like I said, the Anandtech article has a lot of inaccurate information in it. Unfortunately the quality of tech journalism has taken a dive the last few years as most of the good writers have been hired away by the very tech companies they used to cover.
See this article on Gamers Nexus for a much better summary of the power consumption situation for Intel CPUs
I would find it hilarious if this conversation somehow prompted it.
Anyways, AnandTech's position seems to be:
We test at stock, out-of-the-box motherboard settings, except for memory profiles. We do this for three reasons -
1. This is the experience almost all users will have.
2. This is what the benchmarks published by Intel reflect.
3. This is what damn near every other review site has done forever, and to do otherwise would make results less useful.
So that's why their power draw number was 170W and not 95W for the i9-9900k - motherboard vendors take Intel's recommended settings and laugh. But so does Intel for benchmarks.
The real question is "Will it be powerful enough even though I use a Desktop operating system and a software stack designed for programmer comfort rather than efficiency to control <simple-ish device>".
> The NanoPi Fire3 is a high performance ARM Board developed by FriendlyElec for Hobbyists, Makers and Hackers for IOT projects. It features Samsung's Cortex-A53 Octa Core S5P6818@1.4GHz SoC and 1GB 32bit DDR3 RAM
Who needs such a powerful CPU with so little RAM? The reason I have still not bought any Pi is all of them have 2 or less GiBs of RAM and I don't feel interested in buying anything with less than 4.
I've been trying to do something similar with 4 Orange Pi Zero Plus boards (this blog was one of my main inspirations). While I know it's not practical, it's fun to design the case and the stand, how everything needs to connect, and route it all together. I hope to in the end host a distributed personal website on it and a MQTT server on it for any IoT tinkering I'd want to do!
Nice! Distcc based compilation might be something to try on this. :) One thing I noticed is that heatsink fins are oriented in a wrong direction. Air should be going through the fins, not to the side of them. But I guess any air movement is enough to cool this.
Here is a simple study on distcc, pump and using of the make -j# option on low end hardware. It seems that the network could be a bottleneck. The compilation time probably would decrease to 1/4. But I think the use of -j# is the best advice.
The only supercomputer they compare it to is 27 years old, and it uses Gigabit Ethernet as its interconnect. I think they have a much looser definition of 'Supercomputer' than most people.
I wonder what topology this has--it definitely seems reminiscent of older supercomputers like the famous Thinking Machines CM-5, which used a hypercube.
A bunch of Pis networked together is sufficiently far removed from a real HPC cluster that you could probably create a more realistic simulator without too much trouble.
A practical baseline for anyone interested in ARM-compute, would be the Thunder X CPU (Cloud rental: https://www.packet.com/cloud/servers/c1-large-arm/). 48-cores per socket, 2x for 96-core servers.
As another commenter said: the primary use of this NanoPi is the ability to emulate a "real" super-computer and really use MPI and such. MPI is a different architecture than a massive node (like a 96-core Thunder X ARM), and you need to practice a bit with it to become proficient.