Given that today's HPC architectures are mostly power constrained, and a majorit...

fhqghds · on June 22, 2020

Get ready for a surprise then: all those FLOPS are coming from the ARM cores.... This beast has no GPUs:

https://postk-web.r-ccs.riken.jp/spec.html

Merrill · on June 22, 2020

It looks like this is not an ARM core, but a Fujitsu implementation of the Arm v8-A instruction set and Fujitsu-developed Scaleable Vector Extension. Most likely the latter is doing all the heavy lifting.

https://www.fujitsu.com/global/about/resources/news/press-re...

>A64FX is the world's first CPU to adopt the Scalable Vector Extension (SVE), an extension of Armv8-A instruction set architecture for supercomputers. Building on over 60 years' worth of Fujitsu-developed microarchitecture, this chip offers peak performance of over 2.7 TFLOPS, demonstrating superior HPC and AI performance.

d_tr · on June 22, 2020

The text you linked to actually says that the SVE was developed cooperatively by Fujitsu and ARM, without, however, going into details about who did what.

numpad0 · on June 23, 2020

There are words floating that A64fx is basically a SPARC with ARM ISA without much ARM IP in it, no idea how accurate but intriguing

leeter · on June 22, 2020

So looking at anandtech's breakdown the CPUs are closer to a knights landing 'CPU/GPU' than a traditional CPU (currently). They also have a ton of HBM2 right next to the dies so this should be insanely fast as they can feed those cores very very quickly regardless of how fast each core is by clock and pipeline. That should massively reduce stalls.

stephencanon · on June 22, 2020

The "traditional CPU" portion of the core is a bit more capable than KNL, but yeah, that's roughly accurate.

leeter · on June 22, 2020

Oh agreed, but honestly what makes this so interesting is how tuned it is. I'm honestly surprised we haven't seen Intel or AMD ship an HPC CPU with on package HBM2 yet.

m_mueller · on June 23, 2020

Besides FLOP/Watt what's also very interesting here is the FLOP/Byte ratio (memory bandwidth). It has kept the same balance as K computer, i.e. is geared at scientific workloads and not just benchmarks (duh, just worth pointing out here as it makes this machine quite special especially compared to Xeon based clusters - Intel IMO has dropped the ball on bandwidth since the last 5 years or so).

gnufx · on June 22, 2020

As an early user of KNL, I don't get the "GPU" bit. KNL runs normal x86_64 code and doesn't look that much different to the AMD Interlagos systems I once used apart from the memory architecture.

leeter · on June 22, 2020

It comes from the fact that KNL came from Larrabee which was actually developed as a GPU initially (and even ran games... sort of) but was never actually released. The next revision of that was the Xeon Phi chips you used. So the connection is "Lots of small cores with lots of high bandwidth ram" although these cores are definitely superscalar where Larrabee and derivatives were not really.

https://en.wikipedia.org/wiki/Xeon_Phi https://en.wikipedia.org/wiki/Larrabee_(microarchitecture)

gnufx · on June 22, 2020

Sure, but people don't normally think of "GPU" in this context as just runs all your x86_64 code.

ViralBShah · on June 22, 2020

That's pretty cool! That probably means that applications will have an easier time. Looks like it has 512-bit SIMD.

I wonder what BLAS they are using, and if the contributions are open sourced.

gnufx · on June 22, 2020

(SVE isn't 512-bit SIMD like AVX512.) I don't know what BLAS they're using, though I know they've long worked on their own, but BLIS has gained SVE support recently, for what it's worth.

floatboth · on June 22, 2020

SVE is whatever width the chip designer wants, Fujitsu's implementation is 512-bit according to AnandTech

gnufx · on June 22, 2020

I know, but it's different apart from coming in different hardware widths, as ARM techies will gush.

jabl · on June 23, 2020

Yes, SVE, like the RISC-V vector extension, is a "real" vector ISA, with things like vector length register (no need for a scalar loop epilog), scatter/gather memory ops for sparse matrix work, mask registers for if-conversion, looser alignment requirements (no/less need for loop prologues).

That being said, apart from becoming wider, AVX-NNN has also gotten more "real" vector features with every generation. The difference might not be as huge anymore.

d_tr · on June 22, 2020

I am really happy to have come across this post, mainly due to this fact.

stephencanon · on June 22, 2020

Worth noting that Fugaku has no GPU/accelerator; all the compute is located in-core (cpu). The core itself has some GPU-like qualities, of course, since it's more optimized for semi-uniform compute throughput than a "normal" CPU is.

gpderetta · on June 22, 2020

Fujitsu has been building its own HPC CPUs, for a long time, whether they use the ARM architecture or SPARC doesn't probably matter much for them. They know how to make them fast.

bashinator · on June 23, 2020

Yup, one of my first jobs out of college was at HAL.

calaphos · on June 22, 2020

> While I say all of this, I should also point out that the top500 benchmark pretty much is not representative of most real-life workloads, and is largely based on your ability to solve the largest dense linear solve you possibly can - something almost no real application does.

They also publish the HPCG benchmark with sparse matrixes. And unsurprisingly an order of magnitude lower flops across the board. The Fujitsu chip scales a whole lot better than the usual Nvidia GPUs though.

Symmetry · on June 22, 2020

I'll count myself as someone surprised, given that GPUs are often better tuned to HPC code, that Fujitsu was able to do so well with an Intel Phi approach of just using larger vector units on general purpose CPUs. I wouldn't have thought you could make an out of order core efficiently support scatter/gather the way this thing seems to, though I guess it's possible that the vector unit is in order. Well, the proof is in the pudding and hats off to Fujitsu and ARM.

wenc · on June 22, 2020

> based on your ability to solve the largest dense linear solve you possibly can - something almost no real application does.

Sounds right.

I was going to say what about large-scale optimization problems? But I realized that most typically only require sparse linear solves.

Gradient descent does require the solution of dense Ax=b systems. But the most visible/popular application of large-scale gradient descent today, neural networks, typically use SGD which require no dense linear solves at all.

tasogare · on June 22, 2020

> And then there's the ARM in Mac.

Are you posting from the future or referring to the T2 chips?

kohtatsu · on June 23, 2020

It was officially announced at the end of the WWDC event today.