ARM's Cortex A57 and Cortex A53: The First 64-bit ARMv8 CPU Cores

haberman · on Oct 30, 2012

Looks very cool; the idea of having "big" and "little" processors on the same SoC is interesting. I wondering if this is a big win vs. just having a "big" processor that is frequently put into idle mode when not needed.

As a side note, does anyone else find the ARM naming conventions totally impossible to follow? I look at the list of ARM models (http://en.wikipedia.org/wiki/List_of_ARM_microprocessor_core...) and have no idea where to begin. The family/architecture/core scheme means you can have a single chip that is an ARM11 family, ARMv6 architecture, ARM1136JF-S core. How do people make sense of this?

As someone who's writing a library that I want to be supported across all (or most) ARM cores, I don't have any idea how many different chips I'd need to test against to get a representative sample. There are so many optional features (NEON, Thumb, Thumb-2, VFP, etc) that seems to be supported in various combinations across the different models. It's like a maze and I never have any idea how a new model I read about fits in.

brigade · on Oct 30, 2012

One good thing about 64-bit ARM is that there are no optional extensions (yet...) Floating-point and SIMD support are both required.

As for your library, ignore CPU models to start with. Instead, look at the ARM architecture reference manual for a given revision; it will say which instructions are supported and which extensions are optional. Also the quick reference [1] which has a good summary of when various core instructions were added. Then test for the instructions you actually care about.

You can safely ignore Thumb; it's an alternate encoding with tradeoffs that made sense nearly two decades ago but not anymore. You can also mostly ignore Thumb-2; there are processors that only support Thumb-2 but they also lack MMUs. Thumb-2 added some additional ARM instructions that can be useful, however.

So essentially, (with each revision including the prior)

    ARMv4   : baseline
    ARMv5E  : Load/store double, more multiply instructions, prefetch, bx <reg>
    ARMv6   : DSP extensions, rev, load/store exclusive
    ARMv6T2 : Thumb-2, movw/movt, bitfield manipulation, rbit, orn
    ARMv7-A : Memory barriers

Thumb was optional in v4 and v5, and mandatory in v6. VFP was optional as of ARMv6, NEON optional as of ARMv7. NEON mandates VFP. Revisions of VFP and NEON aren't terribly important, except that VFPv3 (but not VFPv3-d16) gets 32 registers.

As for chip to care about, ARMv6 + VFP (ARM1176jzf-s), ARMv7-A + VFPv3-d16 (Tegra 2), and ARMv7-A + NEON are by far the most common for general purpose stuff. If you want to get old-school, test a SheevaPlug for ARMv5E and an old PDA for ARMv4, though I'd not worry too much about ARMv4 (llvm for instance requires a baseline of ARMv5E last I checked.)

[1] http://infocenter.arm.com/help/topic/com.arm.doc.qrc0001l/QR...

haberman · on Oct 30, 2012

This is a fantastically informative comment, thank you!

Nursie · on Oct 30, 2012

Does your library have to be a single compiled binary across all these variants? If so, good luck...

Otherwise, hopefully, a compiler takes care of most of the mess for you and gets you the best it can on the platform you're targeting. You might need to check that things would run on a variety of configurations - for instance hardfloat and softfloat do indeed have very different performance profiles when it comes to floats. Thumb shouldn't bother you too much as an application programmer (unless I'm very much mistaken) because it's just another instruction set for a compiler to target.

Errr....

You can get into all sorts of complications when actually looking at the platform ABIs. Debian, for instance, seem to have something of a lowest-common-denominator approach that targets features present everywhere. Which is then why someone had to rebuild it to get decent FP performance out of the Raspberry Pi which had hard-float...

Part of the complexity is that ARM is a licensed architecture. Some companies license the design of the whole core, some incorporate their own stuff and some just license the instruction set and do their own stuff otherwise.

What do you mean 'supported across all (or most) ARM cores'? Because that's huge and varies massively. There are the sub-100MHz embedded devices I happen to be working on at the moment (which may be running any of a load of different OSs), there are ARM cores embedded in all sorts of controllers where I wouldn't think you'd want to run, then there's the multi-core multi-GHz stuff from the likes of Samsung, Qualcomm and Marvell...

sciurus · on Oct 30, 2012

" Debian, for instance, seem to have something of a lowest-common-denominator approach that targets features present everywhere."

Debian actually has three ARM ports in progress.

ArmEabiPort - newer port using the "new" ABI (EABI), supported on ARM v4t and higher. First released with 5.0 (Lenny). GNU Triplet: arm-linux-gnueabi

ArmHardFloatPort - the latest 32-bit port, using the hard-float version of the "new" ABI (EABI), targetting ARM v7 and up. To be released with 7.0 (Wheezy). GNU Triplet: arm-linux-gnueabihf

Arm64Port - the latest port, for the 64-bit ARMv8 architecture. Likely to be released with 8.0 (Jessie). GNU Triplet: aarch64-linux-gnu

http://wiki.debian.org/ArmPorts

Nursie · on Oct 31, 2012

Perhaps it would have been more accurate to say that's the way debian used to do it then, I know there's been a lot of movement on hardfloat support.

haberman · on Oct 30, 2012

> Does your library have to be a single compiled binary across all these variants?

No no, not that. :)

As Matt mentioned my work does include JIT compilers, so I'm curious to know how many instruction variants I'd have to support. But I also want to simply test that my plain C code (ie. the interpreted, slow paths) doesn't make any platform-specific assumptions that break on some processors.

> What do you mean 'supported across all (or most) ARM cores'?

My intention is that anyone can compile my library out-of-the-box and have it just work, unless their CPU has a fundamental limitation that I can't support. So far the only such limitation I want to concede is that I require at least a 32-bit CPU (for one, my program's code and data will only barely fit in 64k of RAM, and wouldn't leave much space for anything else).

mbrubeck · on Oct 30, 2012

I know that some of haberman's libraries (like upb) include JIT compilers. In those cases you can't just rely on the compiler to take care of instruction set differences. (I'm on the mobile Firefox team, and we run into similar issues targeting our JavaScript engine to different ARM flavors.)

Nursie · on Oct 30, 2012

Then that very well could turn out to be a complete nightmare!

Yes, if you're in the business of writing compilers - traditional, JIT or otherwise - then you're going to hit all sorts of issues with this stuff, and rapidly head off beyond the realms in which I have anything useful to say :)

pjmlp · on Oct 30, 2012

Although I prefer the performance of pure native code, I have to say that this is what makes bytecodes + JIT so appealing nowadays.

IBM is doing this since the OS/360 days, as far as I know.

The JIT lives at the kernel level and all languages, even C, compile to bytecode.

wmf · on Oct 30, 2012

AS/400 aka "i" is JITed; 360 aka "z" is directly executed.

pjmlp · on Oct 30, 2012

Thanks for the clarification.

Symmetry · on Oct 30, 2012

Well, in regards to the first question generally a processor's performance is proportional to the sqaure root of it's area at a given frequency and voltage, and it's active power use and leakage are both proportional to it's area. So yes, a smaller core will be able to accomplish the same task for less energy even though it takes longer.

Now it used to be that processors power budget was dominated by active power and the solution was to scale clockspeeds and voltage levels. As we move to smaller process nodes it's more about leakage power, which only goes away when you power down the core meaning that now race to idle is a better choice. But that's tangential to the bigger/smaller core issue.

lucian1900 · on Oct 30, 2012

The two CPUs appear to differ in dispatch and branch prediction. Interestingly, that's where much of the power/silicon budget goes nowadays.

hoprocker · on Oct 30, 2012

As I understand it, incorrect branch prediction becomes significantly more expensive (power-/time-wise) as pipeline length increases, so getting the prediction right and being able to correct quickly is quite important.

modeless · on Oct 30, 2012

Tegra 3 already has this big/little architecture, but using the same core design for all cores, the difference being that the "little" core is clocked lower and fabbed with a special low-power process. I imagine using two different core designs would be even more effective.

ChuckMcM · on Oct 30, 2012

Pretty creative stuff. I am looking forward to AMD building an eco-system around these things. My guess is that if AMD could deliver 64 bit ARM server chips and full programming documentation to support a robust Linux server OS architecture they could take a huge chunk of server share away from Intel based machines.

My reasoning is that people started migrating to AMD when they had a 64 bit x86 architecture and Intel didn't. That showed me that folks were willing to go with AMD if they provided something that Intel wouldn't (or couldn't). Given that ARM isn't bogged down by Intel's staggering licensing challenges (chipsets, busses, instruction sets, etc) this suggests a very interesting couple of years ahead.

wmf · on Oct 30, 2012

It's not clear how AMD's ARM chips will be any better than the other ones (e.g. Calxeda), though.

ChuckMcM · on Oct 30, 2012

Well they could start by replacing Calexda's 32 bit physical memory addressing (4GB) with 40 bit physical addressing (1TB) all my servers have 96GB and will have 192GB on the next iteration. I'd love to put 512GB of 'flash' memory in the physical address space And then bump of L2 Cache to 12MB maybe 16MB.

mtgx · on Oct 30, 2012

And you think Calxeda will not have 40 bit and 64 bit chips? Of course they will. It's only 32 bit because they've used Cortex A9. Soon they'll use Cortex A15 on 40 bit. And in 2014 they'll use the 64 bit A57, just like AMD and everyone else. But Calxeda will be on generation 3 already.

wmf · on Oct 30, 2012

To be clear, let's compare an AMD chip based on the Cortex-A57 against a Calxeda chip based on the Cortex-A57, or maybe a Samsung chip based on the Cortex-A57. The GPU is an obvious differentiator, but not in servers.

ChuckMcM · on Oct 30, 2012

Sure, but one of the challenges for 'SoC' companies vs 'server' chip companies has been that 'SoC' companies continue to think that the 'system' is mostly just their 'chip' (hence the SoC moniker rather than say CPU moniker)

To date, I am not familiar with any ARM licensees who have made 'CPUs' vs 'SoC's, they may be out there but I've not seen them yet. Something where the ARM CPU is the core but the system designer can pick the level of IO or memory that is included. Basically a socketed ARM chip like server CPUs are socketed today.

brigade · on Oct 30, 2012

Memory interface and core interconnects are another differentiator, just look at how much they vary on Cortex-A9 SoCs for instance (1x32 bit to 4x32bit, to say nothing of the frequency or memory type)

mark-r · on Oct 30, 2012

Putting flash into the physical address space isn't likely to be practical due to their block based nature.

ChuckMcM · on Oct 30, 2012

The current flash architecture is, from a building blocks perspective, identical to dynamic RAM architecture. There is a 'memory controller' which arbitrates delivery of the data from the data bus to the devices themselves based on device specific constraints.

Of course in the case of DRAM its contingent on whether or not that particular row has been refreshed lately or not, and there is a 'page' register which is used to provide part of the address as well.

If you look at how Intel and others have constructed their PCIe cards which have flash on them you will see that the controller presents a "memory" type interface to the PCIe bus as it would in the case of connecting it to the physical memory bus.

The benefit of putting it into the physical address map, combined with its non-volatility, would create opportunities for "on demand" server type OS's which boot in milliseconds rather than seconds.

Tuna-Fish · on Oct 31, 2012

> The current flash architecture is, from a building blocks perspective, identical to dynamic RAM architecture.

This is true of typical NOR flash, but not NAND flash. The smallest word you can read from typical, modern NAND is 8kB. This is not a feature of the controller, but of the way the bit array is laid out. While you could build this memory bus out of NOR, it would be quite expensive -- NAND is not only much more dense, but because it's a commodity with a lot of competition, cost per mm^2 is also much lower than NOR.

wmf · on Oct 30, 2012

If you look at how Intel and others have constructed their PCIe cards which have flash on them you will see that the controller presents a "memory" type interface to the PCIe bus as it would in the case of connecting it to the physical memory bus.

I'm pretty sure this is not correct; you still have to use DMA. Doing MMIO to flash may make a core unhappy (or at least extremely bored) when it blocks for ~20us.

ChuckMcM · on Oct 30, 2012

I think we're talking past each other.

I completely agree with you that it would be challenging and probably quite unsatisfactory to take existing flash controllers and 'pretend' they were a memory controller.

What I am suggesting is that if there is a processor out there which can "consume" a large amount of flash attached to the physical memory bus, then you can will see controllers that are designed to operate well in that mode. Current arm chips for example often have the PSMI "bus" which is used both for pseudo-static memory and sometimes people hook up LCD controllers to that bus.

AMD is big enough to create a market for such flash controllers. And they could put hooks in the TLB such that a cache line fetch from that space could happen asynchronously with other stuff going on. I've got core memory planes that have slower read speeds than flash, I know its possible to make it work :-) But as you (wmf) point out it hasn't been done yet (other than internal NAND flash for embedded devices)

So if I were writing up the MRD or PRD [1] for the controller chip I'd start with provides a multi-channel way of doing load and store of program data at the cache line level. I'd put the wear leveling in the controller to minimize implementation complexity. I might also provide some 'staging' static ram, much like the 'open page' registers in a DRAM controller to keep track of requests that had happened so that I could do read ahead to improve sequential access.

[1] When I worked at Intel each new chip started with a 'Market Requirements Document' (MRD) which described the market for the chip and a 'Product Requirements Document' (PRD) which was a description of a product that could be sold into that market.

Tuna-Fish · on Oct 31, 2012

> But as you (wmf) point out it hasn't been done yet (other than internal NAND flash for embedded devices)

It's not done there either. When NAND is used in such applications, it is usually copied over to volatile ram before use.

Tuna-Fish · on Oct 31, 2012

The biggest advantages they have are Supermicro, existing partherships with server sales channels and manufacturers, and most notably, memory controllers.

Modern ARM really is very near to competitive with x86, but only so long as you don't touch ram. The entire memory hierarchy of A15 or A9 is really weak compared to what is found in any x86 cpu. The most egregious example is that while A15 is much faster than Atom when doing any kind of mostly-register/L1 benchmark, it tends to lose to Atom on real benchmarks with a lot of data.

Low-latency high-speed memory controllers are really hard, and there really are only 3 companies in the world that have proven that they can build them properly (IBM, AMD, Intel). Server workloads tend to be very memory-bound, so if AMD is the only ARM server vendor with a really good memory subsystem, they can potentially do very well.

WiseWeasel · on Oct 30, 2012

AMD could package them with their ATI/Radeon graphics processors.

mtgx · on Oct 30, 2012

Unless they intend to make a mobile GPU like Nvidia did with Tegra, how will their Brazos GPU's compete in performance/Watt with something like the ARM Mali GPU, which would probably be a default choice for Calxeda.

wmf · on Oct 30, 2012

Interestingly, ATI had mobile GPUs but sold them to Qualcomm. Nvidia's trickle-down approach of putting really old desktop GPUs in a mobile SoC doesn't seem to be doing very well.

sparky · on Oct 30, 2012

> ATI had mobile GPUs but sold them to Qualcomm

Yeah, I think that will turn out to have been really shortsighted. Focus is a good thing when you're as small as AMD, but it doesn't seem like they've done too well in the x86 or high-end graphics markets since then either. Of course, their game of CEO musical chairs suggests that the effect of any particular business decision may be swamped by general mismanagement.

> Nvidia's trickle-down approach of putting really old desktop GPUs in a mobile SoC doesn't seem to be doing very well.

Granted, not having unified shaders and some other niceties has been pretty weak, but I think an even larger factor may be the content problem they face. The dominant mobile GPUs (and most importantly, the mobile GPUs in the market-creating, market-leading iOS devices) use a tile-based, or chunker, architecture where the GPU renders one chunk of the screen at a time. Nvidia GPUs don't; they render the whole screen at once. Nvidia believes their architecture is better; I tend to agree for desktop/console content and in a "pure" technical sense I suppose. The problem is that the vast majority of mobile game content has been optimized to work well on chunker GPUs, so any performance advantage Nvidia might see vanishes, and their power efficiency suffers slightly.

They can overcome this structural disadvantage by (1) adopting a chunker architecture, (2) encouraging developers to make games that work better or look cooler on Tegra [1], or (3) capturing a bunch of market share by building a better overall platform, and making Tegra the default target for developers. They could also just build a GPU that's sufficiently better than PowerVR/Mali/Exynos/Adreno in other respects that the tile-based/non-tile-based divide matters less.

For what it's worth, it's hard to compare mobile SoCs from a pure-technical-goodness standpoint, because their designers have such different market positions. A5X's GPU will smoke Tegra at a lot of things, but a lot of that is because Apple can afford for A5X to be twice as big [2]. Nvidia's customers cannot afford a 150mm^2 chip at their current market position.

[1] http://www.tegrazone.com/support/game-support [2] http://en.wikipedia.org/wiki/Apple_system_on_chips

WiseWeasel · on Oct 30, 2012

It wouldn't. It would be a 16-CPU-core beast with a Radeon GPU aimed at competing with x86-64 performance/Watt.

mtgx · on Oct 30, 2012

It sounds like A53 will start at 1.3 Ghz, and A57 will end at 3 Ghz. A 3 Ghz ARM CPU. Interesting:

"For those who are still looking for gigahertz performance numbers Hurley sais]d that new A-50 family will deliver performance ranging from 1.3 gigahertz to 3 Gigahertz depending on how the ARM licensees tweak their designs."

http://gigaom.com/2012/10/30/meet-arms-two-newest-cores-for-...

hemancuso · on Oct 30, 2012

Any Intel employees in the crowd? What's the level of worry surrounding ARM these days?

Given Intel's high levels of competitive paranoia, news like this must have people fairly worried. Not to mention the explosive growth in adoption of the ISA over the past few years.

czhiddy · on Oct 30, 2012

I spoke to my friend @ Intel a while back, and he said that the company viewed Samsung as their main rival (not AMD, not ARM). His argument was that Intel's main competitive advantage (and where the majority of spending/R&D happens) was their fabs, and Samsung was the only other company that could come close to competing on that front.

EwanToo · on Oct 30, 2012

I think Intel's major worry with ARM is that simply speaking, ARM don't need their licencees to be particularly profitable at making chips for them to carry on designing chips, they just need them to buy new licences.

For Intel, having a competitor designing CPUs who can't be undercut directly is a real issue - Intel are not going to be letting Samsung produce Xeon chips under licence in the next 5 years, though I do wonder about Atom ones..

fieldforceapp · on Oct 31, 2012

Well, INTC could adopt the same IP-based business model as ARM and then give up 98% of it's revenue: http://www.wolframalpha.com/input/?i=revenue+of+intel+vs+arm

So the folks at INTC are doing what you'd expect, competing on discrete components and application-tuned architectures. A key example of that is the recent Merrifield-based chipset for smartphones which is obviously a key initiative:

http://www.intomobile.com/2012/05/14/intel-merrifield-system...

theatrus2 · on Oct 31, 2012

Intel is first and foremost a fab company - they're ahead of pretty much everyone in semiconductor processes by at least one generation. If Intel needed to create an ARM-ISA compatible CPU tomorrow, they probably could, and it would be top notch (though maybe not at a margin Intel wants to play at).

The biggest threats come from companies which have shown that they're willing to pour the big bucks into R&D and new fab construction (Samsung).

Jonanin · on Oct 30, 2012

I wish they would release the architecture reference manual to the public... all we have is the instruction set right now. Not enough to do bare metal/os work.

robot · on Oct 30, 2012

you can get it if you register at arm.com and agree to their legal terms.

codex · on Oct 31, 2012

I don't believe these will be the first 64-bit ARMv8 CPU cores. Doesn't that honor go to Applied Micro's X-Gene SOCs? They'll ship a lot sooner than 2014.