Tegra K1 “Denver” Will Be First 64-bit ARM Processor for Android

DCKing · on Aug 12, 2014

What a peculiar design. They went for an in order design instead of the out-of-order design that has been standard in high-end ARM cores for years. Out-of-order usually brings more instructions per clock cycle ("more efficient"), but makes the chip more power hungry and makes it harder to achieve a high clock speed.

So the in-order design they have now allows them to scale up to 2.5 GHz which is very high for such a constrained chip. They also have this weird "code optimizer" that appears to sacrifice some RAM and a CPU core to dynamically reorganize processor instructions to run better on this in-order design. It seems as though they want to have their cake and eat it too, but at the same time they appear to paradoxically introduce complexity to achieve simplicity.

I wonder if someone more knowledgable about chip design could comment on this design?

_delirium · on Aug 12, 2014

One of the more interesting bits of speculation about the design's motivations that I've seen so far is an anonymous comment on the Slashdot version of this story: http://hardware.slashdot.org/comments.pl?sid=5519563&cid=476...

gcp · on Aug 12, 2014

This is another good post, despite the appeal to authority: http://hardware.slashdot.org/comments.pl?sid=5519563&cid=476...

Saying "When an in-order CPU stalls on memory, it's still burning power while waiting, while an OOO processor is still getting work done" seems deceptive, though. The OOO processor is (often) doing speculative work, which may be unneeded, and a stalled in-order CPU won't be using quite the same amount of power as if the execution units were actually switching.

Though I'm pretty sure the NVIDIA design must speculative prefetch and hoist memory reads aggressively to be performance competitive (see the huge cache sizes!), which also burns power.

wmf · on Aug 12, 2014

It's Transmeta 2.0.

walterbell · on Aug 12, 2014

Background on Project Denver, including x86 vs ARM ISA, http://semiaccurate.com/2011/08/05/what-is-project-denver-ba...

"Denver is just one of the variants in that line. T50 was going to be a full 64-bit x86 CPU, not ARM cored chip, but Nvidia lacked the patent licenses to make hardware that was x86 compatible."

voidlogic · on Aug 12, 2014

I'm surprised nVidia didn't buy VIA, or at least Centaur Technology from VIA, to make this happen.

zurn · on Aug 12, 2014

Neat. On Android it's easy to believe Dalvik/ART generated code leaves more performance on the table compared to other platforms.

Another "what's different this time" point: compared to Transmeta era, CPUs now typically have idle core(s) waiting for work to do. This may enable more ambitious optimizations as you're not constantly stealing cycles from the app.

Does anyone have a link to the paper?

higherpurpose · on Aug 12, 2014

Why do you lump Dalvik and ART together? The whole point of ART is that it's compiled to native code, is it not? Therefore it's "not leaving anything on the table".

Not sure why I'm being downvoted:

> The big paradigm-shift that ART brings, is that instead of being a Just-in-Time (JIT) compiler, it now compiles application code Ahead-of-Time (AOT). The runtime goes from having to compile from bytecode to native code each time you run an application, to having it to do it only once, and any subsequent execution from that point forward is done from the existing compiled native code.

http://www.anandtech.com/show/8231/a-closer-look-at-android-...

That sounds like native code compilation to me.

fulafel · on Aug 12, 2014

Dalvik and ART both generate native code. ART does it at app install time and Dalvik, being a JIT, does it at app runtime. He was speculating about quality of the generated native code, not its existence.

Google has been very quiet about ART generated code quality. I haven't found any benchmarks that would compliment its performance either. It's probably better than Dalvik but worse than mature optimizing compilers.

on_and_off · on Aug 16, 2014

I don't think they have been very quiet about ART. They had a whole I/O session and they recently dedicated a whole dev backstage podcast to ART and its future evolution (btw, some nice things are planned : suppression of the 65k limit, code hot-swap, ...).

nb1981 · on Aug 12, 2014

Native does not always (or even often) equal optimal. Plenty will still be left on the table.

malkia · on Aug 12, 2014

20 or more years ago there was a peephole optimizer that looked at generated Turbo Pascal (and other not so advanced (for their time) compilers) and rearranged instructions, removed holes, replaced instructions with better set (turns out some of the "macro"-instructions were not that fast).

Nowadays calling a .so/.dylib/.dll function, or accessing a thread-local variable also generates lots of cruft code that could possibly get optimized once the data is loaded. It won't always work (shared libraries can't be unloaded, but with enough hints, or assumptions that this would never be done, one can gain some benefits). On the negative this may reduce the code-sharing across processes.

pjmlp · on Aug 12, 2014

Some people have problems to distinguish between languages and the multiple ways they can be implemented.

The fact that ART makes use of dex bytecode as portable executable format[0] complicates the matter for those not versed in compiler design.

So people see bytecode and think managed, when a bytecode format doesn't have anything to do with being interpreted or compiled to native code.

[0] An old idea that goes back to the mainframe days and used by quite a few other languages.

Symmetry · on Aug 12, 2014

The paper is here, free with registration and I haven't been spammed (yet).

http://www.tiriasresearch.com/downloads/nvidia-charts-its-ow...

In general CPUs have idle cores, but this looks like it's supposed to be a dual core setup so that's going to be less of a factor. I think the main idea is that you only have to do the optimization once but you run the code many times, which is an improvement from Transmeta where you threw out your optimizations when the process ended.

chrissnell · on Aug 12, 2014

Oh, Transmeta. Back in the heyday of Slashdot, Transmeta had a massive cult following for many months before the product was even released. I remember the dissapointment when they did finally ship. My dad bought one of the little Sony laptops with their proc and it was a dog (although power-efficient).

SixSigma · on Aug 12, 2014

That was because Linus Torvalds did a keynote speech with live streaming when he was employed there, he was pretty stoked himself.

The only detail I remember from it was :

"Linus, what Linux distribution do you use?"

"Debian, of course"

I wonder if he still uses that on his Mac Book Air

gcp · on Aug 12, 2014

Whenever one of the most well known open source technologists joins a (pretty secretive) firm, it's going to generate interest and speculation.

Also remember John Carmack joining Intel to work on Larrabee.

wmf · on Aug 12, 2014

Maybe you're thinking of Michael Abrash.

cheald · on Aug 12, 2014

I asked it when apple announced 64 bit support for the iPhone, and I'll ask it again now: what is the point in 64 bit cpus in devices that will be deprecated and replaced before we get around to having more than 32 bits worth of addressable ram in these devices?

These aren't $1500 desktops with user upgradable components. They are sealed, non upgradable, relatively disposable devices that are intentionally obsoleted by their manufacturers every 12-18 months. It makes for a nice marketing bullet point, but are we actually getting anything besides a bigger number to impress the impressionable customer with?

terhechte · on Aug 12, 2014

It's not only about memory:

"In short, the improvements to Apple's runtime make it so that object allocation in 64-bit mode costs only 40-50% of what it does in 32-bit mode. If your app creates and destroys a lot of objects, that's a big deal." [1]

One example: in 64bit NSNumber and (short) NSString objects on iOS can be stored completely in a pointer (tagged pointers) on the stack without having to create anything on the heap. That's possible because the size of a 64bit pointer is large enough to contain the required information. Creating and accessing one of these objects becomes far faster. Based on what I gathered at WWDC this year, Apple is also inclined to move as many various objects there as possible (as long as they can be stored in a 64 bit pointer).

[1] https://www.mikeash.com/pyblog/friday-qa-2013-09-27-arm64-an...

DCKing · on Aug 12, 2014

It is worth noting that this is likely to be an iOS-on-AArch64 specific tweak. Apple doesn't do this on x64 OS X (right?).

Furthermore, nothing has been said (AFAIK) about similar tricks Google is doing with Android on AArch64 or 64 bit architectures in general. From what I've heard they just made the Android runtime capable of emitting AArch64 (and x64, MIPS64) instructions instead of 32-bit ARMv7 ones.

That alone gives plenty of a performance boost as AArch64 instructions are supposedly quite a bit faster than the old 32-bit ones. It remains to be seen though whether Google can make as much use of clever allocation tricks as much as Apple can: Android is much more platform agnostic and uses a garbage collector for some of these tasks.

pjmlp · on Aug 12, 2014

Google is already doing it in ART.

ART also brings a new GC, which is quite improved compared to the fossilized Dalvik version that has barely changed since Android 2.3.

pohl · on Aug 12, 2014

There is nothing iOS specific about it. MacOS X has been benefiting from this optimization since 10.7.

DCKing · on Aug 12, 2014

I stand corrected. So it's a feature of Apple's runtime on all 64 bit systems.

frankchn · on Aug 12, 2014

64-bit support on ARM entails the implementation of AArch64 - not just widening the current registers to 64-bits and nothing else.

AArch64 has a better instruction set and more general purpose registers (31 vs 16), so applications recompiled to target AArch64 should see performance benefits even if it only uses 1 MB of RAM and never encounters integers greater than 2^32.

rayiner · on Aug 12, 2014

You can't just release a 64 bit chip and expect a full stack of 64-bit clean software to be available on day 1. Indeed, last time I checked Dalvik wasn't fully 64-bit clean. Releasing earlier than you need it ensures that by the time you do need it, the software is ready to go.

Also, besides the correct points made in sister comments, there is the issue that virtual address space is used for things other than simply mapping physical memory. There was a time, for example, when Linux mapped all of physical memory into the kernel's virtual address space. This was simple. These days, on 32-bit systems, it splits the 4GB address space into 3GB for the user and 1GB for the kernel, and maps physical memory in and out of a 128MB window in kernel space, as needed. This is obviously more complicated.

antimagic · on Aug 12, 2014

Ugh, I started to write a response, but then I realised I was more or less just going over the same ground that mikeash has more eloquently covered. https://mikeash.com/pyblog/friday-qa-2013-09-27-arm64-and-yo...

One thing that Mike missed though is that you can do large bandwidth operations faster. You can blend two pixels at a time with 64bit, so your graphical blend can be twice as fast (of course, if you're really worried about the speed of these types of operation, you go SIMD - but that's a pain in the butt and requires in line assembly language to get the biggest boosts. In particular, ARM Neon's C function wrappers on SIMD instructions generally gives you a much smaller boost than if you hand code the assembly language calls. With 64bit, you can get some of the advantage without the hassle).

exDM69 · on Aug 12, 2014

> In particular, ARM Neon's C function wrappers on SIMD instructions generally gives you a much smaller boost than if you hand code the assembly language calls.

My practical experience using intrinsics is quite the opposite. Using C + NEON or SSE intrinsics yielded a lot better performance than what my hand written assembly could.

I've heard similar stories to yours from others but they were from years ago. Later versions of compilers seem to be a lot better at this.

In particular, the C compilers (GCC, Clang) were able to further optimize code written using intrinsics, where using assembler code practically inhibits all compiler optimizations.

While I am able to do pretty good instruction selection by hand (which the compiler wasn't very good at), what blew me away was the compiler's ability to do instruction scheduling and register allocation. I could not have matched that without spending a huge amount of time reading architecture specific optimization manuals.

So I got pretty neat, readable and portable code (x86+SSE and ARM+NEON) using a single code base where I had only written some "primitive" functions using low level intrinsics. I could write nice high level code and the compiler would inline my primitive function calls and then further optimize and re-organize the instructions.

antimagic · on Aug 13, 2014

Um, I don't think my GCC was that old, it was for an Android tablet from about two years ago - GCC 4.3 or 4.4 if memory serves me correctly. But yes, it was definately a failure to optimise that was causing the problem. To be honest, I generated my "hand-written" assembly by taken the output of gcc compiling intrinsic-based code, and then removing all of the dumb, unnecessary shuffles to main memory by better use of the registers... That said, there were clearly bugs in the compiler - you could crash GCC by doing certain innocuous things in your assembly code...

fulafel · on Aug 12, 2014

Your virtual address space gets cramped before your physical RAM hits your addressable VA limit. See the pain points in 32-bit desktop operating systems trying to run with 4GB ram.

brigade · on Aug 12, 2014

Linus has said that the pain point actually starts around 1GB, especially when using a 3/1 userspace/kernel address split. At 1GB, you can't map the entirety of physical ram into kernel address space simultaneously, since you also have memory-mapped IO taking up space. Let alone mapping the same physical memory to multiple virtual addresses with different cache attributes, which is useful because approximately none of these SoCs have full cache coherence across all hardware blocks.

qwerta · on Aug 12, 2014

We have embedded system with 128MB which has problem with 32bit addressing limit as well. You can not mmap files...

ris · on Aug 12, 2014

Being able to mmap large files, even if you don't have the memory to hold them completely is a great advantage. Address space isn't just about physical memory.

exDM69 · on Aug 12, 2014

This! I hope that 32 bit processors will soon be history and we will be able to use mmap (+ madvise, fadvise, etc) instead of read/write without having to worry about portability.

stock_toaster · on Aug 12, 2014

> I asked it when apple announced 64 bit support for the iPhone, and I'll ask it again now: what is the point in 64 bit cpus in devices that will be deprecated and replaced before we get around to having more than 32 bits worth of addressable ram in these devices?

I imagine the additional cpu registers (general purpose and floating point) are handy for crypto and encoder/decoder stuff.

pjmlp · on Aug 12, 2014

Yes, you can make use of extra bits for metadata like type tagging for treating objects as value types and GC information.

Quite useful when moving away from C into languages like Objective-C, Swift, Java, C#, ... as the ones with first class treatment on the vendor SDKs

sliken · on Aug 12, 2014

Double the performance, not because of 64-bit. For instance the dual core 64-bit apple chip does quite well in benchmarks and application performance against numerous 32 bit android devices with 4 cores.

personZ · on Aug 12, 2014

Cortex-A15 designs and up already offer 36 to 40 bits of addressable memory, or from 64GB to 1TB. This is at the OS level, of course, so individual 32-bit processes would still be limited to 2-3GB without intentional PAE, but that's hardly a limit in a mobile design.

But ultimately it isn't about memory in the near-term. ARMv8 offers more, larger registers, instructions, a higher hardware baseline, and in the future higher memory.

shmerl · on Aug 12, 2014

I really hope Nvidia will open up their K1 driver completely following Intel's approach.

tux1968 · on Aug 12, 2014

They have...

http://linuxgizmos.com/nvidia-opens-tegra-k1-driver-wins-tor...

Or am I misunderstanding?

fulafel · on Aug 12, 2014

This doesn't do any accelerated graphics. It remains to be seen if NV are just trying to make life smoother for the proprietary parts of driver stack on Android by getting the the HW initialization etc into upstream Linux, or if they are aiming to actually do an open 3D driver. They haven't announced any plans about the latter.

pjmlp · on Aug 12, 2014

Won't happen.

Intel 3D hardware is crappy anyway.

rbanffy · on Aug 12, 2014

> Intel 3D hardware is crappy anyway.

Good enough goes a long way when the competition is made of high-performance temperamental sports cars that tend to catch fire when you least want them to. Sometimes, all you need is a Honda Accord.

And eventually "good enough" will catch up.

fulafel · on Aug 12, 2014

Less crappy than many people think. Intel and AMD are about head to head in game benchmarks. See eg. http://www.anandtech.com/show/7677/amd-kaveri-review-a8-7600...

pjmlp · on Aug 12, 2014

The ones that have to program for them beg to differ:

http://richg42.blogspot.de/2014/05/the-truth-on-opengl-drive...

brigade · on Aug 12, 2014

He's programming for 200W+ dedicated AMD cards. It may surprise you that AMD can't match that level of performance in their 65W SoCs either.

pjmlp · on Aug 12, 2014

Last time I checked, even AMD's Brazos were better than Intel's offering.

fulafel · on Aug 12, 2014

Yeah, software can be a different story. But, Intel drivers are by far the best ones on Linux, so I'm happy on that front as well.

gnarbarian · on Aug 12, 2014

analogies are failing me here. But the best one I have is this:

A bicycle requires less maintenance and is easier to build repair and operate than an F1 car. Taken a step further. The complex nature of the F1 car and competitive nature of F1 races gives far greater advantages to secrets which may give the team/car the competitive edge.

I don't like the fact that this is where the graphics industry is and I wish that they would simply compete on hardware and the tech was all open. but I do prefer nvidia on linux to any other GPU because they have the performance I'm looking for. Also their linux drivers are far better than AMD's.

shmerl · on Aug 12, 2014

> Intel 3D hardware is crappy anyway.

Note anymore. Their 14nm mobile chips should be competitive and with open drivers will enable Linux + Wayland on them.

muro · on Aug 12, 2014

That has been Intel's story for 10 years - the current generation may suck, but look what's coming next!

ulfw · on Aug 12, 2014

I couldn't agree more. This has been the Marketing spin for many generations of intel Graphics solutions. "Oh this might not be as performant as one might want, but wait for next year's Sandy Bridge/Ivy Bridge/Haswell/Broadwell. Graphics will be x times as fast!" And then it turns out it is barely more performant but hey... next year's will be though!

w0utert · on Aug 12, 2014

To be fair, Intel has made a lot of progress with their IGP's over the last few years. I have been slamming Intel graphics for exactly the same reasons you and the commenter before you (and rightly so), but times do change. HD3000 and HD4000 were both big steps forward, much bigger than anything AMD has shown in the same timeframe, but Intel was so far behind that they were still lacking. Haswell graphics (HD5000, specifically Iris Pro) are already pretty close behind the lower-end AMD IGP's though. If Intel manages a similarly large step forward with Broadwell, they might just step into the same league AMD is playing in.

jbarrow · on Aug 12, 2014

Considering the Tegra K1 is the first CUDA-capable mobile processor, does anyone know if we'll be able to leverage CUDA bindings on devices that run on this?

cvs268 · on Aug 12, 2014

Yes! CUDA on ARM has been available for quite some time now. http://devblogs.nvidia.com/parallelforall/cuda-arm-platforms...

In fact the most recent version of OpenCV already leverages this on ARM platforms that support CUDA. http://code.opencv.org/projects/opencv/wiki/CARMA_platform_c...

msh · on Aug 12, 2014

They says first 64bit arm for android, I thought that Qualcomm and mediatek already have 64bit arm CPUs?

masklinn · on Aug 12, 2014

Qualcomm has 64b chips, but AFAIK there's no Android device using them in the wild. The HTC "A11" was announced yesterday and will apparently sport a Snapdragon 410, I don't know that other devices exist yet.

Of course there's no device using Denver in the wild either.

trevyn · on Aug 12, 2014

Will this ship before the MediaTek MT6752/MT6795? They're claiming later this year, too.

masklinn · on Aug 12, 2014

Am I the only one weirded out by the supposed Anandtech quote figuring nowhere in the linked article, and said article being nothing more than the announce and display of some NVidia marketing material?