What a peculiar design. They went for an in order design instead of the out-of-order design that has been standard in high-end ARM cores for years. Out-of-order usually brings more instructions per clock cycle ("more efficient"), but makes the chip more power hungry and makes it harder to achieve a high clock speed.
So the in-order design they have now allows them to scale up to 2.5 GHz which is very high for such a constrained chip. They also have this weird "code optimizer" that appears to sacrifice some RAM and a CPU core to dynamically reorganize processor instructions to run better on this in-order design. It seems as though they want to have their cake and eat it too, but at the same time they appear to paradoxically introduce complexity to achieve simplicity.
I wonder if someone more knowledgable about chip design could comment on this design?
Saying "When an in-order CPU stalls on memory, it's still burning power while waiting, while an OOO processor is still getting work done" seems deceptive, though. The OOO processor is (often) doing speculative work, which may be unneeded, and a stalled in-order CPU won't be using quite the same amount of power as if the execution units were actually switching.
Though I'm pretty sure the NVIDIA design must speculative prefetch and hoist memory reads aggressively to be performance competitive (see the huge cache sizes!), which also burns power.
"Denver is just one of the variants in that line. T50 was going to be a full 64-bit x86 CPU, not ARM cored chip, but Nvidia lacked the patent licenses to make hardware that was x86 compatible."
Neat. On Android it's easy to believe Dalvik/ART generated code leaves more performance on the table compared to other platforms.
Another "what's different this time" point: compared to Transmeta era, CPUs now typically have idle core(s) waiting for work to do. This may enable more ambitious optimizations as you're not constantly stealing cycles from the app.
Why do you lump Dalvik and ART together? The whole point of ART is that it's compiled to native code, is it not? Therefore it's "not leaving anything on the table".
Not sure why I'm being downvoted:
> The big paradigm-shift that ART brings, is that instead of being a Just-in-Time (JIT) compiler, it now compiles application code Ahead-of-Time (AOT). The runtime goes from having to compile from bytecode to native code each time you run an application, to having it to do it only once, and any subsequent execution from that point forward is done from the existing compiled native code.
Dalvik and ART both generate native code. ART does it at app install time and Dalvik, being a JIT, does it at app runtime.
He was speculating about quality of the generated native code, not its existence.
Google has been very quiet about ART generated code quality. I haven't found any benchmarks that would compliment its performance either. It's probably better than Dalvik but worse than mature optimizing compilers.
I don't think they have been very quiet about ART. They had a whole I/O session and they recently dedicated a whole dev backstage podcast to ART and its future evolution (btw, some nice things are planned : suppression of the 65k limit, code hot-swap, ...).
20 or more years ago there was a peephole optimizer that looked at generated Turbo Pascal (and other not so advanced (for their time) compilers) and rearranged instructions, removed holes, replaced instructions with better set (turns out some of the "macro"-instructions were not that fast).
Nowadays calling a .so/.dylib/.dll function, or accessing a thread-local variable also generates lots of cruft code that could possibly get optimized once the data is loaded. It won't always work (shared libraries can't be unloaded, but with enough hints, or assumptions that this would never be done, one can gain some benefits). On the negative this may reduce the code-sharing across processes.
In general CPUs have idle cores, but this looks like it's supposed to be a dual core setup so that's going to be less of a factor. I think the main idea is that you only have to do the optimization once but you run the code many times, which is an improvement from Transmeta where you threw out your optimizations when the process ended.
Oh, Transmeta. Back in the heyday of Slashdot, Transmeta had a massive cult following for many months before the product was even released. I remember the dissapointment when they did finally ship. My dad bought one of the little Sony laptops with their proc and it was a dog (although power-efficient).
I asked it when apple announced 64 bit support for the iPhone, and I'll ask it again now: what is the point in 64 bit cpus in devices that will be deprecated and replaced before we get around to having more than 32 bits worth of addressable ram in these devices?
These aren't $1500 desktops with user upgradable components. They are sealed, non upgradable, relatively disposable devices that are intentionally obsoleted by their manufacturers every 12-18 months. It makes for a nice marketing bullet point, but are we actually getting anything besides a bigger number to impress the impressionable customer with?
"In short, the improvements to Apple's runtime make it so that object allocation in 64-bit mode costs only 40-50% of what it does in 32-bit mode. If your app creates and destroys a lot of objects, that's a big deal." [1]
One example: in 64bit NSNumber and (short) NSString objects on iOS can be stored completely in a pointer (tagged pointers) on the stack without having to create anything on the heap. That's possible because the size of a 64bit pointer is large enough to contain the required information. Creating and accessing one of these objects becomes far faster. Based on what I gathered at WWDC this year, Apple is also inclined to move as many various objects there as possible (as long as they can be stored in a 64 bit pointer).
It is worth noting that this is likely to be an iOS-on-AArch64 specific tweak. Apple doesn't do this on x64 OS X (right?).
Furthermore, nothing has been said (AFAIK) about similar tricks Google is doing with Android on AArch64 or 64 bit architectures in general. From what I've heard they just made the Android runtime capable of emitting AArch64 (and x64, MIPS64) instructions instead of 32-bit ARMv7 ones.
That alone gives plenty of a performance boost as AArch64 instructions are supposedly quite a bit faster than the old 32-bit ones. It remains to be seen though whether Google can make as much use of clever allocation tricks as much as Apple can: Android is much more platform agnostic and uses a garbage collector for some of these tasks.
64-bit support on ARM entails the implementation of AArch64 - not just widening the current registers to 64-bits and nothing else.
AArch64 has a better instruction set and more general purpose registers (31 vs 16), so applications recompiled to target AArch64 should see performance benefits even if it only uses 1 MB of RAM and never encounters integers greater than 2^32.
You can't just release a 64 bit chip and expect a full stack of 64-bit clean software to be available on day 1. Indeed, last time I checked Dalvik wasn't fully 64-bit clean. Releasing earlier than you need it ensures that by the time you do need it, the software is ready to go.
Also, besides the correct points made in sister comments, there is the issue that virtual address space is used for things other than simply mapping physical memory. There was a time, for example, when Linux mapped all of physical memory into the kernel's virtual address space. This was simple. These days, on 32-bit systems, it splits the 4GB address space into 3GB for the user and 1GB for the kernel, and maps physical memory in and out of a 128MB window in kernel space, as needed. This is obviously more complicated.
One thing that Mike missed though is that you can do large bandwidth operations faster. You can blend two pixels at a time with 64bit, so your graphical blend can be twice as fast (of course, if you're really worried about the speed of these types of operation, you go SIMD - but that's a pain in the butt and requires in line assembly language to get the biggest boosts. In particular, ARM Neon's C function wrappers on SIMD instructions generally gives you a much smaller boost than if you hand code the assembly language calls. With 64bit, you can get some of the advantage without the hassle).
> In particular, ARM Neon's C function wrappers on SIMD instructions generally gives you a much smaller boost than if you hand code the assembly language calls.
My practical experience using intrinsics is quite the opposite. Using C + NEON or SSE intrinsics yielded a lot better performance than what my hand written assembly could.
I've heard similar stories to yours from others but they were from years ago. Later versions of compilers seem to be a lot better at this.
In particular, the C compilers (GCC, Clang) were able to further optimize code written using intrinsics, where using assembler code practically inhibits all compiler optimizations.
While I am able to do pretty good instruction selection by hand (which the compiler wasn't very good at), what blew me away was the compiler's ability to do instruction scheduling and register allocation. I could not have matched that without spending a huge amount of time reading architecture specific optimization manuals.
So I got pretty neat, readable and portable code (x86+SSE and ARM+NEON) using a single code base where I had only written some "primitive" functions using low level intrinsics. I could write nice high level code and the compiler would inline my primitive function calls and then further optimize and re-organize the instructions.
Um, I don't think my GCC was that old, it was for an Android tablet from about two years ago - GCC 4.3 or 4.4 if memory serves me correctly. But yes, it was definately a failure to optimise that was causing the problem. To be honest, I generated my "hand-written" assembly by taken the output of gcc compiling intrinsic-based code, and then removing all of the dumb, unnecessary shuffles to main memory by better use of the registers... That said, there were clearly bugs in the compiler - you could crash GCC by doing certain innocuous things in your assembly code...
Your virtual address space gets cramped before your physical RAM hits your addressable VA limit. See the pain points in 32-bit desktop operating systems trying to run with 4GB ram.
Linus has said that the pain point actually starts around 1GB, especially when using a 3/1 userspace/kernel address split. At 1GB, you can't map the entirety of physical ram into kernel address space simultaneously, since you also have memory-mapped IO taking up space. Let alone mapping the same physical memory to multiple virtual addresses with different cache attributes, which is useful because approximately none of these SoCs have full cache coherence across all hardware blocks.
Being able to mmap large files, even if you don't have the memory to hold them completely is a great advantage. Address space isn't just about physical memory.
This! I hope that 32 bit processors will soon be history and we will be able to use mmap (+ madvise, fadvise, etc) instead of read/write without having to worry about portability.
> I asked it when apple announced 64 bit support for the iPhone, and I'll ask it again now: what is the point in 64 bit cpus in devices that will be deprecated and replaced before we get around to having more than 32 bits worth of addressable ram in these devices?
I imagine the additional cpu registers (general purpose and floating point) are handy for crypto and encoder/decoder stuff.
Double the performance, not because of 64-bit. For instance the dual core 64-bit apple chip does quite well in benchmarks and application performance against numerous 32 bit android devices with 4 cores.
Cortex-A15 designs and up already offer 36 to 40 bits of addressable memory, or from 64GB to 1TB. This is at the OS level, of course, so individual 32-bit processes would still be limited to 2-3GB without intentional PAE, but that's hardly a limit in a mobile design.
But ultimately it isn't about memory in the near-term. ARMv8 offers more, larger registers, instructions, a higher hardware baseline, and in the future higher memory.
This doesn't do any accelerated graphics. It remains to be seen if NV are just trying to make life smoother for the proprietary parts of driver stack on Android by getting the the HW initialization etc into upstream Linux, or if they are aiming to actually do an open 3D driver. They haven't announced any plans about the latter.
Good enough goes a long way when the competition is made of high-performance temperamental sports cars that tend to catch fire when you least want them to. Sometimes, all you need is a Honda Accord.
analogies are failing me here. But the best one I have is this:
A bicycle requires less maintenance and is easier to build repair and operate than an F1 car. Taken a step further. The complex nature of the F1 car and competitive nature of F1 races gives far greater advantages to secrets which may give the team/car the competitive edge.
I don't like the fact that this is where the graphics industry is and I wish that they would simply compete on hardware and the tech was all open. but I do prefer nvidia on linux to any other GPU because they have the performance I'm looking for. Also their linux drivers are far better than AMD's.
I couldn't agree more. This has been the Marketing spin for many generations of intel Graphics solutions. "Oh this might not be as performant as one might want, but wait for next year's Sandy Bridge/Ivy Bridge/Haswell/Broadwell. Graphics will be x times as fast!"
And then it turns out it is barely more performant but hey... next year's will be though!
To be fair, Intel has made a lot of progress with their IGP's over the last few years. I have been slamming Intel graphics for exactly the same reasons you and the commenter before you (and rightly so), but times do change. HD3000 and HD4000 were both big steps forward, much bigger than anything AMD has shown in the same timeframe, but Intel was so far behind that they were still lacking. Haswell graphics (HD5000, specifically Iris Pro) are already pretty close behind the lower-end AMD IGP's though. If Intel manages a similarly large step forward with Broadwell, they might just step into the same league AMD is playing in.
Considering the Tegra K1 is the first CUDA-capable mobile processor, does anyone know if we'll be able to leverage CUDA bindings on devices that run on this?
Qualcomm has 64b chips, but AFAIK there's no Android device using them in the wild. The HTC "A11" was announced yesterday and will apparently sport a Snapdragon 410, I don't know that other devices exist yet.
Of course there's no device using Denver in the wild either.
Am I the only one weirded out by the supposed Anandtech quote figuring nowhere in the linked article, and said article being nothing more than the announce and display of some NVidia marketing material?
So the in-order design they have now allows them to scale up to 2.5 GHz which is very high for such a constrained chip. They also have this weird "code optimizer" that appears to sacrifice some RAM and a CPU core to dynamically reorganize processor instructions to run better on this in-order design. It seems as though they want to have their cake and eat it too, but at the same time they appear to paradoxically introduce complexity to achieve simplicity.
I wonder if someone more knowledgable about chip design could comment on this design?