> Continuum -- the capability that will allow Windows 10 Mobile devices to connect to external displays and keyboards -- is going to be a key for the company
This actually sounds like a very good move by Microsoft. Just issue people a phone and they will do all their work on that. There's really no need for giant workstations anymore, and I think this will be more successful than a Chromebook-type thing.
I agree. Developers often assume that everyone has the latest flagship laptop. A large majority, particularly in developing countries, only have access to $200-$400 laptops that struggle at running modern applications.
We either have to focus on making our applications more efficient, as we did 10 or so years ago, or make high performance computers more accessible.
I think for many that is a hard drive thing combined with dozens of tool bars, though. Phones have solid state memory so that shouldn't be a major problem.
The solid state memory in phones is a far cry from anything in your computer though and we regularly see devices with storage that's noticably slower than even mechanical HDDs.
Think about flash storage on phones to be more in class of a fast SD card, not the SATA3 SSD in your laptop.
By the time this is in place it's likely a high end phone will easily be sufficient for average computing. You won't run VMs on it but it should be comparable in power to an iPad Pro.
Seems about every 2-3 years a company tries this. And every single time my hopes go up like crazy. I SOOOOOOOO want this. and every time I'm disappointed. The ipad pro is close, as is the surface book.
What ends up biting me is the need for a docking station or a laptop form factor thing for use on the go. If display glasses/goggles could ever become usable for long duration it would go a long way towards making a single device usable everywhere.
I think Bluetooth and Miracast are getting us very close to a good "on the go" solution. Even without things like hotel screens picking up Miracast support directly yet, the price of a good Miracast supporting HDMI stick has dropped spectacularly (Amazon's Fire Stick, Microsoft has a Miracast-only "stick" I forget what it is called). Between that stick and a good Bluetooth keyboard/mouse combo you have most of an on-the-go solution.
Or the built-in ability to run 16-bit DOS/Windows apps on NT4 running on MIPS Magnum R4000. I had one for a while and it was rather nifty at the time...
And then, as you mention, FX!32 for running 32-bit apps (slowly) on NT/Alpha - I had it running on a Multia for a while. Those were some fun tiny machines.
I miss Alpha. Such a good arch. I wish Intel would do something with it, though I know they never will because there's really no market for it. I'd do something drastic to get my hands on an Alpha system in a Raspberry Pi form factor, though... or an Alpha laptop, or desktop, or Chromebook, or tablet, or server, or...
Last time I checked, there wasn't even a working emulator for Alpha. I've been thinking about trying my hand at it, if only I had the time and the resources...
Adding to andriew's comment, x86's more strict memory model reduces the hardware's flexibility and enforces a more strict ordering than usually required. If you think you really need the x86 memory model, you're probably doing something wrong, but can get it (at significant cost) by adding lots of memory fences.
The Alpha was also interesting in that it tried extremely hard to avoid saddling itself or future versions with legacy baggage.
Unlike x86, ARM, MIPS, POWER, PA-RISC, SuperH, etc., Alpha was designed without a legacy support 32-bit addressing mode. Of course, if you link against a malloc implementation that stays below the 4 GB boundary, you can use JVM-like pointer compression. (If you need to support heap objects with pointers to the stack, you'll of course also need your stack allocated below the 4 GB boundary.) A lot of pointer-heavy code, however, can be rewritten to use 32-bit array indexes instead.
They tried to get away without single-byte load or store instructions (many string manipulations don't really act on single bytes), though they were added in a later revision. A friend hired to write compilers for DEC shortly before the Compaq buyout told me that hardware engineers had to show simulated benchmark improvements when arguing for new instructions. They pushed back hard to keep valuable instruction space from becoming a junkyard of legacy instructions.
As mentioned, they made as few memory guarantees as practical and forced applications to use memory fence instructions to make their needs explicit to the processor. This left them more flexibility in implementing later models.
The firmware (PAL Code) was almost a hypervisor/nanokernel, with the OS kernel making calls to the PAL Code. The PAL Code version used for Tru64 UNIX/Linux implemented just two protection rings, while the PAL Code version used with OpenVMS emulated more protection rings. As you remember, as long as access violations trap to the most privileged ring, you can emulate an arbitrary number of rings between the least and most privileged rings.
> Adding to andriew's comment, x86's more strict memory model reduces the hardware's flexibility and enforces a more strict ordering than usually required. If you think you really need the x86 memory model, you're probably doing something wrong, but can get it (at significant cost) by adding lots of memory fences.
I think it's perfectly sensible to be weaker than x86. But having to deal with data dependency as you have to do on alpha is just too cumbersome / hard to get right. Acquire/Release (or read/write/full) memory barriers are quite easy to understand, but data dependency barriers are damned finnicky, and missing ones are really hard to debug.
I'm more than a bit doubtful that the cost of a halfway sane (even if weak) coherency model is a relevant blocker in upping the scale of both CPUs core counts and applications. At the moment the biggest issue seems to be the ability of developers to write scalable software, and to come up with patterns that make write writing scalable software more realistic.
I suspect there's room for selectively reducing the coherency effects of individual instructions in some memory models - e.g. x86 atomic instructions that don't essentially imply a full memory barrier (disregarding uncached access and such) would be great. Selectively lowering the coherency makes it possible to do so in the hottest codepaths.
And yet a weaker
memory model (as well as
explicit cache and TLB maintenance) remove a major glass ceiling from both theoretical chip performance, practical perf/watt and silicon design and manufacure costs.
> He noted that this technology seemingly has a new name, "CHPE." [...] My guess is the HP here is HP, as HP has been working increasingly closely with Microsoft on its Windows 10 PCs and the HP Elite x3 Windows Phone. (Maybe the "E" is for emulation?)
The binary file format on Windows is called PE (portable executable). I wonder if this might possibly be a fat binary format.
They wouldn't need to change the binary format for emulation though. It may be something more like "universal" executables from OS X when they went to x86 though?
Before someone says WOW64 isn't an emulator, the article isn't actually wrong. “64-bit Windows” originally meant Itanium, and WOW64 was an emulator on that platform. Of course, it isn't (very much of one?) on AMD64.
In this case, "minimal hypervisor and address space translation layer" (more or less - I've no idea exactly what it does) is somewhat harder to remember.
I think they were going for "simplest/shortest word that gets the point across" since that document is likely to be read by people leaning more toward executive/high-level positions as opposed to the detail-jugglers who shout at bare metal all day.
Address space translation is not a small thing to do in the win32 API. Not only do you need to translate pointer parameters and return values for thousands of API function calls, you also need to translate pointers in (possibly nested) structures, and also to do the same for each and every possible window message, where the general message payload may be any one of a combination of integers, pointers, and structures of (structures of) pointers. The pointers may be data pointers where the pointed-to object may be data such as a string that needs to be readable and writable from either side of the address space, or functions that can be called from either side.
Why would you do that? They can just include the 32-bit libraries.
Wine works by changing enough of the windows system that the libraries can communicate with the host operating system.
Microsoft already has the 32-bit libraries, 'all' they have to do is make sure the parts of the libraries that interface with the rest of the system do the translation, which is much less work and a reasonably well defined edge.
Wow. Now I'm wondering whether WoW64 does all that, then - one missed API call reference or data structure and boom goes the 32-bit app.
I actually forgot memory managers were a thing when I was writing the parent comment, and momentarily thought the lack of address space translation would mean all 32-bit processes were stuck in the first 4GB of RAM. AST is indeed more of a thunking issue.
Thanks for the info about Win3emu, that's awesome!
It's a very similar context where the word "emulator" is often erroneously used to describe how it works. It's the first thing I thought of when I read "WOW64 is the x86 emulator..."
To be precise, Win64 for Itanium contained a binary blob provided by Intel, which worked as an x86 emulator. I think some Itaniums used it and others had hardware to run x86 code.
Yeah, thats correct. The some (most?) of the first series (merced) of itaniums had hardware to help accelerate the emulation of x86 instructions. I don't know about the later iterations of it though as I've only run OpenVMS on the later ones and not Windows.
There's not much to emulate but it's still there, I've spotted plenty of stuff labeled Windows-on-Windows or WOW64 digging around Win7 installs. While it may not be doing CPU emulation (IIRC they just use the 64-bit registers like 32 and remap calls) but it is emulating nonetheless.
How awesome would it be if we could have a PnP processor? If you are docked, you run native x86 code. If you are mobile, you emulate it. The docking/undocking process could even be similar to VMotion.
It's a cool idea, but wouldn't the CPU be awfully far from the cache and RAM if you did that? You could put some RAM in the docking station with the x86 chip, but at that point you're two-thirds of the way toward just having another computer in there, which kind of defeats the purpose.
I agree, it would probably be necessary for everything to live in the docking station other than storage. You'd basically be transferring the live system state, which would then take over from the portable device.
The idea here is that you can "convert" your mobile device into a powerful desktop, though in reality you are just moving the current, live state from mobile to desktop.
That is what some people have conjectured is in the works with a "Surface Phone," although it's really anyone's guess at this point. But it certainly is very appealing. With this new emulation news, the use-case could be that you can run your x86 apps (slowly) in a pinch while out and about. When you get home or to the office, you dock up to get a full-powered PC experience.
It shouldn't take very long to copy a couple gigabytes (and it's usually not more than a couple) from RAM to RAM on docking. And a clever memory controller could copy it only when the memory is first demanded by the docking station CPU.
You don't need x86 when mobile, legacy desktop apps don't work on small phone screens anyways. When mobile, you use native ARM for low power, when connected to power and a screen/mouse you can use more power and emulate x86 on the phone itself.
You can do this in the 99% case on Linux using qemu-user, but I'm not aware of an equivalent for Windows (and booting an entire operating system in an emulator is a poor experience and will be even slower). I imagine this Windows variant works like qemu-user and Rosetta.
QEMU has some bad performance penalties, I think partially related to the generic JIT (TCG), extremely strict floating point, and full host memory isolation. ExaGear is significantly closer to native performance but I think only supports emulating 32-bit x86.
This is a trade-off with any emulation really. The more hardware accurate your emulation gets, the more you have to keep track of in virtual registers and state, and all that extra tracking adds up in a hurry.
Native virtualization dodges a lot of that performance penalty by using the hardware features directly where possible. But that's only really feasible on systems that have nearly identical hardware; x86 is pretty standardized at this point and virtualizes quite well. ARM is much more varied and harder to virtualize, even on other ARM platforms due to differing feature sets.
It's not that in this case, this is plain binary translation and not cycle accurate emulation. QEMU is super slow on x-to-x translation as well (something like 8x to 80x slowdown). Its translator is inefficient (for the sake of portability, the source native code is first translated to an IR and then the IR is compiled to target native code with limited optimisation) and it emulates all floating point instructions in software.
And these guys ship an x86 to ARM DBT which they claim has significantly better performance: https://eltechs.com/product/exagear-desktop/. I haven't tried it. QEMU is the slowest DBT system I'm aware of, so it's entirely plausible. Translating from a strongly ordered architecture to a weakly ordered architecture is big challenge, I wonder if they handle threads efficiently.
qemu-user can already do that too. On Linux you have to put a magic string into an /etc/binfmt.d file (on Fedora at least, the qemu-user package does that for you).
How would this actually work? Wouldn't it be painfully slow? Even ARM emulation on x86 seems to be pushing it, and I feel like the reverse should be 10x worse at best...
If your experience of ARM emulation of x86 is qemu, especially the full-system variant, you're getting something much slower than emulation needs to be:
- Anything floating point or SIMD is generally done with a big mass of C helper functions rather than fully inline, and it's always unrolled rather than translated into native SIMD. You can blame qemu's choice to use an architecture neutral IR, but even with that design it could be a lot smarter than it is. In any case, a JIT designed for a specific guest/native architecture combo should be able to produce much more efficient instruction sequences.
- In system mode, qemu has to emulate page tables, which it does by translating every load/store to call a stub that maps the given virtual address to a "physical" one before loading. This is quite slow. User-mode qemu (where qemu is a host-arch Linux process pretending to be a single target-arch Linux process) is faster because it doesn't need to perform any translation itself (the host page tables provided by the native kernel do the job). Here too qemu could be improved - it's possible to use host page tables for full system emulation, though in some cases at the expense of hardware accuracy - but AFAIK Microsoft's emulator is user-mode, so this too isn't an issue.
- Those are only the big issues; I think there are a lot of small cases where there's room to do better than qemu, either in general or by optimizing for a specific architecture pair.
Also, I'm not sure how much this matters, but translating x86->ARM64 has some small builtin advantages over other pairs due to the nature of the ISAs. x86 has few general-purpose registers, while ARM64 has 32, so you can map each x86 register to an ARM register throughout the emulation; going the other way around you need to be constantly shuffling registers to and from memory. And a small one: ret on x86 requires the use of the stack pointer and memory, while ARM64 ret just takes an arbitrary register argument. (In both cases, the visible semantics are the same as a regular indirect jump, but the instructions are optimized to quickly return to the corresponding call instruction using a shadow stack in hardware.) All indirect jumps, including returns, have some emulation overhead because you need to translate the host code address to a native one; it's a bit easier when the host instructions are more flexible.
Doesn't ARM have very different memory consistency guarantees than x86 across cores, requiring memory barriers? I figured that would be one of the trickiest parts of emulating x86 on ARM
Given the entire OS would still be native, it might not be too bad for certain classes of applications. You would expect games to perform terribly but standard corporate desktop applications would probably perform pretty well.
Many of our critical corporate applications were designed a decade ago -- so you just have to emulate the average desktop performance from those days.
>standard corporate desktop applications would probably perform pretty well.
Don't say such evil things.
Spend half my days being frustrated by corporate software. .net software that runs on a modern i7 ultrabook with SSD. Doesn't do anything wild (mostly buttons & menus)...yet its bloody slow. HOW???
The button calls into a recently developed micro-service, that calls into a previous-generation SOAP webservice, that sends a message to the enterprise service bus, that then goes to netweaver middleware, that calls into the SAP R/3 backend that has been there since 1992 but was recently upgraded to use an even more expensive database.
Of course, the button has to wait for the response of all this, and blocks the UI while waiting...
A friend showed me the industry-standard database app used to manage most Internet-connected second-hand bookshops (I think).
Its development decisions went something like:
"Hey, let's be sooooo awesome and update the search results live, as you type! Wow, without even a one-second delay this is SO REALTIME - like we're in 2187"
(later)
"Do we need to index the database? ...eh, nah. I don't think our custom database solution even supports indexing. Kay."
In my friend's case the store he's at has tens of thousands of books.
Cue perpetual SSD replacements by all the shops using this software
My approach to using it (when my friend and I were discussing it while I visited) was to hit Win+R to obtain a text box, type my search string, then ^C/ESC/^V it over. The database app would lock up for about 1.2 seconds per keypress.
The problem is there shouldn't be much so I/O for loading programs in the first place, but .NET loves loading an exabyte from the disk whenever any program runs.
Given that Apple pulled it off twice with the Mac 68K emulator (for the Motorola 680x0 to PowerPC transition) and on OS X with Rosetta (for the PowerPC to x86 transition), it's certainly possible and ARM processors have come a long way.
Back then, "faster" really meant faster. These days, there's far less of a gap between processor generations for a consumer use-case. Running Firefox on Windows 10 to check Gmail and Facebook and occasional Word usage probably wouldn't feel much slower if you're on an Apple A10 processor vs a Skylake core i5. We long ago reached the point where an iPad processor was fast enough for consumer usage, and I knew plenty of college students who were happy with their Surface (non-pro, ARM-based).
For you and I, we want the fastest machines possible, and right now it's x86. For everyone else, I doubt they've even given it a thought, and I don't think they'd notice.
It could be difficult for Microsoft, but since Apple have their own line of ARM processors, could Apple theoretically add a couple strategic instructions to their CPUs to help speed up the emulation?
They did the same for 68k programs when they first went to PowerPC. And there's mild rumor building of them switching to ARM as well. 4 processor architectures across the Mac's history is pretty impressive.
Apple will stay Apple. I don't think they'll go anywhere.
The question is Google. If this happened in 2008, I don't think Android would have taken off anywhere close to the way it did.
But now? One one hand, Android has millions of apps already on the market. On the other hand, Microsoft now has potentially millions of old, existing, applications.
I don't think it will make a dent in the phone market. It's too commonly used as a hand-held rather than a station, and windows apps are useless there.
On the other hand, it can tank the Android tablet market
I agree on the phone side. How many of those millions of apps are usable from a phone screen, using a phone interface? This sounds like cool technology, but this particular use case sounds very limited use on phones.
However, on the tablet side, it may allow Microsoft to bring down the price of the Surface a bit while still maintaining legacy app compatibility.
My understanding is that Intel caught up with ARM a couple years ago on performance-per-Watt, but how's the idle power consumption of 64-bit Atom processors these days compared to 64-bit ARM offerings? For many consumer use cases, idle power consumption has a bigger impact on battery life than does performance-per-Watt.
I'm a bit sad this news came out after SoftBank bought up my ARM shares, but I'm glad to see more evidence we may yet get the x86 monkey off our back.
A lot of WP fans seem to think that having desktop apps will help it gain a ton of popularity, but I think at this point people would probably like the idea, but it would still be a very niche choice.
My skepticism says it'll benefit HP's competition but, most of all, Microsoft, since whatever they do will be part of Windows and become available to every other Windows OEM.
SoC like Lattepanda can run full version of Windows. I speculate that flagship phones by MS will have Intel chip in it, so no need of emulation. Evan Blass also Tweeted that we may see Intel chip in our phone in near future.
But anyway, this emulation will be helpful for cheaper phones. I am holding my breath for that day.
I guess this is the same reason HyperKit exists for macOS. So Apple can switch to Arm for Desktop or run iOS Apps on macOS.
But I still don't get how BitCode fits into this. It seems like 2 separate teams at Apple had to tackle the problem on how to run apps on different architectures. One solution is BitCode, the other HyperKit, which would be used like a new Rosetta (PowerPC -> x86), but from (x86 -> arm64)
Long overdue, but still welcome. Intel direly needs the competition. With AMD Zen and ARM notebooks coming, hopefully the market will look much more competitive in 2018.
Also, maybe Microsoft will have the guts to do what Google never did: standardize ARM processors, so that all ARM devices can be updated at once. Although I assume Microsoft will also start by supporting "Qualcomm-only" at first, just like it did for phones.
Thus far the approach (dating back to Windows 8) seems to be offloading that need to the Store (Cloud) backend and/or other Installer delivering the right bits instead of delivering everything. This is clearest in .NET applications and the .NET Native stack on Windows that most of the work in .NET Native builds to produce platform binaries actually happen on Store servers, but even Windows 8's Store would encourage uploading .NET IL and in 8 the Store app client-side would do the final AOT compile to ARM/x86 (with that getting offloaded to Store servers in 8.1).
FWIW, neither did Mach-O in the beginning; the fat magic and fat_arch structs are bolted onto an archive of single-architecture binaries, and they debuted in 2005.
The article cites Windows 10 so I suspect it's primarily Mobile and potentially IoT. That doesn't mean some flavor of it won't make it into Windows Server but currently the don't ship an ARM flavor of Server. If we saw ARM support in Server it probably wouldn't be until 2020 since Microsoft isn't known for making sweeping changes with their r2 release and 2016 just shipped.
This actually sounds like a very good move by Microsoft. Just issue people a phone and they will do all their work on that. There's really no need for giant workstations anymore, and I think this will be more successful than a Chromebook-type thing.