Hacker News new | past | comments | ask | show | jobs | submit login
Finding a bug in Win10 BOOTMGR when chaining to NTLDR (bugzilla.mozilla.org)
126 points by yuhong on Jan 27, 2017 | hide | past | favorite | 46 comments



The summary, approximately: there are some crashes detected by Firefox and it seems Firefox (or whatever Firefox uses as its libraries) thinks it can use AVX instructions even if the OS isn't aware of the AVX technology and therefore can't preserve the content of the registers during its "normal work."

And yuhong has found that the reason Firefox thinks that, specifically on Windows XP which is booted with the Windows 10 loader, is that the Windows 10 loader sets some bit "I know about AVX" instead of clearing it before entering in Windows XP.


Set it then forgot to clear it. The "bits" include CR4.OSXSAVE and XCR0 (set by XSETBV). BOOTMGR is used for booting Vista and later, and it chains to NTLDR for booting XP/Server 2003 and older.


Windows has always had the worst dualboot story. It'll gladly overwrite grub etc without asking (to "helpfully" fix boot problems?), but if your partitions are ordered a little out of the ordinary it'll still throw a ridiculously not-helpful error. They might as well just stop pretending to support it.


This took me a while to figure out, but if you use UEFI, and have Windows and Linux on different drives / different EFI system partitions, you can use grub-install --force-extra-removable to keep Linux bootable after Windows helpfully clears out the NVRAM Boot* variables: https://github.com/ludios/ubuntils/blob/master/bin/reinstall...


I legit just physically disconnect the drives with other operating systems while installing Windows in dual boot situations. Can't break what isn't there!


That worked with MBR boot, but with UEFI, Windows will notice you have Boot* variables in motherboard NVRAM pointing to a Linux install that doesn't exist, then clear them for you.


Sometimes very necessary. Windows 7 had a bug that caused it to spread boot information across multiple drives if you had more than one plugged in at install time, even if you specified only to use one.


Seems legit... nice summary.


Yes, that's been my reading as well.


A cool way to fix this if Microsoft doesn't want to (although I can't see it being that hard) would be to make a small stub that simply kills the AVX flags then chainloads through to NTLDR.

It would probably be quite a simple project to work on, and a fun way to learn about bootloader-level software development. Chainloading NTLDR is well-understood (and will never change), and being booted by BOOTMGR is also fairly well understood too. If I was running Windows at the moment I'd be seriously considering playing with this myself.


It's easier to just patch NTLDR. I've made changes to it before - it's very well (if unofficially) documented.


Hah. Well, I expect an appropriately patched version of NTLDR will surface pretty soon then - especially considering the signing issues are moot (see elsewhere in this thread).


Yeah, but good idea about the stub loader. That was actually my first train of thought, too.


Would this work with Secure Boot enabled - doesn't everything have to be signed Microsoft ?


Ooooh. Good point - and I don't actually know. [EDIT: See comment below, XP doesn't do Secure Boot, the following is moot for this context.]

So... BOOTMGR (Win10) is chaining through to NTLDR to load WinXP. And either Win10 comes with a copy of NTLDR, or pokes around to find the one on the XP system.

By my reasoning, Secure Boot should say "okay" and happily start the machine when it decides BOOTMGR is okay, on the basis that BOOTMGR will verify whatever it loads. The question is whether BOOTMGR actually does that, and seeing as if it doesn't then there isn't really a boot trust chain, well, it probably does verify what it loads.

I fear this is something only Microsoft would be able to fix properly for users who want/need Secure Boot. Slightly ironic. But thinking about it, Secure Boot on XP is kind of like deadbolting your front door when your walls have completely disappeared (picture a door sitting in the middle of nowhere), because XP is officially EOL now.


Bootmgr won't chainload other bootloaders in UEFI mode, secure boot enabled or not, unfortunately.

Not even other signed and verified uefi bootloaders.


Ah, that answers that question then.


Secure Boot is UEFI only (can't be enabled unless you disable CSM support) so if you want to boot XP you won't have access to Secure Boot anyway


Ah, I see. I was getting Secure Boot confused with Trusted Boot, which was available in the XP era.


God I bet that was fun to debug!


I even used a checked build of NTLDR with the boot debugger to confirm.


BTW, the boot debugger in checked builds of NTLDR is harder to use than the one in BOOTMGR. You have to manually load symbols by reading the PE header into memory using a copy with the real mode stub removed, and you have to use F10 at the boot menu in order to break into the debugger.


As someone completely naive about Windows (and particularly low-level development), I'm casually/idly curious if the DDK includes the debugging tools you used, or if I have to throw money at people to get at them. (I know the DDK exists, but not much more than that.)

Also, how did you remove the real-mode stub? Hex editing, or is there a tool that will do that if fed the right combination of [obscure] options?

IMHO, the best way to finish this off would be an in-depth tutorial-style blog post. You definitely get my recommendation to consider that :)


Yes, the DDK used to include the needed debug version of NTLDR too. And yes, it is done by hex editing looking for the "MZ" signature and trimming everything before that. With BOOTMGR you just set some BCD options to enable the boot debugger.


Oh of course, the XP DDK.

And MZ-hunting makes perfect sense.

Thanks for the quick reply :) and kudos for figuring out it was the bootloader, hah


What does "checked build" mean here?


https://msdn.microsoft.com/en-us/windows/hardware/drivers/de...

A checked build basically has optimization off and assert() on.


Optimizations are still on in checked builds.


From the MSDN article I linked:

> Many compiler optimizations (such as stack frame elimination) are disabled in the checked build. This makes it easier to understand disassembled machine instructions, and therefore it is easier to trace the cause of problems in system software.


Umm... what exactly is happening?


If I'm understanding this right:

- Firefox is seeing a crash associated with use of the AVX vector-processing (~= high-performance math) instructions

- The AVX instructions use more registers, and registers need to be saved/restored during a context switch, so you can switch back to a program and have it be transparent. So you can only use AVX if the OS supports it, and promises to save/restore those registers in addition to regular registers when it does a context switch. The OS reports to the CPU "Yes, it's okay to let people use AVX" by setting a bit in a control register. Applications check that bit before using AVX instructions.

- The crash is an illegal-operation exception, which should be impossible because Firefox checks to see if that bit is set before using those instructions.

The answer to the mystery: Some people are using an old version of Windows, that does not support AVX, but with a bootloader from a new version of Windows. For whatever reason, the bootloader sets the "Yeah, AVX is fine" bit, and expects the new version of Windows to detect AVX and set support as appropriate. Old versions of Windows don't know about that bit, though, and never clear it. So Firefox proceeds to use AVX on a CPU that has no AVX support.

This was discovered by someone mentioning that they were dual-booting Windows versions, and that the crash went away when restoring the older bootloader.


"So Firefox proceeds to use AVX on a CPU that has no AVX support". To make it clear; it uses AVX on a CPU that does support it (otherwise you'd run into an illegal instruction error), but the OS doesn't. Firefox doesn't know this, however, because it thinks the OS set the bit that says it does.


Oh, does VZEROUPPER generate an illegal instruction if the OS has set the control register bit but has not called XSETBV? OK, that makes more sense.


The newer BOOTMGR chains to the older NTLDR which in turns loads XP/Server 2003 or older, and they forgot to clear the bits when chaining.


It seems a bit odd that they'd be enabling AVX in the bootloader. Any idea why they do that? Seems like something for the kernel to do during initialization (like most other features), not the bootloader.


Does the bootloader choose a kernel entry point based on the CPU features?


My understanding is that BOOTMGR goes to WINLOAD when booting Vista or later. The routine that ensures the correct CR4/XCR0 values in this case is called bootmgr!ArchRestoreProcessorFeatures and it is called just before bootmgr!BlBdStop in bootmgr!ArchExecuteTransition.


I'm sure it would be helpful for those who might attempt to reiterate the submission if you described as much as you think you understand to provide some context.


All right. I didn't understand a thing. Hope this helps.


Are you presently multibooting or planning to in the future?

If so, do you use a Windows 10 (NT6) boot menu to select from other operating systems which incude Windows XP (or presumably other versions of NT5)?

Then the defective Microsoft engineering described, applies to your use of XP after it is booted from the W10 menu.

Looks like it may also apply to other Windows versions or operating systems like some Linux distributions, so beware.

This bug was reported by user Brindusa Tot during operation of Mozilla Firefox. 2015-11-16

Confirmed with difficulty by engineer Yuhong Bao as a defect in the Windows 10 BOOTMGR routine. 2017-01-27

yuhong is on this thread, this is the kind of engineer I would want on my team.

Further testing needs to be done to determine if there is an alternative BOOTMGR or procedure which does not suffer from this insidious failure.

Mozilla is not a source of this defect at all.

It's simply the Microsoft Windows 10 code that's failing to properly do what it says it does.


So ... is there a way to report it to Microsoft and will it get fixed?


XP support ended years ago.


What is the Win 10 bootmgr code for XP loading counted as?


Lots of bootloaders (I would say most I have used) allow chainloading another bootloader. That is what is going on here. Pointing bootmgr at another partition and saying "boot that".


Backwards compatibility!


Based on what the problem description here is (failing to clear some bits in a register), patching it yourself doesn't seem too difficult. Probably less than two dozen bytes to change.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: