*such as system calls* This sounds like something the CPU hardware should be han...

johncolanduoni · on Oct 2, 2016

There are some problems with that that have arisen due to the long disuse of the other privilege levels:

1. The fast methods for system calls (syscall/sysret/sysenter/sysleave) completely ignore these privilege levels and can only perform transitions between 0 and 3. That means you have to use interrupts which are slow, and may be even slower than 0/3 interrupt transitions because the processors aren't used to dealing with them.

2. You can't make much use of them for x86_64 programs, since these disable segment based protection and the x86_64 page tables (you guessed it) only have a single bit to select privilege level of a page. Somebody that remembers the Intel manuals better can hopefully inform us if you can use them in x86 compatibility mode under a 64-bit kernel, but I'm going to guess you'll have some wrinkles here.

I would be very surprised if these two issues don't kill any performance gains you would get from avoiding the recompilation step.

bonzini · on Oct 2, 2016

All x86 page tables have a single bit for page tables, not just 64-bit ones.

comex · on Oct 2, 2016

I agree with what you seem to be getting at, that Native Client and VX32 are essentially a hack: but to do it right, you don't actually need hardware support, only kernel support. After all, user processes - on all common architectures, not just x86 - are already fully isolated from the rest of the system; their only methods of communication are system calls and other triggerable exceptions (e.g. segfault), and the kernel controls the handlers for all exceptions. In theory, all you need to run untrusted code safely, even portably (across OSes if not CPU architectures), is a kernel API to run a process without direct access to the kernel's syscall layer - e.g. the kernel could forward all attempts to invoke syscalls to a configurable handler. In fact, this sort of exists already in the form of seccomp on Linux.

One drawback to a fully hardware based approach is that you can only trap instructions the hardware lets you trap. On x86, for example, CPUID is not in that category (for normal ring 3; see below about VMX), so you can't prevent the untrusted code from learning about what kind of CPU it's running on. Nor is CLFLUSH, an instruction to flush memory from cache to RAM, which is not supposed to be dangerous - but turned out to make it a lot easier to exploit the rowhammer bug on vulnerable systems. Native Client originally allowed CLFLUSH, but was updated to block it once the vulnerability was revealed. (That said, CLFLUSH is/was not strictly necessary to exploit rowhammer, and in fact someone wrote an exploit that worked from a JavaScript VM; the only way to fully prevent it is to fix the RAM refresh rate.)

By the way, there is also VMX, hardware virtualization, which both Linux and macOS (but not Windows AFAIK) allow unprivileged processes to use at will. While traditionally used to run full operating systems, which in theory should be safe too but requires exposing a relatively large amount of hardware surface area to the guest - there's nothing preventing you from having your own mini kernel and running untrusted code in ring 3 inside the VM. This provides multiple advantages: VMX allows trapping CPUID, faults from ring 3 can be handled by the mini kernel without a context switch, more control over various bits of the execution environment, etc. Too bad it's often disabled fully and/or not supported if you happen to already be inside a VM (because nested VMX, while possible, requires some software emulation and incurs a speed penalty)...

geofft · on Oct 2, 2016

> both Linux and macOS (but not Windows AFAIK) allow unprivileged processes to use at will

Only sort of true on Linux. Most Linux distributions make /dev/kvm non-world-accessible because it's a good source of security issues (see e.g. http://www.ubuntu.com/usn/USN-2417-1/ ); the KVM driver isn't quite hardened against people who are trying to compromise the host kernel instead of actually make a VM. PolicyKit often gives access to the current logged-in desktop user, but that's precisely because those processes aren't quite unprivileged (e.g., processes running as the logged-in desktop user can usually shut down the machine or prevent it from sleeping without authentication).

Which leads to an interesting point: doing this in software, as Vx32 and Native Client do, fails safe. Since it's just regular user code, it can't possibly do things that regular user code can't do, and you can belt-and-suspenders it with an extremely tight seccomp policy on the emulator (as Chrome does). If you do this at the OS level via seccomp directly, and the OS gets it wrong, it fails open (e.g., CVE-2009-0835), but still shouldn't allow execution of non-user-mode code. If you do this at the CPU level via privilege rings, the CPU isn't very likely to get it wrong -- but if it does it fails very open (i.e., into a privileged ring) and is the hardest to patch.

bonzini · on Oct 2, 2016

It's absolutely not true that we don't care about KVM host vulnerabilities. KVM survived a good deal of fuzzing with only a handful of trivially fixed NULL-pointer dereference oopses found (including one which turned out to be a bug in a completely different part of the kernel) and no privilege escalations.

Most distros actually make /dev/kvm world-accessible; you are confusing that with virt-manager requiring PolicyKit authentication by default (that's because networking is better integrated if libvirtd runs as root), but for example GNOME Boxes doesn't

geofft · on Oct 2, 2016

That must be a relatively recent change, then - several years ago I think everyone did care quite as much, there just were more vulnerabilities. I'm not insinuating this is because of anyone not caring. :-) It was just a fact that there were a bunch of CVEs.

For instance, Debian stable makes it 664 root:kvm, and a bug I opened a bunch of years ago to change that got wontfixed: https://bugs.debian.org/640328 Is it time to request reconsideration?

bonzini · on Oct 4, 2016

Perhaps it is, but mjt (the Debian maintainer) is pretty stubborn...

bonzini · on Oct 2, 2016

It's possible to do that with KVM, running the untrusted guest as a user-mode program in a guest and trapping system calls into the hypervisor. The cost of a system call would be about 6000 clock cycles.