*> proxy PCI passthrough driver thing* Do you mean vCS [1], which is already int...

tptacek · 2025-02-15T00:29:21 1739579361

I assume anything we did to make MIG work for us would have been custom.

transpute · 2025-02-15T01:51:29 1739584289

Going back to the blog post:

> Alternatively, we could have used a conventional hypervisor. Nvidia suggested VMware (heh). But they could have gotten things working had we used QEMU. We like QEMU fine, and could have talked ourselves into a security story for it, but the whole point of Fly Machines is that they take milliseconds to start.

Someone could implement virtio-cuda (there are PoCs on github [1] [2]), but it would be a huge maintenance burden. It should really be done by Nvidia, in lockstep with CUDA extensions.

Nvidia vCS makes use of licensed GPGPU emulation code in the VM device model, which is QEMU in the case of KVM and Xen. Cloud Hypervisor doesn't use QEMU, it has its own (Rust?) device model, https://github.com/cloud-hypervisor/cloud-hypervisor/blob/ma...

So the question is, how to reuse Nvidia's proprietary GPGPU emulation code from QEMU, with Cloud Hypervisor? C and Rust are not friends. Can a Firecracker or Cloud Hypervisor VM use QEMU only for GPGPU emulation, alongside the existing device model, without impacting millisecond launch speed? Could an emulated vGPGPU be hotplugged after VM launch?

There has been some design/PoC work for QEMU disaggregation [3][4] of emulation functions into separate processes. It might be possible to apply similar techniques so that Cloud Hypervisor's device model (in Rust) process could run alongside a QEMU GPGPU emulator (in C) process, with some coordination by KVM.

If this approach is feasible, the architecture and code changes should be broadly useful to upstream for long-term support and maintenance, rather than custom to Fly. The custom code would be the GPGPU emulator, which is already maintained by Nvidia and running within QEMU on RedHat, Nutanix, etc.

It would also advance the state of the art in security isolation and access control of emulated devices used by VMs.

[1] https://github.com/coldfunction/qCUDA

[2] https://github.com/juniorprincewang/virtio-cuda-module

[3] https://www.qemu.org/docs/master/devel/multi-process.html

[4] https://wiki.qemu.org/Features/MultiProcessQEMU

ignoramous · 2025-02-15T10:47:12 1739616432

> Someone could implement virtio-cuda (there are PoCs on github [1][2]

Any company (let alone Fly) doing this won't go against Nvidia Enterprise T&C?

> how to reuse Nvidia's proprietary GPGPU emulation code from QEMU

If it has been contributed to QEMU, it isn't GPL/LGPL?

> Could an emulated vGPGPU be hotplugged after VM launch

gVisor instead bounces ioctls back and forth between "guest" and host. Sounds like a nice, lightweight (even if limited & sandbox-busting) approach, too. Unsure if it mitigates the need for the "licensing dance" tptacek mentioned above, but I reckon the security posture of such a setup is unacceptable for Fly.

https://gvisor.dev/docs/user_guide/gpu/

> would also advance the state of the art in security isolation and access control of emulated devices used by VMs

I hope I'm not talking to DeepSeek / DeepResearch (:

transpute · 2025-02-15T12:04:08 1739621048

> Any company (let alone Fly) doing this won't go against Nvidia Enterprise T&C?

Good question for a lawyer. Even more reason (beyond maintenance cost) that it would be best done by Nvidia. qCUDA paper has a couple dozen references on API remoting research, https://www.cs.nthu.edu.tw/~ychung/conference/2019-CloudCom....

> If it has been contributed to QEMU, it isn't GPL/LGPL?

Not contributed, but integrated with QEMU by commercial licensees. Since the GPGPU emulation code isn't public, presumably it's a binary blob.

> I hope I'm not talking to DeepSeek / DeepResearch (:

Will take that as a compliment :) Not yet tried DS/DR.

bonzini · 2025-02-15T12:12:59 1739621579

NVIDIA support is not special as far as QEMU is concerned—the special parts are all in their proprietary device driver, and they talk to QEMU via the VFIO infrastructure for userspace drivers. They just reimplemented the same thing in Cloud Hypervisor.

Red Hat for one doesn't ship any functionality that isn't available upstream, much less proprietary, and they have large customers using virtual GPU.

transpute · 2025-02-15T12:37:04 1739623024

> NVIDIA.. just reimplemented the same thing in Cloud Hypervisor.

Was that recent, with MIG support for GPGPU partitioning? Is there a public mailing list thread or patch series for that work?

Nvidia has a 90-page deployment doc on vCS ("virtual compute server") for RedHat KVM, https://images.nvidia.com/content/Solutions/data-center/depl...

bonzini · 2025-02-15T13:11:31 1739625091

Not NVIDIA; fly.io reimplemented the parts that CH didn't already have. I know that CH is developed on GitHub but I don't know whether the changes are public or in-house.

That said, the slowness of QEMU's startup is always exaggerated. Whatever they did with CH they could have done with QEMU.