Hacker Newsnew | past | comments | ask | show | jobs | submit | girfan's commentslogin

Cool post. Have you looked at slicing a single GPU up for multiple VMs? Is there anything other than MIG that you have come across to partition SMs and memory bandwidth within a single GPU?


Last I checked MIG was the only one that made hard promises about especially memory bandwidth; as long as your memory access patterns aren't secret and you have enough trust in the other guests not being highly unfriendly with their cache usage behavior, you should be able to get away with much less strict isolation. Think docker vs. VMs-with-dedicated-cores.

But I thought MIG did do the job of chopping a GPU that's too big for most individual users into something that behaves very close to a literal array of smaller GPUs stuffed into the same PCIe card form factor? Think how a Tesla K80 was pretty much just two GK210 "GPUs" on a PLX "PCIe switch" which connects them to each other and to the host. Obviously trivial to give one to each of two VMs (at least if the PLX didn't interfere with IOMMU separation or such.... for mere performance isolation it certainly sufficed (once you block a heavy user from power budget throttling the sibling, at least).


Can you pass a MIG device into a KVM VM? The team we worked with didn't believe it was possible (they suggested we switch to VMWare); the MIG system interface gives you a UUID, not a PCI BDF.


Kubevirt has some examples passing a vpgu into kvm

https://kubevirt.io/user-guide/compute/host-devices/


Right, vGPUs are explicitly set up to generate BDF addresses that can be passed through (but require host driver support; they're essentially paravirtualized). I'm asking about MIG.


https://docs.nvidia.com/datacenter/tesla/mig-user-guide/supp... says GPU passhtrough is supported on MIG...


There's a MIG vGPU mode usable for this


Have you used it? How does it work? How do you drive it? We tried a lot of different things. Is it not paravirtualized, the way vGPUs are?


It works with SR-IOV instead of mdev afaik

Still needs some host SW to drive it but actually does static partitioning

IIRC it's usable through using the MIG-marked vGPU types


Thanks! I haven't looked deeply into slicing up a single GPU. My understanding is that vGPU (which we briefly mention in the post) can partition memory but time-shares compute, while MIG is the only mechanism that provides partitioning of both SMs and memory bandwidth within a single GPU.


This seems very interesting. Timely, given that Yann LeCun's vision also seems to align with world models being the next frontier: https://news.ycombinator.com/item?id=45897271


An established founder makes claims X is the new frontier. X receives hundreds of millions in funding. Other less established founders claim they are working on X too. VCs suffering from terminal FOMO pump billions more into X. X becomes the next frontier. The previous frontiers are promptly forgotten about.


I think it's a bit confusing when it comes to terminology, this seems more graphics focused while I suspect that a 10 year plan as mentioned by YLC probably revolves around re-architecting AI systems to be less reliant on LLM style nets/refinements and better understand the world in a way that isn't as prone to hallucinations.


What’s it going to do, take away funds from the otherwise extremely prudent AI sector?


Right now it seems legal to scrape Reddit. But given their trajectory of making the API fairly expensive to use, do you think it's likely that they would also limit/prohibit scraping (assuming apps like Apollo start scraping as an alternative)?


My understanding is that scraping of public websites is generally legal, isn't it?


Legal, but probably against ToS


That ToS is meaningless if you scrape logged out.


Even if that is the case, it does not say much about their ability to evaluate what category of users (e.g., those coming from Apollo or their first-party clients) is generating more views/interactions and indirectly more "value" on their platform.


> You can't load one so/dll multiple times in some sort of container

I believe you can do that with `dlmopen` in separate link maps. I have worked with multiple completely isolated Python interpreters in the same process that do not share a GIL using that approach.


Thank you for the hint about dlmopen! I had a problem that can be solved by loading multiple copies of a DLL, and it looks like reading manpages of the dynamic linker would have been a better approach than googling with the wrong keywords.


That's great!

There are a few cases where `dlmopen` has issues, for example, some libraries are written with the assumption that there will only be one of them in the process (their use of globals/thread local variables etc.) which may result in conflicts across namespaces.

Specifically, `libpthread` has one such issue [1] where `pthread_key_create` will create duplicate keys in separate namespaces. But these keys are later used to index into `THREAD_SELF->specific_1stblock` which is shared between all namespaces, which can cause all sorts of weird issues.

There is a (relatively old, unmerged) patch to glibc where you can specify some libraries to be shared across namespaces [2].

[1]: https://sourceware.org/bugzilla/show_bug.cgi?id=24776#c13

[2]: https://patchwork.ozlabs.org/project/glibc/patch/20211010163...


IIRC glibc is limited to 16 namespaces though.


Currently it is, yes. I am not sure how fundamental it is. I tried patching glibc to support more (128 in my case) and it seemed to work fine.


This is a great analysis; thanks for writing.

I have also been working on running multiple Python interpreters in the same process by isolating them in different namespaces using `dlmopen` [1]. The objective on a high level is to receive requests for some compute intensive operations from a TCP/HTTP server and dispatch them on to different workers. In this case, a thin C++ shim receives the requests and dispatches them on to one of the Python interpreters in a namespace. This eliminates contention for the GIL amongst the interpreters and can exploit parallelism by running each interpreter on a different set of cores. The data obtained from the request does not need to be copied into the interpreter because everything is in the same address space; similarly the output produced by the Python interpreter is also just passed back without any copies to the server.

[1] https://www.man7.org/linux/man-pages/man3/dlmopen.3.html


Can you please share pointers to some of the podcasts etc. you listened to? Looking for something similar for non-Bio expert people.


This looks like a cool project. Is there any support (or plan to support) I/O through kernel bypass technologies like RDMA? For example, the client reads the objects using 1-sided reads from the server given it knows which address the object lives in. This could be really benefitial for reducing latency and CPU load.


Similarly, I really liked the ideas in paper accelerating Memcached using an eBPF cache layer in the NIC interrupt: https://www.usenix.org/conference/nsdi21/presentation/ghigof...


I do not know much about RDMA. Our goal is to provide a memory store that is fully compatible with Redis/Memcached protocols so that all the existing frameworks could work as before. I am not sure how RDMA fits this goal.


Cool product! Any plans to support Azure Functions?


Thanks! We do but it's a bit further down the roadmap.


Here is something that has helped me a lot. It's a paper on how to read academic papers. You can extract the general idea and apply to many other technical reading material.

https://web.stanford.edu/class/ee384m/Handouts/HowtoReadPape...


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: