I truly appreciate that this team uses nightly rust and only runs only Linux currently (due to relying on io_uring). Truly in the spirit of systems programming - focusing on a single, tight, efficient implementation first and leaving other considerations such as cross-platform compatibility for later. Lets those of us who love to live on the bleeding edge have our nice things too! :)
Following that through to the blog post ([1]), it's interesting how much new API they had to add. The Windows IO API is already completion-style (as opposed to epoll, etc's readiness-style) - userspace submits an async operation to the kernel, blocks on a channel to receive its result, and the kernel enqueues the result to said channel when it's done. So I naively assumed that they'd "just" have to refit io_uring's API on top of the existing Win32 API.
uring is used for disk-IO as well, and Windows disk I/O completion has similar limitations as Linux aio does for disks, i.e. anything that actually goes into filesystem code, like allocating new blocks, creating files, anything involving metadata, still blocks. Readiness-oriented I/O of course doesn't work for disk I/O at all, because disks are always ready. And IOCPs still mean that for queuing new I/O you're going to call into the kernel at least once for each operation.
why not prove like... "this is the best. all other platforms/circumstances/situations are subpar"
if the performance differences are enough, i can picture people making excuses to avoid all of those other platforms (or just never using this... more likely)
I'm honestly not. Desktop platforms like Windows are incredibly niche. How much rust software actually runs on those? Some, of course. But probably an incredible minority.
Supporting them is smart because people expect it, and not supporting it would be bad optics, but if a language legitimately only targeted Linux, and was better for it, I'd be fine with that - they target the most popular OS by far.
I am a Windows user, so all of the Rust software I use runs on it. And that’s virtually all Rust software. Sometimes you need a small patch or two because someone did something weird with path handling, but 99.99% of it Just Works.
So yeah Linux was the most popular, but dropping effectively half of your users isn't always a great idea. Of course, some people will build OS-specific software in Rust, and that's great! But it is a tradeoff you're making.
I'm only saying it's a tradeoff. Obviously given Rust's origins and goals dropping desktop as a target would make no sense - it was designed explicitly for those platforms.
But if a language said "we're not going to support those" I wouldn't care at all, and a massive number of use cases - the majority, I think - would be solved with that language.
In terms of what you develop on, that's a whole other story. The majority, of course, are on Linux. But I'd be interested to know what platform they target.
> In terms of what you develop on, that's a whole other story. The majority, of course, are on Linux
I'm not sure that's as obvious as you're making it sound. The number of people who use Linux on the desktop is absolutely minuscule compared to the combined user base of Windows and MacOS. It's probably not as lopsided for developers, but I've never seen anything to imply that most developers in general are on Linux, and I'd honestly be surprised if that's the case given how much smaller the portion is in terms of people I know or have worked with, and that's as someone who does not personally own any laptop or desktop that runs anything other than Linux.
Sure, but only barely a majority. Nearly half of Rust users are on Windows or MacOS. Dropping support for those would be crazily irresponsible and would probably be a bigger programming scandal than even the Python 2->3 ordeal. And for what benefit? Making some low level libraries easier?
Yeah, obviously. It would be stupid to not support Windows in order for rust to be a healthy language. But I wouldn't care at all, and the vast majority of code - not just the code written, but in terms of it being deployed to N systems - will be Linux.
With context switches becoming more and more expensive relative to faster and faster I/O devices, almost the same order of magnitude, I believe that thread-per-core is where things are heading, because the alternative of not doing thread-per-core might literally be halving your throughput.
That's also the most exciting thing about io_uring for me: how it enables a simple, single-threaded and yet highly performant thread-per-core control plane, outsourcing to the kernel thread pool for the async I/O data plane, instead of outsourcing to a user space thread pool as in the past. It's much more efficient and at the same time, much easier to reason about. There's no longer the need for multithreading to leak into the control plane.
My experience with io_uring has been mostly working on TigerBeetleDB [1], a new distributed database that can process a million financial transactions a second, and I find it's a whole new way of thinking... that you can now just submit I/O directly from the control plane without blocking and without the cost of a context switch. It really changes the kinds of designs you can achieve, especially in the storage space (e.g. things like LSM-tree compactions can become much more parallel and incremental, while also becoming much simpler, i.e. no longer any need to think of memory barriers). Fantastic also to now have a unified API for networking/storage.
Totally. And these thread per core apps can talk directly to virtio or NVMe passed into the kvm guest, you can get the best of having a unix host but have applications that run directly under KVM w/o sacrificing a rich control plane. And control plane reliability doesn't impact the data plane. Wonderful times!
There are four articles in Chinese about the design and implementation on one of the author's blog. Here's a link to the first one: https://www.ihcblog.com/rust-runtime-design-1/. I don't know the subject and Chinese enough to know how much is lost by automatic translation (Google in my case), but it looks relatively good.
As an aside, Google translation integrated in Chrome breaks the formatting of the code blocks, which is surprising since they're in a <pre> block.
I've found in my own use of rust I want async/nonblocking for two things.
1. Be able to timeout a read of a socket
2. Be able to select over multiple sockets, and read whichever is ready first
Usually a combination of both.
epoll/io_uring(I guess? I only ever did research on epoll) seem like the solution being handed to me on a silver platter, however my understanding is if you want to use either of those you're meant to use async in rust, and that while there are some libraries which provide interfaces for this kind of behavior outside of async, they're usually very ad-hoc and that the community is just very laser focused on async as a language construct.
What I don't understand is why does Rust consider it necessary to introduce async, futures, runtimes, an executor, async versions of TcpListeners, Files, etc for this?
Why can't I just have a function in the standard library that takes a slice of std::net::TcpListeners, a timeout, blocks, and then gives me whichever is ready to read, when its ready to read, or nothing if the timeout is reached? Its not like I was going to do anything else on that thread while I wait for a packet to be received, it can happily be parked.
Instead I have to select a runtime, and libraries compatible with the runtime, replace all the TcpListeners from the standard library I'm using with tokio TcpListeners or whatever, deal with API change, and now I have to deal with the "turtles all the way down" problem of async as well.
That's not even to get into the whole nightmare that a lot of really sick libraries which I want to use in a blocking nature, are now only providing async APIs, which means its "async or the highway". I am very much not happy about this situation and I don't know what I can do about it. It seems the only response I ever get is "just use it, its easy" but that is not at all convincing me. I don't want to use it!
You can absolutely do readiness-based IO. Either call the system-specific APIs or use the MIO library, which is a low-level platform abstraction for that: https://github.com/tokio-rs/mio
If you need to handle <100,000 sockets, you will probably be fine with a thread per socket. Call set_read_timeout to implement the deadline. Run load tests and adjust the socket limit.
Async lets one process handle millions of sockets.
If you want to handle millions of sockets with threads, you can use `mio` [0]. Mio's API has footguns that can cause UB. If your wire protocol is complicated, you may find yourself implementing something like async, but worse.
I wonder how much and for what Rust is being used at Bytedance, given that this seems to be under their GitHub org. It's pretty interesting that they are apparently using it.
Edit: For those wondering who Bytedance are, I can save you a Google search. These are the people making TikTok.
From what I read (in Chinese), this "Monoio" is meant to be used for the proxy/whatever part of their next-generation service mesh. So maybe a corp-wide thing.
Yes its an increasing trend. Much of it is way less well known outside China. But lots of code from Alibaba, Tencent and smaller companies etc. many new CNCF projects are from China.
It's exciting to see another thread-per-core async runtime for Rust. It's truly understated how difficult the Send + Sync requirements in Tokio are for writing regular code. It's typically rare for async tasks in Tokio to be used across two threads simultaneously, but now all of your data must be Send+Sync.
Plus, Tokio is unique in that it's one of the very few runtimes in existence that's work-stealing (meaning the task can move off its original thread). Most other runtimes in other languages do not have that requirement, meaning you can use traditional Cell/RefCell/Rc instead of their slower, atomic variants.
Right now, the best you can do is write a thread pool that spawns a Tokio LocalSet to run !Send futures. In fact, this is what Actin-web does to achieve its crazy performance. Web requests finish so quickly you rarely need work-stealing to achieve good performance, and often the cost of atomics/stealing is greater than the performance gain.
> It's truly understated how difficult the Send + Sync requirements in Tokio are for writing regular code. It's typically rare for async tasks in Tokio to be used across two threads simultaneously, but now all of your data must be Send+Sync.
> Plus, Tokio is unique in that it's one of the very few runtimes in existence that's work-stealing (meaning the task can move off its original thread). Most other runtimes in other languages do not have that requirement, meaning you can use traditional Cell/RefCell/Rc instead of their slower, atomic variants.
Most other languages don't have any concept of Send/Sync, or non-atomic Cell/RefCell/Rc. Rather than the compiler stopping you you only find out about the issue when you hit it, and if you're lucky you can skate by a long long while (or alternatively the language has no way to sync and everything's send).
Work-stealing runtimes may be more common than you think because of that: Erlang's BEAM, Go's scheduler(s), Java's fork/join, .net's TPL, many most if not all OpenMP implementations, Apple's GCD, ... implement work-stealing in various measures.
Also... thread-per-core makes work-stealing even more necessary because the OS can't perform the balancing? The only ways to avoid work-stealing eventually becoming necessary (for a general-purpose scheduler) is either to have a completely single-threaded scheduler, or to not have an application-level scheduler at all and use OS threads.
> Besides Rust the main frameworks which tried to do move tasks between executors are Go and C#'s Threadpool executor (although I think the ASP.NET default executor might have fixed threads).
> Therefore the state of the world is actually more that Rust would need to prove that its approach of defaulting to thread-safe is a viable alternative than questioning the effectiveness of single-threaded eventloops. Their effectiveness in terms of reducing context switches and guaranteeeing good cache hit rates was more or less what triggered people to move towards the model, despite the ergonomic challenges of using callbacks.
I've been thinking about how one could get around the Send + Sync requirements, it's a fascinating conundrum, from a language design standpoint.
If we didn't have the Send + Sync requirements, and multiple async functions are running concurrently on the same thread, and multiple of them locked the same RefCell, might that cause a panic?
> If we didn't have the Send + Sync requirements, and multiple async functions are running concurrently on the same thread, and multiple of them locked the same RefCell, might that cause a panic?
There's a ton of bleeding edge and novel work being done by Chinese companies and communities due to their insane scale requirements. I've been trying to learn the language to gain more insight and stay current with what they're doing.
Have any other Rust async runtimes use io_uring/gotten at all good yet?
Best of the best modern systems programmers gotta get good sometime. Not sure if it's happening yet. Ok here's one point of call: https://github.com/tokio-rs/tokio-uring
Async models are idiomatic for high-performance server code regardless of the programming language, particularly for anything I/O or concurrency intensive. The reason people use thread-per-core software architectures, which naturally require an async model, is because they have much higher throughput than the alternatives.
If software performance, efficiency, and scalability are primary objectives, you are going to be writing a lot of async code in a systems language like Rust or C++. People that “know what they are doing” understand this and why. Hence the interest in async libraries for Rust.
The GP mentioned "async runtimes". There are other approaches to async that don't involve using an async runtime, like epoll / kqueue. I personally prefer writing synchronous code, running in multiple threads pulling from a shared work queue. It isn't a one-size-fits-all solution but it is widely applicable, and you get to avoid the complexity of writing 'async code'
can you take code that is blocking in nature and make it async (like an FFI dlsym call), or at a low level does that just boil down to lots of polling/timers?
Not sure what code you’re looking at but in reality all code is asynchronous by nature (ie you ask the HW to do something and it tells you it’s done some time later). Then we layer blocking syscalls on top and then you layer async underneath. Io_uring is an attempt at getting everything asynchronous top to bottom.
Some times we use polling if there’s sufficient traffic and interrupts are inefficient but that’s still asynchronous, just batched.
Last I saw there was no substantial evidence of io_uring delivering a significant improvement over epoll, but I haven't looked in the last year. I wouldn't assume it is good just because it is in Linux (this is coming from someone who uses Linux exclusively)
edit: I'm not sure why I'm unable to reply to the comment below but I don't think anyone is being unnecessarily combative. I linked it twice because it is relevant to both replies to my comment. There is useful information in that thread if you read it end to end. What it does demonstrate is io_uring isn't a clear cut improvement over a much simpler approach. That may change in the future as it is improved though.
I really don't see why you've chosen to link to that github issue multiple times in this thread. The discussion in that issue doesn't convincingly demonstrate anything except that some people are trying to be unnecessarily combative about io_uring. We don't need to be dragging that attitude into HN threads. If you want to talk about some real performance problems with io_uring, find a way to do so without that baggage.
I'm testing io_uring for file I/O for a database engine, which is a very different code path than networking with epoll, and so far I've been disappointed to find it roughly 20% slower than my userspace thread pool doing the same thing for that task.
Most likely there is some tweak to batching and maybe SQE submission watermarks, but I haven't found the formula yet.
I was surprised to find the sqpoll thread spins instead of a fast timer. Seems like a contention problem on a single core system, which still exist in VMs, etc.
I don't assume it's good because of Linux, I have a generally negative view of Linux. It's just stupid to attribute io_uring to rust or call it cargo culting for Rust to use io_uring. The two projects are unrelated.
As for performance, I wouldn't judge based on a year ago, obviously a lot has changed since then - you can find numbers if you'd like, I saw the maintainer posting benchmarks only a few weeks ago.
Yea, I'll have to check it out again. I see very mixed results, it is likely io_uring may not fully replace epoll for all situations as some people thought it would. That's fine though
As I posted above I integrated Hyper (web server) with Glommio (io_uring based async runtime).
Based on my limited benchmarks against Vertx/Scala and Nginx it was significantly faster, had zero failed transactions and used a fraction of the memory/CPU.
That means I need less servers and have a better end user experience.
Care to explain why they only make sense for interpreted languages and what limitations you are talking about? Async in Rust is mostly just a convenient syntax around ideas which has been around in the C and C++ worlds for decades with libraries like libevent.
“Async” / event loop systems will usually be more efficient because 1) they don’t have to store unnecessary processor state to memory between handling events and 2) they are associated with cooperative event handlers and the assumption of a cooperative system provides more opportunities for optimization.
Ironically, it is not true that "async" (stackless) event loop systems don't store unnecessary processor state to memory.
Those systems store the entire processing state in future object memory. It is the same state that threaded systems store.
Stackful context switching systems (which includes some kinds of efficient threads - I would count Linux kernel internal threads among these) store that same state in the stack and context objects when context switching, so in principle store about the same amount of state. But some async, stackless systems do a bunch of extra work unwinding and restoring entire call stacks whenever an async task pauses in await, and therefore take more CPU time to context switch than stackful systems do.
> It is the same state that threaded systems store.
Nope. Synchronous task switching systems additionally have to store CPU register state. In the cooperative case you need to store the caller-saved registers, in the pre-emptive case you need to store all the registers.
Async systems simply don’t have to do that extra work.
> But some async, stackless systems do a bunch of extra work unwinding and restoring entire call stacks whenever an async task pauses in await
Just the really bad ones. In any competent system a callback is stored ready to handle the event. No “unwinding” necessary.
A cooperative synchronous task switching (i.e. fiber based) need only save the exact same information as an async based one (i.e. stackless coroutines): at a minimum a context pointer and an instruction pointer. Plus any live registers (which might be none).
You only need to save caller saved registers fs your context switching routine uses the conventional ABI, but that's not a requirement.
> You only need to save caller saved registers fs your context switching routine uses the conventional ABI, but that's not a requirement.
If you request a task switch from C or any high level language that has the concept of caller-saved registers and the compiler has no knowledge of your task switching system (vast majority of cases) you will be forced to pay an extra cost. Is there a practical system in common use that is able to elide register-saves that you’re referring to? Or is your point essentially that you don’t have save caller-saved/live registers in the theoretical case that you have no caller-saved/live registers?
You don’t need explicit compiler help. At least with C this can be done entirely with a library. A task_switch() call can conform to the standard C-ABI, requiring no compiler support, and do the switching (in assembly). Without duff’s device. This is for example how kernels written in C do their task switching.
Likely the same can be said for Rust and nearly any language, since they all have ABIs to which can be conformed such that task_switch() looks like a normal function call.
Oh, I have written my own share of userspace C context switching libraries, I know all the gory the details :). For example see my minimalist [1] stackful coroutine library: the full context switching logic is three inline asm instructions (99% of the complexity in that code is to transparently support throwing exceptions across coroutine boundaries with no overhead in the happy path).
You need compiler help for the custom calling convention support and possibly to optimize away the context switching overhead for stackful coroutines, which is something that compilers can already do for stackless coroutines.
The duff device is just a way to simulate stackless coroutines (i.e. async/await or whateverer) in plain C, in a way that the compiler can still optimize quite well.
> the full context switching logic is three inline asm instructions
You tell the compiler that you clobber the caller-saved registers (GPD_CLOBBERS), so in terms of cost it’s not just three asm instructions. Since these are caller-saved registers they will be live at every point in your code, even if your task switch routine is inlined. You have to consider the code the compiler generates to preserve the caller-saved registers before invoking your instruction sequence when evaluating total cost. This is an additional cost that is not necessary in callback style.
Caller-saved registers (aka. "volatile registers") are only saved when they are live in the caller at the point of a function call, and they are not always live. Code generation tends to prefer callee-saved registers instead at these points, precisely so they don't need to be saved there. Whether callee-saved registers are live at the inline task switch depends on whether they have been saved already in the function prologue, and if they are in use as temporaries. Not many registers are live at every point, typically just the stack and frame pointers.
Both types of code (async and stackful) have to save live state, whetever it is, across context switches, whether that's spilling registers to the stack in the stackful case, or into the future object across "await" in the async case. However, typically the code generator has more leeway to decide which spills are optimal in the stackful case, and the spills are to the hot stack, so low cost. Unless integrated with the code generator, async/await spills tend to be unconditional stores and loads, thus on average more expensive.
You're right about potentially redundant saves at stackful context switch. (Though if you control the compiler, as you should if you are comparing best implementations of both kinds, you can avoid truly redundant saves)
However, in practice few of the callee-saved registers are really redundant. If the caller doesn't use them, it's caller or some ancestor further up the chain usually does. If any do, they are genuine live state rather than redundant. There are cases you can construct where no ancestor uses a register, or uses one when it would be better not to, so that in theory it would be better not to use it and not to save it on context switch. But I think this is rare in real code.
You must compare this against the the various extra state storage, and memory allocations, in async/await: For example storing results in future objects in some implementations, spilling live state to an object when the stackful compiler would have used a register or the hot stack, and the way async/await implementations tend to allocate, fill, and later free a separate future object for each level of await in the call stack. All that extra storing is not free. Also, when comparing against the best of stackful, how many await implementations compile to pure continuation jumps without a return to an event loop function just to call the next async handler, and how many allow await results to be transferred directly in registers from generator to consumer, without being stored in the allocated future object?
I would summarise the difference between async/await and stackful-cooperative is that the former has considerable memory op overheads, but they are diffused throughout the code, so the context switch itself looks simple. It's an illusion, though, just like the "small asm" stackful context switch is an illusion due to clobbered live registers. The overhead is still there, either way, and I think it's usually slightly higher overhead in the async/await version. But async/await does have the advantage of not needing a fixed size "large enough for anything" stack to be preallocated per context, which it replaces with multiple and ongoing smaller allocations & frees per context instead.
It would be interesting to see an async/await transform applied to the Linux kernel, to see if it ended up faster or slower.
I see the confusion now, I intended all of my arguments regarding caller-saved to actually refer to callee-saved registers. Hopefully you understand that you can never avoid preserving callee-saved registers with a cooperative task switching system and that this is not a necessary cost in an async/callback system.
> For example storing results in future objects in some implementations, spilling live state to an object when the stackful compiler would have used a register or the hot stack,
Regarding spill of live registers to heap state in the Async case vs stack state in the sync case, contemporary compilers are very good at keeping working state in registers and eliding redundant loads/stores to memory as long as it can prove that that memory is not referenced outside of that control flow. This is truth whether the source of truth is on the stack or the heap. This is due in part to the C11 memory model, which was somewhat implicit before it was standardized. In other words, an optimizing compiler does not treat all load/stores as “volatile *”. Furthermore, the heap state relevant to the current task would be as “hot” as the stack (in terms of being in cache) since it’s likely recently used. Given that I am skeptical of the argument that using heap state is slower than stack state due to increased register spillage. Again, just to be clear, this entire case is a separate discussion from the unavoidable added cost of preserving callee-saved registers in the synchronous case.
Where heap state does have a cost is in allocation. Allocating a stack variable is essentially free, while allocating heap memory can be expensive due to fragmentation and handling concurrent allocations. This cost is usually amortized since allocation is usually done once per task instantiation. Contrast this with register preservation, which is done on every task switch.
This only adds to the confusion. Which one is the caller and the callee? The executor and the coroutine? Or the coroutine and the task-switch code?
In my implementation, at the task switching point, there is no callee saved registers, all registers are clobbered and any live register must be saved (also there is no difference between coroutine vs non-coroutine code, control transfer is completely symmetrical), so calling from executor into the coroutine and coroutine into the executor (or even between coroutines) would run the same code.
At the limit both async (i.e. stackless coroutines i.e. non-first-class continuations) and fibers (i.e. stackfull coroutines, i.e. first-class continuations) can be translated mechanically in CPS form (i.e. purely tail calling, never returning functions, the only difference is the amount of stack frames that can be live at a specific time (exactly one for stackless coroutines, unbounded for stackful), so any optimization that can be done on the former can also be done on the latter, so a stackful coroutine that always yields form top level would compile down to exactly the same machine code as a stackless coroutine.
Thanks for the great contribution to the thread. Pretty much my thoughts.
I do believe that having to allocate a large stack is a downside, but again, with compiler help it should be possible, at least in theory, to compile stackful coroutines whose stack usage is known and bounded (i.e all task switch happen at top level or on non recursive inlined functions) to exactly the same stack usage of stackless coroutines.
Sibling has an amazing long response, but tl;dr, the live registers that are clobbered are exactly the same as those that would need to be saved in the async case.
I see the confusion now. I wrote caller-saved when I meant callee-saved. In general you don’t need to preserve callee-saved registers if you don’t use them during normal control flow but in the “sync” case you always have to save callee-saved registers on task switch. In the Async case, you simple return to the executor.
> ...and the assumption of a cooperative system provides more opportunities for optimization
Also much more opportunities for blocking and starvation. Especially when people try to reinvent schedulers without understanding how they work in the first place.
I have a stupid question, why isn't an async runtime a language feature of rust? we don't seem to see so many async runtimes in other languages? They seem to have a default way to run async tasks?
It sort of goes against Rust’s philosophy to bake anything like that into the language.
Where they screwed up was not providing the machinery to make libraries agnostic of the runtime an end user wants to use in their program, so libraries either depend on a specific runtime explicitly or use features to allow users to switch runtimes at compile time. This causes a lot of headaches for library maintainers and end users both.
There’s a lot of interest in adding said machinery (through collections of traits in std) to enable libraries to be generic over different runtimes, but a solution is still some ways off.
> I have a stupid question, why isn't an async runtime a language feature of rust?
Because it was considered inimical to the core values and purpose of the language, which is to be a systems language.
An async runtime being a language feature means a runtime is a language feature, and that is very much undesirable (in fact it used to be part of the language and was removed as it "settled" into its niche from its original design, which was much higher level and more applicative).
an async runtime inherently requires knowledge of the underlying system which Rust remains agnostic to. Rust can be written to target bare-metal (where no OS exists) to WASM and everything in between. If a runtime was added to the language itself you would inherently limit the flexibility of the platforms by which it could target.