Have any other Rust async runtimes use io_uring/gotten at all good yet? Best of ...

threeseed · on Dec 9, 2021

Glommio uses io_uring: https://github.com/DataDog/glommio

And I integrated Hyper as an example: https://github.com/DataDog/glommio/blob/master/examples/hype...

And the performance was blisteringly quick (6x better latency streaming from a file compared to Nginx).

mitchtbaum · on Dec 9, 2021

> Have any other Rust async runtimes use io_uring/gotten at all good yet?

yes, check out `actix-rt`

https://github.com/actix/actix-net

ibraheemdev · on Dec 9, 2021

actix-rt is a wrapper around tokio's single threaded runtime, and (optionally) tokio-uring.

otabdeveloper4 · on Dec 9, 2021

[flagged]

jandrewrogers · on Dec 9, 2021

Async models are idiomatic for high-performance server code regardless of the programming language, particularly for anything I/O or concurrency intensive. The reason people use thread-per-core software architectures, which naturally require an async model, is because they have much higher throughput than the alternatives.

If software performance, efficiency, and scalability are primary objectives, you are going to be writing a lot of async code in a systems language like Rust or C++. People that “know what they are doing” understand this and why. Hence the interest in async libraries for Rust.

errantmind · on Dec 9, 2021

The GP mentioned "async runtimes". There are other approaches to async that don't involve using an async runtime, like epoll / kqueue. I personally prefer writing synchronous code, running in multiple threads pulling from a shared work queue. It isn't a one-size-fits-all solution but it is widely applicable, and you get to avoid the complexity of writing 'async code'

alexchamberlain · on Dec 9, 2021

Async runtimes are just wrappers around the underlying async technology, like epoll, kqueue and io_uring.

They give structure in a similar way if/for/while put structure around the underlying PC jumps.

errantmind · on Dec 9, 2021

Yes, but they also introduce a lot of complexity to the developer experience. It is turtles all the way down with async code, as they say

MuffinFlavored · on Dec 9, 2021

can you take code that is blocking in nature and make it async (like an FFI dlsym call), or at a low level does that just boil down to lots of polling/timers?

vlovich123 · on Dec 9, 2021

Not sure what code you’re looking at but in reality all code is asynchronous by nature (ie you ask the HW to do something and it tells you it’s done some time later). Then we layer blocking syscalls on top and then you layer async underneath. Io_uring is an attempt at getting everything asynchronous top to bottom.

Some times we use polling if there’s sufficient traffic and interrupts are inefficient but that’s still asynchronous, just batched.

weq · on Dec 9, 2021

http://think-async.com/Asio/asio-1.5.3/doc/asio/overview/cor...

otabdeveloper4 · on Dec 9, 2021

> ...is because they have much higher throughput than the alternatives

You're equating "high-performance" with "throughput". This is a classic newbie mistake and exactly what I mean when I mention cargo-cult.

staticassertion · on Dec 9, 2021

Yes, io_uring, a linux subsystem, is clearly just a cargo cult feature for people who don't know what they're doing.

errantmind · on Dec 9, 2021

Last I saw there was no substantial evidence of io_uring delivering a significant improvement over epoll, but I haven't looked in the last year. I wouldn't assume it is good just because it is in Linux (this is coming from someone who uses Linux exclusively)

threeseed · on Dec 9, 2021

Up to 2.44x faster for network zerocopy send: https://lore.kernel.org/io-uring/cover.1638282789.git.asml.s...

Up to 1.6x faster than epoll: https://twitter.com/axboe/status/1362271500489793539

io_uring 11% faster for non-polled IO: https://twitter.com/axboe/status/1465358880502861829/photo/1

And it's not like io_uring is complete. With every kernel release the gap over epoll gets bigger.

errantmind · on Dec 9, 2021

I was unable to replicate these results, and neither can other people. I'll look into it again though and see.

https://github.com/axboe/liburing/issues/189

edit: I'm not sure why I'm unable to reply to the comment below but I don't think anyone is being unnecessarily combative. I linked it twice because it is relevant to both replies to my comment. There is useful information in that thread if you read it end to end. What it does demonstrate is io_uring isn't a clear cut improvement over a much simpler approach. That may change in the future as it is improved though.

wtallis · on Dec 9, 2021

I really don't see why you've chosen to link to that github issue multiple times in this thread. The discussion in that issue doesn't convincingly demonstrate anything except that some people are trying to be unnecessarily combative about io_uring. We don't need to be dragging that attitude into HN threads. If you want to talk about some real performance problems with io_uring, find a way to do so without that baggage.

CoolGuySteve · on Dec 9, 2021

I found that io_uring socket latency is about 25% higher than epoll in my own benchmarks.

io_uring should be faster by design but it currently doesn’t seem to be. Maybe disk io is a different story.

jlokier · on Dec 9, 2021

I'm testing io_uring for file I/O for a database engine, which is a very different code path than networking with epoll, and so far I've been disappointed to find it roughly 20% slower than my userspace thread pool doing the same thing for that task.

Most likely there is some tweak to batching and maybe SQE submission watermarks, but I haven't found the formula yet.

I was surprised to find the sqpoll thread spins instead of a fast timer. Seems like a contention problem on a single core system, which still exist in VMs, etc.

staticassertion · on Dec 9, 2021

I don't assume it's good because of Linux, I have a generally negative view of Linux. It's just stupid to attribute io_uring to rust or call it cargo culting for Rust to use io_uring. The two projects are unrelated.

As for performance, I wouldn't judge based on a year ago, obviously a lot has changed since then - you can find numbers if you'd like, I saw the maintainer posting benchmarks only a few weeks ago.

errantmind · on Dec 9, 2021

Yea, I'll have to check it out again. I see very mixed results, it is likely io_uring may not fully replace epoll for all situations as some people thought it would. That's fine though

https://github.com/axboe/liburing/issues/189

threeseed · on Dec 9, 2021

As I posted above I integrated Hyper (web server) with Glommio (io_uring based async runtime).

Based on my limited benchmarks against Vertx/Scala and Nginx it was significantly faster, had zero failed transactions and used a fraction of the memory/CPU.

That means I need less servers and have a better end user experience.

jeltz · on Dec 9, 2021

Care to explain why they only make sense for interpreted languages and what limitations you are talking about? Async in Rust is mostly just a convenient syntax around ideas which has been around in the C and C++ worlds for decades with libraries like libevent.

_dh54 · on Dec 9, 2021

“Async” / event loop systems will usually be more efficient because 1) they don’t have to store unnecessary processor state to memory between handling events and 2) they are associated with cooperative event handlers and the assumption of a cooperative system provides more opportunities for optimization.

jlokier · on Dec 9, 2021

Ironically, it is not true that "async" (stackless) event loop systems don't store unnecessary processor state to memory.

Those systems store the entire processing state in future object memory. It is the same state that threaded systems store.

Stackful context switching systems (which includes some kinds of efficient threads - I would count Linux kernel internal threads among these) store that same state in the stack and context objects when context switching, so in principle store about the same amount of state. But some async, stackless systems do a bunch of extra work unwinding and restoring entire call stacks whenever an async task pauses in await, and therefore take more CPU time to context switch than stackful systems do.

_dh54 · on Dec 9, 2021

> It is the same state that threaded systems store.

Nope. Synchronous task switching systems additionally have to store CPU register state. In the cooperative case you need to store the caller-saved registers, in the pre-emptive case you need to store all the registers.

Async systems simply don’t have to do that extra work.

> But some async, stackless systems do a bunch of extra work unwinding and restoring entire call stacks whenever an async task pauses in await

Just the really bad ones. In any competent system a callback is stored ready to handle the event. No “unwinding” necessary.

gpderetta · on Dec 9, 2021

A cooperative synchronous task switching (i.e. fiber based) need only save the exact same information as an async based one (i.e. stackless coroutines): at a minimum a context pointer and an instruction pointer. Plus any live registers (which might be none).

You only need to save caller saved registers fs your context switching routine uses the conventional ABI, but that's not a requirement.

_dh54 · on Dec 9, 2021

> You only need to save caller saved registers fs your context switching routine uses the conventional ABI, but that's not a requirement.

If you request a task switch from C or any high level language that has the concept of caller-saved registers and the compiler has no knowledge of your task switching system (vast majority of cases) you will be forced to pay an extra cost. Is there a practical system in common use that is able to elide register-saves that you’re referring to? Or is your point essentially that you don’t have save caller-saved/live registers in the theoretical case that you have no caller-saved/live registers?

gpderetta · on Dec 10, 2021

The assumption is that you control the compiler (we are comparing with async that outside of duff device hacks does require compiler help).

But even without explicit compiler help, you can go a long way, say, with gcc extended inline asm.

_dh54 · on Dec 10, 2021

You don’t need explicit compiler help. At least with C this can be done entirely with a library. A task_switch() call can conform to the standard C-ABI, requiring no compiler support, and do the switching (in assembly). Without duff’s device. This is for example how kernels written in C do their task switching.

Likely the same can be said for Rust and nearly any language, since they all have ABIs to which can be conformed such that task_switch() looks like a normal function call.

gpderetta · on Dec 10, 2021

Oh, I have written my own share of userspace C context switching libraries, I know all the gory the details :). For example see my minimalist [1] stackful coroutine library: the full context switching logic is three inline asm instructions (99% of the complexity in that code is to transparently support throwing exceptions across coroutine boundaries with no overhead in the happy path).

You need compiler help for the custom calling convention support and possibly to optimize away the context switching overhead for stackful coroutines, which is something that compilers can already do for stackless coroutines.

The duff device is just a way to simulate stackless coroutines (i.e. async/await or whateverer) in plain C, in a way that the compiler can still optimize quite well.

[1] https://github.com/gpderetta/delimited/blob/master/delimited...

_dh54 · on Dec 12, 2021

> the full context switching logic is three inline asm instructions

You tell the compiler that you clobber the caller-saved registers (GPD_CLOBBERS), so in terms of cost it’s not just three asm instructions. Since these are caller-saved registers they will be live at every point in your code, even if your task switch routine is inlined. You have to consider the code the compiler generates to preserve the caller-saved registers before invoking your instruction sequence when evaluating total cost. This is an additional cost that is not necessary in callback style.

jlokier · on Dec 13, 2021

Caller-saved registers (aka. "volatile registers") are only saved when they are live in the caller at the point of a function call, and they are not always live. Code generation tends to prefer callee-saved registers instead at these points, precisely so they don't need to be saved there. Whether callee-saved registers are live at the inline task switch depends on whether they have been saved already in the function prologue, and if they are in use as temporaries. Not many registers are live at every point, typically just the stack and frame pointers.

Both types of code (async and stackful) have to save live state, whetever it is, across context switches, whether that's spilling registers to the stack in the stackful case, or into the future object across "await" in the async case. However, typically the code generator has more leeway to decide which spills are optimal in the stackful case, and the spills are to the hot stack, so low cost. Unless integrated with the code generator, async/await spills tend to be unconditional stores and loads, thus on average more expensive.

You're right about potentially redundant saves at stackful context switch. (Though if you control the compiler, as you should if you are comparing best implementations of both kinds, you can avoid truly redundant saves)

However, in practice few of the callee-saved registers are really redundant. If the caller doesn't use them, it's caller or some ancestor further up the chain usually does. If any do, they are genuine live state rather than redundant. There are cases you can construct where no ancestor uses a register, or uses one when it would be better not to, so that in theory it would be better not to use it and not to save it on context switch. But I think this is rare in real code.

You must compare this against the the various extra state storage, and memory allocations, in async/await: For example storing results in future objects in some implementations, spilling live state to an object when the stackful compiler would have used a register or the hot stack, and the way async/await implementations tend to allocate, fill, and later free a separate future object for each level of await in the call stack. All that extra storing is not free. Also, when comparing against the best of stackful, how many await implementations compile to pure continuation jumps without a return to an event loop function just to call the next async handler, and how many allow await results to be transferred directly in registers from generator to consumer, without being stored in the allocated future object?

I would summarise the difference between async/await and stackful-cooperative is that the former has considerable memory op overheads, but they are diffused throughout the code, so the context switch itself looks simple. It's an illusion, though, just like the "small asm" stackful context switch is an illusion due to clobbered live registers. The overhead is still there, either way, and I think it's usually slightly higher overhead in the async/await version. But async/await does have the advantage of not needing a fixed size "large enough for anything" stack to be preallocated per context, which it replaces with multiple and ongoing smaller allocations & frees per context instead.

It would be interesting to see an async/await transform applied to the Linux kernel, to see if it ended up faster or slower.

_dh54 · on Dec 14, 2021

I see the confusion now, I intended all of my arguments regarding caller-saved to actually refer to callee-saved registers. Hopefully you understand that you can never avoid preserving callee-saved registers with a cooperative task switching system and that this is not a necessary cost in an async/callback system.

> For example storing results in future objects in some implementations, spilling live state to an object when the stackful compiler would have used a register or the hot stack,

Regarding spill of live registers to heap state in the Async case vs stack state in the sync case, contemporary compilers are very good at keeping working state in registers and eliding redundant loads/stores to memory as long as it can prove that that memory is not referenced outside of that control flow. This is truth whether the source of truth is on the stack or the heap. This is due in part to the C11 memory model, which was somewhat implicit before it was standardized. In other words, an optimizing compiler does not treat all load/stores as “volatile *”. Furthermore, the heap state relevant to the current task would be as “hot” as the stack (in terms of being in cache) since it’s likely recently used. Given that I am skeptical of the argument that using heap state is slower than stack state due to increased register spillage. Again, just to be clear, this entire case is a separate discussion from the unavoidable added cost of preserving callee-saved registers in the synchronous case.

Where heap state does have a cost is in allocation. Allocating a stack variable is essentially free, while allocating heap memory can be expensive due to fragmentation and handling concurrent allocations. This cost is usually amortized since allocation is usually done once per task instantiation. Contrast this with register preservation, which is done on every task switch.

gpderetta · on Dec 17, 2021

This only adds to the confusion. Which one is the caller and the callee? The executor and the coroutine? Or the coroutine and the task-switch code?

In my implementation, at the task switching point, there is no callee saved registers, all registers are clobbered and any live register must be saved (also there is no difference between coroutine vs non-coroutine code, control transfer is completely symmetrical), so calling from executor into the coroutine and coroutine into the executor (or even between coroutines) would run the same code.

At the limit both async (i.e. stackless coroutines i.e. non-first-class continuations) and fibers (i.e. stackfull coroutines, i.e. first-class continuations) can be translated mechanically in CPS form (i.e. purely tail calling, never returning functions, the only difference is the amount of stack frames that can be live at a specific time (exactly one for stackless coroutines, unbounded for stackful), so any optimization that can be done on the former can also be done on the latter, so a stackful coroutine that always yields form top level would compile down to exactly the same machine code as a stackless coroutine.

gpderetta · on Dec 13, 2021

Thanks for the great contribution to the thread. Pretty much my thoughts.

I do believe that having to allocate a large stack is a downside, but again, with compiler help it should be possible, at least in theory, to compile stackful coroutines whose stack usage is known and bounded (i.e all task switch happen at top level or on non recursive inlined functions) to exactly the same stack usage of stackless coroutines.

gpderetta · on Dec 13, 2021

Sibling has an amazing long response, but tl;dr, the live registers that are clobbered are exactly the same as those that would need to be saved in the async case.

_dh54 · on Dec 14, 2021

I see the confusion now. I wrote caller-saved when I meant callee-saved. In general you don’t need to preserve callee-saved registers if you don’t use them during normal control flow but in the “sync” case you always have to save callee-saved registers on task switch. In the Async case, you simple return to the executor.

otabdeveloper4 · on Dec 9, 2021

> ...and the assumption of a cooperative system provides more opportunities for optimization

Also much more opportunities for blocking and starvation. Especially when people try to reinvent schedulers without understanding how they work in the first place.