If you're doing truly low latency stuff you shouldn't be swapping at all, everyt...

vgatherps · on Nov 29, 2022

Stock / near stock Linux is pretty close to fine for HFT.

You basically only interact with the kernel on init/shutdown or outside of the fast path, and do something like isolcpus to delegate the kernel and interrupt handling to some garbage cores and give you the rest to do what you want with.

anonymoushn · on Nov 29, 2022

Your comment is correct but might cause readers to underestimate how annoying this tuning work is and how difficult it is to get everything into hugepages (executable memory and stack memory and shared libraries if applicable, not just specific heap allocations). We are trading a joke asset class on joke venues that have millisecond-scale jitter, so we can get away with using io_uring instead of kernel bypass networking.

JedMartin · on Nov 29, 2022

The part about getting everything into hugepages sounds interesting. Any idea where can I find some resources on that? Most of what I was able to find only tell you how to do that for heap allocations.

menaerus · on Nov 29, 2022

It can be done by manually remapping the relevant sections upon application startup.

Perhaps [1] is a good resource to start with (page nr. 7). Example code is here [2]. And [3] makes some experiments with it.

[1] https://www.kernel.org/doc/ols/2006/ols2006v2-pages-83-90.pd...

[2] https://github.com/intel/iodlr/blob/master/large_page-c/exam...

[3] https://easyperf.net/blog/2022/09/01/Utilizing-Huge-Pages-Fo...

JedMartin · on Nov 29, 2022

Thanks, cool stuff. Especially liblppreload.so described in [2] and [3]. I'll give it a try. Do you have any tips how to achieve the same for the stack?

menaerus · on Nov 29, 2022

I haven't done this myself but given that ELF does not have a dedicated .stack region, I guess you first have to find out what memory address range will ELF use to store variables on the stack.

Finding out the beginning of the stack should be possible to deduce from memory addresses upon entering the main().

Finding out the end of the stack depends on how big the stack is and whether it grows upwards or downwards. Former is of dynamic nature but usually configured at system level on Linux and for the latter I am not sure but I think it grows downwards.

deadcanard · on Nov 29, 2022

IMO the easiest way (but certainly not the only way) is to allocate a new stack and switch to it with makecontext(). The manpage has a full code example. You just need to change the stack alloc. This approach has a few drawbacks but is hard to beat in terms of simplicity.

pca006132 · on Nov 29, 2022

Just wondering, how useful is it to get code and stack memory into hugepages? I thought you usually access them sequentially so it doesn't matter that much to put them in hugepages.

menaerus · on Nov 29, 2022

Code is not really accessed sequentially. Just imagine a function calling another function which sits in another translation unit. Depending what the linker is going to do but there's a good chance that these won't sit near to each other unless you explicitly optimized for that case. This is why source-code level locality is also important - to minimize the instruction cache misses. And also why you don't want to go mad about making everything dynamically dispatched in the code unless you really need to (e.g. virtual functions).

EDIT: Putting the code segment into hugepages will relief some of the pressure of VADDR translation which is otherwise larger with 4K segments. Whether this will impact the runtime execution time positively or stay neutral I think it greatly depends on the workload and cannot be said upfront.

deadcanard · on Nov 29, 2022

Wrt code, look at the bench in the article. Even with sequential access, you can get a decent speedup using huge pages. But unless you have a good profile and using PGO, it'll likely not be that sequential for code. Like everything else you'll need to measure it to know exactly what benefit you might get. As a starting point, you can start at looking at the itlb load misses with perf stat -d

Stack access is another story as it's usually local and sequential so it might not be that useful.

JedMartin · on Nov 29, 2022

The most benefit comes from the fact that you end up with a lot less TLB misses, since single mapping covers a large chunk of memory. Predictable memory access pattern helps with caches misses thanks to hardware prefetch, but as far as I know hardware prefetch won't work if it would cause TLB miss on most CPUs.

aldanor · on Nov 29, 2022

Not really. Most HFTs would choose some sort of kernel bypass for critical path networking needs. (unless that's what you mean by near stock)

vgatherps · on Nov 29, 2022

Nothing needs to be changed about your kernel to bypass it.

You can outright install the openonload drivers, preload to intercept epoll, and it literally just works.

Going to efvi can cut out ~1us but that requires specifically targeting efvi, and has more operational / code setup pain. Works on stock Linux all the same though

pca006132 · on Nov 29, 2022

Wondering if there is any guide to programming and accurately measuring low latency stuff. I am working on some low level memory management code and would like to see the latency behavior, but I always get several microseconds of standard deviation (~10%) when I try to benchmark it.

I am pinning the cores, disabled SMT and turbo boost, but haven't tried isolcpu because this has to reboot the computer.

vgatherps · on Nov 29, 2022

Really? With Google Bench or Criterion I've gotten pretty good resolution on microbenchmarks

The gold standard (in my experience) for latency measurement is setting up a packet splitting, marking your outbound packets with some hash/id of the inbound packet, taking hardware timestamps of all these on some dedicated host, and putting it all together after the fact. Ultimately packet-in to packet out is all the matters anyways

You can actually get pretty solid internal timestamps (how long did I take to fully process event X) with TSC counters, but you then have a coordinated omissions problem: https://www.programmingtalks.org/talk/how-not-to-measure-lat...

deadcanard · on Nov 29, 2022

Have you tried one the link of the article: https://docs.kernel.org/admin-guide/kernel-per-CPU-kthreads....? Also try running "perf stat -d" on your run and see anything pops out

bitcharmer · on Nov 29, 2022

It looks like you don't have a good understanding of how virtual memory works and how in that space the hardware (TLB), the OS (page tables) and higher level software are intertwined.

Also, PREEMPT_RT is the worst option for low latency because it's about execution time guarantees and not speed specifically. If you're on PREEMPT_RT and give your critical thread highest prio, be prepared for some serious OS-level lock-ups.

zozbot234 · on Nov 29, 2022

> If you're on PREEMPT_RT and give your critical thread highest prio, be prepared for some serious OS-level lock-ups.

PREEMPT_RT includes priority inheritance, specifically to avoid this scenario. So your app should indeed be favored if you tune accordingly. What you also seem to be saying is that using PREEMPT_RT may lead to lower throughput, but that's not the same thing as latency.

bitcharmer · on Nov 29, 2022

Priority inheritance is a fundamental feature of all scheduling domains in the kernel and has nothing to do with the problem I described.

anonymoushn · on Nov 29, 2022

Pages still exist even if you disable swap. Maybe you could benefit from reading TFA?

WhiteBlueSkies · on Nov 29, 2022

I'm not entirely sure you understand memory hierarchy and that RAM is volatile so paging has to happen and keep in mind that reading memory from disk is a LOT slower than from main memory.