If you're doing truly low latency stuff you shouldn't be swapping at all, everything should be 100% resident in memory at all times. So "pages" are totally irrelevant to you. (You should also probably be using something like the PREEMPT_RT patchset, adjust scheduling priorities and try your best to ensure that the CPU core(s) your app is running on aren't burdened by serving interrupts. Plus likely a lot of other stuff that I haven't touched on in this brief comment.)
Stock / near stock Linux is pretty close to fine for HFT.
You basically only interact with the kernel on init/shutdown or outside of the fast path, and do something like isolcpus to delegate the kernel and interrupt handling to some garbage cores and give you the rest to do what you want with.
Your comment is correct but might cause readers to underestimate how annoying this tuning work is and how difficult it is to get everything into hugepages (executable memory and stack memory and shared libraries if applicable, not just specific heap allocations). We are trading a joke asset class on joke venues that have millisecond-scale jitter, so we can get away with using io_uring instead of kernel bypass networking.
The part about getting everything into hugepages sounds interesting. Any idea where can I find some resources on that? Most of what I was able to find only tell you how to do that for heap allocations.
Thanks, cool stuff. Especially liblppreload.so described in [2] and [3]. I'll give it a try. Do you have any tips how to achieve the same for the stack?
I haven't done this myself but given that ELF does not have a dedicated .stack region, I guess you first have to find out what memory address range will ELF use to store variables on the stack.
Finding out the beginning of the stack should be possible to deduce from memory addresses upon entering the main().
Finding out the end of the stack depends on how big the stack is and whether it grows upwards or downwards. Former is of dynamic nature but usually configured at system level on Linux and for the latter I am not sure but I think it grows downwards.
IMO the easiest way (but certainly not the only way) is to allocate a new stack and switch to it with makecontext(). The manpage has a full code example. You just need to change the stack alloc. This approach has a few drawbacks but is hard to beat in terms of simplicity.
Just wondering, how useful is it to get code and stack memory into hugepages? I thought you usually access them sequentially so it doesn't matter that much to put them in hugepages.
Code is not really accessed sequentially. Just imagine a function calling another function which sits in another translation unit. Depending what the linker is going to do but there's a good chance that these won't sit near to each other unless you explicitly optimized for that case. This is why source-code level locality is also important - to minimize the instruction cache misses. And also why you don't want to go mad about making everything dynamically dispatched in the code unless you really need to (e.g. virtual functions).
EDIT: Putting the code segment into hugepages will relief some of the pressure of VADDR translation which is otherwise larger with 4K segments. Whether this will impact the runtime execution time positively or stay neutral I think it greatly depends on the workload and cannot be said upfront.
Wrt code, look at the bench in the article. Even with sequential access, you can get a decent speedup using huge pages. But unless you have a good profile and using PGO, it'll likely not be that sequential for code.
Like everything else you'll need to measure it to know exactly what benefit you might get. As a starting point, you can start at looking at the itlb load misses with perf stat -d
Stack access is another story as it's usually local and sequential so it might not be that useful.
The most benefit comes from the fact that you end up with a lot less TLB misses, since single mapping covers a large chunk of memory. Predictable memory access pattern helps with caches misses thanks to hardware prefetch, but as far as I know hardware prefetch won't work if it would cause TLB miss on most CPUs.
Nothing needs to be changed about your kernel to bypass it.
You can outright install the openonload drivers, preload to intercept epoll, and it literally just works.
Going to efvi can cut out ~1us but that requires specifically targeting efvi, and has more operational / code setup pain. Works on stock Linux all the same though
Wondering if there is any guide to programming and accurately measuring low latency stuff. I am working on some low level memory management code and would like to see the latency behavior, but I always get several microseconds of standard deviation (~10%) when I try to benchmark it.
I am pinning the cores, disabled SMT and turbo boost, but haven't tried isolcpu because this has to reboot the computer.
Really? With Google Bench or Criterion I've gotten pretty good resolution on microbenchmarks
The gold standard (in my experience) for latency measurement is setting up a packet splitting, marking your outbound packets with some hash/id of the inbound packet, taking hardware timestamps of all these on some dedicated host, and putting it all together after the fact. Ultimately packet-in to packet out is all the matters anyways
It looks like you don't have a good understanding of how virtual memory works and how in that space the hardware (TLB), the OS (page tables) and higher level software are intertwined.
Also, PREEMPT_RT is the worst option for low latency because it's about execution time guarantees and not speed specifically. If you're on PREEMPT_RT and give your critical thread highest prio, be prepared for some serious OS-level lock-ups.
> If you're on PREEMPT_RT and give your critical thread highest prio, be prepared for some serious OS-level lock-ups.
PREEMPT_RT includes priority inheritance, specifically to avoid this scenario. So your app should indeed be favored if you tune accordingly. What you also seem to be saying is that using PREEMPT_RT may lead to lower throughput, but that's not the same thing as latency.
I'm not entirely sure you understand memory hierarchy and that RAM is volatile so paging has to happen and keep in mind that reading memory from disk is a LOT slower than from main memory.