Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Your comment is correct but might cause readers to underestimate how annoying this tuning work is and how difficult it is to get everything into hugepages (executable memory and stack memory and shared libraries if applicable, not just specific heap allocations). We are trading a joke asset class on joke venues that have millisecond-scale jitter, so we can get away with using io_uring instead of kernel bypass networking.


The part about getting everything into hugepages sounds interesting. Any idea where can I find some resources on that? Most of what I was able to find only tell you how to do that for heap allocations.


It can be done by manually remapping the relevant sections upon application startup.

Perhaps [1] is a good resource to start with (page nr. 7). Example code is here [2]. And [3] makes some experiments with it.

[1] https://www.kernel.org/doc/ols/2006/ols2006v2-pages-83-90.pd...

[2] https://github.com/intel/iodlr/blob/master/large_page-c/exam...

[3] https://easyperf.net/blog/2022/09/01/Utilizing-Huge-Pages-Fo...


Thanks, cool stuff. Especially liblppreload.so described in [2] and [3]. I'll give it a try. Do you have any tips how to achieve the same for the stack?


I haven't done this myself but given that ELF does not have a dedicated .stack region, I guess you first have to find out what memory address range will ELF use to store variables on the stack.

Finding out the beginning of the stack should be possible to deduce from memory addresses upon entering the main().

Finding out the end of the stack depends on how big the stack is and whether it grows upwards or downwards. Former is of dynamic nature but usually configured at system level on Linux and for the latter I am not sure but I think it grows downwards.


IMO the easiest way (but certainly not the only way) is to allocate a new stack and switch to it with makecontext(). The manpage has a full code example. You just need to change the stack alloc. This approach has a few drawbacks but is hard to beat in terms of simplicity.


Just wondering, how useful is it to get code and stack memory into hugepages? I thought you usually access them sequentially so it doesn't matter that much to put them in hugepages.


Code is not really accessed sequentially. Just imagine a function calling another function which sits in another translation unit. Depending what the linker is going to do but there's a good chance that these won't sit near to each other unless you explicitly optimized for that case. This is why source-code level locality is also important - to minimize the instruction cache misses. And also why you don't want to go mad about making everything dynamically dispatched in the code unless you really need to (e.g. virtual functions).

EDIT: Putting the code segment into hugepages will relief some of the pressure of VADDR translation which is otherwise larger with 4K segments. Whether this will impact the runtime execution time positively or stay neutral I think it greatly depends on the workload and cannot be said upfront.


Wrt code, look at the bench in the article. Even with sequential access, you can get a decent speedup using huge pages. But unless you have a good profile and using PGO, it'll likely not be that sequential for code. Like everything else you'll need to measure it to know exactly what benefit you might get. As a starting point, you can start at looking at the itlb load misses with perf stat -d

Stack access is another story as it's usually local and sequential so it might not be that useful.


The most benefit comes from the fact that you end up with a lot less TLB misses, since single mapping covers a large chunk of memory. Predictable memory access pattern helps with caches misses thanks to hardware prefetch, but as far as I know hardware prefetch won't work if it would cause TLB miss on most CPUs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: