No need for /dev/zero: Linux has memfd[1] and OSX has vm_remap[2]. You only need one file descriptor per heap because Linux lets you poke holes with fallocate[3].
I'll define objects with a header that looks roughly like this:
struct header {
off_t phys;
size_t len;
int localrefs;
short flags;
char mstrategy, type;
};
phys is the offset in the heap file descriptor. len is the number of units and type is indexed into an array of unit sizes.
mstrategy is used to select the bucket size (allocated range is a power of two, so 1<<(mstrategy&31)) and the heap number.
localrefs is an optimisation which I'll get to.
If I want to allocate an array of type t, size n, I can use a BSR[4] to identify the bucket that it needs, and see if there's anything on the free list. If there isn't, I can see if I can split a larger bucket into two parts (this is effectively Knuth's buddy allocator).
I know that (mstrategy&31) < BSR(page size) needs to be moved just like the classical allocator for appending or prepending, but when (mstrategy&31) >= BSR(page size) I can take a virtual address of a bigger region, then either mmap the memfd (using phys) or vm_remap the region into the bigger region. Instead of copying the contents of the pages, the operating system will simply copy the page table (which is 3 levels deep[5], hence the log log log, although using 1G pages means log log with a lower coefficient). This is a tremendous win for problems that need to deal with several large arrays.
Now localrefs gives me a further optimisation: In-process, I can track the reference count of objects, and if my functions always consume their arguments I know inside the grow/append/prepend routine if this is the only holder of the object. If it is, I can potentially reuse this virtual address immediately, saving 3-4 syscalls.
When it's time to deallocate, I can put small objects on my free list, and garbage collect any big objects by calling fallocate() on the physical address to poke a hole (freeing system memory). OSX doesn't need fallocate() because mach has vm_unmap.
> It also seems like you're trading cache misses for system calls if you have to re-map pages a lot
Cache misses aren't the dominant force here.
The real trade off is illustrated with benchmarking: memcpy one page, versus 8 bytes (page table entry). How many pages do you need to copy before it is faster to pay the fixed (<100ns) cost of the system call and just copy the page table entries?
Memory streams at a rate of around 10GB/sec, but the TLB flush is ~100ns and the memory latency is only around 10ns, so it's easy to see how quick the gains add up when you're using 1GB pages.
I'll define objects with a header that looks roughly like this:
phys is the offset in the heap file descriptor. len is the number of units and type is indexed into an array of unit sizes.mstrategy is used to select the bucket size (allocated range is a power of two, so 1<<(mstrategy&31)) and the heap number.
localrefs is an optimisation which I'll get to.
If I want to allocate an array of type t, size n, I can use a BSR[4] to identify the bucket that it needs, and see if there's anything on the free list. If there isn't, I can see if I can split a larger bucket into two parts (this is effectively Knuth's buddy allocator).
I know that (mstrategy&31) < BSR(page size) needs to be moved just like the classical allocator for appending or prepending, but when (mstrategy&31) >= BSR(page size) I can take a virtual address of a bigger region, then either mmap the memfd (using phys) or vm_remap the region into the bigger region. Instead of copying the contents of the pages, the operating system will simply copy the page table (which is 3 levels deep[5], hence the log log log, although using 1G pages means log log with a lower coefficient). This is a tremendous win for problems that need to deal with several large arrays.
Now localrefs gives me a further optimisation: In-process, I can track the reference count of objects, and if my functions always consume their arguments I know inside the grow/append/prepend routine if this is the only holder of the object. If it is, I can potentially reuse this virtual address immediately, saving 3-4 syscalls.
When it's time to deallocate, I can put small objects on my free list, and garbage collect any big objects by calling fallocate() on the physical address to poke a hole (freeing system memory). OSX doesn't need fallocate() because mach has vm_unmap.
[1]: https://dvdhrm.wordpress.com/2014/06/10/memfd_create2/
[2]: http://web.mit.edu/darwin/src/modules/xnu/osfmk/man/vm_remap... because the osx manual page is pants
[3]: http://man7.org/linux/man-pages/man2/fallocate.2.html
[4]: http://x86.renejeschke.de/html/file_module_x86_id_20.html
[5]: http://wiki.osdev.org/Page_Tables#Long_mode_.2864-bit.29_pag...
> It also seems like you're trading cache misses for system calls if you have to re-map pages a lot
Cache misses aren't the dominant force here.
The real trade off is illustrated with benchmarking: memcpy one page, versus 8 bytes (page table entry). How many pages do you need to copy before it is faster to pay the fixed (<100ns) cost of the system call and just copy the page table entries?
Memory streams at a rate of around 10GB/sec, but the TLB flush is ~100ns and the memory latency is only around 10ns, so it's easy to see how quick the gains add up when you're using 1GB pages.