> Would some kind of LD_PRELOAD interception for socket(2) work?
That would only work if the call goes through libc, and it's not statically linked. However, it's becoming more and more common to do system calls directly, bypassing libc; the Go language is infamous for doing that, but there's also things like the rustix crate for Rust (https://crates.io/crates/rustix), which does direct system calls by default.
And go is wrong for doing that, at least on Linux. It bypasses optimizations in the vDSO in some cases. On Fuchsia, we made direct syscalls not through the vDSO illegal and it was funny the hacks to go that required. The system ABI of Linux really isn't the syscall interface, its the system libc. That's because the C ABI (and the behaviors of the triple it was compiled for) and its isms for that platform are the linga franca of that system. Going around that to call syscalls directly, at least for the 90% of useful syscalls on the system that are wrapped by libc, is asinine and creates odd bugs, makes crash reporters heuristical unwinders, debuggers, etc all more painful to write. It also prevents the system vendor from implementing user mode optimizations that avoid mode and context switches when necessary. We tried to solve these issues in Fuchsia, but for Linux, Darwin, and hell, even Windows, if you are making direct syscalls and it's not for something really special and bespoke, you are just flat-out wrong.
> The system ABI of Linux really isn't the syscall interface, its the system libc.
You might have reasons to prefer to use libc; some software has reason to not use libc. Those preferences are in conflict, but one of them is not automatically right and the other wrong in all circumstances.
Many UNIX systems did follow the premise that you must use libc and the syscall interface is unstable. Linux pointedly did not, and decided to have a stable syscall ABI instead. This means it's possible to have multiple C libraries, as well as other libraries, which have different needs or goals and interface with the system differently. That's a useful property of Linux.
There are a couple of established mechanism on Linux for intercepting syscalls: ptrace, and BPF. If you want to intercept all uses of a syscall, intercept the syscall. If you want to intercept a particular glibc function in programs using glibc, or for that matter a musl function in a program using musl, go ahead and use LD_PRELOAD. But the Linux syscall interface is a valid and stable interface to the system, and that's why LD_PRELOAD is not a complete solution.
It's true that Linux has a stable-ish syscall table. What is funny is that this caused a whole series of Samsung Android phones to reboot randomly with some apps because Samsung added a syscall at the same position someone else did in upstream linux and folks staticly linking their own libc to avoid boionc libc were rebooting phones when calling certain functions because the Samsung syscall causing kernel panics when called wrong. Goes back to it being a bad idea to subvert your system libc. Now, distro vendors do give out multiple versions of a libc that all work with your kernel. This generally works. When we had to fix ABI issues this happened a few times. But I wouldn't trust building our libc and assuming that libc is portable to any linux machine to copy it to.
> It's true that Linux has a stable-ish syscall table.
It's not "stable-ish", it's fully stable. Once a syscall is added to the syscall table on a released version of the official Linux kernel, it might later be replaced by a "not implemented" stub (which always returns -ENOSYS), but it will never be reused for anything else. There's even reserved space on some architectures for the STREAMS syscalls, which were AFAIK never on any released version of the Linux kernel.
The exception is when creating a new architecture; for instance, the syscall table for 32-bit x86 and 64-bit x86 has a completely different order.
I think what they meant (judging by the example you ignored) is that the table changes (even if append-only) and you don't know which version you actually have when you statically compile your own version. Thus, your syscalls might be using a newer version of the table but it a) not actually be implemented, or b) implemented with something bespoke.
> Thus, your syscalls might be using a newer version of the table but it a) not actually be implemented,
That's the same case as when a syscall is later removed: it returns -ENOSYS. The correct way is to do the call normally as if it were implemented, and if it returns -ENOSYS, you know that this syscall does not exist in the currently running kernel, and you should try something else. That is the same no matter whether it's compiled statically or dynamically; even a dynamic glibc has fallback paths for some missing syscalls (glibc has a minimum required kernel version, so it does not need to have fallback paths for features introduced a long time ago).
> or b) implemented with something bespoke.
There's nothing you can do to protect against a modified kernel which does something different from the upstream Linux kernel. Even going through libc doesn't help, since whoever modified the Linux kernel to do something unexpected could also have modified the C library to do something unexpected, or libc could trip over the unexpected kernel changes.
One example of this happening is with seccomp filters. They can be used to make a syscall fail with an unexpected error code, and this can confuse the C library. More specifically, a seccomp filter which forces the clone3 syscall to always return -EPERM breaks newer libc versions which try the clone3 syscall first, and then fallback to the older clone syscall if clone3 returned -ENOSYS (which indicates an older kernel that does not have the clone3 syscall); this breaks for instance running newer Linux distributions within older Docker versions.
Every kernel I’ve ever used has been different from an upstream kernel, with custom patches applied. It’s literally open source, anyone can do anything to it that they want. If you are using libc, you’d have a reasonable expectation not to need to know the details of those changes. If you call the kernel directly via syscall, then yeah, there is nothing you can do about someone making modifications to open source software.
The complication with the linux syscall interface is that it turns the worse is better up to 11. Like setuid works on a per thread basis, which is seriously not what you want, so every program/runtime must do this fun little thread stop and start and thunk dance.
Yeah, agreed. One of the items on my long TODO list is adding `setuid_process` and `setgid_process` and similar, so that perhaps a decade later when new runtimes can count on the presence of those syscalls, they can stop duplicating that mechanism in userspace.
You seem to be saying 'it was incorrect on Fuchsia, so it's incorrect on Linux'. No, it's correct on Linux, and incorrect on every other platform, as each platform's documentation is very clear on. Go did it incorrectly on FreeBSD, but that's Go being Go; they did it in the first place because it's a Linux-first system and it's correct on Linux. And glibc does not have any special privilege, the vdso optimizations it takes advantage of are just as easily taken advantage of by the Go compiler. There's no reason to bucket Linux with Windows on the subject of syscalls when the Linux manpages are very clear that syscalls are there to be used and exhaustively documents them, while MSDN is very clear that the system interface is kernel32.dll and ntdll.dll, and shuffles the syscall numbers every so often so you don't get any funny ideas.
> The system ABI of Linux really isn't the syscall interface, its the system libc.
Which one? The Linux Kernel doesn't provide a libc. What if you're a static executable?
Even on Operating Systems with a libc provided by the kernel, it's almost always allowed to upgrade the kernel without upgrading the userland (including libc); that works because the interface between userland and kernel is syscalls.
That certainly ties something that makes syscalls to a narrow range of kernel versions, but it's not as if dynamically linking libc means your program will be compatible forever either.
In the case where you're running an Operating System that provides a libc and is OK with removing older syscalls, there's a beginning and an end to support.
Looking at FreeBSD under /usr/include/sys/syscall.h, there's a good number of retired syscalls.
On Linux under /usr/include/x86_64-linux-gnu/asm/unistd_32.h I see a fair number of missing numbers --- not sure what those are about, but 222, 223, 251, 285, and 387-392 are missing. (on Debian 12.1 with linux-image-6.1.0-12-amd64 version 6.1.52-1, if it matters)
> And go is wrong for doing that, at least on Linux. It bypasses optimizations in the vDSO in some cases.
Go's runtime does go through the vDSO for syscalls that support it, though (e.g., [0]). Of course, it won't magically adapt to new functions added in later kernel versions, but neither will a statically-linked libc. And it's not like it's a regular occurrence for Linux to new functions to the vDSO, in any case.
Linux doesn't even have consensus on what libc to use, and ABI breakage between glibc and musl is not unheard of. (Probably not for syscalls but for other things.)
The proliferation of Docker containers seems to go against that. Those really only work well since the kernel has a stable syscall ABI.
So much so that you see Microsoft switching to a stable syscall ABI with Windows 11.
"Decoupling the User/Kernel boundary in Windows is a monumental task and highly non-trivial, however, we have been working hard to stabilize this boundary across all of Windows to provide our customers the flexibility to run down-level containers"
It's not that much work; after all, every libc needs to have its own implementation. The kernel maps the vDSO into memory for you, and gives you the base address as an entry in the auxiliary vector.
But using it does require some basic knowledge of the ELF format on the current platform, in order to parse the symbol table. (Alongside knowledge of which functions are available in the first place.)
It's hard work to NOT have the damn vDSO invade your address space. Only kludge part of Linux, well, apart from Nagle's, dlopen, and that weird zero copy kernel patch that mmap'd -each- socket recv(!) for a while.
It's possible, but tedious: if you disable ASLR to put the stack at the top of virtual memory, then use ELF segments to fill up everything from the mmap base downward (or upward, if you've set that), then the kernel will have nowhere left to put the vDSO, and give up.
(I investigated vDSO placement quite a lot for my x86-64 tiny ELF project: I had to rule out the possibility of a tiny ELF placing its entry point in the vDSO, to bounce back out somewhere else in the address space. It can be done, but not in any ELF file shorter than one that enters its own mapping directly.)
Curiously, there are undocumented arch_prctl() commands to map any of the three vDSOs (32, 64, x32) into your address space, if they are not already mapped, or have been unmapped.