I don't see anything to suggest that nsjail has the main feature of bubblewrap: It is safe to make bubblewrap setuid-root, and therefore bubblewrap is a safe way for unprivileged users to use containers. (arguably the only safe way at the moment)
Without nsjail making that guarantee, nsjail is just yet another command line interface to namespaces.
This tool is lighter-weight than firejail. nsjail seems to be a thin abstraction over Linux namespaces, while firejail contains profiles for common desktop applications and some X hackery to enable jailing of GUI programs.
bwrap allows passing a FD containing the seccomp rules (--seccomp FD w/ seccomp_export_bpf). If it can export the compiled eBPF it should be trivial to use kafel profiles w/ bubblewrap/atomic/flatpak/etc.
Is this what I should use if I want to intercept filesystem calls (and rewrite them, or generate on the fly the file that is about to be accessed)? Something else I should look into for this purpose?
This will make /etc/passwd empty, but nsjail doesn't rewrite syscalls. In order to do that, you'd have to use SECCOMP_RET_TRACE (TRACE(number) in kafel config lang), and then add some C code to nsjail which will use ptrace() to intercept and rewrite your syscall. It's possible, just not implemented, because it didn't seem like something that's required by users.
Yes, SECCOMP_RET_TRACE works, but nsjail doesn't have code to support that - it didn't seem that useful when mount namespaces can police access to files.
Otherwise, it's possible to make it support that. Though, a word of caution: ptrace() is complex, and sometimes buggy interface with a lot of corner-cases - iow: it's easy to make a mistake with consequences for security of the whole setup.
PS: It's possible to use SECCOMP_RET_TRAP (TRAP(number) in kafel's - nsjail seccomp-bpf cfg language - nomenclature), and rewrite syscalls in-process with help of SIGSYS signal handler.
Re kernel versions: Depending on when CLONE_NEWUSER and seccomp-bpf were added to the kernel for different CPU architectures. For x86-64 it was probably around 3.16, for some others it might be even 4.3 (e.g. ppc64). It might even work with earlier versions if you use --disable_clone_newuser and avoid using seccomp-bpf filters.
Re 'proot'. I've never used it (it seems to be a configurator for the mount namespace), but nsjail seems much more advanced: cgroups support, seccomp-bpf via configuration language support, and a few more features (configs, net).
Thanks. I appreciate the response. I guess my only option until moving to a more recent kernel is `proot` as our build boxes are still in 2.6.32, but I am happy to have found out about `nsjail` for the future.
You can run it as root, and specifiy users/groups to switch to before executing an app. Though, CLONE_NEWUSER was meant for exactly that - using namespaced without euid==0. Some systems like Debian have a sysctl flag:
kernel.unprivileged_userns_clone
which controls this behavior. Ultimately, it's up to you whether set it to "1", as CLONE_NEWUSER in the past opened many new attack vectors on the Linux kernel. However, I believe that currently the situation is much better, esp. after syzkaller and individual researchers reported and fixed many bugs in this area.