Introduce Time Namespace (2019)

infogulch · on May 4, 2020

It looks like there are 8 different namespaces now [1]: filesystem root, network, device mounts, IPC, process id, (now) time, users, and hostname.

What other kinds of namespaces are people thinking about?

[1]: http://man7.org/linux/man-pages/man7/namespaces.7.html

theamk · on May 5, 2020

I want a coredump_pattern namespace (or even better, sysctl namespace).

Having a single, global handle gives me so many problems!

chupasaurus · on May 5, 2020

+1 for sysctl namespaces, but I think it would be a bigger problem to solve than all other namespaces combined.

the8472 · on May 5, 2020

Some specific sysctls are already tied to other namespaces, e.g. ip_unprivileged_port_start is per network namespace[0]. There also are a few per userns sysctls for resource limits.

[0] https://www.kernel.org/doc/Documentation/networking/ip-sysct...

st_goliath · on May 5, 2020

> What other kinds of namespaces are people thinking about?

I would really love to have something like a device namespace, i.e. being able to split up exclusive device ownership to different namespaces similar to network namespaces.

It would allow a container to see a subset of devices and the root of the containers user namespace would actually be allowed to operate on them while other device namespaces no longer see those devices.

This would also allow containers to create their own loop back devices and loop mount things (as root user of that user namespace that is mapped to a non-root user in the parent).

musicale · on May 5, 2020

Time namespaces that allow you to change the real time duration of a system time second, or to outsource the clock, for testing and debugging purposes among others.

the8472 · on May 5, 2020

> filesystem root [...] device mounts

Those are the same. If you meant the cgroup (control group), that's for process resource management.

infogulch · on May 5, 2020

Thank you for the correction!

brobinson · on May 5, 2020

keyring namespace? IIRC that and time were the two things that didn't get a unique namespace in Docker, etc. containers.

pabs3 · on May 5, 2020

Some related work on that:

https://patchwork.kernel.org/patch/9394983/ https://lwn.net/Articles/791501/

infogulch · on May 5, 2020

Now we just need to be able to use all of these namespaces safely from userland.... so a namepsace-namespace?

KenoFischer · on May 5, 2020

user namespaces are supposed to do this, but they're disabled (or root-setup-only) on a lot of distros as a security precaution.

infogulch · on May 5, 2020

I'd heard that there were some.. risk of increased attack surface, lets say, by allowing userland access to namespaces for devices/mount points, but I'm not familiar with what this would really look like. What could be done to mitigate these security concerns so that everyone is comfortable enabling it?

TheDong · on May 5, 2020

"increased attack surface ... but I'm not familiar with what this would really look like"

There's a related LWN article on this called "Anatomy of a user namespaces vulnerability" [0].

Basically, the kernel has a bunch of places where it checks for various capabilities. Unprivileged usernamespaces let you gain various capabilities in a limited fashion, but those limitations are based on the kernel itself recognizing that you only have those capabilities in said limited fashion.

For an easy example, a root user in a usernamespace (aka an unprivileged user on the host) is allowed to mount a new tmpfs in their namespace, or mount proc... But it's still not allowed to mount /dev/sdx since that device belongs to the real root user.

Necessarily, that means there are different ways to check if someone has permission than the simple "is the caller uid 0"; there has to be for tmpfs "are they CAP_SYS_ADMIN in their current namespace", while for /dev/sdx it has to be "CAP_SYS_ADMIN in the root namespace".

It simply amounts to there being more total codepaths that an unprivileged user can reach because, for example, they can now pass the tmpfs mount check (intentionally), and perhaps there's a mount option that wasn't fully thought through.

"What could be done to mitigate these security concerns"

Docker's seccomp filters would have actually stopped quite a few exploits here by totally gimping user namespaces to be useless. So yeah. maybe that's not the answer.

There's really not an answer though. You can't prove something is secure and bug-free. It has been a while since a really meaningful userns exploit and there are a lot of people that believe it to be a useful thing to have.. a bunch of distros have enabled it. People are still wary because of its history. I think at this point all that's left to mitigate these concerns is time. The implementation isn't likely to change drastically any time soon.

[0]: https://lwn.net/Articles/543273/

musicale · on May 5, 2020

System call namespaces. ;-)

musicale · on May 5, 2020

Clock/time namespaces are something we've needed for years - is this actually integrated into the current kernel?

simcop2387 · on May 5, 2020

5.6 and newer, yes.

musicale · on May 5, 2020

Looking more closely at it, this only seems to implement clock offsets rather than changing the length of a second. Changing the actual rate of system time progress would be helpful for a number of things including finding time-based synchronization errors.

maximilianroos · on May 5, 2020

How does Google Live Migration [1] work without this?

[1]: https://cloud.google.com/compute/docs/instances/live-migrati...

AaronFriel · on May 5, 2020

Live migration of virtual machines is kernel independent. The guest VM can be running a unikernel, BSD, Windows, etc. The interface the guest uses to sync or retrieve time depends on whether the guest reads the virtualized hardware clock (goes by many names) or a paravirtualized API is used (such as kvm-clock).