Did you know there is an option to drop Linux capabilities in Docker? Using the docker run --cap-drop
option, you can lock down root in a container so that it has limited access within the container. Sadly, almost no one ever tightens the security on a container or anywhere else.
The Day After is Too Late
There’s an unfortunate tendency in IT to think about security too late. People only buy a security system the day after they have been broken into.
Dropping capabilities can be low hanging fruit when it comes to improving container security.
What are Linux Capabilities?
According to the capabilities man page, capabilities
are distinct units of privilege that can be independently enabled or disabled.
The way I describe it is that most people think of root as being all powerful. This isn’t the whole picture, the root
user with all capabilities is all powerful. Capabilities were added to the kernel around 15 or so years ago to try to divide up the power of root.
Originally the kernel allocated a 32-bit bitmask to define these capabilities. A few years ago it was expanded to 64. There are currently around 38 capabilities defined.
Capabilities are things like the ability to send raw IP packets, or bind to ports below 1024. When we run containers we can drop a whole bunch of capabilities before running our containers without causing the vast majority of containerized applications to fail.
Most capabilities are required to manipulate the kernel/system, and these are used by the container framework (docker), but seldom used by the processes running inside the container. However, some containers require a few capabilities, for example a container process needs capabilities like setuid/setgid to drop privileges. As with most things in the container world, we try to establish a compromise between security and the ability to get work done.
A few years ago the guys at grsecurity did some analysis of capabilities and found that a lot of them give you close to full access to the system.
Luckily we also use additional tools like SELinux, seccomp, and namespaces to protect the host system from the containers.
Bottom line: dropping more of the capabilities from your container is a good idea from a security point of view.
Note: When the container framework drops capabilities before starting a container, the processes inside of the container can not get them back, even if they execute a setuid application. For more information look for the section Capability Bounding Set
in the capabilities man page.
What Docker gives by default
Let’s look at the default list of capabilities available to privileged processes in a docker container:
chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot, mknod, audit_write, setfcap
In the OCI/runc spec they are even more drastic only retaining, audit_write
, kill
, and net_bind_service
and users can use ocitools
to add additional capabilities. As you can imagine, I like the approach of adding capabilities you need rather than having to remember to remove capabilities you don’t.
Deep Dive into Capabilities
Lets look deeper into each of these remaining capabilities.
chown
The man page describes chown
as the ability to make arbitrary changes to file UIDs and GIDs.
This means that root can change the ownership or group of any file system object. If you are not running a shell within a container and not installing packages into the container, you should drop this capability.
I would make the argument this should never be needed in production. If you need to chown
, allow the capability, do the work, then take it away.
dac_override
The man page says that dac_override
allows root to bypass file read, write, and execute permission checks. DAC is an abbreviation of “discretionary access control”.
This means a root capable process can read, write, and execute any file on the system, even if the permission and ownership fields would not allow it. Almost no apps need DAC_OVERRIDE, and if they do they are probably doing something wrong. There are probably less than ten in the whole distribution that actually need it. Of course the administrator shell could require DAC_OVERRIDE fixing bad permissions in the file system.
Steve Grubb, security standards expert at Red Hat, says that “nothing should need this. If your container needs this, it’s probably doing something horrible.”
fowner
According to the man page, fowner
conveys the ability to bypass permission checks on operations that normally require the filesystem UID of the process to match the UID of the file. For example, chmod
and utime
, and excludes operations covered by cap_dac_override
and cap_dac_read_search
. Here’s more from the man page:
- set extended file attributes (see chattr(1)) on arbitrary files;
- set Access Control Lists (ACLs) on arbitrary files;
- ignore directory sticky bit on file deletion;
- specify O_NOATIME for arbitrary files in open(2) and fcntl(2).
This is similar to DAC_OVERRIDE, almost no applications need this other than, potentially, software installation tools. Most likely your container would run fine without this capability. You might need to allow this for docker build
but it should be blocked it when you run your container is production.
fsetid
The man page says “don’t clear set-user-ID and set-group-ID mode bits when a file is modified; set the set-group-ID bit for a file whose GID does not match the filesystem or any of the supplementary GIDs of the calling process.”
My take: if you are not running an installation, you probably do not need this capability. I would disable this one by default.
kill
If a process has this capability it can override the restriction that “the real or effective user ID of a process sending a signal must match the real or effective user ID of the process receiving the signal.”
This capability basically means that a root owned process can send kill signals to non root processes. If your container is running all processes as root or the root processes never kills processes running as non root, you do not need this capability. If you are running systemd as PID 1 inside of a container and you want to stop a container running with a different UID you might need this capability.
It’s probably also worth mentioning on the danger scale, this one is on the low end.
setgid
The man page says that the setgid
capability lets a process make arbitrary manipulations of process GIDs and supplementary GID list. It can also forge GID when passing socket credentials via UNIX domain sockets or write a group ID mapping in a user namespace. See user_namespaces(7) for more information.
In short, a process with this capability can change its GID to any other GID. Basically allows full group access to all files on the system. If your container processes do not change UIDs/GIDs, they do not need this capability.
setuid
If a process has the setuid
capability it can “make arbitrary manipulations of process UIDs (setuid(2), setreuid(2), setresuid(2), setfsuid(2)); forge UID when passing socket credentials via UNIX domain sockets; write a user ID mapping in a user namespace (see user_namespaces(7)).”
A process with this capability can change its UID to any other UID. Basically, it allows full access to all files on the system. If your container processes do not change UIDs/GIDs always running as the same UID, preferably non root, they do not need this capability. Applications that that need setuid
usually start as root in order to bind to ports below 1024 and then changes their UIDS and drop capabilities. Apache binding to port 80 requires net_bind_service
, usually starting as root. It then needs setuid/setgid to switch to the apache user and drop capabilities.
Most containers can safely drop setuid/setgid capability.
setpcap
Let’s look at the man page description: “Add any capability from the calling thread’s bounding set to its inheritable set; drop capabilities from the bounding set (via prctl(2) PR_CAPBSET_DROP); make changes to the securebits flags.”
In layman’s terms, a process with this capability can change its current capability set within its bounding set. Meaning a process could drop capabilities or add capabilities if it did not currently have them, but limited by the bounding set capabilities.
net_bind_service
This one’s easy. If you have this capability, you can bind to privileged ports (e.g., those below 1024).
If you want to bind to a port below 1024 you need this capability. If you are running a service that listens to a port above 1024 you should drop this capability.
The risk of this capabilty is a rogue process interpreting a service like sshd, and collecting users passwords. Running a container in a different network namespace reduces the risk of this capability. It would be difficult for the container process to get to the public network interface
net_raw
The man page says, “allow use of RAW and PACKET sockets. Allow binding to any address for transparent proxying.”
This access allows a process to spy on packets on its network. That’s bad, right? Most container processes would not need this access so it probably should be dropped. Note this would only affect the containers that share the same network that your container process is running on, usually preventing access to the real network.
RAW sockets also give an attacker the ability to inject scary things onto the network. Depending on what you are doing with the ping
command, it could require this access.
sys_chroot
This capability allows use of chroot()
. In other words, it allows your processes to chroot into a different rootfs. chroot is probably not used within your container, so it should be dropped.
mknod
If you have this capability, you can create special files using mknod
.
This allows your processes to create device nodes. Containers are usually provided all of the device nodes they need in /dev, the creation of device nodes is controlled by the device node cgroup, but I really think this should be dropped by default. Almost no containers ever do this, and even fewer containers should do this.
audit_write
If you have this one, you can write a message to kernel auditing log. Few processes attempt to write to the audit log (login programs, su, sudo) and processes inside of the container are probably not trusted. The audit subsystem is not currently namespace aware, so this should be dropped by default.
setfcap
Finally, the setfcap
capability allows you to set file capabilities on a file system. Might be needed for doing installs during builds, but in production it should probably be dropped.
How can I drop these capabilities using Docker?
So, how can we drop these capabilities using docker
? First, let’s see what capabilities a process has. There is a cool tool in Linux that can help you view what capabilities a process has called pscap
, available in the libcap-ng-utils package on Fedora.
Here’s a sample output using pscap | head -10
:
ppid pid name command capabilities
1 1082 root systemd-journal chown, dac_override, dac_read_search, fowner, setgid, setuid, sys_ptrace, sys_admin, audit_control, mac_override, syslog, audit_read
1 1116 root systemd-udevd full
1 1760 root auditd full
1760 1778 root audispd full
1 1812 root mcelog full
1 1815 root bluetoothd net_bind_service, net_admin
1 1816 root ModemManager full
1 1817 root systemd-logind chown, dac_override, dac_read_search, fowner, kill, sys_admin, sys_tty_config, audit_control, mac_admin
1 1818 root rngd full
Here are the capabilities of a normal container running:
# docker run -d fedora sleep 5 >/dev/null; pscap | grep sleep
26358 26375 root sleep chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot, mknod, audit_write, setfcap
If I wanted to drop setfcap
, audit_write
, and mknod
, I could use --cap-drop=setfcap --cap-drop=audit_write --cap-drop=mknod
:
# docker run -d --cap-drop=setfcap --cap-drop=audit_write --cap-drop=mknod fedora sleep 5 > /dev/null; pscap | grep sleep
26555 26571 root sleep chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot
Better yet, if you know your container only needs setuid
and setgid
, you can drop all capabilities and just add setgid
and setuid
back in.
# docker run -d --cap-drop=all --cap-add=setuid --cap-add=setgid fedora sleep 5 > /dev/null; pscap | grep sleep
26767 26783 root sleep setgid, setuid
You can even use Container Labels and the [atomic run](http://www.projectatomic.io/docs/usr-bin-atomic/)
command to define the default run command which your container should run with.
# cat Dockerfile
FROM fedora
LABEL RUN /usr/bin/docker run -d --cap-drop=all --cap-add=setuid --cap-add=setgid \${IMAGE} sleep 10
# docker build -t sleep . >/dev/null
# atomic run --quiet sleep > /dev/null; pscap | grep sleep
32119 32135 root sleep setgid, setuid
Bottom Line
You are probably running containers with a lot more privileges than they need. Dropping these capabilities when the containers are in production would be a great idea.