-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
TL; DR
This is caused by a design flaw in AppArmor when running runc (or Docker/Podman/containerd) inside a nested container that has an AppArmor profile applied (the very short explanation is that AppArmor incorrectly thinks that when runc accesses /proc/sys/... that it is accessing /sys/... and it rejects the access attempt because it violates the configured AppArmor policy).
We currently cannot work around this issue within runc, so you will need to reconfigure LXC until LXC updates its configurations to avoid this problem.
PLEASE DO NOT LEAVE COMMENTS IF THESE SOLUTIONS WORKED FOR YOU
LXC Users
This section applies to direct users of LXC.
You need to relax the /sys restrictions of the AppArmor profile used for LXC containers. The simplest way is to comment out all of the deny /sys lines in /etc/apparmor.d/abstractions/container-base:
% sudo sed -i.old '/deny \/sys/ s/^/#/g' /etc/apparmor.d/abstractions/lxc/container-base
% sudo apparmor_parser -r /etc/apparmor.d/lxc-containers
% # This should've worked, but you may need to restart your LXC containers.
This will disable the problematic /sys rules. You can also be more strategic by only changing the deny /sys/[^fdc]*{,/**} wklx rule to deny /sys/[^fdcn]*{,/**} wklx (to permit the net.ipv4.ip_unprivileged_port_start sysctl to be written to) but this requires a bit more care.
If you use lxc.apparmor.profile = generated then unfortunately you will need to disable AppArmor (using lxc.apparmor.profile = unconfined) and wait for an LXC patch that fixes their internal hardcoded AppArmor profile. (This is the same boat Proxmox users find themselves in --see that section for some additional information.)
Incus Users
Incus generates its own AppArmor rules which you cannot directly modify. However, Incus has already been patched (in lxc/incus#2624) so if you switch to the daily builds then the problem should already be resolved for you. Unfortunately there is no raw.lxc.* workaround possible.
Proxmox Users
Proxmox makes use of lxc.apparmor.profile = generated, which means that the above mitigations do not work (modifying the profiles in /etc/apparmor.d doesn't do anything because a new profile is generated automatically based on hard-coded strings in LXC). You instead need to add the following configuration to /etc/pve/lxc/$ctr.conf:
lxc.apparmor.profile: unconfined
lxc.mount.entry: /dev/null sys/module/apparmor/parameters/enabled none bind 0 0
And restart the container.
The /dev/null bind-mount is needed in order to trick Docker into thinking that AppArmor is disabled on the system. If you do not include this line, you may see the error Error response from daemon: Could not check if docker-default AppArmor profile was loaded: open /sys/kernel/security/apparmor/profiles: permission denied. This is caused by a non-configurable security policy within AppArmor related to namespace nesting -- the only real solution for this is for the generated profile by LXC be fixed.
Note
The /dev/null bind mount may be unnecessary for some guests (some users have reported that Debian guests just work with lxc.apparmor.profile = unconfined).
Note
In principle you could configure a different AppArmor profile based on lxc-container-default-with-nesting and use that instead, but based on my testing the default profile doesn't permit overlayfs mounts -- fixing that is left as an exercise for the reader.
Regarding Downgrades
A lot of people have commented that they will just downgrade runc. If you are going to do this (instead of the far less drastic workarounds we've outlined above), please do not go overboard. runc 1.2.7 and 1.3.2 are the latest releases that do not contain the security patches which caused these errors. Downgrading to earlier versions than that (especially something as old as runc 1.1.0) is overkill and is needlessly opening yourself up to 4-year-old vulnerabilities.
I also want to re-iterate that while AppArmor can in principle protect against some attacks, by downgrading you are intentionally opening yourself up to actual attacks that we know exist (AND CAN BYPASS APPARMOR in the case of CVE-2025-52881).
Analysis
There is an issue that has been reported when running runc inside LXC where the AppArmor profile for LXC causes permission errors when we try to re-open fds for procfs operations.
This was reported in containerd/containerd#12484, moby/moby#51405, and lxc/incus#2623. Here is my breakdown of the cause:
Okay, I've figured it out. This is really dumb. tl;dr: This is really an AppArmor bug (or even a design flaw if you prefer).
For context, the file we are trying to write to is
/proc/sys/net/ipv4/ip_unprivileged_port_start. @stgraber figured out that the problematic AppArmor rules are the rules they have which block writing to most/sysfiles. How is it possible that one affects the other?Well, the problem is that runc now uses a detached mount of
procfsto operate on (this avoids mount race attacks). Because detached mounts have not been attached to the filesystem,d_name(the kernel's facility for generating names for dentries) just generates a name that looks like/fooif you try to open a filefooinside the detachedprocfsmount. AFAICS this is what AppArmor uses to determine what file you are trying to write to (because AppArmor is path-based, andd_nameis the only way to get pathnames from dentries).This means that when we try to write to
/proc/sys/net/ipv4/ip_unprivileged_port_start, AppArmor sees this as us trying to write to/sys/net/ipv4/ip_unprivileged_port_startwhich is forbidden by the/sysdenial rules. I have attached a program that can show this behaviour using a detachedtmpfsmount, it's very trivial to trigger:% ./aa-bug & c1:~ # ./aa-bug & fd: /proc/2061/fd/5 [1] 2061 c1:~ # mkdir /proc/2061/fd/5/sys c1:~ # mkdir /proc/2061/fd/5/sys/foo mkdir: cannot create directory ‘/proc/2061/fd/5/sys/foo’: Permission deniedThere is a trivial workaround for this particular sysctl:
- deny /sys/[^fdck]*{,/**} wklx, + deny /sys/[^fdckn]*{,/**} wklx,(In
/etc/apparmor.d/abstractions/lxc/container-base.)But this doesn't help in the general case for all sysctls. @stgraber has just submitted lxc/incus#2624 which just removes these rules entirely. I think AppArmor should not do this, because it's incredibly broken (literally any detached mount could match against a rule by accident), but this is unfortunately how AppArmor's design works.
From runc's side, we could in theory use this to our advantage -- if we created a
tmpfswith a subpath like.go-away-apparmorand then attached our procfs mount to that path, we might be able to subvert AppArmor. However, this has a risk of causing lifetime issues that would require a rework of how we do lookups -- thetmpfsmust not be closed after we attach to it because it will lazy-unmount the procfs...