Skip to content

Systemd cgroup driver: CPU quota calculation mismatch between containerd and runc causes container creation failure #4982

@zackery-parkhurst

Description

@zackery-parkhurst

Description

When using the systemd cgroup driver with a CPU limit of 4096m, pod creation fails intermittently because containerd non-deterministically calculates either 409600 or 410000 microseconds for the parent cgroup, while runc consistently calculates 410000 for child cgroups. When they mismatch, the Linux kernel rejects the child cgroup creation with "invalid argument".

Root Cause

Investigation reveals non-deterministic behavior in containerd when converting 4096m to microseconds:

  1. Containerd (when creating pod sandbox) - INCONSISTENT:

    • Sometimes calculates: 4096m → 409600 microseconds (correct: 4096 / 1000 * 100000)
    • Sometimes calculates: 4096m → 410000 microseconds (rounded: 4.1 * 100000)
    • Sets parent cgroup: cpu.cfs_quota_us to whichever value it calculated
  2. runc (when creating application container) - CONSISTENT:

    • Always calculates: 4096m → 410000 microseconds (appears to round 4.096 to 4.1)
    • Tries to set child cgroup: cpu.cfs_quota_us = 410000
  3. Result:

    • When containerd picks 410000: Parent = 410000, child = 410000 → Success!
    • When containerd picks 409600: Parent = 409600, child = 410000 → Kernel rejects! (child > parent)
    • In cgroup v1, child quotas cannot exceed parent quotas

Why It Appears Node-Specific

The issue seems to only affect "previously used nodes" because:

  • When containerd picks 409600 and the pod fails, the parent cgroup gets stuck
  • The pause container remains alive with the 409600 parent cgroup
  • All subsequent attempts to create the pod on that node fail (child 410000 > parent 409600)
  • Fresh nodes might get lucky and containerd picks 410000 → works fine
  • But those nodes would fail too if containerd had picked 409600 on first attempt

This is not about stale cgroups from old pods - it's about which value containerd randomly picks during pod sandbox creation.

Error Message

failed to create containerd task: failed to create shim task: OCI runtime create failed:
runc create failed: unable to start container process: error during container init:
error setting cgroup config for procHooks process: failed to write "410000":
write /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podc65bd648_3faf_4778_90e4_a21afb2a6ad0.slice/cri-containerd-149d004f6e52b5665c6209d1f33a7e516049b79456444e3f74af49e62c5c80c8.scope/cpu.cfs_quota_us:
invalid argument: unknown

Evidence from Investigation

Failing node (containerd picked 409600):

# Parent cgroup quota - containerd calculated 409600
$ cat /sys/fs/cgroup/cpu,cpuacct/.../kubepods-burstable-podc65bd648...slice/cpu.cfs_quota_us
409600

# Pod sandbox metadata confirms
$ crictl inspectp b4420139f34f8
"cpu_quota": 409600

# Containerd logs show runc trying to write 410000
$ journalctl -u containerd | grep "410000"
failed to write "410000": write .../cpu.cfs_quota_us: invalid argument

Working node (containerd picked 410000):

# Parent cgroup quota - containerd calculated 410000!
$ cat /sys/fs/cgroup/cpu,cpuacct/.../kubepods-burstable-pod7f7424a2...slice/cpu.cfs_quota_us
410000

# Application container child cgroup also 410000
$ cat /sys/fs/cgroup/cpu,cpuacct/.../cri-containerd-98745b3c2c216...scope/cpu.cfs_quota_us
410000

# They match - no error!

Both nodes running:

  • Same containerd version: 1.7.27
  • Same runc version: 1.3.2
  • Same Kubernetes version: 1.30.14-eks-113cf36
  • Same pod spec with CPU limit: 4096m

Additional Context

  • This issue started occurring after changing CPU limits from 8192m → 4096m
  • The problem is specific to CPU values resulting in fractional cores (4.096)
  • The 400 microsecond difference (410000 - 409600) violates cgroup v1's parent-child quota constraint
  • Critical finding: Same containerd version behaves differently - this is non-deterministic
  • Calculation theory:
    • Correct: 4096 / 1000 * 100000 = 409600
    • Rounded: 4.1 * 100000 = 410000 (rounding 4.096 to 4.1)

Questions for Maintainers

  1. Where in containerd's codebase does the millicore → microsecond conversion happen for pod sandbox creation?
  2. Why would containerd calculate two different values (409600 vs 410000) for the same input (4096m)?
  3. Is there a race condition or different code path that causes this non-determinism?
  4. Should containerd and runc be using shared conversion logic to ensure consistency?

Related Issues

This appears similar to but distinct from:

However, this is a new issue involving non-deterministic behavior in containerd 1.7.27 when calculating CPU quotas for fractional core values with systemd cgroup driver.

Steps to reproduce the issue

  1. Deploy a Kubernetes pod with CPU limit 4096m multiple times on different fresh nodes

    • Observe: Some pods succeed, some fail (non-deterministic)
    • Successful pods: containerd calculated parent cgroup cpu.cfs_quota_us = 410000
    • Failed pods: containerd calculated parent cgroup cpu.cfs_quota_us = 409600
    • runc always tries to write 410000 for child cgroup
  2. On nodes where containerd picked 409600:

    • runc attempts to create application container
    • runc tries to write 410000 to child cgroup's cpu.cfs_quota_us
    • Kernel rejects: child quota (410000) > parent quota (409600)
    • Container creation fails with "invalid argument" error
    • Pod enters CrashLoopBackOff
    • Pause container remains alive with parent cgroup stuck at 409600
  3. All subsequent restart attempts on that node continue to fail

    • Containerd reuses the existing pod sandbox
    • Parent cgroup still has 409600
    • runc still tries 410000
    • Pattern repeats indefinitely
  4. Evicting the pod and forcing it to a different node:

    • May work if containerd picks 410000 on the new node
    • Will fail if containerd picks 409600 on the new node
    • Outcome is non-deterministic

Example pod spec that reproduces the issue:

apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  containers:
  - name: test-container
    image: nginx:latest
    resources:
      limits:
        cpu: "4096m"
        memory: "16Gi"
      requests:
        cpu: "1024m"
        memory: "8Gi"

How to Verify Which Value Containerd Picked

On a node where the pod was deployed:

# Get pod UID
kubectl get pod <pod-name> -o jsonpath='{.metadata.uid}'

# Check parent cgroup on the node
cat /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<UID_with_underscores>.slice/cpu.cfs_quota_us

# 409600 = pod will fail
# 410000 = pod will succeed

Critical Note: The issue is not about "previously used nodes" - it's about which value containerd randomly calculates during initial pod sandbox creation. The appearance of being node-specific is because once a node gets stuck with 409600, it stays stuck.

Describe the results you received and expected

Expected behavior:

Containerd and runc should use consistent, deterministic calculations when converting millicores to microseconds for CPU quotas.

For a CPU limit of 4096m:

  • Both containerd and runc should calculate: 4096 / 1000 * 100000 = 409600 microseconds
  • OR both should calculate: 4.1 * 100000 = 410000 microseconds
  • They must agree - parent and child cgroups must have compatible values
  • Container should create successfully every time, regardless of node
  • Behavior should be deterministic, not random

Actual behavior:

  • Containerd: Non-deterministically calculates either 409600 or 410000 for the same input
    • Sometimes: 409600 microseconds (mathematically correct)
    • Sometimes: 410000 microseconds (rounded)
    • No obvious pattern - same version, same config, different results
  • runc: Consistently calculates 410000 microseconds (always rounds 4.096 to 4.1)
  • When they mismatch (containerd=409600, runc=410000):
    • Child cgroup creation fails with kernel error: "invalid argument"
    • Pod enters CrashLoopBackOff with 199+ restart attempts
    • Parent cgroup gets stuck with 409600, preventing all future attempts
    • Requires manual node cordoning and pod eviction
  • When they match (containerd=410000, runc=410000):
    • Pod works perfectly fine

Impact:

  • Non-deterministic pod scheduling - same pod spec may work or fail randomly
  • Cannot reliably deploy pods with CPU limit 4096m (or other fractional core values)
  • Once a node "loses the lottery" and gets 409600, it's permanently broken for that pod
  • Requires operational workarounds (cordon/drain/evict)
  • Production impact on Amazon EKS clusters

Root Issue:

This is fundamentally a consistency bug - containerd and runc must use the same conversion logic, and that logic must be deterministic.

What version of runc are you using?

runc version 1.3.2
commit: aeabe4e711d903ef0ea86a4155da0f9e00eabd29
spec: 1.2.1
go: go1.24.9
libseccomp: 2.5.2

Additional environment details:

  • containerd version: 1.7.27 (commit: 05044ec0a9a75232cad458027ca83437aae3f4da)
  • Kubernetes version: 1.30.14-eks-113cf36 (Amazon EKS)
  • Cgroup version: v1
  • Cgroup driver: systemd (SystemdCgroup = true in containerd config at /etc/containerd/config.toml)

Host OS information

NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2026-06-30"

Platform: Amazon EKS (Elastic Kubernetes Service) managed node

Host kernel information

Linux ip-10-7-66-184.prod-eks.newfront.com 5.10.245-241.976.amzn2.x86_64 #1 SMP Tue Oct 21 22:09:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Kernel version: 5.10.245-241.976.amzn2.x86_64

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions