Systemd cgroup driver: CPU quota calculation mismatch between containerd and runc causes container creation failure

### Description

When using the systemd cgroup driver with a CPU limit of `4096m`, pod creation fails intermittently because containerd **non-deterministically** calculates either `409600` or `410000` microseconds for the parent cgroup, while runc consistently calculates `410000` for child cgroups. When they mismatch, the Linux kernel rejects the child cgroup creation with "invalid argument".

## Root Cause

Investigation reveals **non-deterministic behavior in containerd** when converting `4096m` to microseconds:

1. **Containerd** (when creating pod sandbox) - **INCONSISTENT**:
   - Sometimes calculates: `4096m → 409600 microseconds` (correct: 4096 / 1000 * 100000)
   - Sometimes calculates: `4096m → 410000 microseconds` (rounded: 4.1 * 100000)
   - Sets parent cgroup: `cpu.cfs_quota_us` to whichever value it calculated

2. **runc** (when creating application container) - **CONSISTENT**:
   - Always calculates: `4096m → 410000 microseconds` (appears to round 4.096 to 4.1)
   - Tries to set child cgroup: `cpu.cfs_quota_us = 410000`

3. **Result**:
   - When containerd picks `410000`: Parent = 410000, child = 410000 → **Success!**
   - When containerd picks `409600`: Parent = 409600, child = 410000 → **Kernel rejects!** (child > parent)
   - In cgroup v1, child quotas cannot exceed parent quotas

## Why It Appears Node-Specific

The issue **seems** to only affect "previously used nodes" because:
- When containerd picks `409600` and the pod fails, the parent cgroup gets stuck
- The pause container remains alive with the 409600 parent cgroup
- All subsequent attempts to create the pod on that node fail (child 410000 > parent 409600)
- Fresh nodes might get lucky and containerd picks `410000` → works fine
- But those nodes would fail too if containerd had picked `409600` on first attempt

This is **not** about stale cgroups from old pods - it's about **which value containerd randomly picks during pod sandbox creation**.

## Error Message

```
failed to create containerd task: failed to create shim task: OCI runtime create failed:
runc create failed: unable to start container process: error during container init:
error setting cgroup config for procHooks process: failed to write "410000":
write /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podc65bd648_3faf_4778_90e4_a21afb2a6ad0.slice/cri-containerd-149d004f6e52b5665c6209d1f33a7e516049b79456444e3f74af49e62c5c80c8.scope/cpu.cfs_quota_us:
invalid argument: unknown
```

## Evidence from Investigation

**Failing node** (containerd picked 409600):
```bash
# Parent cgroup quota - containerd calculated 409600
$ cat /sys/fs/cgroup/cpu,cpuacct/.../kubepods-burstable-podc65bd648...slice/cpu.cfs_quota_us
409600

# Pod sandbox metadata confirms
$ crictl inspectp b4420139f34f8
"cpu_quota": 409600

# Containerd logs show runc trying to write 410000
$ journalctl -u containerd | grep "410000"
failed to write "410000": write .../cpu.cfs_quota_us: invalid argument
```

**Working node** (containerd picked 410000):
```bash
# Parent cgroup quota - containerd calculated 410000!
$ cat /sys/fs/cgroup/cpu,cpuacct/.../kubepods-burstable-pod7f7424a2...slice/cpu.cfs_quota_us
410000

# Application container child cgroup also 410000
$ cat /sys/fs/cgroup/cpu,cpuacct/.../cri-containerd-98745b3c2c216...scope/cpu.cfs_quota_us
410000

# They match - no error!
```

**Both nodes running**:
- Same containerd version: 1.7.27
- Same runc version: 1.3.2
- Same Kubernetes version: 1.30.14-eks-113cf36
- Same pod spec with CPU limit: 4096m

## Additional Context

- This issue started occurring after changing CPU limits from `8192m → 4096m`
- The problem is specific to CPU values resulting in fractional cores (4.096)
- The 400 microsecond difference (410000 - 409600) violates cgroup v1's parent-child quota constraint
- **Critical finding**: Same containerd version behaves differently - this is **non-deterministic**
- **Calculation theory**:
  - Correct: `4096 / 1000 * 100000 = 409600`
  - Rounded: `4.1 * 100000 = 410000` (rounding 4.096 to 4.1)

## Questions for Maintainers

1. Where in containerd's codebase does the millicore → microsecond conversion happen for pod sandbox creation?
2. Why would containerd calculate two different values (409600 vs 410000) for the same input (4096m)?
3. Is there a race condition or different code path that causes this non-determinism?
4. Should containerd and runc be using shared conversion logic to ensure consistency?

## Related Issues

This appears similar to but distinct from:
- #4622 - Systemd rounding to nearest 10ms (closed/fixed)
- #3084 - Parent quota>0 with period changes (closed/fixed)
- kubernetes/kubernetes#61192 - Systemd rounding from 2018 (closed/fixed)

However, this is a **new issue** involving **non-deterministic behavior** in containerd 1.7.27 when calculating CPU quotas for fractional core values with systemd cgroup driver.


### Steps to reproduce the issue

1. Deploy a Kubernetes pod with CPU limit `4096m` multiple times on different fresh nodes
   - **Observe**: Some pods succeed, some fail (non-deterministic)
   - Successful pods: containerd calculated parent cgroup `cpu.cfs_quota_us = 410000`
   - Failed pods: containerd calculated parent cgroup `cpu.cfs_quota_us = 409600`
   - runc always tries to write `410000` for child cgroup

2. On nodes where containerd picked `409600`:
   - runc attempts to create application container
   - runc tries to write `410000` to child cgroup's `cpu.cfs_quota_us`
   - **Kernel rejects**: child quota (410000) > parent quota (409600)
   - Container creation fails with "invalid argument" error
   - Pod enters `CrashLoopBackOff`
   - Pause container remains alive with parent cgroup stuck at 409600

3. All subsequent restart attempts on that node continue to fail
   - Containerd reuses the existing pod sandbox
   - Parent cgroup still has `409600`
   - runc still tries `410000`
   - Pattern repeats indefinitely

4. Evicting the pod and forcing it to a different node:
   - **May work** if containerd picks `410000` on the new node
   - **Will fail** if containerd picks `409600` on the new node
   - Outcome is **non-deterministic**

Example pod spec that reproduces the issue:
```yaml
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  containers:
  - name: test-container
    image: nginx:latest
    resources:
      limits:
        cpu: "4096m"
        memory: "16Gi"
      requests:
        cpu: "1024m"
        memory: "8Gi"
```

## How to Verify Which Value Containerd Picked

On a node where the pod was deployed:
```bash
# Get pod UID
kubectl get pod <pod-name> -o jsonpath='{.metadata.uid}'

# Check parent cgroup on the node
cat /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod<UID_with_underscores>.slice/cpu.cfs_quota_us

# 409600 = pod will fail
# 410000 = pod will succeed
```

**Critical Note**: The issue is **not** about "previously used nodes" - it's about **which value containerd randomly calculates during initial pod sandbox creation**. The appearance of being node-specific is because once a node gets stuck with 409600, it stays stuck.


### Describe the results you received and expected

**Expected behavior:**

Containerd and runc should use **consistent, deterministic** calculations when converting millicores to microseconds for CPU quotas.

For a CPU limit of `4096m`:
- Both containerd and runc should calculate: `4096 / 1000 * 100000 = 409600 microseconds`
- OR both should calculate: `4.1 * 100000 = 410000 microseconds`
- **They must agree** - parent and child cgroups must have compatible values
- Container should create successfully **every time**, regardless of node
- Behavior should be **deterministic**, not random

**Actual behavior:**

- **Containerd**: Non-deterministically calculates either `409600` or `410000` for the same input
  - Sometimes: `409600 microseconds` (mathematically correct)
  - Sometimes: `410000 microseconds` (rounded)
  - **No obvious pattern** - same version, same config, different results
- **runc**: Consistently calculates `410000 microseconds` (always rounds 4.096 to 4.1)
- When they mismatch (containerd=409600, runc=410000):
  - Child cgroup creation fails with kernel error: "invalid argument"
  - Pod enters `CrashLoopBackOff` with 199+ restart attempts
  - Parent cgroup gets stuck with 409600, preventing all future attempts
  - Requires manual node cordoning and pod eviction
- When they match (containerd=410000, runc=410000):
  - Pod works perfectly fine

**Impact:**

- **Non-deterministic pod scheduling** - same pod spec may work or fail randomly
- Cannot reliably deploy pods with CPU limit `4096m` (or other fractional core values)
- Once a node "loses the lottery" and gets 409600, it's permanently broken for that pod
- Requires operational workarounds (cordon/drain/evict)
- Production impact on Amazon EKS clusters

**Root Issue:**

This is fundamentally a **consistency bug** - containerd and runc must use the same conversion logic, and that logic must be deterministic.


### What version of runc are you using?

```
runc version 1.3.2
commit: aeabe4e711d903ef0ea86a4155da0f9e00eabd29
spec: 1.2.1
go: go1.24.9
libseccomp: 2.5.2
```

**Additional environment details:**
- containerd version: 1.7.27 (commit: 05044ec0a9a75232cad458027ca83437aae3f4da)
- Kubernetes version: 1.30.14-eks-113cf36 (Amazon EKS)
- Cgroup version: v1
- Cgroup driver: systemd (`SystemdCgroup = true` in containerd config at `/etc/containerd/config.toml`)


### Host OS information

```
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
SUPPORT_END="2026-06-30"
```

**Platform:** Amazon EKS (Elastic Kubernetes Service) managed node


### Host kernel information

```
Linux ip-10-7-66-184.prod-eks.newfront.com 5.10.245-241.976.amzn2.x86_64 #1 SMP Tue Oct 21 22:09:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
```

**Kernel version:** 5.10.245-241.976.amzn2.x86_64


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Systemd cgroup driver: CPU quota calculation mismatch between containerd and runc causes container creation failure #4982

Description

Root Cause

Why It Appears Node-Specific

Error Message

Evidence from Investigation

Additional Context

Questions for Maintainers

Related Issues

Steps to reproduce the issue

How to Verify Which Value Containerd Picked

Describe the results you received and expected

What version of runc are you using?

Host OS information

Host kernel information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Systemd cgroup driver: CPU quota calculation mismatch between containerd and runc causes container creation failure #4982

Description

Description

Root Cause

Why It Appears Node-Specific

Error Message

Evidence from Investigation

Additional Context

Questions for Maintainers

Related Issues

Steps to reproduce the issue

How to Verify Which Value Containerd Picked

Describe the results you received and expected

What version of runc are you using?

Host OS information

Host kernel information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions