OCPBUGS-62619: Add etcd size limit validation for rendered MachineConfigs#5729
Conversation
cb489af to
3b4256d
Compare
Fixes bug where MachineConfigPools get stuck in degraded state with "etcdserver: request is too large" errors when rendered MachineConfigs exceed etcd's 1.5MB size limit. Changes: - Add MaxMachineConfigSize constant (1572864 bytes) in constants.go - Add ValidateMachineConfigSize() function in helpers.go that: * Validates rendered MC size before sending to etcd * Returns clear error message with remediation guidance * Logs warning when size exceeds 80% of limit * Provides debug logging of MC size usage - Call validation in render controller before MC create/update This prevents the operator from attempting to write oversized MCs to etcd, provides early detection with helpful error messages, and avoids wasting retry attempts. The error message specifically mentions large registry mirror configurations (ImageDigestMirrorSet/ICSP) as the primary cause and suggests reducing their size.
3b4256d to
fdf1f44
Compare
|
@dkhater-redhat: This pull request references Jira Issue OCPBUGS-62619, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@dkhater-redhat: This pull request references Jira Issue OCPBUGS-62619, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
1 similar comment
|
@dkhater-redhat: This pull request references Jira Issue OCPBUGS-62619, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest-required |
isabella-janssen
left a comment
There was a problem hiding this comment.
/lgtm
Looks good, and I especially like the warning at 80% capacity!
|
/retest-required |
1 similar comment
|
/retest-required |
|
@dkhater-redhat: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/test e2e-gcp-op-part1 |
There was a problem hiding this comment.
Pre Merge Verification
Environment
- OpenShift 4.21.0-0.nightly-2026-03-22-203205 upgraded to 4.22.0-0-2026-03-24-131506
- Platform: AWS
Step 1: Verify Cluster Health (4.21 nightly)
harshpat@harshpat-thinkpadp1gen4i:~/Downloads$ export KUBECONFIG=/home/harshpat/Downloads/kubeconfig
harshpat@harshpat-thinkpadp1gen4i:~/Downloads$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.21.0-0.nightly-2026-03-22-203205 True False 174m Cluster version is 4.21.0-0.nightly-2026-03-22-203205
harshpat@harshpat-thinkpadp1gen4i:~/Downloads$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-22c218032c7a3d69a4122396bdc1e60d True False False 3 3 3 0 3h19m
worker rendered-worker-b33cbf0e1944a9cd424b8862c8d0840e True False False 3 3 3 0 3h19m
harshpat@harshpat-thinkpadp1gen4i:~/Downloads$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-1-118.us-east-2.compute.internal Ready worker 3h14m v1.34.5
ip-10-0-18-24.us-east-2.compute.internal Ready control-plane,master 3h22m v1.34.5
ip-10-0-32-37.us-east-2.compute.internal Ready control-plane,master 3h21m v1.34.5
ip-10-0-50-33.us-east-2.compute.internal Ready worker 3h14m v1.34.5
ip-10-0-82-62.us-east-2.compute.internal Ready control-plane,master 3h22m v1.34.5
ip-10-0-87-200.us-east-2.compute.internal Ready worker 3h14m v1.34.5
Step 2: Check Baseline Rendered MachineConfig Size
harshpat@harshpat-thinkpadp1gen4i:~/Downloads$ oc get mc rendered-master-22c218032c7a3d69a4122396bdc1e60d -o json | wc -c
198490
Baseline rendered MC is ~198KB, i.e. under the 1.5MB etcd limit.
Step 3: Generate and Apply Large ImageDigestMirrorSet (IDMS) Objects
Created a script to generate 20 IDMS objects, each with 100 mirror entries (3 mirrors per source), to inflate the registries.conf and push the rendered MC close to 1.5MB.
harshpat@harshpat-thinkpadp1gen4i:~/Downloads$ bash generate-idms.sh
=== Current rendered MC size (baseline) ===
198490
=== Applying IDMS objects ===
imagedigestmirrorset.config.openshift.io/large-idms-001 created
Applied large-idms-001
imagedigestmirrorset.config.openshift.io/large-idms-002 created
Applied large-idms-002
imagedigestmirrorset.config.openshift.io/large-idms-003 created
Applied large-idms-003
imagedigestmirrorset.config.openshift.io/large-idms-004 created
Applied large-idms-004
imagedigestmirrorset.config.openshift.io/large-idms-005 created
Applied large-idms-005
imagedigestmirrorset.config.openshift.io/large-idms-006 created
Applied large-idms-006
imagedigestmirrorset.config.openshift.io/large-idms-007 created
Applied large-idms-007
imagedigestmirrorset.config.openshift.io/large-idms-008 created
Applied large-idms-008
imagedigestmirrorset.config.openshift.io/large-idms-009 created
Applied large-idms-009
imagedigestmirrorset.config.openshift.io/large-idms-010 created
Applied large-idms-010
imagedigestmirrorset.config.openshift.io/large-idms-011 created
Applied large-idms-011
imagedigestmirrorset.config.openshift.io/large-idms-012 created
Applied large-idms-012
imagedigestmirrorset.config.openshift.io/large-idms-013 created
Applied large-idms-013
imagedigestmirrorset.config.openshift.io/large-idms-014 created
Applied large-idms-014
imagedigestmirrorset.config.openshift.io/large-idms-015 created
Applied large-idms-015
imagedigestmirrorset.config.openshift.io/large-idms-016 created
Applied large-idms-016
imagedigestmirrorset.config.openshift.io/large-idms-017 created
Applied large-idms-017
imagedigestmirrorset.config.openshift.io/large-idms-018 created
Applied large-idms-018
imagedigestmirrorset.config.openshift.io/large-idms-019 created
Applied large-idms-019
imagedigestmirrorset.config.openshift.io/large-idms-020 created
Applied large-idms-020
=== All IDMS objects applied. Waiting 90s for MC regeneration ===
=== Checking IDMS count ===
21
=== Checking 99-master-generated-registries MC size ===
1243163
=== Checking registries.conf content size ===
927657
=== Checking rendered MC size ===
1435305
=== MCP status ===
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-dab8a91ac8cc77665a27a9f4b0dfc6b9 True False False 3 3 3 0 3h23m
worker rendered-worker-65a35e5f9b1d9df882a9fd9968d0b2d1 True False False 3 3 3 0 3h23m
After 20 IDMS objects: rendered MC is ~1.4MB, MCP still healthy (just under the limit).
Step 4: Upgrade Cluster to 4.22 Using ClusterBot Image
harshpat@harshpat-thinkpadp1gen4i:~/Downloads$ oc adm upgrade --to-image=registry.build07.ci.openshift.org/ci-ln-qrzrj9t/release:latest --allow-explicit-upgrade --force
warning: Using by-tag pull specs is dangerous, and while we still allow it in combination with --force for backward compatibility, it would be much safer to pass a by-digest pull spec instead
warning: The requested upgrade image is not one of the available updates. You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. Only use this if you are testing unsigned release images or you are working around a known bug in the cluster-version operator and you have verified the authenticity of the provided image yourself.
Requested update to release image registry.build07.ci.openshift.org/ci-ln-qrzrj9t/release:latest
Monitored upgrade progress:
harshpat@harshpat-thinkpadp1gen4i:~/Downloads$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.21.0-0.nightly-2026-03-22-203205 True True 78s Working towards 4.22.0-0-2026-03-24-131506-test-ci-ln-qrzrj9t-latest: 123 of 984 done (12% complete), waiting on etcd, kube-apiserver
Waited for upgrade to complete:
harshpat@harshpat-thinkpadp1gen4i:~/Downloads$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.22.0-0-2026-03-24-131506-test-ci-ln-qrzrj9t-latest True False 38m Cluster version is 4.22.0-0-2026-03-24-131506-test-ci-ln-qrzrj9t-latest
Step 5: Check Rendered MC Size After Upgrade
harshpat@harshpat-thinkpadp1gen4i:~/Downloads$ oc get mc rendered-master-7c7e3261d5aa100e599e67d4e969cc7e -o json | wc -c
1441689
harshpat@harshpat-thinkpadp1gen4i:~/Downloads$ oc get mc 99-master-generated-registries -o json | wc -c
1250336
Rendered MC at ~1.44MB after upgrade — still under 1.5MB limit by ~131KB. Added 3 more IDMS objects to push it over.
Step 6: Add Additional IDMS Objects to Exceed 1.5MB Limit
harshpat@harshpat-thinkpadp1gen4i:~/Downloads$ for i in $(seq 21 23); do
> cat <<EOF | oc apply -f -
> apiVersion: config.openshift.io/v1
> kind: ImageDigestMirrorSet
> metadata:
> name: large-idms-$(printf "%03d" $i)
> spec:
> imageDigestMirrors:
> $(for j in $(seq 1 100); do
> cat <<ENTRY
> - mirrors:
> - mirror-registry-${i}-${j}-a.example.com/org${i}/repo${j}
> - mirror-registry-${i}-${j}-b.example.com/org${i}/repo${j}
> - mirror-registry-${i}-${j}-c.example.com/org${i}/repo${j}
> source: source-registry-${i}-${j}.example.com/org${i}/repo${j}
> ENTRY
> done)
> EOF
> echo "Applied large-idms-$(printf "%03d" $i)"
> done
imagedigestmirrorset.config.openshift.io/large-idms-021 created
Applied large-idms-021
imagedigestmirrorset.config.openshift.io/large-idms-022 created
Applied large-idms-022
imagedigestmirrorset.config.openshift.io/large-idms-023 created
Applied large-idms-023
Step 7: Verify Bug Reproduction — MCP Degraded
harshpat@harshpat-thinkpadp1gen4i:~/Downloads$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-7e9dd6f2a77167ec04b7a765f755fa88 True False True 3 3 3 0 7h32m
worker rendered-worker-a758bee8d64e5a7312082d23d9a3d2ac True False True 3 3 3 0 7h32m
Both master and worker MCPs are DEGRADED=True.
MCP Master Conditions:
harshpat@harshpat-thinkpadp1gen4i:~/Downloads$ oc get mcp master -o jsonpath='{range .status.conditions[*]}{.type}: {.status} - {.message}{"\n"}{end}'
PinnedImageSetsDegraded: False -
NodeDegraded: False -
RenderDegraded: True - Failed to render configuration for pool master: size validation failed: rendered MachineConfig rendered-master-2577e9f84fb4cf97fc235ba314c538b3 is too large (1612325 bytes, max 1572864 bytes). This will exceed etcd's size limit. Consider reducing the number or size of MachineConfigs, particularly large registry mirror configurations (ImageDigestMirrorSet/ImageContentSourcePolicy)
Degraded: True -
Updated: True - All nodes are updated with MachineConfig rendered-master-7e9dd6f2a77167ec04b7a765f755fa88
Updating: False -
MCP Worker Conditions:
harshpat@harshpat-thinkpadp1gen4i:~/Downloads$ oc get mcp worker -o jsonpath='{range .status.conditions[*]}{.type}: {.status} - {.message}{"\n"}{end}'
PinnedImageSetsDegraded: False -
NodeDegraded: False -
RenderDegraded: True - Failed to render configuration for pool worker: size validation failed: rendered MachineConfig rendered-worker-4f5cc9d23653637a819db8ad423f6b03 is too large (1610456 bytes, max 1572864 bytes). This will exceed etcd's size limit. Consider reducing the number or size of MachineConfigs, particularly large registry mirror configurations (ImageDigestMirrorSet/ImageContentSourcePolicy)
Degraded: True -
Updated: True - All nodes are updated with MachineConfig rendered-worker-a758bee8d64e5a7312082d23d9a3d2ac
Updating: False -
Machine-Config-Controller Logs:
harshpat@harshpat-thinkpadp1gen4i:~/Downloads$ oc logs -n openshift-machine-config-operator deployment/machine-config-controller --tail=20 | grep -i "too large\|Error syncing\|Dropping"
E0324 15:16:13.383482 1 render_controller.go:545] Error syncing Generated MCFG: size validation failed: rendered MachineConfig rendered-worker-4f5cc9d23653637a819db8ad423f6b03 is too large (1610456 bytes, max 1572864 bytes). This will exceed etcd's size limit. Consider reducing the number or size of MachineConfigs, particularly large registry mirror configurations (ImageDigestMirrorSet/ImageContentSourcePolicy)
I0324 15:16:13.392159 1 render_controller.go:478] Error syncing machineconfigpool worker: size validation failed: rendered MachineConfig rendered-worker-4f5cc9d23653637a819db8ad423f6b03 is too large (1610456 bytes, max 1572864 bytes). This will exceed etcd's size limit. Consider reducing the number or size of MachineConfigs, particularly large registry mirror configurations (ImageDigestMirrorSet/ImageContentSourcePolicy)
E0324 15:16:28.440763 1 render_controller.go:545] Error syncing Generated MCFG: size validation failed: rendered MachineConfig rendered-master-2577e9f84fb4cf97fc235ba314c538b3 is too large (1612325 bytes, max 1572864 bytes). This will exceed etcd's size limit. Consider reducing the number or size of MachineConfigs, particularly large registry mirror configurations (ImageDigestMirrorSet/ImageContentSourcePolicy)
I0324 15:16:28.450421 1 render_controller.go:478] Error syncing machineconfigpool master: size validation failed: rendered MachineConfig rendered-master-2577e9f84fb4cf97fc235ba314c538b3 is too large (1612325 bytes, max 1572864 bytes). This will exceed etcd's size limit. Consider reducing the number or size of MachineConfigs, particularly large registry mirror configurations (ImageDigestMirrorSet/ImageContentSourcePolicy)
I0324 15:16:30.903916 1 render_controller.go:484] Dropping machineconfigpool "worker" out of the queue: size validation failed: rendered MachineConfig rendered-worker-4f5cc9d23653637a819db8ad423f6b03 is too large (1610456 bytes, max 1572864 bytes). This will exceed etcd's size limit. Consider reducing the number or size of MachineConfigs, particularly large registry mirror configurations (ImageDigestMirrorSet/ImageContentSourcePolicy)
Rendered MC and Registries Size:
harshpat@harshpat-thinkpadp1gen4i:~/Downloads$ oc get mc $(oc get mcp master -o jsonpath='{.spec.configuration.name}') -o json | wc -c
1566321
harshpat@harshpat-thinkpadp1gen4i:~/Downloads$ oc get mc 99-master-generated-registries -o json | wc -c
1437280
harshpat@harshpat-thinkpadp1gen4i:~/Downloads$ oc get mc 99-master-generated-registries -o json | jq -r '.spec.config.storage.files[0].contents.source' | cut -d',' -f2 | base64 -d | wc -c
1067865
harshpat@harshpat-thinkpadp1gen4i:~/Downloads$ oc get imagedigestmirrorset
NAME AGE
large-idms-001 4h
large-idms-002 4h
large-idms-003 4h
large-idms-004 4h
large-idms-005 4h
large-idms-006 4h
large-idms-007 4h
large-idms-008 4h
large-idms-009 4h
large-idms-010 4h
large-idms-011 4h
large-idms-012 4h
large-idms-013 4h
large-idms-014 4h
large-idms-015 4h
large-idms-016 4h
large-idms-017 4h
large-idms-018 4h
large-idms-019 4h
large-idms-020 4h
large-idms-021 23m
large-idms-022 23m
large-idms-023 23m
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dkhater-redhat, HarshwardhanPatil07, isabella-janssen The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
mcc logs did confirmed with |
|
/verified by @HarshwardhanPatil07 See #5729 (review) |
|
@isabella-janssen: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Review skipped — only excluded labels are configured. (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
/restest-required |
|
/retest-required |
e814e20
into
openshift:main
|
@dkhater-redhat: Jira Issue Verification Checks: Jira Issue OCPBUGS-62619 Jira Issue OCPBUGS-62619 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Fixes bug where MachineConfigPools get stuck in degraded state with "etcdserver: request is too large" errors when rendered MachineConfigs exceed etcd's 1.5MB size limit.
Changes:
This prevents the operator from attempting to write oversized MCs to etcd, provides early detection with helpful error messages, and avoids wasting retry attempts. The error message specifically mentions large registry mirror configurations (ImageDigestMirrorSet/ICSP) as the primary cause and suggests reducing their size.
- What I did
- How to verify it
- Description for the changelog