Skip to content

Conversation

@sonianuj287
Copy link

Related issue - #5275

Summary

The issue was caused by missing timeout configurations in MicroK8s that control how Kubernetes monitors and reports node health. This led to:

Nodes remaining "Ready" even when offline
Healthy nodes being incorrectly marked "NotReady"
Inconsistent cluster behavior

Root Cause Analysis
Missing critical parameters in the Kubernetes control plane components:

--node-monitor-grace-period in kube-controller-manager
--pod-eviction-timeout in kube-controller-manager
--node-status-update-frequency in kubelet

Changes

  1. Updated kube-controller-manager configuration:

  2. Updated kubelet configuration:

Expected Behavior After Fix
Healthy nodes remain "Ready" at all times
Failed nodes are marked "NotReady" within ~40-50 seconds
Only actually failed nodes are marked "NotReady"
Consistent behavior across different cluster configurations

Testing

Possible Regressions

Checklist

  • Read the contributions page.
  • Submitted the CLA form, if you are a first time contributor.
  • The introduced changes are covered by unit and/or integration tests.

Notes

@sonianuj287
Copy link
Author

Hi @lazzarello @akaihola @xnox @timgreen , please review this PR and suggest me changes. Thanks :)

@xnox
Copy link
Contributor

xnox commented Oct 23, 2025

This is nice! Lots of Pro customers are hitting this issue, and it seemed always wild, but this is likely the root cause for it.

Especially since microk8s without these settings behaves unlike other k8s deployments.

@sonianuj287
Copy link
Author

This is nice! Lots of Pro customers are hitting this issue, and it seemed always wild, but this is likely the root cause for it.

Especially since microk8s without these settings behaves unlike other k8s deployments.

Thanks @xnox , how can we run the workflow to veify the changes.., if possible I want it to be merge before end of October to be counted for Hacktoberfest

@ktsakalozos
Copy link
Member

Hi @sonianuj287 @xnox appologies for the late response. We will try to reproduce this issue and verify the fix. In the meantime could you please sing the SLA and address the lint errors?

@kcarson77
Copy link

Any updates to this - easy to reproduce. Install a 3 host cluster, take link down on host1, for example. Sometimes node becomes not ready, sometimes 2 nodes become not ready sometimes no nodes become notready (less common) Seeing this across different systems and having critical impact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants