-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Open
Labels
Description
Describe the bug
These were suggested by Codex after a multi hour analysis of pods
- Environment: OpenSearch 3.3.0, three-node cluster, default cluster.routing.allocation.node_concurrent_{incoming,outgoing} of 2.
- Steps: kill/restart a data node so its replicas need to recover; simulate a stuck recovery (e.g., block traffic between source and target). Observe that the shard
remains INITIALIZING forever, consuming the only available recovery slot. - Expected: OpenSearch eventually marks the recovery failed and retries / lets other replicas proceed.
- Actual: recovery stays stuck indefinitely, unassigned_shards never clear, operator must manually delete the shard or wipe the node.
Suggested fix:
- Auto-fail hung recoveries: Add a timeout to peer recoveries so that if no bytes transfer within (say) X minutes, OpenSearch marks the shard copy as failed and retries
on another node. This would free the node_concurrent_*_recoveries slots and let other replicas progress without manual intervention. - Heuristic for “no valid shard copy”: If a node rejoins with stale data and the cluster sees repeated NODE_LEFT → INITIALIZING loops, automatically wipe that shard
copy (or at least flag it as unusable) so allocation can proceed. This mirrors what operators do manually by deleting PVCs.
Related component
No response
To Reproduce
There is no repro steps and it's hard to recover from such cases, would love to see a solution.
Expected behavior
The cluster recovers w/o manual intervention.
Additional Details
N/A