Skip to content

[BUG] Hung peer recovery permanently blocks replica allocation. #20177

@maxlepikhin

Description

@maxlepikhin

Describe the bug

These were suggested by Codex after a multi hour analysis of pods

  • Environment: OpenSearch 3.3.0, three-node cluster, default cluster.routing.allocation.node_concurrent_{incoming,outgoing} of 2.
  • Steps: kill/restart a data node so its replicas need to recover; simulate a stuck recovery (e.g., block traffic between source and target). Observe that the shard
    remains INITIALIZING forever, consuming the only available recovery slot.
  • Expected: OpenSearch eventually marks the recovery failed and retries / lets other replicas proceed.
  • Actual: recovery stays stuck indefinitely, unassigned_shards never clear, operator must manually delete the shard or wipe the node.

Suggested fix:

  1. Auto-fail hung recoveries: Add a timeout to peer recoveries so that if no bytes transfer within (say) X minutes, OpenSearch marks the shard copy as failed and retries
    on another node. This would free the node_concurrent_*_recoveries slots and let other replicas progress without manual intervention.
  2. Heuristic for “no valid shard copy”: If a node rejoins with stale data and the cluster sees repeated NODE_LEFT → INITIALIZING loops, automatically wipe that shard
    copy (or at least flag it as unusable) so allocation can proceed. This mirrors what operators do manually by deleting PVCs.

Related component

No response

To Reproduce

There is no repro steps and it's hard to recover from such cases, would love to see a solution.

Expected behavior

The cluster recovers w/o manual intervention.

Additional Details

N/A

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions