[BUG] Hung peer recovery permanently blocks replica allocation.

### Describe the bug

These were suggested by Codex after a multi hour analysis of pods
  - Environment: OpenSearch 3.3.0, three-node cluster, default cluster.routing.allocation.node_concurrent_{incoming,outgoing} of 2.
  - Steps: kill/restart a data node so its replicas need to recover; simulate a stuck recovery (e.g., block traffic between source and target). Observe that the shard
    **remains INITIALIZING forever,** consuming the only available recovery slot.
  - Expected: OpenSearch eventually marks the recovery failed and retries / lets other replicas proceed.
  - Actual: recovery stays stuck indefinitely, unassigned_shards never clear, operator must manually delete the shard or wipe the node.

Suggested fix:
  1. Auto-fail hung recoveries: Add a timeout to peer recoveries so that if no bytes transfer within (say) X minutes, OpenSearch marks the shard copy as failed and retries
     on another node. This would **free the node_concurrent_*_recoveries slots** and let other replicas progress without manual intervention.
  2. Heuristic for “no valid shard copy”: If a node rejoins with stale data and the cluster sees repeated NODE_LEFT → INITIALIZING loops, automatically wipe that shard
     copy (or at least flag it as unusable) so allocation can proceed. This mirrors what operators do manually by deleting PVCs.


### Related component

_No response_

### To Reproduce

There is no repro steps and it's hard to recover from such cases, would love to see a solution. 

### Expected behavior

The cluster recovers w/o manual intervention.

### Additional Details

N/A

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Hung peer recovery permanently blocks replica allocation. #20177

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Hung peer recovery permanently blocks replica allocation. #20177

Description

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions