Skip to content

Fix: idle TEs leak from scheduler pool after disable + reconnect + expiry#850

Merged
Andyz26 merged 1 commit intomasterfrom
andyz/fixLeakIdleAgentOnReenable
Apr 22, 2026
Merged

Fix: idle TEs leak from scheduler pool after disable + reconnect + expiry#850
Andyz26 merged 1 commit intomasterfrom
andyz/fixLeakIdleAgentOnReenable

Conversation

@Andyz26
Copy link
Copy Markdown
Collaborator

@Andyz26 Andyz26 commented Apr 22, 2026

Summary

Resource-cluster scheduling can enter a state where a TaskExecutor is reported as available by the metrics endpoint and by /getAvailableTaskExecutors, yet new reservations for its SKU keep failing with NoResourceAvailableException. This PR closes the gap in ExecutorStateManagerActor that produces that drift and ships a deterministic repro test covering both by-id and attribute-based disable paths.

Symptom

From an impacted master log:

Reconnected task executor TaskExecutorID(resourceId=2af3d069-…) was already marked for disabling.
Removed active disable task executors for request ExpireDisableTaskExecutorsRequest(
  request=DisableTaskExecutorsRequest(attributes={}, clusterID=ClusterID(resourceID=),
    expiry=…, taskExecutorID=Optional[TaskExecutorID(2af3d069-…)]))

After these two lines, the TE:

  • shows up in ResourceOverview.numAvailableTaskExecutors,
  • shows up in GetAvailableTaskExecutorsRequest responses,
  • is not picked by findBestFit; assignment requests for its SKU raise NoResourceAvailableException with the accompanying scheduler log "Cannot find any matching sku for request" / "Not enough available TEs found for scheduling constraints".

Root cause

The three views read different data structures:

View Source
ResourceOverview.numAvailable metric taskExecutorStateMap.values().filter(isAvailable) (ExecutorStateManagerImpl.java:253)
GetAvailableTaskExecutorsRequest API taskExecutorStateMap.entrySet().filter(isAvailable) (ExecutorStateManagerActor.java:410)
Scheduler (findBestFit) executorsByGroup: Map<TaskExecutorGroupKey, NavigableSet<TaskExecutorHolder>> (ExecutorStateManagerImpl.java:124, 381-417)

executorsByGroup is populated only by tryMarkAvailable(id, state) and requires state.isAvailable() && state.getRegistration() != null at the moment of the call.

Drift sequence that produces the stuck state (trace for 2af3d069-…):

  1. Targeted disable issued (e.g. from man-nfmantis's UpgradeClusterChildWorkflowImpl drain). state.onNodeDisabled() sets disabled=true. Scheduler's filter at :407 correctly excludes the TE — fine.
  2. TE disconnects (heartbeat-timeout or crash). onDisconnection() clears registration and availabilityState; archive() moves the state out of taskExecutorStateMap. A stale holder may linger in executorsByGroup but is dropped by the !taskExecutorStateMap.containsKey(...) guard during assignment.
  3. TE reconnects via heartbeat before the disable expires.
    • trackIfAbsent puts a fresh TaskExecutorState into the map. tryMarkAvailable(id, new state) is a no-op because the new state has registration=null, availabilityState=null.
    • onHeartbeat loads the registration, calls onRegistration (returns true), then sees the TE is still in disabledTaskExecutors and calls onNodeDisabled()disabled=true. This is the "Reconnected task executor ... was already marked for disabling" log.
    • state.onHeartbeat(hb) flips availabilityState=Pending, but the if (stateChange && state.isAvailable()) gate fails because isAvailable() = Pending && !disabled = false. tryMarkAvailable is not called; the TE never enters executorsByGroup.
  4. Disable expiry fires. onDisableTaskExecutorsRequestExpiry calls state.onNodeEnabled(), which clears the flag — but it does not re-index the TE into executorsByGroup. After this, state.isAvailable() is true, so the metric and the API both report the TE as available, while findBestFit still cannot see it.

The symmetric gap exists in onCheckDisabledTaskExecutors: when a TE is disabled while present in executorsByGroup, state.onNodeDisabled() is called without a matching tryMarkUnavailable. The filter at :407 happens to rescue the assignment side, but the stale holder adds noise to the generation-first ordering in findBestGroupBySizeNameMatch.

Fix

Three edits in ExecutorStateManagerActor.java; all keep executorsByGroup in sync with isAvailable():

  1. onDisableTaskExecutorsRequestExpiry by-id path — after state.onNodeEnabled() flips to false, call delegate.tryMarkAvailable(taskExecutorID). Idempotent: no-op if isAvailable()==false or if the holder is already present.
  2. onDisableTaskExecutorsRequestExpiry attribute path — same fix for the by-attributes branch.
  3. onCheckDisabledTaskExecutors — after state.onNodeDisabled() flips to true, call delegate.tryMarkUnavailable(id). The by-id branch also guards on state.getRegistration() != null to avoid tryMarkUnavailable's log.warn firing when a disable request arrives for a TE that is currently disconnected.

All calls are guarded by the onNodeDisabled / onNodeEnabled return value, so they only run on an actual flag flip — no double-remove or double-add. tryMarkAvailable / tryMarkUnavailable are already safe and idempotent under the preconditions.

No behavior change in any other path.

New DisableReconnectExpireStuckReproTest exercises the full sequence end-to-end via ResourceCluster:

  • reconnectDuringTargetedDisablePlusExpirePreservesSchedulability — by-id DisableTaskExecutorsRequest.
  • reconnectDuringAttributeBasedDisablePlusExpirePreservesSchedulability — attribute-matching DisableTaskExecutorsRequest (exercises the other fix site; confirmed by the re-enable TE: log line).

Each test registers a TE, heartbeats Occupied (so it drops from executorsByGroup just like a running TE during ASG drain), disables it, disconnects it, re-registers it with an Available heartbeat before the expiry, waits past expiry, and then asserts that getTaskExecutorsFor returns the TE. On unfixed master both tests fail with "NoResourceAvailableException despite TE being reported as available…"; with the fix, both pass.

@github-actions
Copy link
Copy Markdown

Test Results

783 tests  +2   772 ✅ +2   10m 20s ⏱️ +12s
163 suites +1    11 💤 ±0 
163 files   +1     0 ❌ ±0 

Results for commit f21fbca. ± Comparison against base commit b9bc562.

@Andyz26 Andyz26 merged commit bb22595 into master Apr 22, 2026
3 checks passed
@Andyz26 Andyz26 deleted the andyz/fixLeakIdleAgentOnReenable branch April 22, 2026 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants