Skip to content

Conversation

@atrbgithub
Copy link
Contributor

@atrbgithub atrbgithub commented Nov 7, 2025

This commit ensures that completed pods are eventually cleaned up.


This relates to #57553

Previously, in say 2.9.3, self.job.executor.try_adopt_task_instances was always called here:

to_reset = self.job.executor.try_adopt_task_instances(tis_to_adopt_or_reset)

It was called unconditionally, even if it found no TaskInstances to adopt.

This meant that in the kubernetes executor, we would always call this line:

self._adopt_completed_pods(kube_client)

This was triggered at startup by calling adopt_or_reset_orphaned_tasks, here:

self.adopt_or_reset_orphaned_tasks()

It was also then called frequently, configurable with orphaned_tasks_check_interval.

The result of this is that if the query that is run to detect adoptable tasks does not find any tasks to adopt, we no longer make a call to _adopt_completed_pods, and as a result completed pods are left hanging around. This happens when an old scheduler instance is stopped and a new one takes its place.

This PR partially restores the old behaviour (frequent calls to _adopt_completed_pods), but it does not appear to run directly after startup. Even if a call is made to _adopt_completed_pods on startup, due to race conditions, you may not catch all pods which enter a completed state but which have not been captured by the query to detect any adoptable tasks, and so there is a requirement to run it again at some point later.

Tests pass with:

breeze testing providers-tests --test-type "Providers[cncf]"

This commit ensures that completed pods are eventually cleaned up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant