k8s executor - ensure pods cleaned up #58047
Open
+7
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This commit ensures that completed pods are eventually cleaned up.
This relates to #57553
Previously, in say
2.9.3,self.job.executor.try_adopt_task_instanceswas always called here:airflow/airflow/jobs/scheduler_job_runner.py
Line 1641 in 81845de
It was called unconditionally, even if it found no TaskInstances to adopt.
This meant that in the kubernetes executor, we would always call this line:
airflow/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py
Line 601 in 81845de
This was triggered at startup by calling
adopt_or_reset_orphaned_tasks, here:airflow/airflow/jobs/scheduler_job_runner.py
Line 928 in 81845de
It was also then called frequently, configurable with
orphaned_tasks_check_interval.The result of this is that if the query that is run to detect adoptable tasks does not find any tasks to adopt, we no longer make a call to
_adopt_completed_pods, and as a result completed pods are left hanging around. This happens when an old scheduler instance is stopped and a new one takes its place.This PR partially restores the old behaviour (frequent calls to
_adopt_completed_pods), but it does not appear to run directly after startup. Even if a call is made to_adopt_completed_podson startup, due to race conditions, you may not catch all pods which enter a completed state but which have not been captured by the query to detect any adoptable tasks, and so there is a requirement to run it again at some point later.Tests pass with: