Skip to content

Descheduler creates eviction loop for pods with preferredDuringSchedulingIgnoredDuringExecution affinity #1707

@hamc

Description

@hamc

What version of descheduler are you using?

descheduler version: v0.33.0

Does this issue reproduce with the latest release?

Yes

Which descheduler CLI options are you using?

Please provide a copy of your descheduler policy config file

# CronJob or Deployment
kind: CronJob

image:
  repository: registry.k8s.io/descheduler/descheduler
  # Overrides the image tag whose default is the chart version
  tag: ""
  pullPolicy: IfNotPresent

imagePullSecrets:
#   - name: container-registry-secret

resources:
  requests:
    cpu: 500m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 256Mi

ports:
  - containerPort: 10258
    protocol: TCP

securityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL
  privileged: false
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 1000

# podSecurityContext -- [Security context for pod](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/)
podSecurityContext: {}
  # fsGroup: 1000

nameOverride: ""
fullnameOverride: ""

# -- Override the deployment namespace; defaults to .Release.Namespace
namespaceOverride: ""

# labels that'll be applied to all resources
commonLabels: {}

cronJobApiVersion: "batch/v1"
schedule: "*/2 * * * *"
suspend: false
# startingDeadlineSeconds: 200
# successfulJobsHistoryLimit: 3
# failedJobsHistoryLimit: 1
# ttlSecondsAfterFinished 600
# timeZone: Etc/UTC

# Required when running as a Deployment
deschedulingInterval: 5m

# Specifies the replica count for Deployment
# Set leaderElection if you want to use more than 1 replica
# Set affinity.podAntiAffinity rule if you want to schedule onto a node
# only if that node is in the same zone as at least one already-running descheduler
replicas: 1

# Specifies whether Leader Election resources should be created
# Required when running as a Deployment
# NOTE: Leader election can't be activated if DryRun enabled
leaderElection: {}
#  enabled: true
#  leaseDuration: 15s
#  renewDeadline: 10s
#  retryPeriod: 2s
#  resourceLock: "leases"
#  resourceName: "descheduler"
#  resourceNamespace: "kube-system"

command:
- "/bin/descheduler"

cmdOptions:
  v: 3

# Recommended to use the latest Policy API version supported by the Descheduler app version
deschedulerPolicyAPIVersion: "descheduler/v1alpha2"

# deschedulerPolicy contains the policies the descheduler will execute.
deschedulerPolicy:
  # nodeSelector: "key1=value1,key2=value2"
  # maxNoOfPodsToEvictPerNode: 10
  # maxNoOfPodsToEvictPerNamespace: 10
  # metricsProviders:
  # - source: KubernetesMetrics
  # tracing:
  #   collectorEndpoint: otel-collector.observability.svc.cluster.local:4317
  #   transportCert: ""
  #   serviceName: ""
  #   serviceNamespace: ""
  #   sampleRate: 1.0
  #   fallbackToNoOpProviderOnError: true
  profiles:
    - name: default
      pluginConfig:
        - name: DefaultEvictor
          args:
            ignorePvcPods: true
            evictLocalStoragePods: true
            nodeFit: true
        - name: RemoveDuplicates
        - name: RemovePodsHavingTooManyRestarts
          args:
            podRestartThreshold: 100
            includingInitContainers: true
        - name: RemovePodsViolatingNodeAffinity
          args:
            nodeAffinityType:
            - requiredDuringSchedulingIgnoredDuringExecution
            - preferredDuringSchedulingIgnoredDuringExecution
        - name: RemovePodsViolatingNodeTaints
        - name: RemovePodsViolatingInterPodAntiAffinity
        - name: RemovePodsViolatingTopologySpreadConstraint
        - name: LowNodeUtilization
          args:
            thresholds:
              "cpu" : 20
              "memory": 20
              "pods": 50
            targetThresholds:
              "cpu" : 50
              "memory": 50
              "pods": 50
      plugins:
        balance:
          enabled:
            - LowNodeUtilization
        deschedule:
          enabled:
            - RemovePodsViolatingInterPodAntiAffinity

priorityClassName: system-cluster-critical

nodeSelector: {}
#  foo: bar

affinity: {}
# nodeAffinity:
#   requiredDuringSchedulingIgnoredDuringExecution:
#     nodeSelectorTerms:
#     - matchExpressions:
#       - key: kubernetes.io/e2e-az-name
#         operator: In
#         values:
#         - e2e-az1
#         - e2e-az2
#  podAntiAffinity:
#    requiredDuringSchedulingIgnoredDuringExecution:
#      - labelSelector:
#          matchExpressions:
#            - key: app.kubernetes.io/name
#              operator: In
#              values:
#                - descheduler
#        topologyKey: "kubernetes.io/hostname"
topologySpreadConstraints: []
# - maxSkew: 1
#   topologyKey: kubernetes.io/hostname
#   whenUnsatisfiable: DoNotSchedule
#   labelSelector:
#     matchLabels:
#       app.kubernetes.io/name: descheduler
tolerations: []
# - key: 'management'
#   operator: 'Equal'
#   value: 'tool'
#   effect: 'NoSchedule'

rbac:
  # Specifies whether RBAC resources should be created
  create: true

serviceAccount:
  # Specifies whether a ServiceAccount should be created
  create: true
  # The name of the ServiceAccount to use.
  # If not set and create is true, a name is generated using the fullname template
  name:
  # Specifies custom annotations for the serviceAccount
  annotations: {}

podAnnotations: {}

podLabels: {}

dnsConfig: {}

livenessProbe:
  failureThreshold: 3
  httpGet:
    path: /healthz
    port: 10258
    scheme: HTTPS
  initialDelaySeconds: 3
  periodSeconds: 10

service:
  enabled: false
  # @param service.ipFamilyPolicy [string], support SingleStack, PreferDualStack and RequireDualStack
  #
  ipFamilyPolicy: ""
  # @param service.ipFamilies [array] List of IP families (e.g. IPv4, IPv6) assigned to the service.
  # Ref: https://kubernetes.io/docs/concepts/services-networking/dual-stack/
  # E.g.
  # ipFamilies:
  #   - IPv6
  #   - IPv4
  ipFamilies: []

serviceMonitor:
  enabled: false
  # The namespace where Prometheus expects to find service monitors.
  # namespace: ""
  # Add custom labels to the ServiceMonitor resource
  additionalLabels: {}
    # prometheus: kube-prometheus-stack
  interval: ""
  # honorLabels: true
  insecureSkipVerify: true
  serverName: null
  metricRelabelings: []
    # - action: keep
    #   regex: 'descheduler_(build_info|pods_evicted)'
    #   sourceLabels: [__name__]
  relabelings: []
    # - sourceLabels: [__meta_kubernetes_pod_node_name]
    #   separator: ;
    #   regex: ^(.*)$
    #   targetLabel: nodename
    #   replacement: $1
    #   action: replace

What k8s version are you using (kubectl version)?

v1.24.4

What did you do?

When using preferredDuringSchedulingIgnoredDuringExecution node affinity on pods, the RemovePodsViolatingNodeAffinity strategy in Kubernetes Descheduler (v0.33.0), despite having nodeFit: true enabled in the DefaultEvictor, creates an eviction loop where pods are repeatedly evicted from one non-preemptible node and rescheduled onto another non-preemptible node, even when a preferred (preemptible) node is not available. This contradicts the expectation that nodeFit: true should prevent evictions if the preferred destination is not genuinely available, or that the Descheduler should only evict if the pod can be moved to a more preferred node.

What did you expect to see?

Pods with preferredDuringSchedulingIgnoredDuringExecution affinity for preemptible nodes, when running on non-preemptible nodes, should only be evicted by RemovePodsViolatingNodeAffinity if:

  1. A preemptible node is available and has sufficient resources to accommodate the pod.
  2. The nodeFit: true logic within DefaultEvictor should effectively determine if the pod can be scheduled onto its preferred node type before initiating an eviction, thereby preventing unnecessary evictions to other less preferred (but still allowed by preferredDuringSchedulingIgnoredDuringExecution) node types.
    The observed behavior leads to an inefficient and disruptive eviction loop between non-preemptible nodes, even when preemptible nodes become available.

How to reproduce it (as minimally and precisely as possible):

  1. Cluster Setup:

    • Kubernetes cluster (e.g., OKE) with two node pools:
      • Preemptible Node Pool: Fixed number of nodes (e.g., 3 nodes), with label oci.oraclecloud.com/oke-is-preemptible: "true". (No Cluster Autoscaler on this pool).
      • On-Demand Node Pool: Single node (e.g., 1 node), with Cluster Autoscaler enabled. (These nodes do not have the oci.oraclecloud.com/oke-is-preemptible label).
  2. Pod Affinity:
    Deploy an application (e.g., app-test) with the following nodeAffinity rule:

    affinity:
      nodeAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - preference:
            matchExpressions:
            - key: oci.oraclecloud.com/oke-is-preemptible
              operator: In
              values:
              - "true"
          weight: 100 # High preference for preemptible nodes
        - preference:
            matchExpressions:
            - key: oci.oraclecloud.com/oke-is-preemptible
              operator: NotIn
              values:
              - "true"
          weight: 50 # Lower preference for non-preemptible nodes (fallback)
  3. Descheduler Deployment (via Helm):
    Install Descheduler v0.33.0 with a values.yaml similar to the following (critical parts are RemovePodsViolatingNodeAffinity enabled and DefaultEvictor with nodeFit: true):

  4. Scenario:

    • Initially, pods are on preemptible nodes.
    • Simulate a preemptible node being terminated by the cloud provider.
    • The app-test pods are rescheduled onto on-demand nodes (as per preferredDuringSchedulingIgnoredDuringExecution fallback).
    • Later, a preemptible node comes back online and is Ready with available resources.

Observed Behavior:
The Descheduler continually evicts app-test pods from on-demand nodes and they are then rescheduled onto other on-demand nodes or back to the same on-demand node (if capacity allows), forming an infinite eviction loop. This happens even when a preemptible node is available but not suitable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.lifecycle/rottenDenotes an issue or PR that has aged beyond stale and will be auto-closed.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions