Descheduler creates eviction loop for pods with preferredDuringSchedulingIgnoredDuringExecution affinity

**What version of descheduler are you using?**

descheduler version: v0.33.0

**Does this issue reproduce with the latest release?**

Yes

**Which descheduler CLI options are you using?**

**Please provide a copy of your descheduler policy config file**

```
# CronJob or Deployment
kind: CronJob

image:
  repository: registry.k8s.io/descheduler/descheduler
  # Overrides the image tag whose default is the chart version
  tag: ""
  pullPolicy: IfNotPresent

imagePullSecrets:
#   - name: container-registry-secret

resources:
  requests:
    cpu: 500m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 256Mi

ports:
  - containerPort: 10258
    protocol: TCP

securityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL
  privileged: false
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 1000

# podSecurityContext -- [Security context for pod](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/)
podSecurityContext: {}
  # fsGroup: 1000

nameOverride: ""
fullnameOverride: ""

# -- Override the deployment namespace; defaults to .Release.Namespace
namespaceOverride: ""

# labels that'll be applied to all resources
commonLabels: {}

cronJobApiVersion: "batch/v1"
schedule: "*/2 * * * *"
suspend: false
# startingDeadlineSeconds: 200
# successfulJobsHistoryLimit: 3
# failedJobsHistoryLimit: 1
# ttlSecondsAfterFinished 600
# timeZone: Etc/UTC

# Required when running as a Deployment
deschedulingInterval: 5m

# Specifies the replica count for Deployment
# Set leaderElection if you want to use more than 1 replica
# Set affinity.podAntiAffinity rule if you want to schedule onto a node
# only if that node is in the same zone as at least one already-running descheduler
replicas: 1

# Specifies whether Leader Election resources should be created
# Required when running as a Deployment
# NOTE: Leader election can't be activated if DryRun enabled
leaderElection: {}
#  enabled: true
#  leaseDuration: 15s
#  renewDeadline: 10s
#  retryPeriod: 2s
#  resourceLock: "leases"
#  resourceName: "descheduler"
#  resourceNamespace: "kube-system"

command:
- "/bin/descheduler"

cmdOptions:
  v: 3

# Recommended to use the latest Policy API version supported by the Descheduler app version
deschedulerPolicyAPIVersion: "descheduler/v1alpha2"

# deschedulerPolicy contains the policies the descheduler will execute.
deschedulerPolicy:
  # nodeSelector: "key1=value1,key2=value2"
  # maxNoOfPodsToEvictPerNode: 10
  # maxNoOfPodsToEvictPerNamespace: 10
  # metricsProviders:
  # - source: KubernetesMetrics
  # tracing:
  #   collectorEndpoint: otel-collector.observability.svc.cluster.local:4317
  #   transportCert: ""
  #   serviceName: ""
  #   serviceNamespace: ""
  #   sampleRate: 1.0
  #   fallbackToNoOpProviderOnError: true
  profiles:
    - name: default
      pluginConfig:
        - name: DefaultEvictor
          args:
            ignorePvcPods: true
            evictLocalStoragePods: true
            nodeFit: true
        - name: RemoveDuplicates
        - name: RemovePodsHavingTooManyRestarts
          args:
            podRestartThreshold: 100
            includingInitContainers: true
        - name: RemovePodsViolatingNodeAffinity
          args:
            nodeAffinityType:
            - requiredDuringSchedulingIgnoredDuringExecution
            - preferredDuringSchedulingIgnoredDuringExecution
        - name: RemovePodsViolatingNodeTaints
        - name: RemovePodsViolatingInterPodAntiAffinity
        - name: RemovePodsViolatingTopologySpreadConstraint
        - name: LowNodeUtilization
          args:
            thresholds:
              "cpu" : 20
              "memory": 20
              "pods": 50
            targetThresholds:
              "cpu" : 50
              "memory": 50
              "pods": 50
      plugins:
        balance:
          enabled:
            - LowNodeUtilization
        deschedule:
          enabled:
            - RemovePodsViolatingInterPodAntiAffinity

priorityClassName: system-cluster-critical

nodeSelector: {}
#  foo: bar

affinity: {}
# nodeAffinity:
#   requiredDuringSchedulingIgnoredDuringExecution:
#     nodeSelectorTerms:
#     - matchExpressions:
#       - key: kubernetes.io/e2e-az-name
#         operator: In
#         values:
#         - e2e-az1
#         - e2e-az2
#  podAntiAffinity:
#    requiredDuringSchedulingIgnoredDuringExecution:
#      - labelSelector:
#          matchExpressions:
#            - key: app.kubernetes.io/name
#              operator: In
#              values:
#                - descheduler
#        topologyKey: "kubernetes.io/hostname"
topologySpreadConstraints: []
# - maxSkew: 1
#   topologyKey: kubernetes.io/hostname
#   whenUnsatisfiable: DoNotSchedule
#   labelSelector:
#     matchLabels:
#       app.kubernetes.io/name: descheduler
tolerations: []
# - key: 'management'
#   operator: 'Equal'
#   value: 'tool'
#   effect: 'NoSchedule'

rbac:
  # Specifies whether RBAC resources should be created
  create: true

serviceAccount:
  # Specifies whether a ServiceAccount should be created
  create: true
  # The name of the ServiceAccount to use.
  # If not set and create is true, a name is generated using the fullname template
  name:
  # Specifies custom annotations for the serviceAccount
  annotations: {}

podAnnotations: {}

podLabels: {}

dnsConfig: {}

livenessProbe:
  failureThreshold: 3
  httpGet:
    path: /healthz
    port: 10258
    scheme: HTTPS
  initialDelaySeconds: 3
  periodSeconds: 10

service:
  enabled: false
  # @param service.ipFamilyPolicy [string], support SingleStack, PreferDualStack and RequireDualStack
  #
  ipFamilyPolicy: ""
  # @param service.ipFamilies [array] List of IP families (e.g. IPv4, IPv6) assigned to the service.
  # Ref: https://kubernetes.io/docs/concepts/services-networking/dual-stack/
  # E.g.
  # ipFamilies:
  #   - IPv6
  #   - IPv4
  ipFamilies: []

serviceMonitor:
  enabled: false
  # The namespace where Prometheus expects to find service monitors.
  # namespace: ""
  # Add custom labels to the ServiceMonitor resource
  additionalLabels: {}
    # prometheus: kube-prometheus-stack
  interval: ""
  # honorLabels: true
  insecureSkipVerify: true
  serverName: null
  metricRelabelings: []
    # - action: keep
    #   regex: 'descheduler_(build_info|pods_evicted)'
    #   sourceLabels: [__name__]
  relabelings: []
    # - sourceLabels: [__meta_kubernetes_pod_node_name]
    #   separator: ;
    #   regex: ^(.*)$
    #   targetLabel: nodename
    #   replacement: $1
    #   action: replace
```


**What k8s version are you using (`kubectl version`)?**

v1.24.4


**What did you do?**

When using `preferredDuringSchedulingIgnoredDuringExecution` node affinity on pods, the `RemovePodsViolatingNodeAffinity` strategy in Kubernetes Descheduler (v0.33.0), despite having `nodeFit: true` enabled in the `DefaultEvictor`, creates an eviction loop where pods are repeatedly evicted from one non-preemptible node and rescheduled onto *another non-preemptible node*, even when a preferred (preemptible) node is not available. This contradicts the expectation that `nodeFit: true` should prevent evictions if the preferred destination is not genuinely available, or that the Descheduler should only evict if the pod can be moved to a *more preferred* node.

**What did you expect to see?**

Pods with `preferredDuringSchedulingIgnoredDuringExecution` affinity for preemptible nodes, when running on non-preemptible nodes, should only be evicted by `RemovePodsViolatingNodeAffinity` if:
1.  A preemptible node is available and has sufficient resources to accommodate the pod.
2.  The `nodeFit: true` logic within `DefaultEvictor` should effectively determine if the pod can be scheduled onto its *preferred* node type before initiating an eviction, thereby preventing unnecessary evictions to other less preferred (but still allowed by `preferredDuringSchedulingIgnoredDuringExecution`) node types.
The observed behavior leads to an inefficient and disruptive eviction loop between non-preemptible nodes, even when preemptible nodes become available.

**How to reproduce it (as minimally and precisely as possible):**

1.  **Cluster Setup:**
    * Kubernetes cluster (e.g., OKE) with two node pools:
        * **Preemptible Node Pool:** Fixed number of nodes (e.g., 3 nodes), with label `oci.oraclecloud.com/oke-is-preemptible: "true"`. (No Cluster Autoscaler on this pool).
        * **On-Demand Node Pool:** Single node (e.g., 1 node), with Cluster Autoscaler enabled. (These nodes *do not* have the `oci.oraclecloud.com/oke-is-preemptible` label).
2.  **Pod Affinity:**
    Deploy an application (e.g., `app-test`) with the following `nodeAffinity` rule:
    ```yaml
    affinity:
      nodeAffinity:
        preferredDuringSchedulingIgnoredDuringExecution:
        - preference:
            matchExpressions:
            - key: oci.oraclecloud.com/oke-is-preemptible
              operator: In
              values:
              - "true"
          weight: 100 # High preference for preemptible nodes
        - preference:
            matchExpressions:
            - key: oci.oraclecloud.com/oke-is-preemptible
              operator: NotIn
              values:
              - "true"
          weight: 50 # Lower preference for non-preemptible nodes (fallback)
    ```
3.  **Descheduler Deployment (via Helm):**
    Install Descheduler v0.33.0 with a `values.yaml` similar to the following (critical parts are `RemovePodsViolatingNodeAffinity` enabled and `DefaultEvictor` with `nodeFit: true`):

4.  **Scenario:**
    * Initially, pods are on preemptible nodes.
    * Simulate a preemptible node being terminated by the cloud provider.
    * The `app-test` pods are rescheduled onto on-demand nodes (as per `preferredDuringSchedulingIgnoredDuringExecution` fallback).
    * Later, a preemptible node comes back online and is `Ready` with available resources.

**Observed Behavior:**
The Descheduler continually evicts `app-test` pods from on-demand nodes  and they are then rescheduled onto *other on-demand nodes* or back to the same on-demand node (if capacity allows), forming an infinite eviction loop. This happens even when a preemptible node is available but not suitable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Descheduler creates eviction loop for pods with preferredDuringSchedulingIgnoredDuringExecution affinity #1707

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Descheduler creates eviction loop for pods with preferredDuringSchedulingIgnoredDuringExecution affinity #1707

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions