-
Notifications
You must be signed in to change notification settings - Fork 758
Description
What version of descheduler are you using?
descheduler version: v0.33.0
Does this issue reproduce with the latest release?
Yes
Which descheduler CLI options are you using?
Please provide a copy of your descheduler policy config file
# CronJob or Deployment
kind: CronJob
image:
repository: registry.k8s.io/descheduler/descheduler
# Overrides the image tag whose default is the chart version
tag: ""
pullPolicy: IfNotPresent
imagePullSecrets:
# - name: container-registry-secret
resources:
requests:
cpu: 500m
memory: 256Mi
limits:
cpu: 500m
memory: 256Mi
ports:
- containerPort: 10258
protocol: TCP
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
privileged: false
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
# podSecurityContext -- [Security context for pod](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/)
podSecurityContext: {}
# fsGroup: 1000
nameOverride: ""
fullnameOverride: ""
# -- Override the deployment namespace; defaults to .Release.Namespace
namespaceOverride: ""
# labels that'll be applied to all resources
commonLabels: {}
cronJobApiVersion: "batch/v1"
schedule: "*/2 * * * *"
suspend: false
# startingDeadlineSeconds: 200
# successfulJobsHistoryLimit: 3
# failedJobsHistoryLimit: 1
# ttlSecondsAfterFinished 600
# timeZone: Etc/UTC
# Required when running as a Deployment
deschedulingInterval: 5m
# Specifies the replica count for Deployment
# Set leaderElection if you want to use more than 1 replica
# Set affinity.podAntiAffinity rule if you want to schedule onto a node
# only if that node is in the same zone as at least one already-running descheduler
replicas: 1
# Specifies whether Leader Election resources should be created
# Required when running as a Deployment
# NOTE: Leader election can't be activated if DryRun enabled
leaderElection: {}
# enabled: true
# leaseDuration: 15s
# renewDeadline: 10s
# retryPeriod: 2s
# resourceLock: "leases"
# resourceName: "descheduler"
# resourceNamespace: "kube-system"
command:
- "/bin/descheduler"
cmdOptions:
v: 3
# Recommended to use the latest Policy API version supported by the Descheduler app version
deschedulerPolicyAPIVersion: "descheduler/v1alpha2"
# deschedulerPolicy contains the policies the descheduler will execute.
deschedulerPolicy:
# nodeSelector: "key1=value1,key2=value2"
# maxNoOfPodsToEvictPerNode: 10
# maxNoOfPodsToEvictPerNamespace: 10
# metricsProviders:
# - source: KubernetesMetrics
# tracing:
# collectorEndpoint: otel-collector.observability.svc.cluster.local:4317
# transportCert: ""
# serviceName: ""
# serviceNamespace: ""
# sampleRate: 1.0
# fallbackToNoOpProviderOnError: true
profiles:
- name: default
pluginConfig:
- name: DefaultEvictor
args:
ignorePvcPods: true
evictLocalStoragePods: true
nodeFit: true
- name: RemoveDuplicates
- name: RemovePodsHavingTooManyRestarts
args:
podRestartThreshold: 100
includingInitContainers: true
- name: RemovePodsViolatingNodeAffinity
args:
nodeAffinityType:
- requiredDuringSchedulingIgnoredDuringExecution
- preferredDuringSchedulingIgnoredDuringExecution
- name: RemovePodsViolatingNodeTaints
- name: RemovePodsViolatingInterPodAntiAffinity
- name: RemovePodsViolatingTopologySpreadConstraint
- name: LowNodeUtilization
args:
thresholds:
"cpu" : 20
"memory": 20
"pods": 50
targetThresholds:
"cpu" : 50
"memory": 50
"pods": 50
plugins:
balance:
enabled:
- LowNodeUtilization
deschedule:
enabled:
- RemovePodsViolatingInterPodAntiAffinity
priorityClassName: system-cluster-critical
nodeSelector: {}
# foo: bar
affinity: {}
# nodeAffinity:
# requiredDuringSchedulingIgnoredDuringExecution:
# nodeSelectorTerms:
# - matchExpressions:
# - key: kubernetes.io/e2e-az-name
# operator: In
# values:
# - e2e-az1
# - e2e-az2
# podAntiAffinity:
# requiredDuringSchedulingIgnoredDuringExecution:
# - labelSelector:
# matchExpressions:
# - key: app.kubernetes.io/name
# operator: In
# values:
# - descheduler
# topologyKey: "kubernetes.io/hostname"
topologySpreadConstraints: []
# - maxSkew: 1
# topologyKey: kubernetes.io/hostname
# whenUnsatisfiable: DoNotSchedule
# labelSelector:
# matchLabels:
# app.kubernetes.io/name: descheduler
tolerations: []
# - key: 'management'
# operator: 'Equal'
# value: 'tool'
# effect: 'NoSchedule'
rbac:
# Specifies whether RBAC resources should be created
create: true
serviceAccount:
# Specifies whether a ServiceAccount should be created
create: true
# The name of the ServiceAccount to use.
# If not set and create is true, a name is generated using the fullname template
name:
# Specifies custom annotations for the serviceAccount
annotations: {}
podAnnotations: {}
podLabels: {}
dnsConfig: {}
livenessProbe:
failureThreshold: 3
httpGet:
path: /healthz
port: 10258
scheme: HTTPS
initialDelaySeconds: 3
periodSeconds: 10
service:
enabled: false
# @param service.ipFamilyPolicy [string], support SingleStack, PreferDualStack and RequireDualStack
#
ipFamilyPolicy: ""
# @param service.ipFamilies [array] List of IP families (e.g. IPv4, IPv6) assigned to the service.
# Ref: https://kubernetes.io/docs/concepts/services-networking/dual-stack/
# E.g.
# ipFamilies:
# - IPv6
# - IPv4
ipFamilies: []
serviceMonitor:
enabled: false
# The namespace where Prometheus expects to find service monitors.
# namespace: ""
# Add custom labels to the ServiceMonitor resource
additionalLabels: {}
# prometheus: kube-prometheus-stack
interval: ""
# honorLabels: true
insecureSkipVerify: true
serverName: null
metricRelabelings: []
# - action: keep
# regex: 'descheduler_(build_info|pods_evicted)'
# sourceLabels: [__name__]
relabelings: []
# - sourceLabels: [__meta_kubernetes_pod_node_name]
# separator: ;
# regex: ^(.*)$
# targetLabel: nodename
# replacement: $1
# action: replace
What k8s version are you using (kubectl version)?
v1.24.4
What did you do?
When using preferredDuringSchedulingIgnoredDuringExecution node affinity on pods, the RemovePodsViolatingNodeAffinity strategy in Kubernetes Descheduler (v0.33.0), despite having nodeFit: true enabled in the DefaultEvictor, creates an eviction loop where pods are repeatedly evicted from one non-preemptible node and rescheduled onto another non-preemptible node, even when a preferred (preemptible) node is not available. This contradicts the expectation that nodeFit: true should prevent evictions if the preferred destination is not genuinely available, or that the Descheduler should only evict if the pod can be moved to a more preferred node.
What did you expect to see?
Pods with preferredDuringSchedulingIgnoredDuringExecution affinity for preemptible nodes, when running on non-preemptible nodes, should only be evicted by RemovePodsViolatingNodeAffinity if:
- A preemptible node is available and has sufficient resources to accommodate the pod.
- The
nodeFit: truelogic withinDefaultEvictorshould effectively determine if the pod can be scheduled onto its preferred node type before initiating an eviction, thereby preventing unnecessary evictions to other less preferred (but still allowed bypreferredDuringSchedulingIgnoredDuringExecution) node types.
The observed behavior leads to an inefficient and disruptive eviction loop between non-preemptible nodes, even when preemptible nodes become available.
How to reproduce it (as minimally and precisely as possible):
-
Cluster Setup:
- Kubernetes cluster (e.g., OKE) with two node pools:
- Preemptible Node Pool: Fixed number of nodes (e.g., 3 nodes), with label
oci.oraclecloud.com/oke-is-preemptible: "true". (No Cluster Autoscaler on this pool). - On-Demand Node Pool: Single node (e.g., 1 node), with Cluster Autoscaler enabled. (These nodes do not have the
oci.oraclecloud.com/oke-is-preemptiblelabel).
- Preemptible Node Pool: Fixed number of nodes (e.g., 3 nodes), with label
- Kubernetes cluster (e.g., OKE) with two node pools:
-
Pod Affinity:
Deploy an application (e.g.,app-test) with the followingnodeAffinityrule:affinity: nodeAffinity: preferredDuringSchedulingIgnoredDuringExecution: - preference: matchExpressions: - key: oci.oraclecloud.com/oke-is-preemptible operator: In values: - "true" weight: 100 # High preference for preemptible nodes - preference: matchExpressions: - key: oci.oraclecloud.com/oke-is-preemptible operator: NotIn values: - "true" weight: 50 # Lower preference for non-preemptible nodes (fallback)
-
Descheduler Deployment (via Helm):
Install Descheduler v0.33.0 with avalues.yamlsimilar to the following (critical parts areRemovePodsViolatingNodeAffinityenabled andDefaultEvictorwithnodeFit: true): -
Scenario:
- Initially, pods are on preemptible nodes.
- Simulate a preemptible node being terminated by the cloud provider.
- The
app-testpods are rescheduled onto on-demand nodes (as perpreferredDuringSchedulingIgnoredDuringExecutionfallback). - Later, a preemptible node comes back online and is
Readywith available resources.
Observed Behavior:
The Descheduler continually evicts app-test pods from on-demand nodes and they are then rescheduled onto other on-demand nodes or back to the same on-demand node (if capacity allows), forming an infinite eviction loop. This happens even when a preemptible node is available but not suitable.