Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -210,7 +210,7 @@ The profile exposes the following customization:
- `devActualUtilizationProfile`: Enable load-aware descheduling.
- `devDeviationThresholds`: Have the thresholds be based on the average utilization.

By default, this profile will enable load-aware descheduling based on the `PrometheusCPUCombined` Prometheus query.
By default, this profile will enable load-aware descheduling based on the `PrometheusCPUMemoryCombinedProfile` Prometheus query. That query is based on a recording rule combining the impact of CPU and memory utilization and PSI pressure.
By default, the thresholds will be dynamic (based on the distance from the average utilization) and asymmetric (all the nodes below the average will be considered as underutilized to help rebalancing overutilized outliers) tolerating low deviations (10%).

By default, this profile configures the descheduler to restrict the maximum number of overall parallel evictions to 5 and
Expand Down Expand Up @@ -255,6 +255,7 @@ The operator provides the following profiles:
- `PrometheusMemoryPSIPressure`: `rate(node_pressure_memory_waiting_seconds_total[1m])` (`node_pressure_memory_waiting_seconds_total` is reported in OpenShift only for nodes configured with psi=1 kernel argument)
- `PrometheusIOPSIPressure`: `rate(node_pressure_io_waiting_seconds_total[1m])` (`node_pressure_memory_waiting_seconds_total` is reported in OpenShift only for nodes configured with psi=1 kernel argument)
- `PrometheusCPUCombined`: `descheduler:combined_utilization_and_pressure:avg1m` (`descheduler:combined_utilization_and_pressure:avg1m` uses a combination of CPU utilization and CPU PSI pressure based on a recording rule; CPU PSI pressure is reported in OpenShift only for nodes configured with psi=1 kernel argument)
- `PrometheusCPUMemoryCombinedProfile`: `descheduler:node:linear_amplified_ideal_point_positive_distance:k3:avg1m` (`descheduler:node:linear_amplified_ideal_point_positive_distance:k3:avg1m` uses a multidimensional combination of CPU (utilization and pressure) and memory (utilization and pressure) based on a recording rule; PSI pressure is reported in OpenShift only for nodes configured with psi=1 kernel argument)

```yaml
apiVersion: operator.openshift.io/v1
Expand All @@ -266,9 +267,9 @@ spec:
managementState: Managed
deschedulingIntervalSeconds: 3600
profiles:
- LongLifecycle
- KubeVirtRelieveAndMigrate
profileCustomizations:
devActualUtilizationProfile: PrometheusCPUUsage
devActualUtilizationProfile: PrometheusCPUMemoryCombinedProfile
```

## Descheduling modes
Expand Down
93 changes: 84 additions & 9 deletions bindata/assets/kube-descheduler/prometheusrule.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,33 +7,108 @@ metadata:
spec:
groups:
- name: recordingRules.rules
interval: 30s
rules:
# Base metrics (CPU and Memory utilization)
- record: descheduler:nodeutilization:cpu:avg1m
expr: avg by (instance) (1 - rate(node_cpu_seconds_total{mode='idle'}[1m]))

- record: descheduler:averageworkersutilization:cpu:avg1m
expr: avg(descheduler:nodeutilization:cpu:avg1m * on(instance) group_left(node) label_replace(kube_node_role{role="worker"}, 'instance', "$1", 'node', '(.+)'))

- record: descheduler:nodeutilization:memory:avg1m
expr: |-
(
1 - avg_over_time(node_memory_MemAvailable_bytes[1m]) /
on(instance) label_replace(kube_node_status_allocatable{resource="memory"}, 'instance', "$1", 'node', '(.+)')
) and on(instance)
label_replace(kube_node_status_allocatable{resource="memory"}, 'instance', "$1", 'node', '(.+)') > 0

- record: descheduler:averageworkersutilization:memory:avg1m
expr: avg(descheduler:nodeutilization:memory:avg1m * on(instance) group_left(node) label_replace(kube_node_role{role="worker"}, 'instance', "$1", 'node', '(.+)'))

# Pressure metrics
- record: descheduler:nodepressure:cpu:avg1m
# return the cpu pressure if the cpu usage is over 70% otherwise
# return cpu pressure as zero to (partially) filter out false
# positives pressure spikes due to CPU limited pods.
# See: https://github.com/kubernetes/enhancements/issues/5062
expr: |-
avg by (instance) (
rate(node_pressure_cpu_waiting_seconds_total[1m])
) and (
1 - avg by (instance) (
rate(node_cpu_seconds_total{mode='idle'}[1m])
)
) > 0.7
(
avg by (instance) (rate(node_pressure_cpu_waiting_seconds_total[1m]))
and
(1 - avg by (instance) (rate(node_cpu_seconds_total{mode='idle'}[1m]))) > 0.7
)
or
(avg by (instance) (rate(node_pressure_cpu_waiting_seconds_total[1m])) * 0)

- record: descheduler:averageworkerspressure:cpu:avg1m
expr: avg(descheduler:nodepressure:cpu:avg1m * on(instance) group_left(node) label_replace(kube_node_role{role="worker"}, 'instance', "$1", 'node', '(.+)'))

- record: descheduler:nodepressure:memory:avg1m
expr: |-
avg by (instance) (
rate(node_pressure_cpu_waiting_seconds_total[1m])
) * 0
rate(node_pressure_memory_waiting_seconds_total[1m])
)

- record: descheduler:averageworkerspressure:memory:avg1m
expr: avg(descheduler:nodepressure:memory:avg1m * on(instance) group_left(node) label_replace(kube_node_role{role="worker"}, 'instance', "$1", 'node', '(.+)'))

- record: descheduler:combined_utilization_and_pressure:avg1m
expr: |-
(descheduler:nodeutilization:cpu:avg1m and on() descheduler:averageworkersutilization:cpu:avg1m < 0.8)
or
(descheduler:nodepressure:cpu:avg1m)

- record: descheduler:averageworkersutilization:memory:avg1m
expr: avg(descheduler:nodeutilization:memory:avg1m * on(instance) group_left(node) label_replace(kube_node_role{role="worker"}, 'instance', "$1", 'node', '(.+)'))

- record: descheduler:nodeutilization:memory:avg1m:positivedeviation
expr: |-
descheduler:nodeutilization:memory:avg1m - on() group_left() descheduler:averageworkersutilization:memory:avg1m
and
descheduler:nodeutilization:memory:avg1m - on() group_left() descheduler:averageworkersutilization:memory:avg1m >= 0
or
descheduler:nodeutilization:memory:avg1m * 0

- record: descheduler:nodeutilization:cpu:avg1m:positivedeviation
expr: |-
descheduler:nodeutilization:cpu:avg1m - on() group_left() descheduler:averageworkersutilization:cpu:avg1m
and
descheduler:nodeutilization:cpu:avg1m - on() group_left() descheduler:averageworkersutilization:cpu:avg1m >= 0
or
descheduler:nodeutilization:cpu:avg1m * 0

- record: descheduler:nodepressure:cpu:avg1m:positivedeviation
expr: |-
descheduler:nodepressure:cpu:avg1m - on() group_left() descheduler:averageworkerspressure:cpu:avg1m
and
descheduler:nodepressure:cpu:avg1m - on() group_left() descheduler:averageworkerspressure:cpu:avg1m >= 0
or
descheduler:nodepressure:cpu:avg1m * 0

- record: descheduler:nodepressure:memory:avg1m:positivedeviation
expr: |-
descheduler:nodepressure:memory:avg1m - on() group_left() descheduler:averageworkerspressure:memory:avg1m
and
descheduler:nodepressure:memory:avg1m - on() group_left() descheduler:averageworkerspressure:memory:avg1m >= 0
or
descheduler:nodepressure:memory:avg1m * 0

# Ideal Point Positive Distance (Euclidean distance from ideal using positive deviations)
- record: descheduler:node:ideal_point_positive_distance:avg1m
expr: |-
sqrt(
descheduler:nodeutilization:cpu:avg1m:positivedeviation ^ 2 +
descheduler:nodepressure:cpu:avg1m:positivedeviation ^ 2 +
descheduler:nodeutilization:memory:avg1m:positivedeviation ^ 2 +
descheduler:nodepressure:memory:avg1m:positivedeviation ^ 2
)

# Linear Amplified Ideal Point Positive Distance (k=3.0) - Amplified by 3x, clamped to [0,1]
- record: descheduler:node:linear_amplified_ideal_point_positive_distance:k3:avg1m
expr: |-
clamp_max(
3 * descheduler:node:ideal_point_positive_distance:avg1m,
1.0
)
2 changes: 2 additions & 0 deletions pkg/apis/descheduler/v1/types_descheduler.go
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,8 @@ const (
PrometheusIOPSIPressureProfile ActualUtilizationProfile = "PrometheusIOPSIPressure"
// PrometheusCPUCombinedProfile uses a combination of CPU utilization and CPU pressure based on a recording rule
PrometheusCPUCombinedProfile ActualUtilizationProfile = "PrometheusCPUCombined"
// PrometheusCPUMemoryCombinedProfile uses a multidimensional combination of CPU (utilization and pressure) and memory (utilization and pressure) based on a recording rule
PrometheusCPUMemoryCombinedProfile ActualUtilizationProfile = "PrometheusCPUMemoryCombinedProfile"
)

// Namespaces overrides included and excluded namespaces while keeping
Expand Down
4 changes: 3 additions & 1 deletion pkg/operator/target_config_reconciler.go
Original file line number Diff line number Diff line change
Expand Up @@ -837,6 +837,8 @@ func utilizationProfileToPrometheusQuery(profile deschedulerv1.ActualUtilization
return "rate(node_pressure_io_waiting_seconds_total[1m])", nil
case deschedulerv1.PrometheusCPUCombinedProfile:
return "descheduler:combined_utilization_and_pressure:avg1m", nil
case deschedulerv1.PrometheusCPUMemoryCombinedProfile:
return "descheduler:node:linear_amplified_ideal_point_positive_distance:k3:avg1m", nil
default:
if !strings.HasPrefix(string(profile), "query:") {
return "", fmt.Errorf("unknown prometheus profile: %v", profile)
Expand Down Expand Up @@ -1092,7 +1094,7 @@ func kubeVirtRelieveAndMigrateProfile(profileCustomizations *deschedulerv1.Profi
args := profile.PluginConfigs[0].Args.Object.(*nodeutilization.LowNodeUtilizationArgs)

// profile defaults
const defaultActualUtilizationProfile = deschedulerv1.PrometheusCPUCombinedProfile
const defaultActualUtilizationProfile = deschedulerv1.PrometheusCPUMemoryCombinedProfile
args.UseDeviationThresholds = true
query, err := utilizationProfileToPrometheusQuery(defaultActualUtilizationProfile)
if err != nil {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ profiles:
- openshift-kube-scheduler
metricsUtilization:
prometheus:
query: descheduler:combined_utilization_and_pressure:avg1m
query: descheduler:node:linear_amplified_ideal_point_positive_distance:k3:avg1m
source: Prometheus
targetThresholds:
MetricResource: 10
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ profiles:
- openshift-kube-scheduler
metricsUtilization:
prometheus:
query: descheduler:combined_utilization_and_pressure:avg1m
query: descheduler:node:linear_amplified_ideal_point_positive_distance:k3:avg1m
source: Prometheus
targetThresholds:
MetricResource: 30
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ profiles:
- openshift-kube-scheduler
metricsUtilization:
prometheus:
query: descheduler:combined_utilization_and_pressure:avg1m
query: descheduler:node:linear_amplified_ideal_point_positive_distance:k3:avg1m
source: Prometheus
targetThresholds:
MetricResource: 10
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ profiles:
- openshift-kube-scheduler
metricsUtilization:
prometheus:
query: descheduler:combined_utilization_and_pressure:avg1m
query: descheduler:node:linear_amplified_ideal_point_positive_distance:k3:avg1m
source: Prometheus
targetThresholds:
MetricResource: 20
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ profiles:
- openshift-kube-scheduler
metricsUtilization:
prometheus:
query: descheduler:combined_utilization_and_pressure:avg1m
query: descheduler:node:linear_amplified_ideal_point_positive_distance:k3:avg1m
source: Prometheus
targetThresholds:
MetricResource: 10
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ profiles:
- openshift-kube-scheduler
metricsUtilization:
prometheus:
query: descheduler:combined_utilization_and_pressure:avg1m
query: descheduler:node:linear_amplified_ideal_point_positive_distance:k3:avg1m
source: Prometheus
targetThresholds:
MetricResource: 70
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ profiles:
- openshift-kube-scheduler
metricsUtilization:
prometheus:
query: descheduler:combined_utilization_and_pressure:avg1m
query: descheduler:node:linear_amplified_ideal_point_positive_distance:k3:avg1m
source: Prometheus
targetThresholds:
MetricResource: 10
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ profiles:
- openshift-kube-scheduler
metricsUtilization:
prometheus:
query: descheduler:combined_utilization_and_pressure:avg1m
query: descheduler:node:linear_amplified_ideal_point_positive_distance:k3:avg1m
source: Prometheus
targetThresholds:
MetricResource: 30
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ profiles:
- openshift-kube-scheduler
metricsUtilization:
prometheus:
query: descheduler:combined_utilization_and_pressure:avg1m
query: descheduler:node:linear_amplified_ideal_point_positive_distance:k3:avg1m
source: Prometheus
targetThresholds:
MetricResource: 50
Expand Down