Skip to content

[Feature]: Pressure Stall Information (PSI) from linux hosts #8082

@alpineQ

Description

@alpineQ

Component

Instrumentation: host

Problem Statement

Linux kernel 4.20+ provides Pressure Stall Information (PSI) metrics that offer valuable insights into resource contention and system performance bottlenecks. PSI tracks the amount of time processes spend stalled waiting for CPU, memory, and I/O resources, providing both some (at least one process stalled) and full (all non-idle processes stalled) metrics.

Currently, the OpenTelemetry Go host metrics instrumentation does not collect PSI metrics, which are increasingly used by modern observability platforms and are particularly valuable for:

  • Detecting resource saturation before traditional utilization metrics show problems
  • Identifying performance degradation in containerized environments
  • Understanding the real impact of resource limits on application performance
  • Proactive capacity planning and alerting

These metrics are available at /proc/pressure/{cpu,memory,io} and are already widely adopted by tools like systemd, Facebook's resource management systems, and various monitoring solutions.

Proposed Solution

Implement a new PSI metrics collector within the host metrics instrumentation package that:

  1. Reads PSI files from /proc/pressure/ for CPU, memory, and I/O

  2. Parses the format:

    some avg10=0.00 avg60=0.00 avg300=0.00 total=12345
    full avg10=0.00 avg60=0.00 avg300=0.00 total=67890
    
  3. Exposes metrics following OpenTelemetry semantic conventions:

    • system.psi.cpu.some.pct - Percentage of time some processes stalled on CPU
    • system.psi.cpu.full.pct - Percentage of time all processes stalled on CPU
    • system.psi.memory.some.pct - Memory pressure (some)
    • system.psi.memory.full.pct - Memory pressure (full)
    • system.psi.io.some.pct - I/O pressure (some)
    • system.psi.io.full.pct - I/O pressure (full)
  4. Implementation considerations:

    • Make PSI collection opt-in or configurable
    • Add appropriate unit tests and documentation
    • Consider rate of collection (PSI files are updated every 2 seconds by the kernel)

Alternatives

No response

Prior Art

No response

Additional Context

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions