Skip to content

Conversation

@jianyun8023
Copy link

@jianyun8023 jianyun8023 commented Nov 17, 2025

Implement Prometheus metrics support in Scrutiny

  • Added configuration options for enabling Prometheus metrics in example.scrutiny.yaml and config.go.
  • Introduced a new metrics package to handle metrics collection and registration.
  • Created a ScrutinyCollector to gather device metrics and expose them via a /metrics endpoint.
  • Updated the web server to conditionally register the metrics endpoint based on configuration.
  • Implemented caching for device details to optimize metrics collection.
  • Added tests for utility functions related to metrics sanitization and parsing.

This commit enhances the monitoring capabilities of Scrutiny by integrating Prometheus metrics support.

During the test process, I discovered a bug, and the related PR is #829 .

- Added configuration options for enabling Prometheus metrics in `example.scrutiny.yaml` and `config.go`.
- Introduced a new `metrics` package to handle metrics collection and registration.
- Created a `ScrutinyCollector` to gather device metrics and expose them via a `/metrics` endpoint.
- Updated the web server to conditionally register the metrics endpoint based on configuration.
- Implemented caching for device details to optimize metrics collection.
- Added tests for utility functions related to metrics sanitization and parsing.

This commit enhances the monitoring capabilities of Scrutiny by integrating Prometheus metrics support.
@jianyun8023
Copy link
Author

I also created a Grafana dashboard. If you need one, you can use this example.
grafana_dashboard.json

@jianyun8023
Copy link
Author

Change Summary

1. Dependency Updates (go.mod, go.sum)

  • Added Prometheus client library: github.com/prometheus/client_golang v1.17.0
  • Updated related dependency versions:
    • golang.org/x/sync: v0.1.0 → v0.3.0
    • golang.org/x/net: v0.8.0 → v0.10.0
    • golang.org/x/sys: v0.7.0 → v0.11.0
    • golang.org/x/text: v0.8.0 → v0.9.0
    • google.golang.org/protobuf: v1.28.1 → v1.31.0

2. Configuration Management (webapp/backend/pkg/config/config.go)

  • Added configuration option: web.metrics.enabled (default: true)
  • Allows enabling/disabling Prometheus metrics endpoint via configuration file

3. Configuration Example (example.scrutiny.yaml)

  • Added web.metrics configuration section
  • Added enabled option to control metrics endpoint

4. Core Metrics Collector (webapp/backend/pkg/metrics/collector.go)

New file implementing complete Prometheus metrics collection functionality:

Main Components

  • Collector struct: Manages metrics data for all devices
    • Uses sync.RWMutex for thread-safe operations
    • Stores device metrics data in memory (keyed by WWN)
    • Maintains independent Prometheus Registry

Core Functionality

  1. Initialization

    • Registers Go runtime metrics (memory, GC, goroutines, etc.)
    • Registers custom device metrics collector
  2. Data Updates

    • UpdateDeviceMetrics(): Updates device metrics in real-time (called when device data is uploaded)
    • LoadInitialData(): Asynchronously loads initial data from database at startup
  3. Metrics Collection (Collect() method)

    • collectDeviceInfo(): Device information metrics
    • collectDeviceCapacity(): Device capacity metrics
    • collectDeviceStatus(): Device status metrics
    • collectSmartAttributes(): SMART attribute metrics
    • collectSummaryMetrics(): Summary metrics (temperature, runtime, etc.)
    • collectStatistics(): Statistics metrics (total devices, by protocol, etc.)

Exported Metric Types

  • scrutiny_device_info: Device information (labels: wwn, device_name, model_name, serial_number, firmware, protocol, host_id, form_factor)
  • scrutiny_device_capacity_bytes: Device capacity in bytes
  • scrutiny_device_status: Device status (0=passed, 1=failed)
  • scrutiny_smart_temperature_celsius: Device temperature in Celsius
  • scrutiny_smart_power_on_hours: Device power-on hours
  • scrutiny_smart_power_cycle_count: Device power cycle count
  • scrutiny_smart_collector_timestamp: Data collection timestamp
  • scrutiny_devices_total: Total number of monitored devices
  • scrutiny_devices_by_protocol: Number of devices by protocol
  • scrutiny_smart_*: Dynamic SMART attribute metrics (auto-generated based on device attributes)

5. Utility Functions (webapp/backend/pkg/metrics/utils.go)

New file providing helper functions:

  • SanitizeMetricName(): Converts strings to valid Prometheus metric names
  • TryParseFloat(): Attempts to convert various types to float64 (supports int, int64, float32, float64, string, hexadecimal strings)
  • SelectLatestSmartResult(): Selects the latest record from SMART results list

6. Unit Tests (webapp/backend/pkg/metrics/utils_test.go)

New file containing comprehensive unit tests:

  • TestSanitizeMetricName(): Tests metric name sanitization
  • TestTryParseFloat(): Tests type conversion functionality
  • TestSelectLatestSmartResult(): Tests latest SMART result selection

7. Data Models (webapp/backend/pkg/models/metrics/types.go)

New file defining metrics data structures:

  • DeviceMetricsData: Stores metrics data for a single device (device info + SMART data + update time)

8. HTTP Handler (webapp/backend/pkg/web/handler/get_metrics.go)

New file implementing Prometheus metrics endpoint:

  • GetMetrics(): Handles /api/metrics GET requests
  • Retrieves metrics collector from Gin context
  • Uses promhttp.HandlerFor() to generate Prometheus format response

9. Middleware (webapp/backend/pkg/web/middleware/metrics.go)

New file providing metrics collector injection:

  • MetricsMiddleware(): Injects metrics collector into Gin context
  • Allows handler functions to access metrics collector

10. Server Integration (webapp/backend/pkg/web/server.go)

Modified file integrating metrics functionality into web server:

Changes

  1. AppEngine Struct Extension

    • Added MetricsCollector field
  2. Setup() Method

    • Initializes metrics collector based on configuration
    • Registers metrics middleware
    • Conditionally registers /api/metrics route (only when enabled)
  3. Start() Method

    • Asynchronously loads initial metrics data (executed in background goroutine)
    • Loads device summaries and latest SMART data from database

11. Device Metrics Upload Integration (webapp/backend/pkg/web/handler/upload_device_metrics.go)

Modified file updating metrics when device data is uploaded:

  • In UploadDeviceMetrics() function, updates Prometheus metrics after successful device data upload
  • Retrieves metrics collector from Gin context and calls UpdateDeviceMetrics()

Architecture Design

Data Flow

Device Data Upload → UploadDeviceMetrics() 
    ↓
Update in-memory metrics data (Collector.UpdateDeviceMetrics)
    ↓
Prometheus Scrape → /api/metrics → GetMetrics()
    ↓
Collector.Collect() generates metrics
    ↓
Return Prometheus format data

Thread Safety

  • Uses sync.RWMutex to protect shared data structures
  • Read operations use RLock(), write operations use Lock()
  • Startup data loading uses goroutines to concurrently fetch SMART data for each device

Performance Optimizations

  1. In-Memory Caching: Metrics data stored in memory, avoiding database queries on every request
  2. Asynchronous Loading: Initial data loaded in background goroutine at startup, not blocking server startup
  3. Concurrent Loading: Uses sync.WaitGroup to concurrently fetch SMART data for multiple devices
  4. On-Demand Updates: Only updates metrics for corresponding device when device data is uploaded

Configuration

Enable Metrics Endpoint

In scrutiny.yaml configuration file:

web:
  metrics:
    enabled: true  # Enable Prometheus metrics endpoint

Access Metrics Endpoint

  • URL: http://<host>:<port>/api/metrics
  • Method: GET
  • Format: Prometheus text format

Starosdev added a commit to Starosdev/scrutiny that referenced this pull request Nov 30, 2025
Starosdev pushed a commit to Starosdev/scrutiny that referenced this pull request Nov 30, 2025
## [1.1.0](v1.0.0...v1.1.0) (2025-11-30)

### Features

* Add "day" as resolution for temperature graph ([2670af2](2670af2))
* add day resolution for temperature graph (upstream PR [AnalogJ#823](https://github.com/Starosdev/scrutiny/issues/823)) ([2d6ffa7](2d6ffa7))
* add setting to enable/disable SCT temperature history (upstream PR [AnalogJ#557](https://github.com/Starosdev/scrutiny/issues/557)) ([c3692ac](c3692ac))
* Implement device-wise notification mute/unmute ([925e86d](925e86d))
* implement device-wise notification mute/unmute (upstream PR [AnalogJ#822](https://github.com/Starosdev/scrutiny/issues/822)) ([ea7102e](ea7102e))
* implement Prometheus metrics support (upstream PR [AnalogJ#830](https://github.com/Starosdev/scrutiny/issues/830)) ([7384f7d](7384f7d))
* support SAS temperature (upstream PR [AnalogJ#816](https://github.com/Starosdev/scrutiny/issues/816)) ([f954cc8](f954cc8))

### Bug Fixes

* better handling of ata_sct_temperature_history (upstream PR [AnalogJ#825](https://github.com/Starosdev/scrutiny/issues/825)) ([d134ad7](d134ad7))
* **database:** add missing temperature parameter in SCSI migration ([df7da88](df7da88))
* support transient SMART failures (upstream PR [AnalogJ#375](https://github.com/Starosdev/scrutiny/issues/375)) ([601775e](601775e))
* **ui:** fix temperature conversion in temperature.pipe.ts (upstream PR [AnalogJ#815](https://github.com/Starosdev/scrutiny/issues/815)) ([e0f2781](e0f2781))

### Refactoring

* use limit() instead of tail() for fetching smart attributes (upstream PR [AnalogJ#829](https://github.com/Starosdev/scrutiny/issues/829)) ([2849531](2849531))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant