This change introduces using prometheus exporter integration#21
Open
manoj-freyr wants to merge 12 commits intomainfrom
Open
This change introduces using prometheus exporter integration#21manoj-freyr wants to merge 12 commits intomainfrom
manoj-freyr wants to merge 12 commits intomainfrom
Conversation
We have device-metrics-exporter as publicly available package to have prometheus exporter capability. Adding this to nodes to export device metrics which can be collected by central node and also have grafna integration to the same with said prometheus data source. Tested locally for deployemnt using ./utils/deploy_monitoring_stack.sh which deploys these three utilities to nodes. After this can run any other cvs tests. Device Metrics Exporter runs 24/7 exposing GPU metrics Prometheus scrapes metrics every 15 seconds automatically Tests run independently - metrics collected in background After test, you can query Prometheus to see what happened during test
fixing config for prometheus to scrappe proper ones
solaiys
reviewed
Nov 20, 2025
solaiys
reviewed
Nov 20, 2025
Contributor
There was a problem hiding this comment.
its same as clustor.json template right? User need to fill in the nodes and key details.
May be we can rename it as sample_monitor_cluster.json
solaiys
reviewed
Nov 20, 2025
Contributor
There was a problem hiding this comment.
I think you added this file for testing with the local host. which may not be needed in actual case.
May be we can rename it as "sample_localhost_monitor_clustor.json"
solaiys
reviewed
Nov 20, 2025
solaiys
reviewed
Nov 20, 2025
added 2 commits
November 20, 2025 03:57
e65e54b to
043fd1f
Compare
043fd1f to
4a6ae0e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
We have device-metrics-exporter as publicly available package to have prometheus exporter capability. Adding this to nodes to export device metrics which can be collected by central node and also have grafna integration to the same with said prometheus data source. Tested locally for deployemnt using
./utils/deploy_monitoring_stack.sh
which deploys these three utilities to nodes.
After this can run any other cvs tests.
Working Directory: /etc/cvs
Cluster File: ./input/cluster_file/local_test_cluster.json
Monitoring Config: ./input/config_file/monitoring/monitoring_config.json
Prometheus Version: 2.55.0
Grafana Version: 10.4.1
Exporter Version: v1.4.0
post deployment success we can query either curl or UI based:
curl -s 'http://localhost:9090/api/v1/query?query=gpu_edge_temperature' | jq -r '.data.result[] | "GPU (.metric.gpu_id) on (.metric.hostname): (.value[1])°C"'

or via browsing at http://localhost:9090/
Running gives
and
pytest tests/monitoring/install_device_metrics_exporter.py --cluster_file=input/cluster_file/sample_monitor_cluster.json --config_file=input/config_file/monitoring/monitoring_config.json v