-
-
Notifications
You must be signed in to change notification settings - Fork 257
Description
I'm writing a PMDA, and it works as expected, but instance label propagation seems inconsistent in PCP. This affects my PMDA, but also standard ones. It seems like some logic leads to the REST API not returning instance labels, which in turn causes errors in grafana-pcp and other downstream consumers.
System info:
- PCP 7.0.3 (installed from packagecloud repo)
- OS: ubuntu/noble
- pmproxy: Running as systemd service on port 44322
Can't share the PMDA code (it's platform specific anyway), but I'll show the behavior and also how to repro with a bundled PMDA.
The gist is that the following tools show all expected instance labels:
pminfo --labelsdbpmdawith thelabelsubcommand/metricsendpoint
But they're missing from:
- REST API (PMAPI)
pmseries- and downstream pcp-grafana
Working through this with my own PMDA, which is exposing low-level AMD EPYC metrics:
static int
esmi_labelCallBack(pmInDom indom, unsigned int inst, pmLabelSet **lp)
{
int serial;
if (indom == PM_INDOM_NULL)
return 0;
serial = pmInDom_serial(indom);
/* Add disp_instance label for socket indom */
if (serial == SOCKET_INDOM) {
if (inst < num_sockets && socket_names[inst] != NULL) {
return pmdaAddLabels(lp, "{\"disp_instance\":\"%s\"}",
socket_names[inst]);
}
}
/* Add disp_instance, die_id, and socket_id labels for core indom */
if (serial == CORE_INDOM) {
if (inst < num_cores && core_names[inst] != NULL) {
return pmdaAddLabels(lp, "{\"disp_instance\":\"%s\",\"die_id\":%d,\"socket_id\":%d}",
core_names[inst], core_die_id[inst], core_socket_id[inst]);
}
}
return 0;
}All core-scope metrics get three instance labels to localize them on the chiplets and sockets. The PMDA installs and runs without warnings or errors.
Poking via dbpmda:
echo 'open pipe /var/lib/pcp/pmdas/esmi/pmdaesmi -d 470
label instances 470.1' | sudo dbpmdaoutput:
Start pmdaesmi PMDA: /var/lib/pcp/pmdas/esmi/pmdaesmi -d 470
Instances of pmInDom: 470.1
[ 0] Labels inst: 0
die_id=0
disp_instance="core0"
socket_id=0
[ 1] Labels inst: 1
die_id=0
disp_instance="core1"
socket_id=0
[ 2] Labels inst: 2
die_id=0
disp_instance="core2"
socket_id=0
...
Then pminfo:
pminfo --labels esmi.energy.core | head -n5output:
esmi.energy.core
labels {"agent":"esmi","device_type":"cpu_core","domainname":"localdomain","groupid":1000,"hostname":"aitop","indom_name":"per core","machineid":"4542127d93e1480e823cf51ba57d25a3","userid":1000}
inst [0 or "core0"] labels {"agent":"esmi","device_type":"cpu_core","die_id":0,"disp_instance":"core0","domainname":"localdomain","groupid":1000,"hostname":"aitop","indom_name":"per core","machineid":"4542127d93e1480e823cf51ba57d25a3","socket_id":0,"userid":1000}
inst [1 or "core1"] labels {"agent":"esmi","device_type":"cpu_core","die_id":0,"disp_instance":"core1","domainname":"localdomain","groupid":1000,"hostname":"aitop","indom_name":"per core","machineid":"4542127d93e1480e823cf51ba57d25a3","socket_id":0,"userid":1000}
Note: die_id, disp_instance, and socket_id are present in instance labels.
Metrics endpoint:
curl -s http://localhost:44322/metrics | grep esmi_energy_core | head -5output:
# HELP esmi_energy_core Cumulative core energy consumption in Joules
# TYPE esmi_energy_core counter
esmi_energy_core{disp_instance="core0",agent="esmi",indom_name="per core",hostname="aitop",instid="0",instname="core0",machineid="4542127d93e1480e823cf51ba57d25a3",domainname="localdomain",die_id="0",socket_id="0",device_type="cpu_core"} 383191.321868
esmi_energy_core{disp_instance="core1",agent="esmi",indom_name="per core",hostname="aitop",instid="1",instname="core1",machineid="4542127d93e1480e823cf51ba57d25a3",domainname="localdomain",die_id="0",socket_id="0",device_type="cpu_core"} 339390.282028
esmi_energy_core{disp_instance="core2",agent="esmi",indom_name="per core",hostname="aitop",instid="2",instname="core2",machineid="4542127d93e1480e823cf51ba57d25a3",domainname="localdomain",die_id="0",socket_id="0",device_type="cpu_core"} 75834.41345199999
Still there ...
Now the instance domain REST API:
curl -s 'http://localhost:44322/pmapi/indom?indom=470.1' | jq | head -n30Output:
{
"context": 184337792,
"indom": "470.1",
"labels": {
"device_type": "cpu_core",
"domainname": "localdomain",
"hostname": "aitop",
"indom_name": "per core",
"machineid": "4542127d93e1480e823cf51ba57d25a3"
},
"text-oneline": "Instance domain \"core\" for ESMI PMDA",
"text-help": "One instance per physical CPU core detected by the ESMI library.\nInstances are named \"core0\", \"core1\", etc.\nNote: With SMT enabled, sibling threads share the same core.",
"instances": [
{
"instance": 11,
"name": "core11",
"labels": {
"domainname": "localdomain",
"hostname": "aitop",
"machineid": "4542127d93e1480e823cf51ba57d25a3"
}
},
{
"instance": 23,
"name": "core23",
"labels": {
"domainname": "localdomain",
"hostname": "aitop",
"machineid": "4542127d93e1480e823cf51ba57d25a3"
}Instance labels are missing.
Now the metric endpoint:
curl -s 'http://localhost:44322/pmapi/metric?names=esmi.energy.core' | jqOutput:
{
"context": 1243538900,
"metrics": [
{
"name": "esmi.energy.core",
"series": "a3f7748a41b1aca855f561e934ab092b2a78428f",
"pmid": "470.5.0",
"indom": "470.1",
"type": "double",
"sem": "counter",
"units": "none",
"labels": {
"agent": "esmi",
"device_type": "cpu_core",
"domainname": "localdomain",
"hostname": "aitop",
"indom_name": "per core",
"machineid": "4542127d93e1480e823cf51ba57d25a3"
},
"text-oneline": "Cumulative core energy consumption in Joules",
"text-help": "The cumulative energy consumption of each physical core in Joules.[...]"
}
]
}Now this isn't specific to my PMDA, for example I noticed missing labels for a NVIDIA, disk, and others, so dashboards/pmrpoxy queries lack important context.
For example, there's a device_name here:
pminfo --labels disk.dev.read 2>/dev/null | grep device_name | head -n1
# inst [0 or "nvme3n1"] labels {"agent":"linux","device_name":"nvme3n1","device_type":"block","domainname":"localdomain","groupid":1000,"hostname":"aitop","indom_name":"per disk","machineid":"4542127d93e1480e823cf51ba57d25a3","userid":1000}But not here:
pmseries -l `pmseries 'disk.dev.read'` 2>&1 | grep nvme3n1
# inst [0 or "nvme3n1"] labels {"agent":"linux","device_type":"block","domainname":"localdomain","groupid":986,"hostname":"aitop","indom_name":"per disk","machineid":"4542127d93e1480e823cf51ba57d25a3","userid":997}nor here:
curl -s 'http://localhost:44322/pmapi/indom?indom=60.1' | jq '.instances[] | select(.name == "nvme3n1")'
# {
# "instance": 0,
# "name": "nvme3n1",
# "labels": {
# "domainname": "localdomain",
# "hostname": "aitop",
# "machineid": "4542127d93e1480e823cf51ba57d25a3"
# }
# }And another example from NVIDIA:
pminfo --labels nvidia.power
# nvidia.power
# labels {"agent":"nvidia","device_type":"gpu","domainname":"localdomain","groupid":1000,"hostname":"aitop","indom_name":"per gpu","machineid":"4542127d93e1480e823cf51ba57d25a3","units":"milliwatts","userid":1000}
# inst [0 or "gpu0"] labels {"agent":"nvidia","device_type":"gpu","domainname":"localdomain","gpu":0,"groupid":1000,"hostname":"aitop","indom_name":"per gpu","machineid":"4542127d93e1480e823cf51ba57d25a3","units":"milliwatts","userid":1000,"uuid":"GPU-7235b8ec-cfc6-c44b-967f-c404e4564320"}
# inst [1 or "gpu1"] labels {"agent":"nvidia","device_type":"gpu","domainname":"localdomain","gpu":1,"groupid":1000,"hostname":"aitop","indom_name":"per gpu","machineid":"4542127d93e1480e823cf51ba57d25a3","units":"milliwatts","userid":1000,"uuid":"GPU-4db1918f-1bb1-7200-983c-0ceb1ae0c7b3"}
# inst [2 or "gpu2"] labels {"agent":"nvidia","device_type":"gpu","domainname":"localdomain","gpu":2,"groupid":1000,"hostname":"aitop","indom_name":"per gpu","machineid":"4542127d93e1480e823cf51ba57d25a3","units":"milliwatts","userid":1000,"uuid":"GPU-bb8efb89-eacc-8ef4-fd19-4ca522115940"}gpu exists as a label here, but not the in API, the proxy, Redis or others.