Skip to content

Commit 6fecf80

Browse files
authored
add grpc server metrics. (#152)
* add grpc server metrics. Signed-off-by: morvencao <[email protected]> * add operator guide. Signed-off-by: morvencao <[email protected]> --------- Signed-off-by: morvencao <[email protected]>
1 parent 8ccbe0f commit 6fecf80

File tree

1 file changed

+191
-0
lines changed
  • enhancements/sig-architecture/141-grpc-based-registration

1 file changed

+191
-0
lines changed

enhancements/sig-architecture/141-grpc-based-registration/README.md

Lines changed: 191 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -163,6 +163,197 @@ The following security principles should be considered between the broker and so
163163
- The sources should be authorized by broker to avoid one source can consume event messages from other sources
164164
- The agent should be authorized by broker to avoid one agent can consume event messages from other clusters
165165

166+
### Metrics
167+
168+
The gRPC server exposes Prometheus metrics to monitor the health and performance. They are grouped into two categories:
169+
170+
1. **General gRPC server metrics**
171+
2. **CloudEvents-specific gRPC server metrics**
172+
173+
#### General gRPC server metrics
174+
175+
Common metrics for gRPC server health and performance, started with `grpc_server` as Prometheus subsystem name. Each metric comes with a operator guide on healthy vs. degraded values.
176+
177+
- **`grpc_server_active_connections`**
178+
179+
**Type**: Gauge \
180+
**Description**: Current number of active connections. \
181+
**Healthy**: A stable or predictable number of grpc connections, based on expected client load. \
182+
**Degraded**: A sudden drop to zero (all clients disconnected) or a sharp surge above the baseline may indicate connection leaks, restarts, or faulty clients. \
183+
**Metrics sample**:
184+
```
185+
grpc_server_active_connections{local_addr="10.244.0.18:8090",remote_addr="10.244.0.16:45128"} 1
186+
```
187+
188+
- **`grpc_server_started_total`**
189+
190+
**Type**: Counter \
191+
**Description**: Total number of RPCs started on the server. \
192+
**Healthy**: The number of started RPCs closely matches the number of handled RPCs (`grpc_server_handled_total`). \
193+
**Degraded**: A growing gap between started and handled RPCs (`grpc_server_handled_total`) suggests requests are failing before completion. \
194+
**Metrics sample**:
195+
```
196+
grpc_server_started_total{grpc_method="Publish",grpc_service="io.cloudevents.v1.CloudEventService",grpc_type="unary"} 3
197+
grpc_server_started_total{grpc_method="Subscribe",grpc_service="io.cloudevents.v1.CloudEventService",grpc_type="server_stream"} 4
198+
```
199+
200+
- **`grpc_server_msg_received_total`**
201+
202+
**Type**: Counter \
203+
**Description**: Total number of RPC messages received on the server. \
204+
**Healthy**: Steady growth aligned with expected traffic. \
205+
**Degraded**: Abnormal spikes may indicate flooding or misbehaving clients. If `grpc_server_msg_received_total/grpc_server_msg_sent_total` also rises, the server or downstream may not be processing requests as fast as they are received. \
206+
**Metrics sample**:
207+
```
208+
grpc_server_msg_received_total{grpc_method="Publish",grpc_service="io.cloudevents.v1.CloudEventService",grpc_type="unary"} 3
209+
grpc_server_msg_received_total{grpc_method="Subscribe",grpc_service="io.cloudevents.v1.CloudEventService",grpc_type="server_stream"} 4
210+
```
211+
212+
- **`grpc_server_msg_sent_total`**
213+
214+
**Type**: Counter \
215+
**Description**: Total number of gRPC messages sent by the server. \
216+
**Healthy**: Consistent growth that matches the gRPC server and downstream processing rate. \
217+
**Degraded**: A sudden drop may mean the server and downstream aren’t processing requests, If `grpc_server_msg_received_total/grpc_server_msg_sent_total` also rises, the server or downstream may be falling behind. \
218+
**Metrics sample**:
219+
```
220+
grpc_server_msg_sent_total{grpc_method="Publish",grpc_service="io.cloudevents.v1.CloudEventService",grpc_type="unary"} 3
221+
grpc_server_msg_sent_total{grpc_method="Subscribe",grpc_service="io.cloudevents.v1.CloudEventService",grpc_type="server_stream"} 1
222+
```
223+
224+
- **`grpc_server_msg_received_bytes_total`**
225+
226+
**Type**: Counter \
227+
**Description**: Total number of message bytes received on the gRPC server. \
228+
**Healthy**: Steady growth aligned with expected traffic. \
229+
**Degraded**: Abnormal spikes may indicate flooding or misbehaving clients. If `grpc_server_msg_received_bytes_total/grpc_server_msg_sent_bytes_total` also rises, the server or downstream may not be processing requests as fast as they are received. \
230+
**Metrics sample**:
231+
```
232+
grpc_server_msg_received_bytes_total{grpc_method="Publish",grpc_service="io.cloudevents.v1.CloudEventService",grpc_type="unary"} 1729
233+
grpc_server_msg_received_bytes_total{grpc_method="Subscribe",grpc_service="io.cloudevents.v1.CloudEventService",grpc_type="server_stream"} 245
234+
```
235+
236+
- **`grpc_server_msg_sent_bytes_total`**
237+
238+
**Type**: Counter \
239+
**Description**: Total number of message bytes sent by the gRPC server. \
240+
**Healthy**: Consistent growth that matches the gRPC server and downstream processing rate. \
241+
**Degraded**: A sudden drop may mean the server and downstream aren’t processing requests, If `grpc_server_msg_received_total/grpc_server_msg_sent_total` also rises, the server or downstream may be falling behind. \
242+
**Metrics sample**:
243+
```
244+
grpc_server_msg_sent_bytes_total{grpc_method="Publish",grpc_service="io.cloudevents.v1.CloudEventService",grpc_type="unary"} 0
245+
grpc_server_msg_sent_bytes_total{grpc_method="Subscribe",grpc_service="io.cloudevents.v1.CloudEventService",grpc_type="server_stream"} 1147
246+
```
247+
248+
- **`grpc_server_handled_total`**
249+
250+
**Type**: Counter \
251+
**Description**: Total number of RPCs completed on the server, regardless of success or failure. \
252+
**Healthy**: Most RPCs complete with `grpc_code="OK"`. \
253+
**Degraded**: An increasing number of non-OK codes (e.g., `Unavailable`, `DeadlineExceeded`, `Internal`) signals grpc server instability or downstream errors. \
254+
**Metrics sample**:
255+
```
256+
grpc_server_handled_total{grpc_code="OK",grpc_method="Publish",grpc_service="io.cloudevents.v1.CloudEventService",grpc_type="unary"} 3
257+
```
258+
259+
- **`grpc_server_handling_seconds`**
260+
261+
**Type**: Histogram \
262+
**Description**: Histogram of the duration of RPC handling by the gRPC server. \
263+
**Healthy**: Request latencies fall mostly into the lower buckets (e.g., <0.1s). \
264+
**Degraded**: Shifts into higher buckets (e.g., >1s or >5s) mean the server is slowing down, possibly due to load, resource starvation, or dependency issues. \
265+
**Metrics sample**:
266+
```
267+
grpc_server_handling_seconds_bucket{grpc_method="Publish",grpc_service="io.cloudevents.v1.CloudEventService",grpc_type="unary",le="0.005"} 3
268+
grpc_server_handling_seconds_bucket{grpc_method="Publish",grpc_service="io.cloudevents.v1.CloudEventService",grpc_type="unary",le="0.01"} 3
269+
grpc_server_handling_seconds_bucket{grpc_method="Publish",grpc_service="io.cloudevents.v1.CloudEventService",grpc_type="unary",le="0.025"} 3
270+
grpc_server_handling_seconds_bucket{grpc_method="Publish",grpc_service="io.cloudevents.v1.CloudEventService",grpc_type="unary",le="0.05"} 3
271+
grpc_server_handling_seconds_bucket{grpc_method="Publish",grpc_service="io.cloudevents.v1.CloudEventService",grpc_type="unary",le="0.1"} 3
272+
grpc_server_handling_seconds_bucket{grpc_method="Publish",grpc_service="io.cloudevents.v1.CloudEventService",grpc_type="unary",le="0.25"} 3
273+
grpc_server_handling_seconds_bucket{grpc_method="Publish",grpc_service="io.cloudevents.v1.CloudEventService",grpc_type="unary",le="0.5"} 3
274+
grpc_server_handling_seconds_bucket{grpc_method="Publish",grpc_service="io.cloudevents.v1.CloudEventService",grpc_type="unary",le="1"} 3
275+
grpc_server_handling_seconds_bucket{grpc_method="Publish",grpc_service="io.cloudevents.v1.CloudEventService",grpc_type="unary",le="2.5"} 3
276+
grpc_server_handling_seconds_bucket{grpc_method="Publish",grpc_service="io.cloudevents.v1.CloudEventService",grpc_type="unary",le="5"} 3
277+
grpc_server_handling_seconds_bucket{grpc_method="Publish",grpc_service="io.cloudevents.v1.CloudEventService",grpc_type="unary",le="10"} 3
278+
grpc_server_handling_seconds_bucket{grpc_method="Publish",grpc_service="io.cloudevents.v1.CloudEventService",grpc_type="unary",le="+Inf"} 3
279+
grpc_server_handling_seconds_sum{grpc_method="Publish",grpc_service="io.cloudevents.v1.CloudEventService",grpc_type="unary"} 0.0055182140000000005
280+
grpc_server_handling_seconds_count{grpc_method="Publish",grpc_service="io.cloudevents.v1.CloudEventService",grpc_type="unary"} 3
281+
```
282+
283+
#### CloudEvents-specific gRPC server metrics
284+
285+
Metrics specific to CloudEvents RPC calls, started with `grpc_server_ce` as Prometheus subsystem name. Each metric comes with a operator guide on healthy vs. degraded values.
286+
287+
- **`grpc_server_ce_called_total`**
288+
289+
**Type**: Counter \
290+
**Description**: Total number of RPC requests called on the server. \
291+
**Healthy**: Regular increments matching agents traffic. \
292+
**Degraded**: Sudden stop (no calls received) or unexpected drops after a steady pattern may indicate communication issues with agents. \
293+
**Metrics sample**:
294+
```
295+
grpc_server_ce_called_total{cluster="cluster1",data_type="io.open-cluster-management.works.v1alpha1.manifestbundles",method="Publish"} 1
296+
grpc_server_ce_called_total{cluster="cluster1",data_type="io.open-cluster-management.works.v1alpha1.manifestbundles",method="Subscribe"} 1
297+
```
298+
299+
- **`grpc_server_ce_msg_received_total`**
300+
301+
**Type**: Counter \
302+
**Description**: Total number of messages received on the gRPC server. \
303+
**Healthy**: Regular increments matching agents traffic, most received events eventually lead to sent events. \
304+
**Degraded**: Large gap of `grpc_server_ce_msg_received_total/grpc_server_ce_msg_sent_total` (many received but few sent/processed) may mean server bottlenecks, or dropped events. \
305+
**Metrics sample**:
306+
```
307+
grpc_server_ce_msg_received_total{cluster="cluster1",data_type="io.open-cluster-management.works.v1alpha1.manifestbundles",method="Publish"} 1
308+
grpc_server_ce_msg_received_total{cluster="cluster1",data_type="io.open-cluster-management.works.v1alpha1.manifestbundles",method="Subscribe"} 1
309+
```
310+
311+
- **`grpc_server_ce_msg_sent_total`**
312+
313+
**Type**: Counter \
314+
**Description**: Total number of messages sent by the gRPC server. \
315+
**Healthy**: Consistent growth that matches the gRPC server and downstream processing rate. \
316+
**Degraded**: Large gap of `grpc_server_ce_msg_received_total/grpc_server_ce_msg_sent_total` (many received but few sent/processed) may mean server bottlenecks, or dropped events. \
317+
**Metrics sample**:
318+
```
319+
grpc_server_ce_msg_sent_total{cluster="cluster1",data_type="io.open-cluster-management.works.v1alpha1.manifestbundles",method="Publish"} 1
320+
```
321+
322+
- **`grpc_server_ce_processed_total`**
323+
324+
**Type**: Counter \
325+
**Description**: Total number of messages processed by the gRPC server. \
326+
**Healthy**: Most of cloudevents are processed with `grpc_code="OK"`. \
327+
**Degraded**: Rising counts of non-OK codes show the server is failing during processing. \
328+
**Metrics sample**:
329+
```
330+
grpc_server_ce_processed_total{cluster="cluster1",data_type="io.open-cluster-management.works.v1alpha1.manifestbundles",grpc_code="OK",method="Publish"} 1
331+
```
332+
333+
- **`grpc_server_ce_processed_duration_seconds_bucket`**
334+
335+
**Type**: Histogram \
336+
**Description**: Histogram of the duration of RPC requests for cloudevents processed on the server. \
337+
**Healthy**: Processing durations mostly in small buckets (e.g., <0.1s). \
338+
**Degraded**: Shifts into higher buckets (>1s or >5s) signals slowdown in event handling. \
339+
**Metrics sample**:
340+
```
341+
grpc_server_ce_processed_duration_seconds_bucket{cluster="cluster1",data_type="io.open-cluster-management.works.v1alpha1.manifestbundles",grpc_code="OK",method="Publish",le="0.005"} 0
342+
grpc_server_ce_processed_duration_seconds_bucket{cluster="cluster1",data_type="io.open-cluster-management.works.v1alpha1.manifestbundles",grpc_code="OK",method="Publish",le="0.01"} 1
343+
grpc_server_ce_processed_duration_seconds_bucket{cluster="cluster1",data_type="io.open-cluster-management.works.v1alpha1.manifestbundles",grpc_code="OK",method="Publish",le="0.025"} 1
344+
grpc_server_ce_processed_duration_seconds_bucket{cluster="cluster1",data_type="io.open-cluster-management.works.v1alpha1.manifestbundles",grpc_code="OK",method="Publish",le="0.05"} 1
345+
grpc_server_ce_processed_duration_seconds_bucket{cluster="cluster1",data_type="io.open-cluster-management.works.v1alpha1.manifestbundles",grpc_code="OK",method="Publish",le="0.1"} 1
346+
grpc_server_ce_processed_duration_seconds_bucket{cluster="cluster1",data_type="io.open-cluster-management.works.v1alpha1.manifestbundles",grpc_code="OK",method="Publish",le="0.25"} 1
347+
grpc_server_ce_processed_duration_seconds_bucket{cluster="cluster1",data_type="io.open-cluster-management.works.v1alpha1.manifestbundles",grpc_code="OK",method="Publish",le="0.5"} 1
348+
grpc_server_ce_processed_duration_seconds_bucket{cluster="cluster1",data_type="io.open-cluster-management.works.v1alpha1.manifestbundles",grpc_code="OK",method="Publish",le="1"} 1
349+
grpc_server_ce_processed_duration_seconds_bucket{cluster="cluster1",data_type="io.open-cluster-management.works.v1alpha1.manifestbundles",grpc_code="OK",method="Publish",le="2.5"} 1
350+
grpc_server_ce_processed_duration_seconds_bucket{cluster="cluster1",data_type="io.open-cluster-management.works.v1alpha1.manifestbundles",grpc_code="OK",method="Publish",le="5"} 1
351+
grpc_server_ce_processed_duration_seconds_bucket{cluster="cluster1",data_type="io.open-cluster-management.works.v1alpha1.manifestbundles",grpc_code="OK",method="Publish",le="10"} 1
352+
grpc_server_ce_processed_duration_seconds_bucket{cluster="cluster1",data_type="io.open-cluster-management.works.v1alpha1.manifestbundles",grpc_code="OK",method="Publish",le="+Inf"} 1
353+
grpc_server_ce_processed_duration_seconds_sum{cluster="cluster1",data_type="io.open-cluster-management.works.v1alpha1.manifestbundles",grpc_code="OK",method="Publish"} 0.001053519
354+
grpc_server_ce_processed_duration_seconds_count{cluster="cluster1",data_type="io.open-cluster-management.works.v1alpha1.manifestbundles",grpc_code="OK",method="Publish"} 1
355+
```
356+
166357
### Test Plan
167358

168359
**Note:** *Section not required until targeted at a release.*

0 commit comments

Comments
 (0)