You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: enhancements/sig-architecture/141-grpc-based-registration/README.md
+191Lines changed: 191 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -163,6 +163,197 @@ The following security principles should be considered between the broker and so
163
163
- The sources should be authorized by broker to avoid one source can consume event messages from other sources
164
164
- The agent should be authorized by broker to avoid one agent can consume event messages from other clusters
165
165
166
+
### Metrics
167
+
168
+
The gRPC server exposes Prometheus metrics to monitor the health and performance. They are grouped into two categories:
169
+
170
+
1.**General gRPC server metrics**
171
+
2.**CloudEvents-specific gRPC server metrics**
172
+
173
+
#### General gRPC server metrics
174
+
175
+
Common metrics for gRPC server health and performance, started with `grpc_server` as Prometheus subsystem name. Each metric comes with a operator guide on healthy vs. degraded values.
176
+
177
+
-**`grpc_server_active_connections`**
178
+
179
+
**Type**: Gauge \
180
+
**Description**: Current number of active connections. \
181
+
**Healthy**: A stable or predictable number of grpc connections, based on expected client load. \
182
+
**Degraded**: A sudden drop to zero (all clients disconnected) or a sharp surge above the baseline may indicate connection leaks, restarts, or faulty clients. \
**Description**: Total number of RPC messages received on the server. \
204
+
**Healthy**: Steady growth aligned with expected traffic. \
205
+
**Degraded**: Abnormal spikes may indicate flooding or misbehaving clients. If `grpc_server_msg_received_total/grpc_server_msg_sent_total` also rises, the server or downstream may not be processing requests as fast as they are received. \
**Description**: Total number of gRPC messages sent by the server. \
216
+
**Healthy**: Consistent growth that matches the gRPC server and downstream processing rate. \
217
+
**Degraded**: A sudden drop may mean the server and downstream aren’t processing requests, If `grpc_server_msg_received_total/grpc_server_msg_sent_total` also rises, the server or downstream may be falling behind. \
**Description**: Total number of message bytes received on the gRPC server. \
228
+
**Healthy**: Steady growth aligned with expected traffic. \
229
+
**Degraded**: Abnormal spikes may indicate flooding or misbehaving clients. If `grpc_server_msg_received_bytes_total/grpc_server_msg_sent_bytes_total` also rises, the server or downstream may not be processing requests as fast as they are received. \
**Description**: Total number of message bytes sent by the gRPC server. \
240
+
**Healthy**: Consistent growth that matches the gRPC server and downstream processing rate. \
241
+
**Degraded**: A sudden drop may mean the server and downstream aren’t processing requests, If `grpc_server_msg_received_total/grpc_server_msg_sent_total` also rises, the server or downstream may be falling behind. \
**Description**: Total number of RPCs completed on the server, regardless of success or failure. \
252
+
**Healthy**: Most RPCs complete with `grpc_code="OK"`. \
253
+
**Degraded**: An increasing number of non-OK codes (e.g., `Unavailable`, `DeadlineExceeded`, `Internal`) signals grpc server instability or downstream errors. \
**Description**: Histogram of the duration of RPC handling by the gRPC server. \
263
+
**Healthy**: Request latencies fall mostly into the lower buckets (e.g., <0.1s). \
264
+
**Degraded**: Shifts into higher buckets (e.g., >1s or >5s) mean the server is slowing down, possibly due to load, resource starvation, or dependency issues. \
Metrics specific to CloudEvents RPC calls, started with `grpc_server_ce` as Prometheus subsystem name. Each metric comes with a operator guide on healthy vs. degraded values.
286
+
287
+
-**`grpc_server_ce_called_total`**
288
+
289
+
**Type**: Counter \
290
+
**Description**: Total number of RPC requests called on the server. \
**Description**: Total number of messages received on the gRPC server. \
303
+
**Healthy**: Regular increments matching agents traffic, most received events eventually lead to sent events. \
304
+
**Degraded**: Large gap of `grpc_server_ce_msg_received_total/grpc_server_ce_msg_sent_total` (many received but few sent/processed) may mean server bottlenecks, or dropped events. \
**Description**: Total number of messages sent by the gRPC server. \
315
+
**Healthy**: Consistent growth that matches the gRPC server and downstream processing rate. \
316
+
**Degraded**: Large gap of `grpc_server_ce_msg_received_total/grpc_server_ce_msg_sent_total` (many received but few sent/processed) may mean server bottlenecks, or dropped events. \
0 commit comments