Better tracking of crash-looping processes

### What would you like to be added/changed?

This came up in https://github.com/FoundationDB/fdb-kubernetes-operator/pull/2373#discussion_r2413848086. Right now the operator has no way to track crash-looping processes, e.g. processes that report for a short amount of time to the cluster and then crash. Some of the root causes could be networking issues or resource constraints. Since the operator is not tracking the restart count and the crash-looping of the `fdbserver` is not reported to the Pod (because the `fdbmonitor` will run and just restart the `fdbserver`) it's currently hard for the operator to replace such flaky pods. Those pods could cause issues to the cluster and block certain operations that expect that the cluster is up for a specific time. In addition it can be hard for a human operator to track down those crash-looping processes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Better tracking of crash-looping processes #2384

What would you like to be added/changed?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Better tracking of crash-looping processes #2384

Description

What would you like to be added/changed?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions