Skip to content

Better tracking of crash-looping processes #2384

@johscheuer

Description

@johscheuer

What would you like to be added/changed?

This came up in #2373 (comment). Right now the operator has no way to track crash-looping processes, e.g. processes that report for a short amount of time to the cluster and then crash. Some of the root causes could be networking issues or resource constraints. Since the operator is not tracking the restart count and the crash-looping of the fdbserver is not reported to the Pod (because the fdbmonitor will run and just restart the fdbserver) it's currently hard for the operator to replace such flaky pods. Those pods could cause issues to the cluster and block certain operations that expect that the cluster is up for a specific time. In addition it can be hard for a human operator to track down those crash-looping processes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions