Skip to content

Caching service health endpoint #4427

@pavel-jares-bcm

Description

@pavel-jares-bcm

This issue is based on #4408.

In the caching service is difficult to verify the state. It is related mainly to Infinispan, but in general it could be an issue of any distributed cache implementation.

The issue is that Infinispan (JGroups) has a independent communication that is not monitored. The way how we check the state of Caching service is to check the state of service. But the status UP doesn't mean that service is working properly. To be accurate we should look in the log for GMS message that inform us that the JGroup ports are binded. It is not enough to detect if all nodes we joined. To do that a user required debug log of JGroup and needs to analyze communication if all nodes are communicating each other. It is very complicated even with knowledge how it works.

The aim of this issue is to simplify that.

The health endpoint should contains new checks:

  • jgroup is listening
  • list of connected nodes
  • amount of nodes

It is questionable if the JGroup should influence the service status, because even some instances in HA are down the service should be available, but it probably makes sense to change the service state during the start-up. At least service should be down till JGroup is listening. Then I would suggest to set status to down till first time the cluster is connected (the requirement establish cluster at least one time).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestnewNew issue that has not been worked on yet

    Type

    No type

    Projects

    Status

    New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions