Skip to content

Conversation

@DennisAhausSAP
Copy link

@DennisAhausSAP DennisAhausSAP commented Nov 24, 2025

This change extends the heartbeat event payload to include ProcessLength, which represents the current number of running processes on the VM.

By including process count in heartbeats, the bosh-monitor can now align its behavior with the bosh-cli 'vms' command output. This allows the monitor to react to process state changes in real-time without requiring separate get_state RPC calls. With this change we are able to implement another health monitor api called '/unhealthy_agents' which corresponds to the existing api '/unresponsive_agents'. The reason ist that unresponsive agents regarding to the current implementation in the bosh health monitor addresses only those agents which did not respond in a certain amount of time. But if an agent does respond but is not in an healthy state (like the '-' is showing up when we call bosh-cli commands) this can not be obtained by an observer. To have a better observability we would like to have this api in the health monitor. Therefore the heartbeat is an easy and reliable way to get that information regulalry. An observer can actively be informed instead of calling apis (which lead to more effort in client and server's side) and calculate on it own. The consumers would be those who have a large scale environment deployed by bosh and want to have an active alert if any agent recognizes an unhealthy state.

This change extends the heartbeat event payload to include ProcessLength,
which represents the current number of running processes on the VM.

By including process count in heartbeats, the bosh-monitor can now align
its behavior with the bosh-cli 'vms' command output. This allows the monitor
to react to process state changes in real-time without requiring separate
get_state RPC calls
@beyhan
Copy link
Member

beyhan commented Nov 27, 2025

During the FI WG meeting we discussed this and looks like there were also past prs which tried to address this topic. Could we include those for context.

@beyhan beyhan requested a review from rkoster November 27, 2025 15:55
Copy link
Contributor

@rkoster rkoster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also could you update the description with a bit more context on why this change is needed? Who would be the consumer of this change?

@github-project-automation github-project-automation bot moved this from Pending Review | Discussion to Waiting for Changes | Open for Contribution in Foundational Infrastructure Working Group Nov 27, 2025
@DennisAhausSAP
Copy link
Author

Also could you update the description with a bit more context on why this change is needed? Who would be the consumer of this change?

Description updated

Copy link
Contributor

@rkoster rkoster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about what happens if the call to get the procceses fails. Because missing heartbeats could lead to the vm being recreated, so we should prioritize sending heartbeats even if some information is incomplete.

@aramprice
Copy link
Member

Cross linking bosh changes cloudfoundry/bosh#2649

@rkoster rkoster moved this from Waiting for Changes | Open for Contribution to Pending Review | Discussion in Foundational Infrastructure Working Group Dec 18, 2025
aramprice
aramprice previously approved these changes Dec 18, 2025
Comment on lines 258 to 270
Context("when the boshAgent fails to get processes for a heartbeat", func() {
BeforeEach(func() {
jobSupervisor.ProcessesError = errors.New("fake-processes-error")
handler.KeepOnRunning()
})

It("returns the error", func() {
err := boshAgent.Run()
Expect(err).To(HaveOccurred())
Expect(err.Error()).To(ContainSubstring("fake-processes-error"))
})
})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me that this test needs to be adapted.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I adjusted it in f1967f5

@github-project-automation github-project-automation bot moved this from Pending Review | Discussion to Waiting for Changes | Open for Contribution in Foundational Infrastructure Working Group Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Waiting for Changes | Open for Contribution

Development

Successfully merging this pull request may close these issues.

7 participants