-
Notifications
You must be signed in to change notification settings - Fork 121
Add ProcessLength to heartbeat event #390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add ProcessLength to heartbeat event #390
Conversation
This change extends the heartbeat event payload to include ProcessLength, which represents the current number of running processes on the VM. By including process count in heartbeats, the bosh-monitor can now align its behavior with the bosh-cli 'vms' command output. This allows the monitor to react to process state changes in real-time without requiring separate get_state RPC calls
|
During the FI WG meeting we discussed this and looks like there were also past prs which tried to address this topic. Could we include those for context. |
rkoster
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also could you update the description with a bit more context on why this change is needed? Who would be the consumer of this change?
Description updated |
rkoster
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried about what happens if the call to get the procceses fails. Because missing heartbeats could lead to the vm being recreated, so we should prioritize sending heartbeats even if some information is incomplete.
|
Cross linking bosh changes cloudfoundry/bosh#2649 |
| Context("when the boshAgent fails to get processes for a heartbeat", func() { | ||
| BeforeEach(func() { | ||
| jobSupervisor.ProcessesError = errors.New("fake-processes-error") | ||
| handler.KeepOnRunning() | ||
| }) | ||
|
|
||
| It("returns the error", func() { | ||
| err := boshAgent.Run() | ||
| Expect(err).To(HaveOccurred()) | ||
| Expect(err.Error()).To(ContainSubstring("fake-processes-error")) | ||
| }) | ||
| }) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks to me that this test needs to be adapted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I adjusted it in f1967f5
This change extends the heartbeat event payload to include ProcessLength, which represents the current number of running processes on the VM.
By including process count in heartbeats, the bosh-monitor can now align its behavior with the bosh-cli 'vms' command output. This allows the monitor to react to process state changes in real-time without requiring separate get_state RPC calls. With this change we are able to implement another health monitor api called '/unhealthy_agents' which corresponds to the existing api '/unresponsive_agents'. The reason ist that unresponsive agents regarding to the current implementation in the bosh health monitor addresses only those agents which did not respond in a certain amount of time. But if an agent does respond but is not in an healthy state (like the '-' is showing up when we call bosh-cli commands) this can not be obtained by an observer. To have a better observability we would like to have this api in the health monitor. Therefore the heartbeat is an easy and reliable way to get that information regulalry. An observer can actively be informed instead of calling apis (which lead to more effort in client and server's side) and calculate on it own. The consumers would be those who have a large scale environment deployed by bosh and want to have an active alert if any agent recognizes an unhealthy state.