Skip to content

fix: service monitor recovery detection#318

Merged
rshoemaker merged 1 commit intomainfrom
fix/PLAT-525/service_monitor
Mar 25, 2026
Merged

fix: service monitor recovery detection#318
rshoemaker merged 1 commit intomainfrom
fix/PLAT-525/service_monitor

Conversation

@rshoemaker
Copy link
Copy Markdown
Contributor

Summary

  • The service instance monitor treated failed as a terminal state — when a container restarted and became healthy, the status blob updated but the state stayed failed permanently, requiring manual intervention.
  • Added recovery logic in handleStateTransition: when a failed instance's container is healthy, transition back to running and clear the error. This matches the existing pattern used by the database instance monitor.
  • Added explicit failed state handling in handleNilStatus and handleStatusError with debug logging for better observability.

Test plan

  • Reproduced on a running 1-node+MCP fixture: killed MCP container, confirmed state stuck at failed with old code
  • Deployed fix, confirmed automatic recovery: failed → running within ~16 seconds of container restart
  • Verified stable — no flapping after recovery
  • go vet passes

PLAT-525

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 25, 2026

📝 Walkthrough

Walkthrough

This PR adds explicit handling for the Failed state in the service instance monitor. Changes include debug logging for status errors in failed instances, state transitions when container health improves, and nil status handling for failed instances.

Changes

Cohort / File(s) Summary
Failed State Handling
server/internal/monitor/service_instance_monitor.go
Introduces explicit Failed state management in three methods: adds debug logging in handleStatusError when already failed, transitions to Running in handleStateTransition upon health recovery, and prevents state changes in handleNilStatus for failed instances.

Poem

🐰 When instances fail with troubled eyes,
We log and watch, then help them rise,
From failed to running, health renewed,
With gentle handling, well-construed! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding recovery detection logic to allow failed service instances to transition back to running state when healthy.
Description check ✅ Passed The description provides a clear summary, explains the changes, and includes a comprehensive test plan with reproduction steps and verification. However, it does not follow the template structure with explicit Changes and Checklist sections.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/PLAT-525/service_monitor

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
server/internal/monitor/service_instance_monitor.go (1)

209-220: Add targeted monitor tests for the new Failed-state branches.

Please add/extend tests for: (1) Failed + healthy status ⇒ transition to Running with cleared error, (2) Failed + nil status ⇒ no-op, (3) Failed + status error ⇒ no state mutation.

Based on learnings: “Ensure nil status is handled separately by handleNilStatus, and add tests to cover both nil and non-nil status scenarios to verify the default behavior and the explicit nil-status handling.”

Also applies to: 271-277

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/internal/monitor/service_instance_monitor.go` around lines 209 - 220,
Add unit tests for the ServiceInstanceMonitor logic when a service instance is
in database.ServiceInstanceStateFailed: (1) simulate a non-nil healthy status
and assert that m.updateState is called with
database.ServiceInstanceStateRunning and the error/message cleared, (2) simulate
a nil status and assert the monitor routes to handleNilStatus and performs no
state mutation, and (3) simulate a non-nil unhealthy status/error and assert
m.updateStatus is called but no state transition occurs. Target the switch
branch that invokes m.updateState and m.updateStatus (the Failed case) and the
handleNilStatus path, using mocks/spies for updateState/updateStatus to verify
calls and arguments for each scenario.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@server/internal/monitor/service_instance_monitor.go`:
- Around line 209-220: Add unit tests for the ServiceInstanceMonitor logic when
a service instance is in database.ServiceInstanceStateFailed: (1) simulate a
non-nil healthy status and assert that m.updateState is called with
database.ServiceInstanceStateRunning and the error/message cleared, (2) simulate
a nil status and assert the monitor routes to handleNilStatus and performs no
state mutation, and (3) simulate a non-nil unhealthy status/error and assert
m.updateStatus is called but no state transition occurs. Target the switch
branch that invokes m.updateState and m.updateStatus (the Failed case) and the
handleNilStatus path, using mocks/spies for updateState/updateStatus to verify
calls and arguments for each scenario.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 88a6213e-7df9-475e-a46d-1511d5856443

📥 Commits

Reviewing files that changed from the base of the PR and between 145ba44 and 4ddd041.

📒 Files selected for processing (1)
  • server/internal/monitor/service_instance_monitor.go

Copy link
Copy Markdown
Member

@mmols mmols left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rshoemaker rshoemaker merged commit 4713732 into main Mar 25, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants