Skip to content

Conversation

@makramkd
Copy link
Collaborator

@makramkd makramkd commented Dec 19, 2025

The Problem

Described in detail in the ticket. The short summary of it is that the plugin seems to hang for some nops in the chainfee.Processor's Observation method, which causes the entire observation to hang forever.

This causes participation to drop and alerts to fire. Other than that, we lose a participant in the DON which is not good.

The Solution

The plugin simply cannot allow Observation to hang for any reason; in order to achieve this, this PR updates the way WaitForAllNoErrOperations works:

  • Introduce the ExecuteAsyncOperations which always respects the provided timeout - it abandons running goroutines that don't return within this timeout if necessary
  • In order to avoid spawning new goroutines for the same hanging operation, we add another helper WrapWithSingleFlight that checks a sync.Map object prior to trying to run an operation again.
  • In order to avoid data races where ExecuteAsyncOperations returns prior to all goroutines finishing, the operation now returns a value instead of updating a value in a closure.
  • In order to more easily track what operations failed, UUIDs that represent an operation's execution are generated and used as fields in the logs.

@github-actions
Copy link

Metric mk/CCIP-8540-2 main
Coverage 70.4% 69.6%

@makramkd makramkd marked this pull request as ready for review December 19, 2025 13:30
@makramkd makramkd requested a review from a team as a code owner December 19, 2025 13:30
@makramkd makramkd closed this Dec 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant