-
Notifications
You must be signed in to change notification settings - Fork 3.1k
fix(k8sobjectsreceiver): restart watch on unexpected channel close #43974
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
|
Welcome, contributor! Thank you for your contribution to opentelemetry-collector-contrib. Important reminders:
A maintainer will review your pull request soon. Thank you for helping make OpenTelemetry better! |
|
Please add a changelog, and is there a way to test this behavior? |
|
Welcome, contributor! Thank you for your contribution to opentelemetry-collector-contrib. Important reminders:
A maintainer will review your pull request soon. Thank you for helping make OpenTelemetry better! |
|
@atoulme Thanks for the review! Here's the changelog entry and test strategy: Changelog
Test Strategy
Let me know if you'd like a unit test for the restart logic or if integration coverage is sufficient. |
| if !done { | ||
| time.Sleep(2 * time.Second) | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Utkarsh9571 Why are we sleeping here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nevermind, got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May ask that too :)? Is this supposed to give time to the watcher to start? I don't see why is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChrsMark looking at PR description, it looks like this is adding some sort of backoff before retrying immediately.
The sleep is there to avoid tight restart loops when the channel closes silently. It gives the receiver a moment to reset before retrying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we will be okay without it, but I'll let @Utkarsh9571 reply to your question before proceeding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thank's for elaborating! My question then is if that's actually needed. It looks a bit arbitrary tbh and I'm not sure if it solves an existing problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ChrsMark @VihasMakwana I'm happy to clarify.
The time.Sleep(2 * time.Second) is a minimal backoff to prevent tight restart loops when the channel closes silently. Without it, the receiver can enter a rapid retry cycle that floods logs and consumes CPU unnecessarily, especially in edge cases where the watcher fails to initialize cleanly.
This sleep gives the receiver a moment to reset before retrying. It’s not about waiting for the watcher to start — it’s about avoiding aggressive retries when done is false due to a silent failure.
I added it, the time.Sleep(2 * time.Second) was my interpretation of how to avoid tight restart loops after a silent channel close. It wasn’t directly requested by @atoulme, but I added it based on observed behavior during testing, where the receiver would retry rapidly without delay.
Let me know your preference and I’ll update the patch accordingly.
|
@VihasMakwana Thanks for checking! The sleep is there to avoid tight restart loops when the channel closes silently. It gives the receiver a moment to reset before retrying. Glad it makes sense now — let me know if you'd prefer a different strategy. |
For adding the change-log see
I guess what @atoulme asked for is how we this fix can be manually tested, what are the steps to reproduce etc. In general, I'm not sure if we can ensure that it will fix #43928. |
This patch ensures the k8sobjectsreceiver recovers from silent watch stream drops. Previously, when the channel closed unexpectedly (!ok), the receiver exited without retrying. This change:
This improves resilience in watch mode and aligns with the retry behavior for 410 Gone.
Related to: #43928