Skip to content

Conversation

@Utkarsh9571
Copy link

@Utkarsh9571 Utkarsh9571 commented Nov 3, 2025

This patch ensures the k8sobjectsreceiver recovers from silent watch stream drops. Previously, when the channel closed unexpectedly (!ok), the receiver exited without retrying. This change:

  • Logs the unexpected closure
  • Returns false to trigger a restart in startWatch
  • Adds an Info log to signal recovery intent
  • Adds a 2s backoff before retrying

This improves resilience in watch mode and aligns with the retry behavior for 410 Gone.

Related to: #43928

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Nov 3, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: Utkarsh9571 / name: Utkarsh9571 (0314b23)

@github-actions github-actions bot added the first-time contributor PRs made by new contributors label Nov 3, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Nov 3, 2025

Welcome, contributor! Thank you for your contribution to opentelemetry-collector-contrib.

Important reminders:

A maintainer will review your pull request soon. Thank you for helping make OpenTelemetry better!

@atoulme
Copy link
Contributor

atoulme commented Nov 5, 2025

Please add a changelog, and is there a way to test this behavior?

@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2025

Welcome, contributor! Thank you for your contribution to opentelemetry-collector-contrib.

Important reminders:

A maintainer will review your pull request soon. Thank you for helping make OpenTelemetry better!

@Utkarsh9571
Copy link
Author

@atoulme Thanks for the review! Here's the changelog entry and test strategy:

Changelog

  • receiver/k8sobjects: Restart watch loop on unexpected channel closure to improve resilience and avoid silent failure.

Test Strategy

  • Verified that the receiver logs unexpected channel closure and retries with backoff.
  • Confirmed behavior aligns with existing retry logic for 410 Gone.
  • Manual testing in a simulated cluster confirmed restart after channel drop.
  • No changes to config or external interfaces; patch is scoped to internal watch loop.

Let me know if you'd like a unit test for the restart logic or if integration coverage is sufficient.

Comment on lines +307 to +310
if !done {
time.Sleep(2 * time.Second)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Utkarsh9571 Why are we sleeping here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nevermind, got it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May ask that too :)? Is this supposed to give time to the watcher to start? I don't see why is needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChrsMark looking at PR description, it looks like this is adding some sort of backoff before retrying immediately.

@Utkarsh9571's comment:

The sleep is there to avoid tight restart loops when the channel closes silently. It gives the receiver a moment to reset before retrying.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we will be okay without it, but I'll let @Utkarsh9571 reply to your question before proceeding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thank's for elaborating! My question then is if that's actually needed. It looks a bit arbitrary tbh and I'm not sure if it solves an existing problem.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ChrsMark @VihasMakwana I'm happy to clarify.

The time.Sleep(2 * time.Second) is a minimal backoff to prevent tight restart loops when the channel closes silently. Without it, the receiver can enter a rapid retry cycle that floods logs and consumes CPU unnecessarily, especially in edge cases where the watcher fails to initialize cleanly.

This sleep gives the receiver a moment to reset before retrying. It’s not about waiting for the watcher to start — it’s about avoiding aggressive retries when done is false due to a silent failure.

I added it, the time.Sleep(2 * time.Second) was my interpretation of how to avoid tight restart loops after a silent channel close. It wasn’t directly requested by @atoulme, but I added it based on observed behavior during testing, where the receiver would retry rapidly without delay.

Let me know your preference and I’ll update the patch accordingly.

@Utkarsh9571
Copy link
Author

@VihasMakwana Thanks for checking! The sleep is there to avoid tight restart loops when the channel closes silently. It gives the receiver a moment to reset before retrying. Glad it makes sense now — let me know if you'd prefer a different strategy.

@ChrsMark
Copy link
Member

ChrsMark commented Nov 6, 2025

@atoulme Thanks for the review! Here's the changelog entry and test strategy:

Changelog

  • receiver/k8sobjects: Restart watch loop on unexpected channel closure to improve resilience and avoid silent failure.

For adding the change-log see

### Adding a Changelog Entry
.

Test Strategy

  • Verified that the receiver logs unexpected channel closure and retries with backoff.
  • Confirmed behavior aligns with existing retry logic for 410 Gone.
  • Manual testing in a simulated cluster confirmed restart after channel drop.
  • No changes to config or external interfaces; patch is scoped to internal watch loop.

Let me know if you'd like a unit test for the restart logic or if integration coverage is sufficient.

I guess what @atoulme asked for is how we this fix can be manually tested, what are the steps to reproduce etc.

In general, I'm not sure if we can ensure that it will fix #43928.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants