-
Notifications
You must be signed in to change notification settings - Fork 3.7k
fix: Prevent Close from hanging on etcd reconnection #45622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Prevent Close from hanging on etcd reconnection #45622
Conversation
|
@weiliu1031 Please associate the related pr of master to the body of your Pull Request. (eg. "pr: #") |
|
[ci-v2-notice]
To rerun ci-v2 checks, comment with:
If you have any questions or requests, please contact @zhikunyao. |
|
[INFO] PR Label Summary by Default
[WARNING] Milestone not set
You can set milestone by commenting: Use /refresh-label to update related check and label manually |
|
@weiliu1031 Please associate the related issue to the body of your Pull Request. (eg. "issue: #") |
|
/kind branch-feature |
|
/set-milestone 2.5.23 |
|
[INFO] Set milestone to: 2.5.23 |
|
/refresh-label |
|
[INFO] PR Label Summary by Refresh-Label
[INFO] Dependent PR check skipped - branch feature PR (kind/branch-feature) Use /refresh-label to update related check and label manually |
|
/ci-rerun-ut-go |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## 2.5 #45622 +/- ##
==========================================
- Coverage 82.10% 82.05% -0.05%
==========================================
Files 1128 1587 +459
Lines 179181 248710 +69529
==========================================
+ Hits 147110 204087 +56977
- Misses 26099 38618 +12519
- Partials 5972 6005 +33
🚀 New features to boost your workflow:
|
When etcd reconnects, the DataCoord rewatches DataNodes and calls ChannelManager.Startup again without closing the previous instance. This causes multiple contexts and goroutines to accumulate, leading to Close hanging indefinitely waiting for untracked goroutines. Root cause: - Etcd reconnection triggers rewatch flow and calls Startup again - Startup was not idempotent, allowing repeated calls - Multiple context cancellations and goroutines accumulated - Close would wait indefinitely for untracked goroutines Changes: - Add started field to ChannelManagerImpl - Refactor Startup to check and handle restart scenario - Add state check in Close to prevent hanging Signed-off-by: Wei Liu <[email protected]>
Signed-off-by: Wei Liu <[email protected]>
69c7431 to
0822d4e
Compare
|
[INFO] PR Label Summary by Default Use /refresh-label to update related check and label manually |
congqixia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: congqixia, weiliu1031 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
[INFO] PR Label Summary by Default Use /refresh-label to update related check and label manually |
issue: #45623
When etcd reconnects, the DataCoord rewatches DataNodes and calls ChannelManager.Startup again without closing the previous instance. This causes multiple contexts and goroutines to accumulate, leading to Close hanging indefinitely waiting for untracked goroutines.
Root cause:
Changes: