Skip to content

Conversation

@weiliu1031
Copy link
Contributor

@weiliu1031 weiliu1031 commented Nov 17, 2025

issue: #45623
When etcd reconnects, the DataCoord rewatches DataNodes and calls ChannelManager.Startup again without closing the previous instance. This causes multiple contexts and goroutines to accumulate, leading to Close hanging indefinitely waiting for untracked goroutines.

Root cause:

  • Etcd reconnection triggers rewatch flow and calls Startup again
  • Startup was not idempotent, allowing repeated calls
  • Multiple context cancellations and goroutines accumulated
  • Close would wait indefinitely for untracked goroutines

Changes:

  • Add started field to ChannelManagerImpl
  • Refactor Startup to check and handle restart scenario
  • Add state check in Close to prevent hanging

@sre-ci-robot sre-ci-robot added the size/L Denotes a PR that changes 100-499 lines. label Nov 17, 2025
@mergify mergify bot added the dco-passed DCO check passed. label Nov 17, 2025
@mergify
Copy link
Contributor

mergify bot commented Nov 17, 2025

@weiliu1031 Please associate the related pr of master to the body of your Pull Request. (eg. "pr: #")

@mergify mergify bot added do-not-merge/missing-related-pr kind/bug Issues or changes related a bug labels Nov 17, 2025
@sre-ci-robot
Copy link
Contributor

[ci-v2-notice]
Notice: We are gradually rolling out the new ci-v2 system.

  • Legacy CI jobs remain unaffected, you can just ignore ci-v2 if you don't want to run it.
  • Additional "ci-v2/*" checkers will run for this PR to ensure the new ci-v2 system is working as expected.
  • For tests that exist in both v1 and v2, passing in either system is considered PASS.

To rerun ci-v2 checks, comment with:

  • /ci-rerun-code-check // for ci-v2/code-check
  • /ci-rerun-build // for ci-v2/build
  • /ci-rerun-ut-integration // for ci-v2/ut-integration
  • /ci-rerun-ut-go // for ci-v2/ut-go
  • /ci-rerun-ut-cpp // for ci-v2/ut-cpp
  • /ci-rerun-ut // for all ci-v2/ut-integration, ci-v2/ut-go, ci-v2/ut-cpp
  • /ci-rerun-e2e-arm // for ci-v2/e2e-arm

If you have any questions or requests, please contact @zhikunyao.

@sre-ci-robot sre-ci-robot added do-not-merge/need-merge-master-first any pr merge to release branch need to merge master first do-not-merge/need-milestone generate by v2-label-manager labels Nov 17, 2025
@sre-ci-robot
Copy link
Contributor

[INFO] PR Label Summary by Default
[WARNING] No dependent PR reference found

  • Target branch '2.5' requires a PR merged to master first
  • Please add reference in format 'pr: #number'

[WARNING] Milestone not set

You can set milestone by commenting:
/set-milestone
Example:
/set-milestone 2.5.0

Use /refresh-label to update related check and label manually

@mergify
Copy link
Contributor

mergify bot commented Nov 17, 2025

@weiliu1031 Please associate the related issue to the body of your Pull Request. (eg. "issue: #")

@weiliu1031
Copy link
Contributor Author

/kind branch-feature

@weiliu1031
Copy link
Contributor Author

/set-milestone 2.5.23

@sre-ci-robot sre-ci-robot added this to the 2.5.23 milestone Nov 17, 2025
@sre-ci-robot sre-ci-robot removed the do-not-merge/need-milestone generate by v2-label-manager label Nov 17, 2025
@sre-ci-robot
Copy link
Contributor

[INFO] Set milestone to: 2.5.23

@weiliu1031
Copy link
Contributor Author

/refresh-label

@sre-ci-robot sre-ci-robot removed the do-not-merge/need-merge-master-first any pr merge to release branch need to merge master first label Nov 17, 2025
@sre-ci-robot
Copy link
Contributor

[INFO] PR Label Summary by Refresh-Label

  • Title: fix: Prevent Close from hanging on etcd reconnection
  • Target: 2.5
  • Labels: kind/bug, size/L, dco-passed, kind/branch-feature, do-not-merge/need-merge-master-first

[INFO] Dependent PR check skipped - branch feature PR (kind/branch-feature)

Use /refresh-label to update related check and label manually

@weiliu1031
Copy link
Contributor Author

/ci-rerun-ut-go

@codecov
Copy link

codecov bot commented Nov 17, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.05%. Comparing base (3a7a08f) to head (0822d4e).
⚠️ Report is 59 commits behind head on 2.5.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##              2.5   #45622      +/-   ##
==========================================
- Coverage   82.10%   82.05%   -0.05%     
==========================================
  Files        1128     1587     +459     
  Lines      179181   248710   +69529     
==========================================
+ Hits       147110   204087   +56977     
- Misses      26099    38618   +12519     
- Partials     5972     6005      +33     
Components Coverage Δ
Client 78.90% <22.22%> (-0.06%) ⬇️
Core 84.56% <79.54%> (∅)
Go 82.38% <79.16%> (+<0.01%) ⬆️
Files with missing lines Coverage Δ
internal/datacoord/channel_manager.go 89.59% <100.00%> (+0.58%) ⬆️
internal/datacoord/server.go 74.16% <ø> (+0.25%) ⬆️

... and 509 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mergify mergify bot added the ci-passed label Nov 17, 2025
When etcd reconnects, the DataCoord rewatches DataNodes and calls
ChannelManager.Startup again without closing the previous instance.
This causes multiple contexts and goroutines to accumulate, leading
to Close hanging indefinitely waiting for untracked goroutines.

Root cause:
- Etcd reconnection triggers rewatch flow and calls Startup again
- Startup was not idempotent, allowing repeated calls
- Multiple context cancellations and goroutines accumulated
- Close would wait indefinitely for untracked goroutines

Changes:
- Add started field to ChannelManagerImpl
- Refactor Startup to check and handle restart scenario
- Add state check in Close to prevent hanging

Signed-off-by: Wei Liu <[email protected]>
Signed-off-by: Wei Liu <[email protected]>
@weiliu1031 weiliu1031 force-pushed the fix_datacoord_close_stuck branch from 69c7431 to 0822d4e Compare November 18, 2025 06:57
@mergify mergify bot removed the ci-passed label Nov 18, 2025
@sre-ci-robot
Copy link
Contributor

[INFO] PR Label Summary by Default
[INFO] Dependent PR check skipped - branch feature PR (kind/branch-feature)

Use /refresh-label to update related check and label manually

@mergify mergify bot added the ci-passed label Nov 18, 2025
Copy link
Contributor

@congqixia congqixia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@sre-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: congqixia, weiliu1031

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sre-ci-robot
Copy link
Contributor

[INFO] PR Label Summary by Default
[INFO] Dependent PR check skipped - branch feature PR (kind/branch-feature)

Use /refresh-label to update related check and label manually

@sre-ci-robot sre-ci-robot merged commit 2232dfc into milvus-io:2.5 Nov 19, 2025
17 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved ci-passed dco-passed DCO check passed. kind/branch-feature kind/bug Issues or changes related a bug lgtm size/L Denotes a PR that changes 100-499 lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants