[CELEBORN-2274] Fix replicate channels not resumed when transitioning from PUSH_AND_REPLICATE_PAUSED to PUSH_PAUSED by sl3635 · Pull Request #3616 · apache/celeborn

sl3635 · 2026-03-05T00:23:02Z

What changes were proposed in this pull request?

Fix a bug in MemoryManager.switchServingState() where replicate channels permanently lose autoRead=true after a memory pressure event.

When the serving state transitions from PUSH_AND_REPLICATE_PAUSED to PUSH_PAUSED, resumeReplicate() was only called inside the !tryResumeByPinnedMemory() guard. If tryResumeByPinnedMemory() returned true, the entire block was skipped and replicate channels were never resumed.

The fix moves resumeReplicate() outside the tryResumeByPinnedMemory() guard so it is always called when stepping down from PUSH_AND_REPLICATE_PAUSED to PUSH_PAUSED. This is a state machine invariant: PUSH_PAUSED means only push is paused; replicate must always be resumed.

Why are the changes needed?

Once replicate channels are stuck with autoRead=false, Netty I/O threads stop reading from all replicate connections. Remote workers writing to the affected worker see their TCP send buffers fill up (zero window), causing pending writes to accumulate in ChannelOutboundBuffer. Each pending write holds a reference to a direct memory ByteBuf, causing direct memory to grow indefinitely on the remote workers.

The failure sequence:

Worker hits memory pressure → state = PUSH_AND_REPLICATE_PAUSED → all channels paused
Pinned memory is low → tryResumeByPinnedMemory() returns true → resumeByPinnedMemory(PUSH_PAUSED) resumes push only, replicate not resumed
Memory drops to push-only range → state = PUSH_PAUSED, but resumeReplicate() is never called
Replicate channels permanently stuck with autoRead=false, causing unbounded direct memory growth on remote workers

Does this PR resolve a correctness bug?

Yes.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added a new unit test Test MemoryManager resume replicate by pinned memory in MemoryManagerSuite that reproduces the exact failure scenario:

Enter PUSH_AND_REPLICATE_PAUSED with low pinned memory (channels resumed by pinned memory path)
Raise pinned memory so both push and replicate get paused
Drop memory to PUSH_PAUSED range with low pinned memory
Assert replicate listener is resumed — this assertion fails without the fix

… from PUSH_AND_REPLICATE_PAUSED to PUSH_PAUSED

SteNicholas

Thanks for fix. LGTM.

[CELEBORN-2274] Fix replicate channels not resumed when transitioning…

e4ccaba

… from PUSH_AND_REPLICATE_PAUSED to PUSH_PAUSED

github-actions bot added the module:worker label Mar 5, 2026

SteNicholas assigned SteNicholas and RexXiong Mar 5, 2026

SteNicholas approved these changes Mar 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CELEBORN-2274] Fix replicate channels not resumed when transitioning from PUSH_AND_REPLICATE_PAUSED to PUSH_PAUSED#3616

[CELEBORN-2274] Fix replicate channels not resumed when transitioning from PUSH_AND_REPLICATE_PAUSED to PUSH_PAUSED#3616
sl3635 wants to merge 1 commit intoapache:mainfrom
sl3635:CELEBORN-2274

sl3635 commented Mar 5, 2026

Uh oh!

SteNicholas left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sl3635 commented Mar 5, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR resolve a correctness bug?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SteNicholas left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants