Skip to content

Conversation

@vqianxiao
Copy link

During the service provider's release period, concurrent read routes from consumers were rejected #15881

What is the purpose of the change?

Changing invokerRefreshLock from ReentrantLock to ReentrantReadWriteLock avoids concurrency issues, and using invokerRefreshReadLock avoids lock blocking during high concurrency reads

Checklist

  • Make sure there is a GitHub_issue field for the change.
  • Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
  • Write necessary unit-test to verify your logic correction. If the new feature or significant change is committed, please remember to add sample in dubbo samples project.
  • Make sure gitHub actions can pass. Why the workflow is failing and how to fix it?

wangwei added 2 commits December 19, 2025 16:08
…rantReadWriteLock avoids concurrency issues, and using invokerRefreshReadLock avoids lock blocking during high concurrency reads apache#15881
@codecov-commenter
Copy link

codecov-commenter commented Dec 19, 2025

Codecov Report

❌ Patch coverage is 77.77778% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.75%. Comparing base (a3a35b5) to head (ef3eef3).

Files with missing lines Patch % Lines
...dubbo/rpc/cluster/directory/AbstractDirectory.java 77.77% 7 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##                3.3   #15883      +/-   ##
============================================
+ Coverage     60.74%   60.75%   +0.01%     
  Complexity    11702    11702              
============================================
  Files          1938     1938              
  Lines         88694    88710      +16     
  Branches      13387    13389       +2     
============================================
+ Hits          53879    53900      +21     
+ Misses        29291    29278      -13     
- Partials       5524     5532       +8     
Flag Coverage Δ
integration-tests-java21 32.37% <58.33%> (-0.01%) ⬇️
integration-tests-java8 32.45% <58.33%> (-0.09%) ⬇️
samples-tests-java21 32.06% <44.44%> (+0.04%) ⬆️
samples-tests-java8 29.75% <44.44%> (+0.06%) ⬆️
unit-tests-java11 59.06% <63.88%> (+0.01%) ⬆️
unit-tests-java17 58.53% <63.88%> (-0.01%) ⬇️
unit-tests-java21 58.52% <63.88%> (-0.01%) ⬇️
unit-tests-java25 58.49% <63.88%> (+0.01%) ⬆️
unit-tests-java8 59.05% <63.88%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request addresses a concurrency issue during service provider release periods by upgrading the locking mechanism from ReentrantLock to ReentrantReadWriteLock. This change allows multiple consumer threads to concurrently read routes without blocking each other, while still maintaining exclusive access for write operations.

Key Changes:

  • Replaced invokerRefreshLock (ReentrantLock) with a ReentrantReadWriteLock and extracted separate read and write lock references
  • Modified the list() method to use the read lock for concurrent access to invoker lists
  • Updated all write operations (add/remove invokers, refresh, etc.) to use the write lock

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

EarthChen
EarthChen previously approved these changes Dec 23, 2025
Copy link
Member

@EarthChen EarthChen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@RainYuY RainYuY added the type/discussion Everything related with code discussion or question label Dec 23, 2025
Copy link
Member

@RainYuY RainYuY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we accept this PR, it will lead to the invokers not being refreshed when routing occurs. If the QPS is high, this may cause issues such as dead nodes remaining valid for an extended period. Additionally, I don’t understand why #10925 added this restriction. We need to discuss this further. @AlbumenJ

@vqianxiao
Copy link
Author

vqianxiao commented Dec 24, 2025

Hi @RainYuY
Thank you for your comment, but this is not an improvement, it's a fix. Because in the production environment, we found that after the service provider was published, Dubbo consumers were unable to call the provider's service normally. You can see that #15881 has been called 0 times. I overwritten my modifications with the AbstractDirectory class in the jar package and republished it to consumers. After the service provider published it, consumers can consume normally. We consumers call the provider for about 10wQPS, and a single machine for about 1000QPS. I think this should already be considered a high QPS call.

@RainYuY
Copy link
Member

RainYuY commented Dec 24, 2025

Hi @RainYuY : Thank you for your comment, but this is not an improvement, it's a fix. Because in the production environment, we found that after the service provider was published, Dubbo consumers were unable to call the provider's service normally. You can see that #15881 has been called 0 times. I overwritten my modifications with the AbstractDirectory class in the jar package and republished it to consumers. After the service provider published it, consumers can consume normally. We consumers call the provider for about 10wQPS, and a single machine for about 1000QPS. I think this should already be considered a high QPS call.

So you haven’t encountered the situation where invokers are refreshed late? From my understanding of your code, if a request is being routed, the invoker list cannot be refreshed. As a result, the refresh process will be blocked until routing is completed. However, if routing is ongoing continuously (e.g., a read lock is held persistently), the write lock will take much longer to be acquired.

@EarthChen
Copy link
Member

**RainYuY **

What @RainYuY is concerned about is that due to the existence of the high-concurrency read-write lock, the invoker list fails to acquire the write lock and thus cannot be updated successfully. In this case, you will keep retrieving the outdated invoker list.

@vqianxiao
Copy link
Author

So you haven’t encountered the situation where invokers are refreshed late? From my understanding of your code, if a request is being routed, the invoker list cannot be refreshed. As a result, the refresh process will be blocked until routing is completed. However, if routing is ongoing continuously (e.g., a read lock is held persistently), the write lock will take much longer to be acquired.

you are right. I found that during the release period of a service provider, the overall Dubbo call time increased due to lock blocking until the service provider completed the release. So, I wonder if there is a better way to solve this problem, but currently all I can think of is to lock it first to ensure that the call can be made normally instead of being unable to call directly

@EarthChen
Copy link
Member

So you haven’t encountered the situation where invokers are refreshed late? From my understanding of your code, if a request is being routed, the invoker list cannot be refreshed. As a result, the refresh process will be blocked until routing is completed. However, if routing is ongoing continuously (e.g., a read lock is held persistently), the write lock will take much longer to be acquired.

you are right. I found that during the release period of a service provider, the overall Dubbo call time increased due to lock blocking until the service provider completed the release. So, I wonder if there is a better way to solve this problem, but currently all I can think of is to lock it first to ensure that the call can be made normally instead of being unable to call directly

I think a solution that is more oriented to AP would be to remove the validation between the new and old invoker lists to ensure availability. However, this validation was added via a separate PR submitted by another PMC member, so we need to confirm the intention behind this modification.

@RainYuY
Copy link
Member

RainYuY commented Dec 24, 2025

So you haven’t encountered the situation where invokers are refreshed late? From my understanding of your code, if a request is being routed, the invoker list cannot be refreshed. As a result, the refresh process will be blocked until routing is completed. However, if routing is ongoing continuously (e.g., a read lock is held persistently), the write lock will take much longer to be acquired.

you are right. I found that during the release period of a service provider, the overall Dubbo call time increased due to lock blocking until the service provider completed the release. So, I wonder if there is a better way to solve this problem, but currently all I can think of is to lock it first to ensure that the call can be made normally instead of being unable to call directly

I think a solution that is more oriented to AP would be to remove the validation between the new and old invoker lists to ensure availability. However, this validation was added via a separate PR submitted by another PMC member, so we need to confirm the intention behind this modification.

I don’t have a better solution yet and I’m still thinking about it. But I’m wondering why this restriction exists, so I’m waiting for Kevin to give me an answer LOL. If I don’t get a reply, I’ll call him this Friday ^v^. @AlbumenJ

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type/discussion Everything related with code discussion or question

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants