Skip to content

Net, LLMQ: Fix several sources of frequent lock contention#1799

Merged
reubenyap merged 5 commits intomasterfrom
cursor/block-header-validation-issue-667d
Apr 14, 2026
Merged

Net, LLMQ: Fix several sources of frequent lock contention#1799
reubenyap merged 5 commits intomasterfrom
cursor/block-header-validation-issue-667d

Conversation

@reubenyap
Copy link
Copy Markdown
Member

@reubenyap reubenyap commented Mar 19, 2026

PR intention

Rebases and refines the changes from #1688 onto the latest master. Addresses real lock contention issues in LLMQ signing share processing and CConnman::ThreadSocketHandler, while preserving prompt sig-share failover when peers are banned outside CSigSharesManager. Supersedes #1688.

Issues fixed:

  1. RemoveBannedNodeStates() causes frequent cs_main contention: This function acquired LOCK2(cs_main, cs) and was being called every ~100ms (every loop iteration of WorkThreadMain). Since its periodic work is only needed for local sig-share state cleanup, acquiring cs_main that often creates unnecessary contention with validation, block processing, and other cs_main consumers. Fix: throttle the periodic cleanup to once every 30 seconds, and move the correctness-sensitive cleanup for banned peers into a direct notification path.

    The direct notification path matters because peers can also be banned elsewhere in net processing. If the sig-share manager only noticed bans via the periodic IsBanned() scan, throttling that scan could delay releasing requested sig-shares until timeouts expired. Fix: add MarkNodeBanned() and call it both from CSigSharesManager::BanNode() and from the net-processing ban path when a peer is actually being disconnected, so banned peers immediately release requested sig-shares, drop cached sig-share sessions, and are ignored for future sig-share traffic.

    Additional guards now re-check the local banned flag after taking cs in the sig-share message handlers, so a concurrent MarkNodeBanned() cannot race with session creation or pending-share enqueueing. Pending verification work is also discarded for already-banned peers.

  2. Unnecessary full vNodes copy in ThreadSocketHandler: Every ~50ms, the entire vNodes vector was copied (std::vector<CNode*> vNodesCopy = vNodes) even though typically 0 nodes need disconnecting per iteration. Fix: iterate vNodes directly and collect only disconnected nodes (usually none), avoiding an O(n) heap allocation on every loop.

  3. 100ms idle sleep too frequent: The idle sleep in WorkThreadMain contributes to the overall contention frequency. Fix: increase to 250ms — using 250ms rather than the originally proposed 1000ms in Fix Several Sources of Frequent Lock Contention #1688, since there is no wakeup mechanism for new signing work and longer delays could impact ChainLock/InstantSend latency (per @aleflm's review feedback on Fix Several Sources of Frequent Lock Contention #1688).

Code changes brief

  • src/llmq/quorums_signing_shares.cpp/h: Added MarkNodeBanned() to immediately clear requested sig-shares, drop cached sig-share sessions/pending work, and ignore further sig-share messages from banned peers. MarkNodeBanned() only updates existing sig-share state, so unrelated peers do not get a synthetic banned entry. RemoveBannedNodeStates() now only reclaims already-banned peer state once the peer is gone or disconnecting, so the 30-second throttle does not delay failover correctness. Sig-share handlers now re-check the banned state after taking cs, and pending verification work is discarded for banned peers. Idle sleep remains 250ms.
  • src/net_processing.cpp: Notify CSigSharesManager when net processing marks a peer bannable and the peer is actually being disconnected, keeping LLMQ sig-share cleanup in sync with bans originating outside the sig-share manager without affecting whitelisted/addnode peers.
  • src/net.cpp: Refactored ThreadSocketHandler disconnect handling to avoid copying the full vNodes vector. Instead, iterates directly and collects only disconnected nodes before processing them. Same locking scope and behavior preserved.

Closes #1688

Open in Web Open in Cursor 

cursoragent and others added 2 commits March 19, 2026 18:11
In ThreadSocketHandler, the full vNodes vector was being copied every
iteration (~50ms) just to safely iterate while modifying. Since
typically 0 nodes are disconnected per iteration, this copies hundreds
of node pointers unnecessarily.

Instead, iterate vNodes directly and collect only the disconnected
nodes (usually none), then process them. This avoids an O(n) heap
allocation on every loop while maintaining the same locking scope
and behavior.

refs #1688

Co-authored-by: Reuben Yap <reuben@firo.org>
RemoveBannedNodeStates() acquires LOCK2(cs_main, cs) every loop
iteration of WorkThreadMain, which runs every ~100ms when idle. This
causes cs_main to be acquired ~10 times per second purely for local
memory cleanup of banned node states - creating significant contention
with validation, block processing, and other cs_main consumers.

Throttle RemoveBannedNodeStates() to run at most once every 30 seconds,
which is more than sufficient for its housekeeping purpose.

Also increase the idle sleep from 100ms to 250ms to further reduce
unnecessary polling frequency. Using 250ms (rather than the originally
proposed 1000ms) since there is no wakeup mechanism for new signing
work, and longer delays could impact ChainLock/InstantSend latency.

refs #1688

Co-authored-by: Reuben Yap <reuben@firo.org>
@cursor
Copy link
Copy Markdown

cursor bot commented Mar 19, 2026

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
Learn more about Cursor Agents

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 86ee684d-9316-464c-9108-a2e9bcea4552

📥 Commits

Reviewing files that changed from the base of the PR and between b3f9206 and 3454d88.

📒 Files selected for processing (1)
  • src/llmq/quorums_signing_shares.cpp
✅ Files skipped from review due to trivial changes (1)
  • src/llmq/quorums_signing_shares.cpp

Walkthrough

Centralized banning for LLMQ sigshares via a new MarkNodeBanned method; net processing now notifies the sigshares manager on ban; banned-node cleanup is rate-limited and worker idle sleep increased; socket handler disconnection now collects only nodes marked for disconnect.

Changes

Cohort / File(s) Summary
Quorum Signature Shares: core logic
src/llmq/quorums_signing_shares.cpp, src/llmq/quorums_signing_shares.h
Added CSigSharesManager::MarkNodeBanned(NodeId) and declared it in the header. ProcessMessage early-returns for peers with banned=true. RemoveBannedNodeStates now deletes only locally stored banned states for disconnected peers and is rate-limited to once per 30s; worker idle sleep increased from 100ms to 250ms. BanNode now calls Misbehaving(...) then delegates to MarkNodeBanned.
Quorum Signature Shares: ban notification integration
src/net_processing.cpp
SendRejectsAndCheckIfBanned calls llmq::quorumSigSharesManager->MarkNodeBanned(pnode->GetId()) when state.fShouldBan and manager is non-null, prior to disconnect.
Socket handler disconnection refactor
src/net.cpp
CConnman::ThreadSocketHandler now collects nodes with fDisconnect into vDisconnectedNodes and iterates that list for teardown, avoiding copying the full vNodes list while preserving teardown steps.

Sequence Diagram(s)

sequenceDiagram
  participant Peer
  participant NetProcessing
  participant SigSharesMgr
  participant WorkerThread

  Peer->>NetProcessing: trigger misbehavior / reach ban threshold
  NetProcessing->>SigSharesMgr: MarkNodeBanned(nodeId) 					
  SigSharesMgr-->>NetProcessing: (return)
  NetProcessing->>Peer: set fDisconnect = true / continue ban flow
  WorkerThread->>SigSharesMgr: periodic RemoveBannedNodeStates (rate-limited)
  SigSharesMgr-->>WorkerThread: removed stale banned node states
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • levonpetrosyan93
  • navidR

Poem

🐰 I sniffed the tracks and marked the bad,
Cleared crumbs and queues, made tidy, not sad.
I hop a bit slower, rest a little more,
No extra copies, cleaner floor.
A carrot cheer for code that's glad. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: fixing lock contention issues in networking and LLMQ subsystems through multiple targeted optimizations.
Description check ✅ Passed The description includes all required template sections with comprehensive detail: PR intention explains the rebase and specific issues fixed, and code changes brief outlines architectural modifications across all touched files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cursor/block-header-validation-issue-667d
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai bot requested a review from levonpetrosyan93 March 19, 2026 18:19
@reubenyap reubenyap requested a review from navidR March 19, 2026 18:21
@reubenyap reubenyap closed this Mar 19, 2026
@reubenyap reubenyap reopened this Mar 19, 2026
@reubenyap reubenyap closed this Mar 19, 2026
Teach CSigSharesManager about peer bans directly so sig-share
request/session cleanup happens immediately instead of depending on
a periodic IsBanned() scan.

This preserves the lock-contention improvement from throttling
RemoveBannedNodeStates(), while avoiding multi-second failover delays
for peers that are banned elsewhere in net processing.

Mark banned peers locally, drop their cached sessions and pending work,
ignore further sig-share messages from them, and leave the periodic
cleanup as local state reclamation for peers that have disconnected.

refs #1688

Co-authored-by: Reuben Yap <reuben@firo.org>
@reubenyap reubenyap reopened this Mar 19, 2026
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment thread src/net_processing.cpp Outdated
Comment thread src/llmq/quorums_signing_shares.cpp Outdated
Restrict the net-processing LLMQ ban notification to peers that are
actually being disconnected, so whitelisted and addnode peers do not
lose sig-share functionality after unrelated misbehavior.

Also restore the old unknown-peer behavior in MarkNodeBanned() by
using find() instead of creating a new node state for peers that never
participated in sig-share exchange.

refs #1688

Co-authored-by: Reuben Yap <reuben@firo.org>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/llmq/quorums_signing_shares.cpp`:
- Around line 240-246: Handlers currently only check nodeStates[pfrom->id] for
banned status before taking cs, so a race where MarkNodeBanned() runs in-between
lets handlers like ProcessMessageSigSesAnn and ProcessMessageBatchedSigShares
recreate sessions or push into pendingIncomingSigShares; update those handlers
to reacquire cs and re-check nodeStates[pfrom->id].banned (and return/skip)
immediately after taking LOCK(cs) before mutating state (creating sessions or
enqueueing), and also update CollectPendingSigSharesToVerify to skip nodes whose
nodeStates[].banned is true so no verification work is performed for banned
peers.
- Around line 1356-1368: BanNode currently grabs LOCK(cs_main) before performing
local sig-share cleanup which causes nested contention; move the local cleanup
out of the cs_main critical section and run it before acquiring LOCK(cs_main).
Specifically, in CSigSharesManager::BanNode, first perform the local cleanup
operations for the node (clear any requested shares and cached session state for
nodeId using the manager's existing cleanup helpers), then acquire LOCK(cs_main)
and call Misbehaving(nodeId, 100), and finally call MarkNodeBanned(nodeId);
ensure the cleanup calls do not touch cs_main to avoid nested locking.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 33cbab35-1dab-41ae-9f75-ae28a13480ea

📥 Commits

Reviewing files that changed from the base of the PR and between 5168f4b and b3f9206.

📒 Files selected for processing (2)
  • src/llmq/quorums_signing_shares.cpp
  • src/net_processing.cpp

Comment thread src/llmq/quorums_signing_shares.cpp
Comment thread src/llmq/quorums_signing_shares.cpp
Re-check the local banned flag after taking cs in the sig-share
message handlers so a concurrent MarkNodeBanned() cannot race with
session creation or enqueueing pending shares.

Also skip and clear any pending verification work for banned peers.

refs #1688

Co-authored-by: Reuben Yap <reuben@firo.org>
@navidR
Copy link
Copy Markdown
Contributor

navidR commented Mar 20, 2026

Looks good to me.

Copy link
Copy Markdown
Member Author

@reubenyap reubenyap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI Code Review — PR #1799

Overall Assessment: Approve with minor suggestions

This PR addresses real lock contention issues in LLMQ signing share processing and CConnman::ThreadSocketHandler. The architecture is well thought out: separating immediate correctness-sensitive cleanup (MarkNodeBanned) from periodic garbage collection (RemoveBannedNodeStates) is a clean design that prevents race windows where banned peers could still inject sig-share work.

Key Observations

Lock ordering is consistent — verified that cs_main → CSigSharesManager::cs ordering is maintained across all call sites: BanNode(), MarkNodeBanned() (called from SendRejectsAndCheckIfBanned under cs_main), and CollectPendingSigSharesToVerify (releases cs before acquiring cs_main). No deadlock risk introduced.

Banned peer rejection is comprehensive — every message handler (ProcessMessageSigSesAnn, ProcessMessageSigSharesInv, ProcessMessageGetSigShares, ProcessMessageBatchedSigShares) plus the top-level ProcessMessage entry point and CollectPendingSigSharesToVerify all check the banned flag. This is thorough.

The split-lock pattern in RemoveBannedNodeStates — collecting banned IDs under cs, checking connectivity via ForEachNode (no lock overlap), then erasing under cs with re-validation — is correctly implemented with no TOCTOU issues since the second lock block re-checks it->second.banned.

Findings Summary

# Severity File Finding
1 🟡 suggestion quorums_signing_shares.cpp Ternary with implicit map insertion could be clearer
2 🟡 suggestion quorums_signing_shares.cpp Extract 30s throttle to named constant
3 🔵 question quorums_signing_shares.cpp 250ms idle sleep impact on IS latency
4 🔵 nitpick net.cpp O(n·k) erase-remove inside loop
5 🟢 praise quorums_signing_shares.cpp MarkNodeBanned design eliminates race window

Static analysis: flawfinder ran with no findings in changed lines. cppcheck/clang-tidy/semgrep unavailable in current environment.

Comment thread src/llmq/quorums_signing_shares.cpp
Comment thread src/llmq/quorums_signing_shares.cpp
Comment thread src/llmq/quorums_signing_shares.cpp
Comment thread src/llmq/quorums_signing_shares.cpp
Comment thread src/net.cpp
Comment thread src/net_processing.cpp
reubenyap pushed a commit that referenced this pull request Mar 22, 2026
After CollectPendingSigSharesToVerify copies sig shares into a local map
and releases cs, a peer could be banned by another thread (e.g.
ProcessMessage) before ProcessPendingSigSharesFromNode is called. The
already-copied shares would still be processed because MarkNodeBanned
cannot retract work copied out of the pending map.

Add an IsBanned() re-check under cs_main immediately before calling
ProcessPendingSigSharesFromNode(), skipping any node banned mid-flight.

Add regression test verifying IsBanned() returns true immediately after
Misbehaving(nodeId, 100), which is the prerequisite for the guard.

refs #1799

https://claude.ai/code/session_015tEmb5UhGTsMc9XJC6mHmL
reubenyap pushed a commit that referenced this pull request Mar 22, 2026
ProcessPendingSigShares() collects sig shares into a local map under cs,
then releases the lock for batch verification. If MarkNodeBanned() runs
on another thread between collection and the final processing loop, the
already-copied shares would still be applied because MarkNodeBanned()
cannot retract work already copied out.

Add a nodeState.banned re-check under cs immediately before calling
ProcessPendingSigSharesFromNode(), skipping any node banned mid-flight.

Add regression test verifying IsBanned() returns true immediately after
Misbehaving(nodeId, 100), confirming the guard prerequisite.

refs #1799

https://claude.ai/code/session_015tEmb5UhGTsMc9XJC6mHmL
@reubenyap reubenyap added the networking/p2p Peer-to-peer networking layer label Mar 24, 2026
@reubenyap reubenyap requested review from psolstice and removed request for navidR March 30, 2026 03:15
@reubenyap reubenyap merged commit 4943a33 into master Apr 14, 2026
49 of 51 checks passed
@reubenyap reubenyap deleted the cursor/block-header-validation-issue-667d branch April 14, 2026 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

networking/p2p Peer-to-peer networking layer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants