Net, LLMQ: Fix several sources of frequent lock contention#1799
Net, LLMQ: Fix several sources of frequent lock contention#1799
Conversation
In ThreadSocketHandler, the full vNodes vector was being copied every iteration (~50ms) just to safely iterate while modifying. Since typically 0 nodes are disconnected per iteration, this copies hundreds of node pointers unnecessarily. Instead, iterate vNodes directly and collect only the disconnected nodes (usually none), then process them. This avoids an O(n) heap allocation on every loop while maintaining the same locking scope and behavior. refs #1688 Co-authored-by: Reuben Yap <reuben@firo.org>
RemoveBannedNodeStates() acquires LOCK2(cs_main, cs) every loop iteration of WorkThreadMain, which runs every ~100ms when idle. This causes cs_main to be acquired ~10 times per second purely for local memory cleanup of banned node states - creating significant contention with validation, block processing, and other cs_main consumers. Throttle RemoveBannedNodeStates() to run at most once every 30 seconds, which is more than sufficient for its housekeeping purpose. Also increase the idle sleep from 100ms to 250ms to further reduce unnecessary polling frequency. Using 250ms (rather than the originally proposed 1000ms) since there is no wakeup mechanism for new signing work, and longer delays could impact ChainLock/InstantSend latency. refs #1688 Co-authored-by: Reuben Yap <reuben@firo.org>
|
Cursor Agent can help with this pull request. Just |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
WalkthroughCentralized banning for LLMQ sigshares via a new MarkNodeBanned method; net processing now notifies the sigshares manager on ban; banned-node cleanup is rate-limited and worker idle sleep increased; socket handler disconnection now collects only nodes marked for disconnect. Changes
Sequence Diagram(s)sequenceDiagram
participant Peer
participant NetProcessing
participant SigSharesMgr
participant WorkerThread
Peer->>NetProcessing: trigger misbehavior / reach ban threshold
NetProcessing->>SigSharesMgr: MarkNodeBanned(nodeId)
SigSharesMgr-->>NetProcessing: (return)
NetProcessing->>Peer: set fDisconnect = true / continue ban flow
WorkerThread->>SigSharesMgr: periodic RemoveBannedNodeStates (rate-limited)
SigSharesMgr-->>WorkerThread: removed stale banned node states
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Teach CSigSharesManager about peer bans directly so sig-share request/session cleanup happens immediately instead of depending on a periodic IsBanned() scan. This preserves the lock-contention improvement from throttling RemoveBannedNodeStates(), while avoiding multi-second failover delays for peers that are banned elsewhere in net processing. Mark banned peers locally, drop their cached sessions and pending work, ignore further sig-share messages from them, and leave the periodic cleanup as local state reclamation for peers that have disconnected. refs #1688 Co-authored-by: Reuben Yap <reuben@firo.org>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Restrict the net-processing LLMQ ban notification to peers that are actually being disconnected, so whitelisted and addnode peers do not lose sig-share functionality after unrelated misbehavior. Also restore the old unknown-peer behavior in MarkNodeBanned() by using find() instead of creating a new node state for peers that never participated in sig-share exchange. refs #1688 Co-authored-by: Reuben Yap <reuben@firo.org>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/llmq/quorums_signing_shares.cpp`:
- Around line 240-246: Handlers currently only check nodeStates[pfrom->id] for
banned status before taking cs, so a race where MarkNodeBanned() runs in-between
lets handlers like ProcessMessageSigSesAnn and ProcessMessageBatchedSigShares
recreate sessions or push into pendingIncomingSigShares; update those handlers
to reacquire cs and re-check nodeStates[pfrom->id].banned (and return/skip)
immediately after taking LOCK(cs) before mutating state (creating sessions or
enqueueing), and also update CollectPendingSigSharesToVerify to skip nodes whose
nodeStates[].banned is true so no verification work is performed for banned
peers.
- Around line 1356-1368: BanNode currently grabs LOCK(cs_main) before performing
local sig-share cleanup which causes nested contention; move the local cleanup
out of the cs_main critical section and run it before acquiring LOCK(cs_main).
Specifically, in CSigSharesManager::BanNode, first perform the local cleanup
operations for the node (clear any requested shares and cached session state for
nodeId using the manager's existing cleanup helpers), then acquire LOCK(cs_main)
and call Misbehaving(nodeId, 100), and finally call MarkNodeBanned(nodeId);
ensure the cleanup calls do not touch cs_main to avoid nested locking.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 33cbab35-1dab-41ae-9f75-ae28a13480ea
📒 Files selected for processing (2)
src/llmq/quorums_signing_shares.cppsrc/net_processing.cpp
Re-check the local banned flag after taking cs in the sig-share message handlers so a concurrent MarkNodeBanned() cannot race with session creation or enqueueing pending shares. Also skip and clear any pending verification work for banned peers. refs #1688 Co-authored-by: Reuben Yap <reuben@firo.org>
|
Looks good to me. |
reubenyap
left a comment
There was a problem hiding this comment.
AI Code Review — PR #1799
Overall Assessment: Approve with minor suggestions
This PR addresses real lock contention issues in LLMQ signing share processing and CConnman::ThreadSocketHandler. The architecture is well thought out: separating immediate correctness-sensitive cleanup (MarkNodeBanned) from periodic garbage collection (RemoveBannedNodeStates) is a clean design that prevents race windows where banned peers could still inject sig-share work.
Key Observations
Lock ordering is consistent — verified that cs_main → CSigSharesManager::cs ordering is maintained across all call sites: BanNode(), MarkNodeBanned() (called from SendRejectsAndCheckIfBanned under cs_main), and CollectPendingSigSharesToVerify (releases cs before acquiring cs_main). No deadlock risk introduced.
Banned peer rejection is comprehensive — every message handler (ProcessMessageSigSesAnn, ProcessMessageSigSharesInv, ProcessMessageGetSigShares, ProcessMessageBatchedSigShares) plus the top-level ProcessMessage entry point and CollectPendingSigSharesToVerify all check the banned flag. This is thorough.
The split-lock pattern in RemoveBannedNodeStates — collecting banned IDs under cs, checking connectivity via ForEachNode (no lock overlap), then erasing under cs with re-validation — is correctly implemented with no TOCTOU issues since the second lock block re-checks it->second.banned.
Findings Summary
| # | Severity | File | Finding |
|---|---|---|---|
| 1 | 🟡 suggestion | quorums_signing_shares.cpp | Ternary with implicit map insertion could be clearer |
| 2 | 🟡 suggestion | quorums_signing_shares.cpp | Extract 30s throttle to named constant |
| 3 | 🔵 question | quorums_signing_shares.cpp | 250ms idle sleep impact on IS latency |
| 4 | 🔵 nitpick | net.cpp | O(n·k) erase-remove inside loop |
| 5 | 🟢 praise | quorums_signing_shares.cpp | MarkNodeBanned design eliminates race window |
Static analysis: flawfinder ran with no findings in changed lines. cppcheck/clang-tidy/semgrep unavailable in current environment.
After CollectPendingSigSharesToVerify copies sig shares into a local map and releases cs, a peer could be banned by another thread (e.g. ProcessMessage) before ProcessPendingSigSharesFromNode is called. The already-copied shares would still be processed because MarkNodeBanned cannot retract work copied out of the pending map. Add an IsBanned() re-check under cs_main immediately before calling ProcessPendingSigSharesFromNode(), skipping any node banned mid-flight. Add regression test verifying IsBanned() returns true immediately after Misbehaving(nodeId, 100), which is the prerequisite for the guard. refs #1799 https://claude.ai/code/session_015tEmb5UhGTsMc9XJC6mHmL
ProcessPendingSigShares() collects sig shares into a local map under cs, then releases the lock for batch verification. If MarkNodeBanned() runs on another thread between collection and the final processing loop, the already-copied shares would still be applied because MarkNodeBanned() cannot retract work already copied out. Add a nodeState.banned re-check under cs immediately before calling ProcessPendingSigSharesFromNode(), skipping any node banned mid-flight. Add regression test verifying IsBanned() returns true immediately after Misbehaving(nodeId, 100), confirming the guard prerequisite. refs #1799 https://claude.ai/code/session_015tEmb5UhGTsMc9XJC6mHmL

PR intention
Rebases and refines the changes from #1688 onto the latest
master. Addresses real lock contention issues in LLMQ signing share processing andCConnman::ThreadSocketHandler, while preserving prompt sig-share failover when peers are banned outsideCSigSharesManager. Supersedes #1688.Issues fixed:
RemoveBannedNodeStates()causes frequentcs_maincontention: This function acquiredLOCK2(cs_main, cs)and was being called every ~100ms (every loop iteration ofWorkThreadMain). Since its periodic work is only needed for local sig-share state cleanup, acquiringcs_mainthat often creates unnecessary contention with validation, block processing, and othercs_mainconsumers. Fix: throttle the periodic cleanup to once every 30 seconds, and move the correctness-sensitive cleanup for banned peers into a direct notification path.The direct notification path matters because peers can also be banned elsewhere in net processing. If the sig-share manager only noticed bans via the periodic
IsBanned()scan, throttling that scan could delay releasing requested sig-shares until timeouts expired. Fix: addMarkNodeBanned()and call it both fromCSigSharesManager::BanNode()and from the net-processing ban path when a peer is actually being disconnected, so banned peers immediately release requested sig-shares, drop cached sig-share sessions, and are ignored for future sig-share traffic.Additional guards now re-check the local banned flag after taking
csin the sig-share message handlers, so a concurrentMarkNodeBanned()cannot race with session creation or pending-share enqueueing. Pending verification work is also discarded for already-banned peers.Unnecessary full
vNodescopy inThreadSocketHandler: Every ~50ms, the entirevNodesvector was copied (std::vector<CNode*> vNodesCopy = vNodes) even though typically 0 nodes need disconnecting per iteration. Fix: iteratevNodesdirectly and collect only disconnected nodes (usually none), avoiding an O(n) heap allocation on every loop.100ms idle sleep too frequent: The idle sleep in
WorkThreadMaincontributes to the overall contention frequency. Fix: increase to 250ms — using 250ms rather than the originally proposed 1000ms in Fix Several Sources of Frequent Lock Contention #1688, since there is no wakeup mechanism for new signing work and longer delays could impact ChainLock/InstantSend latency (per @aleflm's review feedback on Fix Several Sources of Frequent Lock Contention #1688).Code changes brief
src/llmq/quorums_signing_shares.cpp/h: AddedMarkNodeBanned()to immediately clear requested sig-shares, drop cached sig-share sessions/pending work, and ignore further sig-share messages from banned peers.MarkNodeBanned()only updates existing sig-share state, so unrelated peers do not get a synthetic banned entry.RemoveBannedNodeStates()now only reclaims already-banned peer state once the peer is gone or disconnecting, so the 30-second throttle does not delay failover correctness. Sig-share handlers now re-check the banned state after takingcs, and pending verification work is discarded for banned peers. Idle sleep remains 250ms.src/net_processing.cpp: NotifyCSigSharesManagerwhen net processing marks a peer bannable and the peer is actually being disconnected, keeping LLMQ sig-share cleanup in sync with bans originating outside the sig-share manager without affecting whitelisted/addnode peers.src/net.cpp: RefactoredThreadSocketHandlerdisconnect handling to avoid copying the fullvNodesvector. Instead, iterates directly and collects only disconnected nodes before processing them. Same locking scope and behavior preserved.Closes #1688