Net, LLMQ: Fix several sources of frequent lock contention by reubenyap · Pull Request #1799 · firoorg/firo

reubenyap · 2026-03-19T18:18:21Z

PR intention

Rebases and refines the changes from #1688 onto the latest master. Addresses real lock contention issues in LLMQ signing share processing and CConnman::ThreadSocketHandler, while preserving prompt sig-share failover when peers are banned outside CSigSharesManager. Supersedes #1688.

Issues fixed:

RemoveBannedNodeStates() causes frequent cs_main contention: This function acquired LOCK2(cs_main, cs) and was being called every ~100ms (every loop iteration of WorkThreadMain). Since its periodic work is only needed for local sig-share state cleanup, acquiring cs_main that often creates unnecessary contention with validation, block processing, and other cs_main consumers. Fix: throttle the periodic cleanup to once every 30 seconds, and move the correctness-sensitive cleanup for banned peers into a direct notification path.

The direct notification path matters because peers can also be banned elsewhere in net processing. If the sig-share manager only noticed bans via the periodic IsBanned() scan, throttling that scan could delay releasing requested sig-shares until timeouts expired. Fix: add MarkNodeBanned() and call it both from CSigSharesManager::BanNode() and from the net-processing ban path when a peer is actually being disconnected, so banned peers immediately release requested sig-shares, drop cached sig-share sessions, and are ignored for future sig-share traffic.

Additional guards now re-check the local banned flag after taking cs in the sig-share message handlers, so a concurrent MarkNodeBanned() cannot race with session creation or pending-share enqueueing. Pending verification work is also discarded for already-banned peers.
Unnecessary full vNodes copy in ThreadSocketHandler: Every ~50ms, the entire vNodes vector was copied (std::vector<CNode*> vNodesCopy = vNodes) even though typically 0 nodes need disconnecting per iteration. Fix: iterate vNodes directly and collect only disconnected nodes (usually none), avoiding an O(n) heap allocation on every loop.
100ms idle sleep too frequent: The idle sleep in WorkThreadMain contributes to the overall contention frequency. Fix: increase to 250ms — using 250ms rather than the originally proposed 1000ms in Fix Several Sources of Frequent Lock Contention #1688, since there is no wakeup mechanism for new signing work and longer delays could impact ChainLock/InstantSend latency (per @aleflm's review feedback on Fix Several Sources of Frequent Lock Contention #1688).

Code changes brief

src/llmq/quorums_signing_shares.cpp/h: Added MarkNodeBanned() to immediately clear requested sig-shares, drop cached sig-share sessions/pending work, and ignore further sig-share messages from banned peers. MarkNodeBanned() only updates existing sig-share state, so unrelated peers do not get a synthetic banned entry. RemoveBannedNodeStates() now only reclaims already-banned peer state once the peer is gone or disconnecting, so the 30-second throttle does not delay failover correctness. Sig-share handlers now re-check the banned state after taking cs, and pending verification work is discarded for banned peers. Idle sleep remains 250ms.
src/net_processing.cpp: Notify CSigSharesManager when net processing marks a peer bannable and the peer is actually being disconnected, keeping LLMQ sig-share cleanup in sync with bans originating outside the sig-share manager without affecting whitelisted/addnode peers.
src/net.cpp: Refactored ThreadSocketHandler disconnect handling to avoid copying the full vNodes vector. Instead, iterates directly and collects only disconnected nodes before processing them. Same locking scope and behavior preserved.

Closes #1688

In ThreadSocketHandler, the full vNodes vector was being copied every iteration (~50ms) just to safely iterate while modifying. Since typically 0 nodes are disconnected per iteration, this copies hundreds of node pointers unnecessarily. Instead, iterate vNodes directly and collect only the disconnected nodes (usually none), then process them. This avoids an O(n) heap allocation on every loop while maintaining the same locking scope and behavior. refs #1688 Co-authored-by: Reuben Yap <reuben@firo.org>

RemoveBannedNodeStates() acquires LOCK2(cs_main, cs) every loop iteration of WorkThreadMain, which runs every ~100ms when idle. This causes cs_main to be acquired ~10 times per second purely for local memory cleanup of banned node states - creating significant contention with validation, block processing, and other cs_main consumers. Throttle RemoveBannedNodeStates() to run at most once every 30 seconds, which is more than sufficient for its housekeeping purpose. Also increase the idle sleep from 100ms to 250ms to further reduce unnecessary polling frequency. Using 250ms (rather than the originally proposed 1000ms) since there is no wakeup mechanism for new signing work, and longer delays could impact ChainLock/InstantSend latency. refs #1688 Co-authored-by: Reuben Yap <reuben@firo.org>

cursor · 2026-03-19T18:18:22Z

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
_{Learn more about Cursor Agents}

coderabbitai · 2026-03-19T18:18:47Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 86ee684d-9316-464c-9108-a2e9bcea4552

📥 Commits

Reviewing files that changed from the base of the PR and between b3f9206 and 3454d88.

📒 Files selected for processing (1)

src/llmq/quorums_signing_shares.cpp

✅ Files skipped from review due to trivial changes (1)

src/llmq/quorums_signing_shares.cpp

Walkthrough

Centralized banning for LLMQ sigshares via a new MarkNodeBanned method; net processing now notifies the sigshares manager on ban; banned-node cleanup is rate-limited and worker idle sleep increased; socket handler disconnection now collects only nodes marked for disconnect.

Changes

Cohort / File(s)	Summary
Quorum Signature Shares: core logic `src/llmq/quorums_signing_shares.cpp`, `src/llmq/quorums_signing_shares.h`	Added `CSigSharesManager::MarkNodeBanned(NodeId)` and declared it in the header. `ProcessMessage` early-returns for peers with `banned=true`. `RemoveBannedNodeStates` now deletes only locally stored banned states for disconnected peers and is rate-limited to once per 30s; worker idle sleep increased from 100ms to 250ms. `BanNode` now calls `Misbehaving(...)` then delegates to `MarkNodeBanned`.
Quorum Signature Shares: ban notification integration `src/net_processing.cpp`	`SendRejectsAndCheckIfBanned` calls `llmq::quorumSigSharesManager->MarkNodeBanned(pnode->GetId())` when `state.fShouldBan` and manager is non-null, prior to disconnect.
Socket handler disconnection refactor `src/net.cpp`	`CConnman::ThreadSocketHandler` now collects nodes with `fDisconnect` into `vDisconnectedNodes` and iterates that list for teardown, avoiding copying the full `vNodes` list while preserving teardown steps.

Sequence Diagram(s)

sequenceDiagram
  participant Peer
  participant NetProcessing
  participant SigSharesMgr
  participant WorkerThread

  Peer->>NetProcessing: trigger misbehavior / reach ban threshold
  NetProcessing->>SigSharesMgr: MarkNodeBanned(nodeId) 					
  SigSharesMgr-->>NetProcessing: (return)
  NetProcessing->>Peer: set fDisconnect = true / continue ban flow
  WorkerThread->>SigSharesMgr: periodic RemoveBannedNodeStates (rate-limited)
  SigSharesMgr-->>WorkerThread: removed stale banned node states

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

levonpetrosyan93
navidR

Poem

🐰 I sniffed the tracks and marked the bad,
Cleared crumbs and queues, made tidy, not sad.
I hop a bit slower, rest a little more,
No extra copies, cleaner floor.
A carrot cheer for code that's glad. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main changes: fixing lock contention issues in networking and LLMQ subsystems through multiple targeted optimizations.
Description check	✅ Passed	The description includes all required template sections with comprehensive detail: PR intention explains the rebase and specific issues fixed, and code changes brief outlines architectural modifications across all touched files.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch cursor/block-header-validation-issue-667d

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Teach CSigSharesManager about peer bans directly so sig-share request/session cleanup happens immediately instead of depending on a periodic IsBanned() scan. This preserves the lock-contention improvement from throttling RemoveBannedNodeStates(), while avoiding multi-second failover delays for peers that are banned elsewhere in net processing. Mark banned peers locally, drop their cached sessions and pending work, ignore further sig-share messages from them, and leave the periodic cleanup as local state reclamation for peers that have disconnected. refs #1688 Co-authored-by: Reuben Yap <reuben@firo.org>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

Restrict the net-processing LLMQ ban notification to peers that are actually being disconnected, so whitelisted and addnode peers do not lose sig-share functionality after unrelated misbehavior. Also restore the old unknown-peer behavior in MarkNodeBanned() by using find() instead of creating a new node state for peers that never participated in sig-share exchange. refs #1688 Co-authored-by: Reuben Yap <reuben@firo.org>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/llmq/quorums_signing_shares.cpp`:
- Around line 240-246: Handlers currently only check nodeStates[pfrom->id] for
banned status before taking cs, so a race where MarkNodeBanned() runs in-between
lets handlers like ProcessMessageSigSesAnn and ProcessMessageBatchedSigShares
recreate sessions or push into pendingIncomingSigShares; update those handlers
to reacquire cs and re-check nodeStates[pfrom->id].banned (and return/skip)
immediately after taking LOCK(cs) before mutating state (creating sessions or
enqueueing), and also update CollectPendingSigSharesToVerify to skip nodes whose
nodeStates[].banned is true so no verification work is performed for banned
peers.
- Around line 1356-1368: BanNode currently grabs LOCK(cs_main) before performing
local sig-share cleanup which causes nested contention; move the local cleanup
out of the cs_main critical section and run it before acquiring LOCK(cs_main).
Specifically, in CSigSharesManager::BanNode, first perform the local cleanup
operations for the node (clear any requested shares and cached session state for
nodeId using the manager's existing cleanup helpers), then acquire LOCK(cs_main)
and call Misbehaving(nodeId, 100), and finally call MarkNodeBanned(nodeId);
ensure the cleanup calls do not touch cs_main to avoid nested locking.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 33cbab35-1dab-41ae-9f75-ae28a13480ea

📥 Commits

Reviewing files that changed from the base of the PR and between 5168f4b and b3f9206.

📒 Files selected for processing (2)

src/llmq/quorums_signing_shares.cpp
src/net_processing.cpp

Re-check the local banned flag after taking cs in the sig-share message handlers so a concurrent MarkNodeBanned() cannot race with session creation or enqueueing pending shares. Also skip and clear any pending verification work for banned peers. refs #1688 Co-authored-by: Reuben Yap <reuben@firo.org>

navidR · 2026-03-20T14:18:29Z

Looks good to me.

reubenyap

AI Code Review — PR #1799

Overall Assessment: Approve with minor suggestions

This PR addresses real lock contention issues in LLMQ signing share processing and CConnman::ThreadSocketHandler. The architecture is well thought out: separating immediate correctness-sensitive cleanup (MarkNodeBanned) from periodic garbage collection (RemoveBannedNodeStates) is a clean design that prevents race windows where banned peers could still inject sig-share work.

Key Observations

Lock ordering is consistent — verified that cs_main → CSigSharesManager::cs ordering is maintained across all call sites: BanNode(), MarkNodeBanned() (called from SendRejectsAndCheckIfBanned under cs_main), and CollectPendingSigSharesToVerify (releases cs before acquiring cs_main). No deadlock risk introduced.

Banned peer rejection is comprehensive — every message handler (ProcessMessageSigSesAnn, ProcessMessageSigSharesInv, ProcessMessageGetSigShares, ProcessMessageBatchedSigShares) plus the top-level ProcessMessage entry point and CollectPendingSigSharesToVerify all check the banned flag. This is thorough.

The split-lock pattern in RemoveBannedNodeStates — collecting banned IDs under cs, checking connectivity via ForEachNode (no lock overlap), then erasing under cs with re-validation — is correctly implemented with no TOCTOU issues since the second lock block re-checks it->second.banned.

Findings Summary

#	Severity	File	Finding
1	🟡 suggestion	quorums_signing_shares.cpp	Ternary with implicit map insertion could be clearer
2	🟡 suggestion	quorums_signing_shares.cpp	Extract 30s throttle to named constant
3	🔵 question	quorums_signing_shares.cpp	250ms idle sleep impact on IS latency
4	🔵 nitpick	net.cpp	O(n·k) erase-remove inside loop
5	🟢 praise	quorums_signing_shares.cpp	MarkNodeBanned design eliminates race window

Static analysis: flawfinder ran with no findings in changed lines. cppcheck/clang-tidy/semgrep unavailable in current environment.

After CollectPendingSigSharesToVerify copies sig shares into a local map and releases cs, a peer could be banned by another thread (e.g. ProcessMessage) before ProcessPendingSigSharesFromNode is called. The already-copied shares would still be processed because MarkNodeBanned cannot retract work copied out of the pending map. Add an IsBanned() re-check under cs_main immediately before calling ProcessPendingSigSharesFromNode(), skipping any node banned mid-flight. Add regression test verifying IsBanned() returns true immediately after Misbehaving(nodeId, 100), which is the prerequisite for the guard. refs #1799 https://claude.ai/code/session_015tEmb5UhGTsMc9XJC6mHmL

ProcessPendingSigShares() collects sig shares into a local map under cs, then releases the lock for batch verification. If MarkNodeBanned() runs on another thread between collection and the final processing loop, the already-copied shares would still be applied because MarkNodeBanned() cannot retract work already copied out. Add a nodeState.banned re-check under cs immediately before calling ProcessPendingSigSharesFromNode(), skipping any node banned mid-flight. Add regression test verifying IsBanned() returns true immediately after Misbehaving(nodeId, 100), confirming the guard prerequisite. refs #1799 https://claude.ai/code/session_015tEmb5UhGTsMc9XJC6mHmL

cursoragent and others added 2 commits March 19, 2026 18:11

coderabbitai bot requested a review from levonpetrosyan93 March 19, 2026 18:19

reubenyap requested a review from navidR March 19, 2026 18:21

reubenyap closed this Mar 19, 2026

reubenyap reopened this Mar 19, 2026

reubenyap closed this Mar 19, 2026

reubenyap reopened this Mar 19, 2026

cursor bot reviewed Mar 19, 2026

View reviewed changes

Comment thread src/net_processing.cpp Outdated

Comment thread src/llmq/quorums_signing_shares.cpp Outdated

coderabbitai bot reviewed Mar 19, 2026

View reviewed changes

Comment thread src/llmq/quorums_signing_shares.cpp

Comment thread src/llmq/quorums_signing_shares.cpp

reubenyap commented Mar 22, 2026

View reviewed changes

Comment thread src/llmq/quorums_signing_shares.cpp

Comment thread src/llmq/quorums_signing_shares.cpp

Comment thread src/llmq/quorums_signing_shares.cpp

Comment thread src/llmq/quorums_signing_shares.cpp

Comment thread src/net.cpp

Comment thread src/net_processing.cpp

reubenyap added the networking/p2p Peer-to-peer networking layer label Mar 24, 2026

reubenyap requested review from psolstice and removed request for navidR March 30, 2026 03:15

reubenyap merged commit 4943a33 into master Apr 14, 2026
49 of 51 checks passed

reubenyap deleted the cursor/block-header-validation-issue-667d branch April 14, 2026 17:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Net, LLMQ: Fix several sources of frequent lock contention#1799

Net, LLMQ: Fix several sources of frequent lock contention#1799
reubenyap merged 5 commits intomasterfrom
cursor/block-header-validation-issue-667d

reubenyap commented Mar 19, 2026 •

edited by cursor bot

Loading

Uh oh!

cursor bot commented Mar 19, 2026

Uh oh!

coderabbitai bot commented Mar 19, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

cursor bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

navidR commented Mar 20, 2026

Uh oh!

reubenyap left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

reubenyap commented Mar 19, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR intention

Issues fixed:

Code changes brief

Uh oh!

cursor bot commented Mar 19, 2026

Uh oh!

coderabbitai bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

navidR commented Mar 20, 2026

Uh oh!

reubenyap left a comment

Choose a reason for hiding this comment

AI Code Review — PR #1799

Key Observations

Findings Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

reubenyap commented Mar 19, 2026 •

edited by cursor bot

Loading

coderabbitai bot commented Mar 19, 2026 •

edited

Loading