Fix race condition in LLM callback system by devin-ai-integration[bot] · Pull Request #4218 · crewAIInc/crewAI

devin-ai-integration · 2026-01-10T21:11:41Z

Summary

Fixes a race condition in the LLM callback system where multiple LLM instances calling set_callbacks concurrently could cause callbacks to be removed before they fire. This was identified in issue #4214, where a test required a sleep(5) workaround to pass.

The fix adds a class-level threading.RLock to synchronize access to global litellm callbacks during call() and acall() methods. The lock ensures that callback registration and the subsequent LLM call execution are atomic, preventing one instance from modifying callbacks while another is using them.

Changes:

Add _callback_lock: threading.RLock class attribute to LLM
Wrap callback setup and LLM call execution in both call() and acall() with the lock
Remove sleep(5) workaround from test_llm_callback_replacement
Add new test_llm_callback_lock_prevents_race_condition test for concurrent access

Review & Testing Checklist for Human

Performance impact: The lock is held during the entire LLM call (including network I/O), which serializes all LLM calls across all instances. Verify this is acceptable for multi-agent scenarios where parallelism is expected.
Lock scope: Consider if the lock scope is too broad - could it just protect the callback registration/cleanup rather than the entire call?
Real-world validation: Run a multi-agent crew with concurrent LLM calls to verify the fix works in production scenarios and doesn't introduce deadlocks or significant latency.

Recommended test plan:

Run the existing test suite to ensure no regressions
Create a simple multi-agent crew with 3+ agents making concurrent LLM calls
Verify callbacks (like token tracking) work correctly without interference
Monitor for any performance degradation compared to before the fix

Notes

Fixes [BUG] Hidden race condition in LLM callback system causing test failures #4214
Uses RLock (reentrant lock) to handle recursive calls that occur when retrying with unsupported 'stop' parameter
Link to Devin run: https://app.devin.ai/sessions/387dcbd2330148399a988256dee3fa6a
Requested by: João

This commit fixes a race condition in the LLM callback system where multiple LLM instances calling set_callbacks concurrently could cause callbacks to be removed before they fire. Changes: - Add class-level RLock (_callback_lock) to LLM class to synchronize access to global litellm callbacks - Wrap callback registration and LLM call execution in the lock for both call() and acall() methods - Use RLock (reentrant lock) to handle recursive calls without deadlock (e.g., when retrying with unsupported 'stop' parameter) - Remove sleep(5) workaround from test_llm_callback_replacement test - Add new test_llm_callback_lock_prevents_race_condition test to verify concurrent callback access is properly synchronized Fixes #4214 Co-Authored-By: João <joao@crewai.com>

devin-ai-integration · 2026-01-10T21:11:43Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

VedantMadane

The race condition diagnosis is correct, and RLock is the right choice for the recursive retry case. One significant concern worth discussing:

Performance consideration:

The lock is held for the entire duration of the LLM call (including network I/O):
python with LLM._callback_lock: # ... entire completion call happens here

This effectively serializes all LLM calls across all instances in the process. For multi-agent crews that rely on parallel LLM calls, this could be a significant performance regression.

Alternative approaches to consider:

Narrow the lock scope - Only protect the callback registration/cleanup, not the actual LLM call:
`python
with LLM._callback_lock:
self.set_callbacks(callbacks)

result = litellm.completion(**params) # Outside lock

with LLM._callback_lock:
# cleanup callbacks
`

Per-instance callbacks instead of global litellm callbacks - would require changes to how callbacks are wired but eliminates the global state issue entirely.
Copy-on-write pattern for the callback list.

The current fix is correct for correctness, but the performance tradeoff should be explicitly accepted or an alternative considered. The PR description does flag this for review which is good.

Would love to hear thoughts from maintainers on acceptable parallelism tradeoffs.

devin-ai-integration · 2026-01-19T18:25:19Z

Closing due to inactivity for more than 7 days. Configure here.

VedantMadane · 2026-01-19T18:35:20Z

/review

VedantMadane · 2026-01-19T19:21:26Z

Opened follow-up PR #4252 with the same intent as this PR but without a global lock held across the network call (keeps concurrency): #4252

VedantMadane reviewed Jan 16, 2026

View reviewed changes

devin-ai-integration bot closed this Jan 19, 2026

VedantMadane mentioned this pull request Jan 19, 2026

Fix LLM callback isolation without serializing requests #4252

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition in LLM callback system#4218

Fix race condition in LLM callback system#4218
devin-ai-integration[bot] wants to merge 1 commit intomainfrom
devin/1768079022-fix-llm-callback-race-condition

devin-ai-integration bot commented Jan 10, 2026

Uh oh!

devin-ai-integration bot commented Jan 10, 2026

Uh oh!

VedantMadane left a comment

Uh oh!

devin-ai-integration bot commented Jan 19, 2026

Uh oh!

VedantMadane commented Jan 19, 2026

Uh oh!

VedantMadane commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

devin-ai-integration bot commented Jan 10, 2026

Summary

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration bot commented Jan 10, 2026

🤖 Devin AI Engineer

Uh oh!

VedantMadane left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration bot commented Jan 19, 2026

Uh oh!

VedantMadane commented Jan 19, 2026

Uh oh!

VedantMadane commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant