Skip to content

Fix race condition in LLM callback system#4218

Closed
devin-ai-integration[bot] wants to merge 1 commit intomainfrom
devin/1768079022-fix-llm-callback-race-condition
Closed

Fix race condition in LLM callback system#4218
devin-ai-integration[bot] wants to merge 1 commit intomainfrom
devin/1768079022-fix-llm-callback-race-condition

Conversation

@devin-ai-integration
Copy link
Contributor

Summary

Fixes a race condition in the LLM callback system where multiple LLM instances calling set_callbacks concurrently could cause callbacks to be removed before they fire. This was identified in issue #4214, where a test required a sleep(5) workaround to pass.

The fix adds a class-level threading.RLock to synchronize access to global litellm callbacks during call() and acall() methods. The lock ensures that callback registration and the subsequent LLM call execution are atomic, preventing one instance from modifying callbacks while another is using them.

Changes:

  • Add _callback_lock: threading.RLock class attribute to LLM
  • Wrap callback setup and LLM call execution in both call() and acall() with the lock
  • Remove sleep(5) workaround from test_llm_callback_replacement
  • Add new test_llm_callback_lock_prevents_race_condition test for concurrent access

Review & Testing Checklist for Human

  • Performance impact: The lock is held during the entire LLM call (including network I/O), which serializes all LLM calls across all instances. Verify this is acceptable for multi-agent scenarios where parallelism is expected.
  • Lock scope: Consider if the lock scope is too broad - could it just protect the callback registration/cleanup rather than the entire call?
  • Real-world validation: Run a multi-agent crew with concurrent LLM calls to verify the fix works in production scenarios and doesn't introduce deadlocks or significant latency.

Recommended test plan:

  1. Run the existing test suite to ensure no regressions
  2. Create a simple multi-agent crew with 3+ agents making concurrent LLM calls
  3. Verify callbacks (like token tracking) work correctly without interference
  4. Monitor for any performance degradation compared to before the fix

Notes

This commit fixes a race condition in the LLM callback system where
multiple LLM instances calling set_callbacks concurrently could cause
callbacks to be removed before they fire.

Changes:
- Add class-level RLock (_callback_lock) to LLM class to synchronize
  access to global litellm callbacks
- Wrap callback registration and LLM call execution in the lock for
  both call() and acall() methods
- Use RLock (reentrant lock) to handle recursive calls without deadlock
  (e.g., when retrying with unsupported 'stop' parameter)
- Remove sleep(5) workaround from test_llm_callback_replacement test
- Add new test_llm_callback_lock_prevents_race_condition test to verify
  concurrent callback access is properly synchronized

Fixes #4214

Co-Authored-By: João <joao@crewai.com>
@devin-ai-integration
Copy link
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

Copy link

@VedantMadane VedantMadane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The race condition diagnosis is correct, and RLock is the right choice for the recursive retry case. One significant concern worth discussing:

Performance consideration:

The lock is held for the entire duration of the LLM call (including network I/O):
python with LLM._callback_lock: # ... entire completion call happens here

This effectively serializes all LLM calls across all instances in the process. For multi-agent crews that rely on parallel LLM calls, this could be a significant performance regression.

Alternative approaches to consider:

  1. Narrow the lock scope - Only protect the callback registration/cleanup, not the actual LLM call:
    `python
    with LLM._callback_lock:
    self.set_callbacks(callbacks)

result = litellm.completion(**params) # Outside lock

with LLM._callback_lock:
# cleanup callbacks
`

  1. Per-instance callbacks instead of global litellm callbacks - would require changes to how callbacks are wired but eliminates the global state issue entirely.

  2. Copy-on-write pattern for the callback list.

The current fix is correct for correctness, but the performance tradeoff should be explicitly accepted or an alternative considered. The PR description does flag this for review which is good.

Would love to hear thoughts from maintainers on acceptable parallelism tradeoffs.

@devin-ai-integration
Copy link
Contributor Author

Closing due to inactivity for more than 7 days. Configure here.

@VedantMadane
Copy link

/review

@VedantMadane
Copy link

Opened follow-up PR #4252 with the same intent as this PR but without a global lock held across the network call (keeps concurrency): #4252

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Hidden race condition in LLM callback system causing test failures

1 participant