Fix race condition in LLM callback system#4218
Fix race condition in LLM callback system#4218devin-ai-integration[bot] wants to merge 1 commit intomainfrom
Conversation
This commit fixes a race condition in the LLM callback system where multiple LLM instances calling set_callbacks concurrently could cause callbacks to be removed before they fire. Changes: - Add class-level RLock (_callback_lock) to LLM class to synchronize access to global litellm callbacks - Wrap callback registration and LLM call execution in the lock for both call() and acall() methods - Use RLock (reentrant lock) to handle recursive calls without deadlock (e.g., when retrying with unsupported 'stop' parameter) - Remove sleep(5) workaround from test_llm_callback_replacement test - Add new test_llm_callback_lock_prevents_race_condition test to verify concurrent callback access is properly synchronized Fixes #4214 Co-Authored-By: João <joao@crewai.com>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
VedantMadane
left a comment
There was a problem hiding this comment.
The race condition diagnosis is correct, and RLock is the right choice for the recursive retry case. One significant concern worth discussing:
Performance consideration:
The lock is held for the entire duration of the LLM call (including network I/O):
python with LLM._callback_lock: # ... entire completion call happens here
This effectively serializes all LLM calls across all instances in the process. For multi-agent crews that rely on parallel LLM calls, this could be a significant performance regression.
Alternative approaches to consider:
- Narrow the lock scope - Only protect the callback registration/cleanup, not the actual LLM call:
`python
with LLM._callback_lock:
self.set_callbacks(callbacks)
result = litellm.completion(**params) # Outside lock
with LLM._callback_lock:
# cleanup callbacks
`
-
Per-instance callbacks instead of global litellm callbacks - would require changes to how callbacks are wired but eliminates the global state issue entirely.
-
Copy-on-write pattern for the callback list.
The current fix is correct for correctness, but the performance tradeoff should be explicitly accepted or an alternative considered. The PR description does flag this for review which is good.
Would love to hear thoughts from maintainers on acceptable parallelism tradeoffs.
|
Closing due to inactivity for more than 7 days. Configure here. |
|
/review |
Summary
Fixes a race condition in the LLM callback system where multiple LLM instances calling
set_callbacksconcurrently could cause callbacks to be removed before they fire. This was identified in issue #4214, where a test required asleep(5)workaround to pass.The fix adds a class-level
threading.RLockto synchronize access to global litellm callbacks duringcall()andacall()methods. The lock ensures that callback registration and the subsequent LLM call execution are atomic, preventing one instance from modifying callbacks while another is using them.Changes:
_callback_lock: threading.RLockclass attribute toLLMcall()andacall()with the locksleep(5)workaround fromtest_llm_callback_replacementtest_llm_callback_lock_prevents_race_conditiontest for concurrent accessReview & Testing Checklist for Human
Recommended test plan:
Notes
RLock(reentrant lock) to handle recursive calls that occur when retrying with unsupported 'stop' parameter