-
Notifications
You must be signed in to change notification settings - Fork 383
Closed
Copy link
Labels
documentationImprovements or additions to documentationImprovements or additions to documentationenhancementNew feature or requestNew feature or requesttestingTesting (unit, e2e, manual, automated, etc)Testing (unit, e2e, manual, automated, etc)
Milestone
Description
Summary
PR #1236 adds Redis-based session consistency for LLMChat multi-worker deployments. While the implementation is well-architected and solves a real problem, there are several follow-up tasks to ensure production readiness.
PR #1236 Overview
- Changes: 786 additions / 295 deletions
- Key Components:
- New
ChatHistoryManagerclass for centralized history management - Redis-based distributed session storage with fallback to in-memory
- Worker coordination with distributed locking
- TTL-based session ownership (SESSION_TTL=300s, LOCK_TTL=30s)
- New
- Status: All 36 CI checks passing
Required Documentation Updates
1. Redis Setup Requirements
Add documentation covering:
- Redis installation and configuration for multi-worker deployments
- Environment variables for Redis connection (
REDIS_URL,CACHE_TYPE) - Session management environment variables:
LLMCHAT_SESSION_TTL(default: 300 seconds)LLMCHAT_LOCK_TTL(default: 30 seconds)LLMCHAT_LOCK_TIMEOUT(default: 10 seconds)LLMCHAT_MAX_HISTORY_MESSAGES(default: 50)
2. Architecture Documentation
Document the new multi-worker architecture:
- How worker coordination works
- Session ownership and TTL behavior
- Distributed locking mechanism
- Automatic session recreation from stored config
- Fallback behavior when Redis is unavailable
Recommended Integration Tests
1. Multi-Worker Coordination Tests
- Session handoff between workers
- Distributed lock acquisition and release
- Race condition scenarios with concurrent requests
- Session TTL expiration and renewal
- Lock timeout handling
2. Redis Failure Scenarios
- Graceful degradation when Redis is unavailable
- Fallback to in-memory storage
- Redis connection loss during active session
- Redis recovery and session restoration
3. Chat History Persistence Tests
- History preservation across worker restarts
- Message ordering with concurrent appends
- History size limits and trimming behavior
- Clear history operations across workers
Monitoring Recommendations
Add observability for:
- Session recreation frequency (potential indicator of TTL tuning needs)
- Lock contention metrics
- Redis connection health
- Worker session distribution
Configuration Improvements
Consider moving hardcoded tunables to config.py:
# Currently in llmchat_router.py:
SESSION_TTL = int(os.getenv("LLMCHAT_SESSION_TTL", "300"))
LOCK_TTL = int(os.getenv("LLMCHAT_LOCK_TTL", "30"))
LOCK_TIMEOUT = int(os.getenv("LLMCHAT_LOCK_TIMEOUT", "10"))
MAX_HISTORY_MESSAGES = int(os.getenv("LLMCHAT_MAX_HISTORY_MESSAGES", "50"))
# Should be in config.py with proper Pydantic validation:
llmchat_session_ttl: PositiveInt = Field(default=300)
llmchat_lock_ttl: PositiveInt = Field(default=30)
llmchat_lock_timeout: PositiveInt = Field(default=10)
llmchat_max_history_messages: PositiveInt = Field(default=50)Related Files
mcpgateway/services/mcp_client_chat_service.py- ChatHistoryManager class (lines 1193-1454)mcpgateway/routers/llmchat_router.py- Worker coordination logic (lines 407-669)mcpgateway/config.py- Potential location for configuration additions
References
Metadata
Metadata
Assignees
Labels
documentationImprovements or additions to documentationImprovements or additions to documentationenhancementNew feature or requestNew feature or requesttestingTesting (unit, e2e, manual, automated, etc)Testing (unit, e2e, manual, automated, etc)