Skip to content

LLMChat Multi-Worker: Add Documentation and Integration Tests (PR #1236 Follow-up) #1239

@crivetimihai

Description

@crivetimihai

Summary

PR #1236 adds Redis-based session consistency for LLMChat multi-worker deployments. While the implementation is well-architected and solves a real problem, there are several follow-up tasks to ensure production readiness.

PR #1236 Overview

  • Changes: 786 additions / 295 deletions
  • Key Components:
    • New ChatHistoryManager class for centralized history management
    • Redis-based distributed session storage with fallback to in-memory
    • Worker coordination with distributed locking
    • TTL-based session ownership (SESSION_TTL=300s, LOCK_TTL=30s)
  • Status: All 36 CI checks passing

Required Documentation Updates

1. Redis Setup Requirements

Add documentation covering:

  • Redis installation and configuration for multi-worker deployments
  • Environment variables for Redis connection (REDIS_URL, CACHE_TYPE)
  • Session management environment variables:
    • LLMCHAT_SESSION_TTL (default: 300 seconds)
    • LLMCHAT_LOCK_TTL (default: 30 seconds)
    • LLMCHAT_LOCK_TIMEOUT (default: 10 seconds)
    • LLMCHAT_MAX_HISTORY_MESSAGES (default: 50)

2. Architecture Documentation

Document the new multi-worker architecture:

  • How worker coordination works
  • Session ownership and TTL behavior
  • Distributed locking mechanism
  • Automatic session recreation from stored config
  • Fallback behavior when Redis is unavailable

Recommended Integration Tests

1. Multi-Worker Coordination Tests

  • Session handoff between workers
  • Distributed lock acquisition and release
  • Race condition scenarios with concurrent requests
  • Session TTL expiration and renewal
  • Lock timeout handling

2. Redis Failure Scenarios

  • Graceful degradation when Redis is unavailable
  • Fallback to in-memory storage
  • Redis connection loss during active session
  • Redis recovery and session restoration

3. Chat History Persistence Tests

  • History preservation across worker restarts
  • Message ordering with concurrent appends
  • History size limits and trimming behavior
  • Clear history operations across workers

Monitoring Recommendations

Add observability for:

  • Session recreation frequency (potential indicator of TTL tuning needs)
  • Lock contention metrics
  • Redis connection health
  • Worker session distribution

Configuration Improvements

Consider moving hardcoded tunables to config.py:

# Currently in llmchat_router.py:
SESSION_TTL = int(os.getenv("LLMCHAT_SESSION_TTL", "300"))
LOCK_TTL = int(os.getenv("LLMCHAT_LOCK_TTL", "30"))
LOCK_TIMEOUT = int(os.getenv("LLMCHAT_LOCK_TIMEOUT", "10"))
MAX_HISTORY_MESSAGES = int(os.getenv("LLMCHAT_MAX_HISTORY_MESSAGES", "50"))

# Should be in config.py with proper Pydantic validation:
llmchat_session_ttl: PositiveInt = Field(default=300)
llmchat_lock_ttl: PositiveInt = Field(default=30)
llmchat_lock_timeout: PositiveInt = Field(default=10)
llmchat_max_history_messages: PositiveInt = Field(default=50)

Related Files

  • mcpgateway/services/mcp_client_chat_service.py - ChatHistoryManager class (lines 1193-1454)
  • mcpgateway/routers/llmchat_router.py - Worker coordination logic (lines 407-669)
  • mcpgateway/config.py - Potential location for configuration additions

References

Metadata

Metadata

Assignees

Labels

documentationImprovements or additions to documentationenhancementNew feature or requesttestingTesting (unit, e2e, manual, automated, etc)

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions