Skip to content

Comments

012 - Hot reload feature proposal#83

Open
Uzziee wants to merge 5 commits intokroxylicious:mainfrom
Uzziee:hot-reload-proposal
Open

012 - Hot reload feature proposal#83
Uzziee wants to merge 5 commits intokroxylicious:mainfrom
Uzziee:hot-reload-proposal

Conversation

@Uzziee
Copy link

@Uzziee Uzziee commented Nov 18, 2025

This proposal is to add hot reload functionality, which will enable the app to reload any changes to virtual cluster config without the need to restart the app

Signed-off-by: Urjit Patel <105218041+Uzziee@users.noreply.github.com>
Copy link
Member

@SamBarker SamBarker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design PR#83 Feedback - Configuration Reload Design

Date: 2026-01-28
Reviewer: Sam Barker
Design PR: #83

Executive Summary

Thank you for putting together this design proposal! Configuration reload is a critical operational feature that many users have been asking for, and your design work provides a solid foundation for moving this forward.

The current proposal focuses on file watch as the primary mechanism. This feedback suggests an alternative HTTP-first approach with 2-phase validation and discusses enhancements that will make either approach production-ready. The feedback builds on analysis of the POC implementation (PR#3176).

Your POC demonstrates the core reload mechanism works well - the questions here are primarily about the trigger mechanism and operator integration patterns. The groundwork you've laid out makes these decisions much clearer.

Proposed Change to Design: HTTP Endpoints as Primary Interface

Current Design Proposal

The design PR currently proposes file watch as the primary mechanism for configuration reload (Part 1), with potential HTTP endpoints as future work.

Recommended Alternative: HTTP-First Approach

I recommend inverting this: make HTTP endpoints the primary interface, with file watching as an optional convenience layer.

Rationale for HTTP-first:

Universal: Works on bare metal, Kubernetes, and any deployment model
Operator-friendly: Natural integration point for Kubernetes operator (operator detects ConfigMap changes → POST /admin/config/reload)
Testable: Easy to test programmatically (integration tests can POST directly)
Observable: Clear success/failure responses (200 OK vs 400 Bad Request with error details)
Composable: File watching can be implemented as a layer that calls the HTTP endpoint internally
Kubernetes-native: Aligns with how operators interact with workloads (API calls, not filesystem)

File watching challenges:

  • ❌ Read-only filesystem (Kubernetes security best practice blocks file writes)
  • ❌ ConfigMap mounting complexity (..data symlinks, atomic updates)
  • ❌ No feedback mechanism (how does operator know reload succeeded/failed?)
  • ❌ Race conditions (file watch triggers before ConfigMap fully mounted)

Proposed architecture:

Core: HTTP Management Endpoints

Proxy exposes on localhost:9190 (management port):
    ↓
POST /admin/config/validate (validate without applying)
POST /admin/config/reload (apply changes)
GET /admin/config/status (current config version, last operation status)
GET /admin/health (proxy health for liveness/readiness, already exists)
    ↓
Core reload mechanism (shared by all trigger mechanisms)

2-Phase Workflow:

  1. Validate: Build models, initialize filters, check internal consistency (no port binding)
  2. Reload: If validation passes, apply changes (bind ports, register gateways)

Security:

  • Default bind: localhost:9190 (local access only)
  • For Kubernetes: Bind to 0.0.0.0:9190 (pod IP accessible to operator)
  • Authentication: Optional (TLS client certificates, bearer tokens)
  • Recommendations:
    • Bare metal: Keep localhost binding, use local access controls
    • Kubernetes: Use NetworkPolicy to restrict operator→proxy traffic
    • Production: Consider mTLS for operator↔proxy communication

Trigger Mechanisms (How to Call HTTP Endpoints)

Option 1: Direct HTTP (Kubernetes Operator)

Operator detects ConfigMap change
    ↓
POST /admin/config/validate to management Service
    ↓
POST /admin/config/reload to all pod IPs

✅ Native Kubernetes integration
✅ Immediate feedback via HTTP responses
✅ No filesystem coupling

Option 2: File Watcher (Bare Metal)

Sidecar process watches config file
    ↓
On file change → POST to localhost:9190/admin/config/validate
    ↓
If valid → POST to localhost:9190/admin/config/reload

Sidecar options:

  • Shell script: Simple inotifywait wrapper
    inotifywait -e modify /etc/kroxylicious/config.yaml | while read; do
      curl -X POST http://localhost:9190/admin/config/validate --data-binary @/etc/kroxylicious/config.yaml
      if [ $? -eq 0 ]; then
        curl -X POST http://localhost:9190/admin/config/reload --data-binary @/etc/kroxylicious/config.yaml
      fi
    done
  • Go binary: More robust error handling, retry logic
  • In-process Java: WatchService (if proxy can write to filesystem for persistence)

✅ Familiar workflow for bare metal users
✅ Decoupled from proxy (sidecar can be restarted independently)
✅ Uses same HTTP endpoints as Kubernetes

This means:

  • HTTP endpoints are the primitive (required)
  • File watching is optional convenience (can be added later)
  • Both deployment models use same tested, validated endpoints
  • Validation catches config errors before any cluster goes down

Note: This is a significant change from the current design proposal, which focuses on file watch without a validation phase. If the community prefers file watch as the primary mechanism, we should address the challenges listed above (read-only filesystem, feedback mechanisms, etc.) in the design.

Cluster Modification Semantics

The design's remove→add pattern is architecturally necessary:

The proxy's channel state machine has a fundamental constraint: each frontend channel (client→proxy) has a 1:1 relationship with a backend channel (proxy→broker). There's no mechanism to redirect an existing backend connection without closing the frontend connection.

This means:

  • Any cluster modification requires draining connections (1-30 seconds downtime per cluster)
  • "Atomic swap" approaches don't eliminate downtime—they would require hot-swapping filters in the Netty pipeline, which introduces filter state management complexity
  • The remove→add pattern is the correct architectural choice, not a limitation to be overcome

Implication for design: Document that cluster modifications incur brief downtime (1-30s) and this is by design, not a quality issue.

Rollback Strategy (Needs Discussion)

Current POC behavior: Rollback ALL clusters on ANY failure (all-or-nothing semantics)

This is a critical design decision that requires community consensus. The choice affects operational complexity, user experience, and downtime characteristics. See "Questions for Design Discussion" below for detailed analysis of trade-offs.

Key question: When cluster-a succeeds but cluster-b fails, should we:

  • Option A: Rollback cluster-a (all-or-nothing) → simpler operations, more downtime
  • Option B: Keep cluster-a on new config (partial success) → less downtime, more complexity

Recommendation for design: Dedicate a section to this decision, present both options fairly, and explicitly request community feedback before proceeding.

Core Design: HTTP Endpoints with 2-Phase Commit

Validation Endpoint (Core Component)

API:

POST /admin/config/validate
Content-Type: application/yaml

{new configuration YAML}

Response (200 OK):
{
  "valid": true,
  "configVersion": "a3f5b2c19e4d"  // SHA-256 hash of config
}

Response (400 Bad Request):
{
  "valid": false,
  "errors": [
    "Filter 'record-encryption' initialization failed: KMS URL required",
    "Port conflict: 9293 used by cluster-a and cluster-b"
  ]
}

What it validates:

  • ✅ YAML syntax and structure
  • ✅ Filter types exist (registered via SPI)
  • FilterFactory.initialize() succeeds (filter config valid)
  • ✅ Port ranges internally consistent (no duplicate ports in config)

What it doesn't validate (runtime concerns):

  • ❌ Ports actually available on the OS (might be in use)
  • ❌ External dependencies reachable (KMS might be down during reload)
  • ❌ Upstream Kafka cluster healthy

Why this split is acceptable:

Validation is about catching configuration errors (syntax, invalid filter config). Runtime failures (port conflicts, KMS down at reload time) are handled by rollback. We can't guarantee "config valid at 10:00am" means "will succeed at 10:02am" for external dependencies.

Implementation note: Validation should build models and initialize filters without binding ports or registering gateways. This makes validation:

  • Fast (no network operations)
  • Deterministic (same result on all pods)
  • Resource-light (no double-memory usage)

Reload Endpoint (Core Component)

API:

POST /admin/config/reload
Content-Type: application/yaml

{new configuration YAML}

Response (200 OK):
{
  "success": true,
  "configVersion": "a3f5b2c19e4d",
  "clustersModified": ["cluster-a", "cluster-b"]
}

Response (500 Internal Server Error):
{
  "success": false,
  "error": "Failed to modify cluster-b: filter initialization failed",
  "configVersion": "abc123"  // Rolled back to previous version
}

What it does:

  1. Applies configuration changes (remove→add clusters as needed)
  2. If any operation fails → rollback all changes
  3. Returns success/failure with current config version

Configuration Options

Management endpoint binding:

# proxy-config.yaml
admin:
  host: "localhost"  # Default: localhost only (bare metal)
  # host: "0.0.0.0"  # Kubernetes: bind to pod IP
  port: 9190
  tls:  # Optional: mTLS for operator communication
    keyStore: /path/to/keystore.jks
    trustStore: /path/to/truststore.jks

Benefits of this architecture:

  • Catches 90% of errors before any cluster goes down (validation phase)
  • Clear error messages before disruption
  • Same HTTP endpoints for Kubernetes and bare metal
  • File watching is optional, can be added as sidecar later
  • Security: localhost by default, configurable for Kubernetes

Kubernetes Integration Patterns

Management Service

Problem: Operator creates Services for Kafka traffic (ports 9292+) but not for the management port (9190).

Proposed: Create dedicated management Service for operator↔proxy communication:

apiVersion: v1
kind: Service
metadata:
  name: my-proxy-management
spec:
  type: ClusterIP  # Internal only
  selector:
    app.kubernetes.io/instance: minimal
    app.kubernetes.io/component: proxy
  ports:
  - name: management
    port: 9190
    targetPort: 9190

Benefits:

  • ✅ Automatic pod readiness handling (Service only routes to ready pods, returns 503 if none ready)
  • ✅ Stable DNS endpoint (my-proxy-management.ns.svc.cluster.local)
  • ✅ Survives pod restarts/rescheduling
  • ✅ Follows Kubernetes best practices (Services for stable endpoints)

Usage:

  • Validation: POST http://my-proxy-management:9190/admin/config/validate (one pod via Service)
  • Reload: Iterate over pods, POST directly to pod IPs (all pods must succeed)

Recommendation: Add management Service pattern to Kubernetes deployment section of design.

Read-Only Filesystem Support

Problem: Kubernetes deployments use securityContext.readOnlyRootFilesystem: true as security best practice. Current design persists config to disk after successful reload, which fails with read-only filesystem.

Proposed: Make config file persistence optional:

Deployment models:

  • Bare metal: Config file on disk, persist on successful reload
  • Kubernetes: Config in ConfigMap (operator-managed), no disk persistence

Recommendation: Document read-only filesystem support as a requirement for Kubernetes deployments.

Checksum-Based Change Detection

Problem: Operator needs to detect "config actually changed" vs "CRD reconciliation loop with no real change."

Proposed: Store SHA-256 hash of config YAML in KafkaProxy annotation:

apiVersion: kroxylicious.io/v1alpha1
kind: KafkaProxy
metadata:
  name: minimal
  annotations:
    kroxylicious.io/config-checksum: "a3f5b2c19e4d"  # SHA-256 hash
spec:
  # ... config ...

Operator logic:

String newChecksum = sha256(generateYaml(kafkaProxy));
String oldChecksum = kafkaProxy.getMetadata().getAnnotations().get("kroxylicious.io/config-checksum");

if (newChecksum.equals(oldChecksum)) {
    LOGGER.debug("Config unchanged, skipping reload");
    return;  // No-op, avoid unnecessary reload
}

// Config changed, trigger 2-phase reload
ValidationResult validation = validateViaManagementService(yaml);
if (validation.valid()) {
    reloadAllPods(yaml);
    kafkaProxy.getMetadata().getAnnotations().put("kroxylicious.io/config-checksum", newChecksum);
}

Benefits:

  • ✅ Automatic no-op detection (reconciliation loop doesn't trigger unnecessary reloads)
  • ✅ Rollback detection (reverting config doesn't reload if already at that state)
  • ✅ O(1) comparison vs deep config diff

Recommendation: Add checksum-based change detection to operator integration section.

Additional Design Components

Configurable Drain Timeout

Problem: Hard-coded 30-second drain timeout is too short for Kafka consumers with long poll timeouts (default 5 minutes).

Proposed:

# proxy-config.yaml
admin:
  drainTimeoutSeconds: 300  # 5 minutes for graceful connection drain

Trade-off: Longer timeouts mean longer reload times, but fewer disrupted clients.

Recommendation: Add configurable drain timeout to design.

Observability and Status Reporting

Configuration Status Endpoint:

Separate configuration status from health checks (health is for liveness/readiness):

GET /admin/config/status
{
  "currentConfigVersion": "sha256:a3f5b2c19e4d...",
  "appliedAt": "2026-01-28T10:15:30Z",
  "lastReloadAttempt": {
    "timestamp": "2026-01-28T10:15:30Z",
    "status": "SUCCESS",
    "requestedVersion": "sha256:a3f5b2c19e4d...",
    "durationMs": 1234,
    "clustersModified": ["cluster-a"]
  },
  "lastValidationAttempt": {
    "timestamp": "2026-01-28T10:15:25Z",
    "status": "SUCCESS",
    "requestedVersion": "sha256:a3f5b2c19e4d..."
  }
}

// After reload failure with rollback failure:
{
  "currentConfigVersion": "sha256:abc123...",  // Previous version still running
  "appliedAt": "2026-01-28T09:00:00Z",
  "lastReloadAttempt": {
    "timestamp": "2026-01-28T10:20:00Z",
    "status": "ROLLBACK_PARTIAL_FAILURE",
    "requestedVersion": "sha256:newversion...",
    "rollbackState": {
      "successful": ["cluster-a"],
      "failed": {
        "cluster-b": "Failed to re-register gateway: port 9293 in use"
      }
    }
  }
}

Health endpoint stays focused on proxy health:

GET /admin/health
{
  "status": "UP",
  "checks": {
    "netty": "UP",
    "virtualClusters": "UP"
  }
}

Benefit: Clean separation - operators query /admin/config/status for reload state, /admin/health for liveness/readiness.

Recommendation: Add dedicated config status endpoint to design.

Metrics:

kroxylicious_config_reload_total{result="success|failure"} counter
kroxylicious_config_reload_duration_seconds histogram
kroxylicious_config_version_info{version="a3f5b2c19e4d"} gauge

Use cases:

  • Alerting on reload failures
  • Tracking reload duration trends
  • Capacity planning (reload frequency)

Recommendation: Add metrics to observability section.

Error Handling and Recovery

Rollback Failure Handling:

Current design: Log "CRITICAL: system may be in inconsistent state"

Proposed: Track rollback state and expose via health endpoint (see above).

Recovery path:

  1. Query /admin/health to see which clusters failed rollback
  2. Manual intervention:
    • Verify cluster state (is port bound? filter initialized?)
    • Either retry reload or manually fix state
  3. Operator automation (future):
    • Detect rollback failure from health endpoint
    • Attempt recovery (remove failed cluster, re-add from old config)

Recommendation: Document rollback failure recovery procedures.

Concurrent Reload Prevention:

  • Only one reload at a time (enforced via lock)
  • Concurrent requests fail fast with 409 Conflict
POST /admin/config/reload
{new config}

Response (409 Conflict):
{
  "error": "Reload already in progress",
  "inProgressSince": "2026-01-28T10:15:30Z"
}

Recommendation: Document concurrency model in API specification.

Design Document Structure

Suggest organizing the design document as follows. Note: This structure assumes the HTTP-first approach described above. If the community prefers the file watch approach, the structure would need to adjust accordingly (swap "HTTP Endpoints" with "File Watch" as primary, etc.).

1. Goals and Non-Goals

Goals:

  • Zero-restart configuration updates
  • Universal deployment model (bare metal, Kubernetes)
  • Operator-friendly integration
  • Clear error handling and rollback

Non-Goals:

  • Zero-downtime modification (brief downtime per cluster is acceptable)
  • Hot-swapping filters in active connections
  • Partial success / continue-on-failure

2. Architecture

2.1 Core: HTTP Management Endpoints

Required endpoints:

  • POST /admin/config/validate - Validate config without applying
  • POST /admin/config/reload - Apply validated config
  • GET /admin/config/status - Current config version, last operation status
  • GET /admin/health - Proxy health (liveness/readiness)

Security:

  • Default bind: localhost:9190 (bare metal)
  • Kubernetes bind: 0.0.0.0:9190 (pod IP)
  • Optional TLS/mTLS for authentication
  • NetworkPolicy to restrict access in Kubernetes

2.2 Trigger Mechanisms (Optional)

Direct HTTP (Kubernetes):

  • Operator calls endpoints directly
  • No file watching needed

File Watcher Sidecar (Bare Metal):

  • Separate process watches config file
  • Calls HTTP endpoints on change
  • Options: shell script, Go binary, Java WatchService
  • Decoupled from proxy process

2.3 Reload Mechanism

  • Remove→add pattern (architecturally necessary)
  • Sequential processing (simplicity > parallelism)
  • All-or-nothing rollback (operational simplicity - needs discussion)

2.4 Validation Strategy

  • Build models + initialize filters without port binding
  • Deterministic (same result on all pods)
  • Catches config errors, not runtime failures

3. Deployment Patterns

3.1 Bare Metal

  • HTTP endpoints on localhost:9190

  • Config file on disk (optional)

  • Persist config to disk on success (if writable filesystem)

3.2 Kubernetes with Operator

  • HTTP endpoints on 0.0.0.0:9190
  • Config in ConfigMap (operator-managed)
  • Management Service for validation (exposes port 9190)
  • Checksum-based change detection (avoid no-op reloads)
  • 2-phase commit (validate via Service → reload all pods)
  • Read-only filesystem support (no disk persistence)
    • Sidecar file watcher (optional) → calls HTTP endpoints

4. Failure Modes and Recovery

  • Filter initialization failure → rollback
  • Port binding failure → rollback
  • Rollback failure → tracked state, manual recovery
  • Concurrent reload → fail fast with 409

5. Observability

  • Logging throughout reload process
  • Metrics for reload operations

6. Future Enhancements

  • Granular endpoints (/reload/cluster/{name})
  • Canary rollout strategies
  • Blue-green at pod level (operator)

Questions for Design Discussion

  1. Should FilterFactory.initialize() be documented as validation-safe?

    • Must be idempotent (can be called multiple times)?
    • Should avoid side effects (don't connect to external services)?
    • Or allow filter authors to decide (validation calls real KMS if they want)?
  2. Rollback Strategy: All-or-Nothing vs Partial Success (Critical Design Decision)

    This requires community consensus before proceeding.

    Scenario: Config change affects cluster-a, cluster-b, cluster-c

    • cluster-a: modify succeeds ✅ (downtime: 2s)
    • cluster-b: modify fails ❌ (downtime: 30s)
    • cluster-c: modify succeeds ✅ (downtime: 2s)

    Option A: All-or-Nothing (Current POC)

    Result: Rollback cluster-a and cluster-c
    Final state: All clusters on OLD config
    Total downtime: cluster-a (4s), cluster-b (30s), cluster-c (4s)
    

    Pros:

    • ✅ Single source of truth (config file intent OR previous state, never mixed)
    • ✅ Predictable retry path (fix issue → retry → all move together)
    • ✅ No configuration drift (never "cluster-a on v2, cluster-b on v1")
    • ✅ Simple status model (one config version for entire proxy)
    • ✅ Follows declarative configuration philosophy (Kubernetes/GitOps)

    Cons:

    • ❌ Unnecessary downtime for successful clusters during rollback
    • ❌ Wastes successful work (cluster-a, cluster-c succeeded but rolled back)

    Option B: Partial Success / Continue-on-Failure

    Result: Keep cluster-a and cluster-c on new config
    Final state: cluster-a (NEW), cluster-b (OLD), cluster-c (NEW)
    Total downtime: cluster-a (2s), cluster-b (30s), cluster-c (2s)
    

    Pros:

    • ✅ Less total downtime (no rollback for successful clusters)
    • ✅ Preserves successful work

    Cons:

    • ❌ Configuration drift (reality doesn't match declared intent)
    • ❌ Complex status model (per-cluster versions: {a: "v2", b: "v1", c: "v2"})
    • ❌ Unclear retry path (should cluster-a reload again? How does operator know?)
    • ❌ Reconciliation complexity (which clusters already on target version?)
    • ❌ Requires granular reload endpoints (/reload/cluster/{name})
    • ❌ Confusing user experience ("Reload failed" but some clusters succeeded?)

    Operational Comparison:

    Aspect All-or-Nothing Partial Success
    Source of truth Config OR previous state (clear) Mixed state (confusing)
    Retry after fixing cluster-b Simple (reload all) Complex (skip a,c or reload?)
    Status API One version Per-cluster versions
    Downtime on failure Higher (rollback) Lower (no rollback)
    Operator logic Simple Complex reconciliation
    User understanding Clear Confusing

    User Experience Example:

    All-or-Nothing:

    $ kubectl apply -f new-config.yaml
    Error: Config reload failed on cluster-b (filter init error)
    Status: All clusters on version abc123 (previous config)
    Action: Fix cluster-b config, retry apply
    

    Partial Success:

    $ kubectl apply -f new-config.yaml
    Error: Config reload failed on cluster-b (filter init error)
    Status: cluster-a (def456), cluster-b (abc123), cluster-c (def456)
    Question: Should I retry? Will cluster-a reload again?
    

    Questions for the community:

    • Which operational model do users prefer?
    • Is configuration drift acceptable as a trade-off for less downtime?
    • Should this be configurable, or should we pick one approach?
    • If configurable:
      admin:
        rollbackStrategy: ALL  # Default? Or FAILED_ONLY?
    • Do we need granular reload endpoints regardless of rollback strategy?

    Calude's recommendation: Start with all-or-nothing (simpler, matches declarative config philosophy), gather operational feedback, add partial success later if users request it. But this needs community buy-in, not just maintainer decision.

  3. Should we define granular reload endpoints now or defer?

    • POST /admin/config/reload (full config, current)
    • POST /admin/config/reload/cluster/{name} (single cluster, future?)
  4. What should config version format be?

    • SHA-256 hash (deterministic, no clock dependency)
    • Timestamp-based (easier for humans to understand)
    • Operator-provided (e.g., ConfigMap resourceVersion)

Summary

The configuration reload design addresses a critical operational need. This feedback proposes HTTP endpoints with 2-phase commit (validate → reload) as the primary interface (alternative to the current file watch proposal) for the following reasons:

Why HTTP-first with validation:

  • Better Kubernetes integration (operator-friendly, read-only filesystem compatible)
  • Clear observability (HTTP responses vs file watch with no feedback)
  • Testability (programmatic testing vs file system manipulation)
  • Validation catches config errors before any cluster goes down
  • File watching can still be supported as a convenience layer that calls HTTP internally

Core components proposed:

  1. POST /admin/config/validate - Validates config without applying (deterministic, fast)
  2. POST /admin/config/reload - Applies validated config (with rollback on failure)
  3. Management Service - Kubernetes Service exposing port 9190 for operator access
  4. Checksum-based change detection - Avoid unnecessary reloads on no-op reconciliation
  5. Read-only filesystem support - Make disk persistence optional for Kubernetes

Key takeaway: The architectural constraints (channel state machine, draining requirement) mean the design correctly accepts brief downtime per cluster modification. This is not a limitation—it's the right trade-off for operational simplicity and safety.

Recommended next steps:

  1. Discuss HTTP vs file watch as primary mechanism - This is a fundamental design choice that needs community input
  2. Discuss rollback strategy - All-or-nothing vs partial success requires consensus
  3. Add validation endpoint and 2-phase commit to design
  4. Add Kubernetes integration patterns (management Service, checksum-based change detection)
  5. Document failure modes and recovery procedures
  6. Refine POC implementation (PR#3176) based on finalized design

Excellent work on the POC—it provides a solid foundation for whichever trigger mechanism the community prefers!

Signed-off-by: Urjit Patel <105218041+Uzziee@users.noreply.github.com>
@Uzziee
Copy link
Author

Uzziee commented Feb 2, 2026

Hi @gunnarmorling @SamBarker @tombentley regarding the security risk for HTTP over TCP, we can either

  1. Expose the endpoint only on localhost (would immediately give the required security, as one will have to exec into the pods and then run the HTTP command)
  2. Replace HTTP over TCP with HTTP over UNIX.

I believe option #1 would satisfy our needs without introducing any additional code complexities involved with UNIX socket based approach. WDYT ?

@SamBarker
Copy link
Member

SamBarker commented Feb 2, 2026

Pulling some conversation from slack for posterity:

My initial reaction would be that config should only happen in exactly one way, which is files. A new config would also be provided as a file. An HTTP endpoint should be there for triggering the validation and eventually application of changed config, but which itself would be read from a file.

Inline updates to files are suboptimal in case of validation failures, because then you have again that mismatch of the state of the config file (already changed) and what actually is applied by the proxy (not changed).

How about the following:

  • Config is provided in a file (as it is today)
  • When changing config, create a new file with those changes, e.g. by copying and mutating that copy as needed (on K8s, this could be a new config map)
  • Have HTTP endpoints for a) validating a config file (by specifying its location in the file system) and be) applying a config file (by specifying its location in the file system)

That way, config is always in files (so it can be in source control, etc.), you can always easily examine the current state of config. The operator could support this by alternating through two config maps, one with currently applied config, and one with changed config, staged for application.

by @gunnarmorling

Yeah, I hear the point about two methods of providing config.

I'd mentally turned the issues with file watches & the need for read only file systems into a requirement for HTTP upload, which is not really true.

How does:

  1. HTTP Post to /config/validate/file for validation - served by any active process Valid files are made available on the filesystem
  2. HTTP PUT / PATCH to /config/file/ the config path which the proxy then re-loads from (could be the same path for k8s) this step triggers the actual incremental reload
  3. HTTP GET on /config/[status|version|info]  to serve details about whats currently running in the proxy?

by @SamBarker

@SamBarker
Copy link
Member

SamBarker commented Feb 2, 2026

Hi @gunnarmorling @SamBarker @tombentley regarding the security risk for HTTP over TCP, we can either

  1. Expose the endpoint only on localhost (would immediately give the required security, as one will have to exec into the pods and then run the HTTP command)
  2. Replace HTTP over TCP with HTTP over UNIX.

I believe option #1 would satisfy our needs without introducing any additional code complexities involved with UNIX socket based approach. WDYT ?

We can already support 1. using the bindAddress config property.

Hopefully thats good enough to get started with.

@Uzziee
Copy link
Author

Uzziee commented Feb 5, 2026

Hi @SamBarker given that we are still in discussion on how to trigger the reload, can we get an alignment on the actual graceful restart of the virtual clusters ? (Part 2 of the design)
If we get an approval on Part 2, I can submit a PR for part 2 (graceful restart of clusters), while we finalize Part 1 (triggering hot reload)

@SamBarker
Copy link
Member

Sorry @Uzziee I've been to slow getting back to this.

In principle yes. I think we need a little bit out thought to work out the interface between the trigger and the actual reloading code.

One other thought: How does the runtime know that a plug-ins configuration has changed? Take the ACL authz plugin that has a rules file, the content of that file might have changed even though the path to it is the same. The runtime doesn't and shouldn't understand plugin configuration so we should consider some way of asking plugins to detect config changes as well.

@SamBarker
Copy link
Member

Thanks for sticking with this @Uzziee, the proposal has come a long way and there's a lot of good thinking here. I've been noodling on a few areas and wanted to share where my head is at. Happy to discuss any of this further.

Decoupling trigger from apply

I think it would help to draw a clearer line between the trigger mechanism (the HTTP endpoint) and the bit that actually applies a new configuration to the running proxy. Something like a ProxyControl interface with an applyConfiguration(Configuration) method — the HTTP endpoint deals with parsing and validation, then hands the Configuration off to ProxyControl to do the actual work. That way if we later want to trigger from a file watcher or an operator callback, there's an obvious place to plug in. It would also make the apply logic easier to test in isolation.

Plugin resource tracking

This is the bit I've been chewing on most. The runtime can spot when a filter's YAML config blob changes (via equals()), but it has no way of knowing when external files that a plugin reads during initialize() change — things like password files, TLS keystores, ACL rules. Those reads tend to happen deep in nested plugin stacks (e.g. RecordEncryptionKmsServiceCredentialProviderFilePassword) so the runtime has no visibility.

One approach that seems promising: add a readResource(URI) method to FilterFactoryContext. Instead of plugins doing direct file I/O, they'd read through this method. The runtime would read the content, hash it, track the dependency, and return the content — all in one go. On a subsequent reload check, the runtime can re-read and re-hash the tracked URIs to see if anything changed.

Some of the thinking behind this:

  • Thread-local context access: We could provide a static FilterFactoryContext.current() method (similar to how Vert.x handles context) so that code deep in the call stack — like FilePassword — can access the context without us having to thread it through every intermediate SPI. The single-thread guarantee on initialize() makes this safe. I'm happy to put together a PR for this part myself.

  • URI-based with pluggable resolvers: I'm inclined toward taking a URI rather than a Path. Files are the common case today, but there are plausible near-term use cases for reading resources over HTTP (e.g. fetching schemas or credentials from a remote endpoint). If the API takes a URI, we can ship a file:// resolver by default and add other scheme resolvers (e.g. https://) later via ServiceLoader — without changing the plugin-facing API. The resolver itself would be a simple interface (scheme() + read(URI)) so adding a new scheme doesn't require changes to the runtime, just a new implementation on the classpath.

  • Returns content, not typed objects: I think readResource should return InputStream/String rather than trying to deserialize into typed objects. The runtime's concern is tracking dependencies — the plugin knows what the content means.

  • Throws outside initialize(): I'm inclined to have FilterFactoryContext.current() throw IllegalStateException if called outside of factory initialization, rather than returning null or a stub. Reading resources outside initialize() would create untracked dependencies, and I'd rather that be a loud failure than a silent one.

  • Consistent change detection: Because the runtime reads the content and computes the hash in the same operation that provides the bytes to the plugin, the hash always matches what the plugin actually received. There's no gap between checking for changes and reading the new content.

None of this is set in stone — I'm keen to hear what others think, especially around the FilterFactoryContext.current() approach.

Minimising disruption

For now I think restarting a modified cluster by tearing it down and rebuilding it (remove + add) is the right starting point — dropping connections is unavoidable. It is worth calling this out explicitly in the proposal so it's clear this is a known trade-off rather than something we've overlooked. More surgical reloads (swapping just the filter chain without dropping connections or routing changes) could be interesting to explore later but I wouldn't want to block on that.

Similarly, I think we should call out the all-or-nothing rollback semantics as a deliberate choice. Even with a clean internal separation, failures during apply (port conflicts, TLS errors at bind time) can still happen, so we'll need a rollback strategy regardless.

One small thing — Should the 30-second drain timeout be configurable? Long-running consumer rebalances or slow produces with acks=all can legitimately exceed that.


Design exploration notes (for context, not part of the main proposal)

These are ideas we explored and set aside during discussion. Recording them here so we don't retread the same ground.

ResourceDependency (plugin-declared change detection): An alternative to readResource where plugins declare dependencies with an opaque version token (Object currentVersion()) and the runtime compares tokens between checks. More general (works for any resource type) but the runtime's version check and the plugin's re-read during initialize() are independent operations that can see different versions of the resource. Also relies on plugin authors remembering to declare dependencies. We leaned toward readResource for the common case since it's harder to accidentally miss a dependency.

Returning null from FilterFactoryContext.current(): We considered returning null when called outside initialize(), with a fallback to direct I/O. The worry was that silently succeeding means untracked dependencies. Also considered a no-op stub but that has the same problem. Throwing seemed like the clearest contract.

Typed readResource return (Jackson deserialization): We considered readResource(URI, Class<T>) to deserialize into typed objects, but the runtime would then need to know serialization formats (JSON? YAML? properties?). The current resources (passwords, keystores) are simple enough that raw bytes/string feels like the right level.

Plan/apply split on the public interface: We considered exposing plan() and apply() separately on ProxyControl to enable dry-run validation. Decided this is an internal concern — the trigger just needs applyConfiguration(). A validate/dry-run endpoint could be added later without changing the interface.

Kafka topic as config source: We discussed storing configuration in a dedicated Kafka topic. This feels more like a trigger mechanism than a resource dependency. Too early to design for — the URI-based readResource doesn't need to accommodate it.

ConfigurationReconciler naming: We considered this to describe the "compare desired vs current and converge" pattern, but there are actual Kubernetes reconcilers in the source tree and overloading the term seemed likely to cause confusion.

@Uzziee
Copy link
Author

Uzziee commented Feb 13, 2026

Hey @SamBarker

I think it would help to draw a clearer line between the trigger mechanism (the HTTP endpoint) and the bit that actually applies a new configuration to the running proxy. Something like a ProxyControl interface with an applyConfiguration(Configuration) method — the HTTP endpoint deals with parsing and validation, then hands the Configuration off to ProxyControl to do the actual work. That way if we later want to trigger from a file watcher or an operator callback, there's an obvious place to plug in. It would also make the apply logic easier to test in isolation.

Heheh, this is something already in works by me. I feel we are still not convinced on "how to trigger" part of this problem, so I have already started working on a design which decouples the trigger-mechanism. I'll update the new proposal here in a few days :)

Some high level thoughts about it

ReloadResult result = proxy.reload(Configuration newConfig, ReloadOptions reloadOptions)


ReloadOptions
├── OnFailure
│   └── appState: ROLLBACK | TERMINATE | CONTINUE
└── OnSuccess
    └── persistConfigToDisk: true | false

We can have a interface which the triggers should implement which internally invokes this proxy.reload() method.
eg:- HttpReloadTrigger, FileWatcherReloadTrigger, MyOwnSuperAwesomeCustomReloadTrigger

Behavior Matrix

OnFailure: ROLLBACK vs TERMINATE vs CONTINUE

Aspect ROLLBACK (default) TERMINATE CONTINUE
Cluster operations fail Undo all successful operations in reverse order (remove added, restore modified, re-add removed) No undo. Partial changes persist until proxy shuts down. No undo. Partial changes persist. Proxy keeps running.
FilterChainFactory Old factory remains active; new factory is closed. New connections use old filters. New factory is committed. Moot — proxy is shutting down. New factory is committed despite failure. New connections use new filters.
Proxy state Running, consistent — fully operational with old config. Shut down — process exits (or caller handles). Running, inconsistent — some clusters old, some new.
Future result Completes exceptionally with ReloadException. Completes exceptionally (after shutdown initiated). Completes exceptionally with ReloadException.
Recovery Automatic — proxy is in known-good state. Retry anytime. External — process supervisor (K8s, systemd) restarts. Manual — operator inspects, fixes, calls reload() again.

OnSuccess: persistConfigToDisk = true vs false

Aspect persistConfigToDisk = true (default) persistConfigToDisk = false
Config file Overwritten with new config (old config backed up as .bak) Unchanged — old config remains on disk
Proxy restart Proxy starts with new config after restart Proxy starts with old config after restart (reload was ephemeral)
Use case Production — config file should always reflect running state K8s (config comes from CRD), tests, temporary experiments

Combined Examples

Scenario OnFailure OnSuccess Effect
Production HTTP reload rollback() withPersist() Safest: rollback + save to disk
K8s Operator reconciler terminate() withoutPersist() Pod restarts on failure; K8s owns the config
Integration test continueRunning() withoutPersist() Test can inspect partial state; nothing written to disk
Debug session continueRunning() withPersist() Keep running for investigation; save what was attempted
CI pipeline rollback() withoutPersist() Safe rollback; config comes from CI, don't overwrite

Related to the Plugin resource tracking, I believe checking on the hash would be the most optimal way to go about.
I'll be honest, I did not quite understand your proposal on this as I don't have much context around how plugins are configured. I'll try to take a look at that part once I done with the above mentioned proposal 🥲

One small thing — Should the 30-second drain timeout be configurable

I was already planning to make it configurable, since this is just a POC PR, a lot of hardening work might still be pending. I'll be anyways creating separate PR when its time to submit

@Uzziee
Copy link
Author

Uzziee commented Feb 17, 2026

Hi @SamBarker , as part of this proposal, what I am proposing is to just have the proxy.reload() method added to begin with.
In later enhancements, we could have trigger interfaces which uses this internal method to reload, like HTTPTrigger, FileWatcherTrigger. This will help the PR move forward without being stuck on the discussion around "how to trigger" part

ReloadResult result = proxy.reload(Configuration newConfig, ReloadOptions reloadOptions)


ReloadOptions
├── OnFailure
│   └── appState: ROLLBACK | TERMINATE
└── OnSuccess
    └── persistConfigToDisk: true | false

We can have later have interfaces which the triggers should implement which internally invokes this proxy.reload() method. (We can have this as part of future enhancements)
eg:- HttpReloadTrigger, FileWatcherReloadTrigger, MyOwnSuperAwesomeCustomReloadTrigger

Behavior Matrix

OnFailure: ROLLBACK vs TERMINATE

Aspect ROLLBACK (default) TERMINATE
Cluster operations fail Undo all successful operations in reverse order (remove added, restore modified, re-add removed) No undo. Partial changes persist until proxy shuts down
FilterChainFactory Old factory remains active; new factory is closed. New connections use old filters New factory is committed (proxy is shutting down)
Proxy state Running, consistent — fully operational with old config Shut down — process exits (or caller handles)
Future result Completes exceptionally with ReloadException Completes exceptionally (after shutdown initiated)
Recovery Automatic — proxy is in known-good state; retry anytime External — process supervisor (K8s, systemd) restarts

OnSuccess: persistConfigToDisk = true vs false

Aspect persistConfigToDisk = true (default) persistConfigToDisk = false
Config file Overwritten with new config (old config backed up as .bak) Unchanged — old config remains on disk
Proxy restart Proxy starts with new config after restart Proxy starts with old config after restart (reload was ephemeral)
Use case Production — config file should always reflect running state K8s (config comes from CRD), tests, temporary experiments

Combined Examples

Scenario OnFailure OnSuccess Effect
Production HTTP reload rollback() withPersist() Safest: rollback + save to disk
K8s Operator reconciler terminate() withoutPersist() Pod restarts on failure; K8s owns the config
CI pipeline rollback() withoutPersist() Safe rollback; config comes from CI, don't overwrite

What do you think ?

@SamBarker
Copy link
Member

Thanks @Uzziee, this is heading in the right direction.

Core API shape

Agreed on deferring the trigger mechanism and focusing on the core operation first.

A thought on naming: "reload" presupposes re-reading from somewhere. Something like applyConfiguration(Configuration) better describes what it does — "make the running proxy match this configuration." Worth getting right early since it'll show up everywhere.

I'd push back on ReloadOptions as a per-call parameter though. Things like rollback-vs-terminate and disk persistence will vary between deployments, but they shouldn't vary between invocations within the same deployment. A multi-tenant ingress-style deployment might want to limp on with partial success; a sidecar model has different constraints again. These are decisions the operator makes at deployment time, not decisions the trigger makes per invocation — so they belong in the proxy's static configuration rather than the API. That keeps applyConfiguration(Configuration) simple and gives us space to figure out the right options as we understand deployment models better.

One more open question: the current proposal works with state-of-the-world snapshots — pass a complete Configuration and the proxy diffs it against what's running. That's a good starting point, but worth thinking about whether we eventually want something more granular (delta-based operations, or more targeted snapshots). No need to solve now, but the API shape should leave room for it. Thoughts?

Drain timeout

Agreed it should be configurable — details during implementation.

Plugin resource tracking

I think this is orthogonal to the core mechanism. I'd suggest splitting it into a separate proposal so it doesn't block this one and reviewers can engage with each concern independently.

The problem in brief: the runtime can detect when a filter's YAML config changes (via equals()), but has no visibility into external resources plugins read during initialize() — password files, TLS keystores, ACL rules. Those reads happen deep in plugin call stacks (e.g. RecordEncryptionKmsServiceCredentialProviderFilePassword), so the runtime can't detect when they change. Without addressing this, a reload would miss those changes entirely.

We've been exploring an approach where plugins read external resources through the runtime rather than doing direct file I/O. The runtime tracks what was read and hashes the content, so it can detect changes on subsequent checks. This makes dependency tracking automatic rather than opt-in. There are open design questions around the API shape, how deeply nested code accesses the context, and whether to support non-file resources — happy to go into detail if useful.

Proposal structure

The PR discussion has covered a lot of ground and it's getting hard for someone coming in cold to follow. I'd suggest updating the proposal document itself to reflect where we've landed:

  • Reframe around applyConfiguration() as the core API, with trigger mechanisms as future work
  • Call out remove+add with brief per-cluster downtime as a deliberate design choice
  • Call out all-or-nothing rollback as the initial default (consistent with startup, where any cluster failure fails the whole proxy) while acknowledging other deployment models may need different behaviour
  • Mention plugin resource tracking as a known gap, with a pointer to a separate proposal

That way reviewers can engage with the document rather than reconstructing the position from comments.

SamBarker and others added 3 commits February 18, 2026 16:36
Rewrite the hot reload proposal to focus on architectural decisions
rather than implementation detail. The PR discussion has established
consensus on several key points that the document didn't reflect:

- Reframe around applyConfiguration(Configuration) as the core API,
  decoupled from trigger mechanisms (HTTP, file watcher, operator)
- Remove all Java class implementations and handler chains — these
  belong in the code PR where they're reviewable in context
- Call out remove+add with brief per-cluster downtime as deliberate
- Call out all-or-nothing rollback as the initial default, consistent
  with startup behaviour
- Move ReloadOptions to deployment-level static configuration rather
  than per-call parameters
- Identify plugin resource tracking as a known gap with pointer to
  separate proposal
- Flag open questions (config granularity, failure behaviour options,
  drain timeout configurability)
- Defer trigger mechanism design as explicit future work

Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
- Fix summary to read as proposed behaviour, not existing
- Use "administrators" instead of "operators" for humans to avoid
  confusion with the Kubernetes operator process
- Fix filter config examples (KMS endpoint, key selection pattern)
- Clarify failure behaviour is consistent across trigger mechanisms
- Note thundering herd as a known trade-off of remove+add
- Fix "original proposal" to "earlier iterations"

Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants