feat: Virtual cluster lifecycle state model by SamBarker · Pull Request #89 · kroxylicious/design

SamBarker · 2026-02-17T03:43:23Z

Summary

Introduces proposal 014 defining a lifecycle state machine for virtual clusters with six states:

initializing — cluster being set up from a clean state
degraded — viable but dependent resource status unconfirmed/unavailable
healthy — fully operational, all health checks passing
draining — rejecting new connections, existing connections completing
failed — configuration not viable, resources released
stopped — terminal, cluster no longer operational

Key Design Decisions

degraded as default after init — clusters enter degraded not healthy, because runtime dependency status is unverified until health checks confirm it
Health check criteria are a separate concern — the lifecycle model defines that degraded ↔ healthy transitions exist, not what triggers them
Port binding is a proxy-level concern — scoped out of the VC lifecycle; identified as future work (proxy-level lifecycle)
failed releases all resources — retry from failed is always a clean initializing cycle
stopped is terminal — reload routes through draining → initializing, never through stopped
Fail-fast startup by default — best-effort is opt-in via startupPolicy config

Relationship to Other Proposals

This provides a foundation for 012 - Hot Reload, which can define reload transitions on this state model rather than inventing its own.

Test plan

Review state definitions for clarity and completeness
Verify state transition diagram consistency with transition descriptions
Review rejected alternatives for completeness
Consider whether future enhancements section adequately captures known follow-up work

🤖 Generated with Claude Code

Introduce proposal 014 defining a lifecycle state machine for virtual clusters: initializing, degraded, healthy, draining, failed, and stopped. This provides a foundation for resilient startup (best-effort mode), graceful shutdown with drain timeouts, runtime health distinction (degraded vs healthy), and configuration reload via the hot reload proposal (012). Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

014 and 015 are already claimed by other proposals. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

tombentley

Thanks @SamBarker I think this is useful work, but have doubts that we really need to have "healthy" as a state.

tombentley · 2026-02-22T23:30:30Z

proposals/016-virtual-cluster-lifecycle.md

+
+## Motivation
+
+As Kroxylicious moves toward use cases where it acts as a multi-tenant gateway (multiple independent virtual clusters serving different teams or workloads), the blast radius of failures becomes critical. A configuration error affecting one tenant's cluster should not disrupt another tenant's traffic.


This conflates tenancy with virtual cluster. They're not necessarily the same thing. For example, @k-wall's isolation filter is a different take on tenancy, one where different tenants are in the same VC.

I think we should try to keep any mention of tenancy out of this proposal, because what is proposed here can stand on its own terms without needing to bring any kind of tenancy into the discussion.

The hot-reload proposal aims to make the virtual cluster the "unit of reconfiguration". That seems reasonable in the sense that a VC defines a bunch of network-level configuration, and network level stuff is a significant source of potential reconfiguration failure. So making the VC the "unit of reconfiguration" also makes the VC the "failure domain of reconfiguration". Roughly speaking, I think the VC is the largest possible failure domain that's smaller than a whole proxy instance.

So thinking in terms of failure domain seems like a more useful was of framing the motivation, without needing to make assumptions about what the user is using virtual clusters for.

tombentley · 2026-02-22T23:30:35Z

proposals/016-virtual-cluster-lifecycle.md

+
+### Health Checks
+
+The transitions between `degraded` and `healthy` are driven by health checks. This proposal defines that these transitions exist and that some mechanism triggers them, but does not prescribe what constitutes a health check. The criteria for health (upstream broker connectivity, KMS availability, filter readiness, etc.) are a separate concern from the lifecycle model itself.


It's not clear to me that having these two distinct states is necessary.
For sure "healthy" and "unhealthy" are distinct states, but they don't don't need to be part of this state machine. AFAICS you could combine these states into a single "running" state (they have the same inward and outward transitions) and everything else would be the same.

The advantage of doing that:

a simpler state machine (always a good thing).

it frees you up from this awkward fence-sitting where you're claiming there are distinct states without defining the things which cause the transitions.

they would seem depend on external things (like is the Kafka cluster running?) which means transitions between them are likely to depend on some kind of polling mechanism, which is going to be a delayed signal.

their definition might not even be the same for all observers. For example what if some clients connect to the proxy directly but some via some network loadbalancer or similar. You can't define healthy in the same way for both these clients without the possibility of disagreement (the loadbalancer not working makes the VC appear unhealthy for some but not others). By building in a notion of "healthy" you're making a rod for your own back however you end up defining it.

Without taking a position on the overall point your making.

they would seem depend on external things (like is the Kafka cluster running?) which means transitions between them are likely to depend on some kind of polling mechanism, which is going to be a delayed signal.

The delayed nature of the signal was what pushed me towards healthy & degraded, partly I'd admit because I was including notions of availability of remote resources into the picture (which your reasonably pushing back on) but if felt strange to transition to running if. for instance, we can't connect to the upstream broker.

their definition might not even be the same for all observers. For example what if some clients connect to the proxy directly but some via some network loadbalancer or similar. You can't define healthy in the same way for both these clients without the possibility of disagreement (the loadbalancer not working makes the VC appear unhealthy for some but not others).

I was constructing healthy, or rather health in general, as from the proxies perspective. What does the run time think the state of play is.

By building in a notion of "healthy" you're making a rod for your own back however you end up defining it.

I do find that point persuasive, we are always going to get different people wanting different definitions of healthy. However I think everyone would agree if the proxy can't connect to an upstream endpoint then its degraded (at best). While we don't have any immediate requirements to model anything other than the basic starting, running, stopping states I can easily imagine requirements where detecting a failed broker we might want to apply rate limits to new connections to manage a thundering heard on re-connect.

The health of a virtual cluster feels like an important thing for proxy administrators to be aware of and understand. So I was combining it into this life cycle. I can see there is a case to make they are orthogonal concepts.

So I'm very interested to hear from @kroxylicious/developers as to how they see the model.

…ve tenancy motivation - Replace healthy/degraded states with single 'accepting' state that makes no health claims — lifecycle tracks what the proxy is doing, not runtime health - Reframe motivation around failure domains rather than multi-tenancy - Replace ASCII state diagram with excalidraw PNG - Add 'Runtime health as lifecycle state' rejected alternative explaining why health is orthogonal to lifecycle Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

Health criteria vary by deployment and may become per-destination once request-level routing is in play. Defining a health model prematurely would constrain future design without immediate value. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

tombentley · 2026-02-27T00:38:03Z

proposals/016-virtual-cluster-lifecycle.md

+
+This has several consequences:
+
+1. **Startup is all-or-nothing.** If one virtual cluster fails to start (e.g. port conflict, filter initialisation failure), the entire proxy process fails. Other clusters that could have started successfully are taken down with it.


Minor word pedantry: The other clusters can't be taken down if they never came up.

Suggested change

1. **Startup is all-or-nothing.** If one virtual cluster fails to start (e.g. port conflict, filter initialisation failure), the entire proxy process fails. Other clusters that could have started successfully are taken down with it.

1. **Startup is all-or-nothing.** If one virtual cluster fails to start (e.g. port conflict, filter initialisation failure), the entire proxy process fails. Other clusters that could have started successfully never become available.

tombentley · 2026-02-27T00:40:04Z

proposals/016-virtual-cluster-lifecycle.md

+
+1. **Startup is all-or-nothing.** If one virtual cluster fails to start (e.g. port conflict, filter initialisation failure), the entire proxy process fails. Other clusters that could have started successfully are taken down with it.
+
+2. **Shutdown is unstructured.** The proxy stops accepting connections and closes channels, but there is no formal draining phase that ensures in-flight Kafka requests complete before the connection is torn down.


Suggested change

2. **Shutdown is unstructured.** The proxy stops accepting connections and closes channels, but there is no formal draining phase that ensures in-flight Kafka requests complete before the connection is torn down.

2. **Shutdown is unstructured.** The proxy stops accepting connections and closes channels, but there is no formal draining phase that ensures in-flight Kafka requests complete before the connection is torn down. While this does no violate any of the guarantees of the Kafka protocol (which needs to cope with network partitions), it would be good to shutdown more gracefully in situations where that's possible.

tombentley · 2026-02-27T00:42:40Z

proposals/016-virtual-cluster-lifecycle.md

+
+1. **Startup is all-or-nothing.** If one virtual cluster fails to start (e.g. port conflict, filter initialisation failure), the entire proxy process fails. Other clusters that could have started successfully are taken down with it.
+
+2. **Shutdown is unstructured.** The proxy stops accepting connections and closes channels, but there is no formal draining phase that ensures in-flight Kafka requests complete before the connection is torn down.


It's not immediately apparent to me how this would work in practice. Suppose I have a producer pipelining Produce requests. How do we intend to stop that flow in such a way that the producer doesn't end up with an unacknowledged request?

You mention later on about timeout, but:

the broker doesn't know what the client's timeout it configured to, so it doesn't know how long it needs to wait

if the client timesout request 1 while thinking that 2, 3 and 4 are in flight then it will send request 5

tombentley · 2026-02-27T00:51:50Z

proposals/016-virtual-cluster-lifecycle.md

+
+2. **Shutdown is unstructured.** The proxy stops accepting connections and closes channels, but there is no formal draining phase that ensures in-flight Kafka requests complete before the connection is torn down.
+
+3. **No foundation for partial failure.** Proposals such as [012 - Hot Reload](https://github.com/kroxylicious/design/pull/83) need the ability to express "cluster-b failed to apply new configuration but cluster-a is still serving traffic." Without a lifecycle model this state is undefined and unreportable.


It's not clear to me exactly how these are related. The Hot Reload proposal currently says:

The initial default is all-or-nothing rollback: if any cluster operation fails during apply (e.g. a port conflict when rebinding, a TLS error, a plugin initialisation failure), all previously successful operations in that apply are rolled back in reverse order.
Added clusters are removed, modified clusters are reverted to their original configuration, and removed clusters are re-added.

Looking at the diagram of the state machine in this proposal, I think a VM would be in accepting when the reload started.

Would it then transition draining -> initializing -> failed -> stopped as the problem was detected? And then automatically the revert happens (applying the old configuration). According to the diagram that must be a different instance of the state machine (because stopped is the final state). So it would start initializing -> accepting (with luck). I thought it must be like this because the Hot Reload says "Cluster modification: remove + add".

Or does it transition draining -> initializing -> failed, at which point the revert happens and we see failed -> initializing -> accepting? In this case it doesn't seems like the reload behaviour can really be described as "Cluster modification: remove + add".

tombentley · 2026-02-27T00:54:59Z

proposals/016-virtual-cluster-lifecycle.md

+
+## Motivation
+
+A virtual cluster is the natural unit of independent operation — the smallest scope at which the proxy can contain a failure without affecting unrelated traffic. Today this independence is not modelled: the proxy treats all clusters as a single unit that either starts completely or fails completely.


"modelled" is a bit of an odd word here (which suggests to me that there are come components/classes in the source code which directly correspond to this, but really its more like an emergent property of the system, rather than an inherent one) . I think it would be clearer to say something like "the independence is notional"

tombentley · 2026-02-27T01:08:12Z

proposals/016-virtual-cluster-lifecycle.md

+| **initializing** | The cluster is being set up. Not yet accepting connections. Used on first boot, when retrying from `failed`, and during configuration reload. |
+| **accepting** | The proxy has completed setup for this cluster and is accepting connections. This state makes no claim about the availability of upstream brokers or other runtime dependencies — it means the proxy is ready to handle connection attempts. |
+| **draining** | New connections are rejected. Existing connections remain open to give in-flight requests the opportunity to complete. Connections are closed once idle or when the drain timeout is reached. |
+| **failed** | The proxy determined the configuration not to be viable. All partially-acquired resources are released on entry to this state. The proxy retains the cluster's configuration and failure reason for diagnostics and retry. |


"The proxy retains the cluster's configuration and failure reason for diagnostics and retry." Again, I don't really understand how this relates to the rollback behaviour described in the Hot Reload proposal. Assuming the rollback is successful then won't the failure reason be rapidly either out of date (the VM isn't failed because the config was rolled back), or it the reason is removed then what purpose did its transient presence serve?

tombentley · 2026-02-27T01:19:39Z

proposals/016-virtual-cluster-lifecycle.md

+  # startupPolicy: best-effort  # start with whatever clusters succeed
+```
+
+In best-effort mode, the proxy starts and serves traffic for clusters that initialised successfully, while reporting failed clusters via health endpoints and logs. Kubernetes readiness probes or monitoring systems can apply their own thresholds (e.g. "all clusters must be accepting" vs "at least one cluster must not be failed"). The operator would typically set this policy.


Suggested change

In best-effort mode, the proxy starts and serves traffic for clusters that initialised successfully, while reporting failed clusters via health endpoints and logs. Kubernetes readiness probes or monitoring systems can apply their own thresholds (e.g. "all clusters must be accepting" vs "at least one cluster must not be failed"). The operator would typically set this policy.

In best-effort mode, the proxy starts and serves traffic for clusters that initialised successfully, while reporting failed clusters via health endpoints and logs. Kubernetes readiness probes or monitoring systems can apply their own thresholds (e.g. "all clusters must be accepting" vs "at least one cluster must not be failed"). The Kubernetes operator would typically set this policy.

Just because operator is easily misunderstood to mean a human

tombentley · 2026-02-27T01:21:57Z

proposals/016-virtual-cluster-lifecycle.md

+
+### Observability
+
+Cluster lifecycle state should be observable — through management endpoints, logging, or metrics — so that operators and tooling can determine which clusters are accepting connections, which have failed, and why. The specific reporting mechanism is an implementation concern and not prescribed by this proposal.


The specific reporting mechanism is an implementation concern and not prescribed by this proposal.

Management endpoints and metrics count as public API so should be described here.

tombentley · 2026-02-27T01:27:43Z

proposals/016-virtual-cluster-lifecycle.md

+### Internal Representation
+
+Each virtual cluster holds a state object:
+
+```java
+public record ClusterState(
+        LifecyclePhase phase,
+        Instant since,
+        @Nullable String reason) {
+
+    public enum LifecyclePhase {
+        INITIALIZING,
+        ACCEPTING,
+        DRAINING,
+        FAILED,
+        STOPPED
+    }
+}
+```
+
+State transitions should be validated — e.g. a cluster cannot move from `stopped` to any other state. Invalid transitions indicate a programming error and should throw.
+
+The component responsible for managing cluster state (likely an evolution of the existing `EndpointRegistry` or a new `ClusterLifecycleManager`) should be the single source of truth for state transitions, ensuring they are logged and observable.


I think we should try to avoid putting implementation details like this into proposals.

Those will become apparent when a PR is opened, so it's not really necessary.

It's highly likely that they'll become stale once an implementation is available

It encourages other people opening proposals to do the same. We end up with people thinking they need to write an essay just to get anything done. Personally that's not how I want things to work. Everyone benefits from up-front discussion and agreement about public, supported APIs. PR reviews should be sufficient for other code changes.

tombentley · 2026-02-27T01:34:11Z

proposals/016-virtual-cluster-lifecycle.md

+
+Some configuration changes will likely always require draining — for example, changes to the upstream cluster identity or TLS configuration that invalidate existing connections. The optimisation is about identifying changes where draining can be safely skipped, not eliminating it.
+
+### Proxy-level lifecycle


You've defined how proxy started works wrt the VM state machines. It's not really clear to me whether we need a proposal about the rest, unless you think that that state is also exposed via metrics/mgmt etc.

SamBarker requested a review from a team as a code owner February 17, 2026 03:43

Renumber proposal from 014 to 016

34a5b50

014 and 015 are already claimed by other proposals. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

tombentley reviewed Feb 22, 2026

View reviewed changes

SamBarker added 2 commits February 26, 2026 16:46

tombentley reviewed Feb 27, 2026

View reviewed changes

tombentley mentioned this pull request Feb 27, 2026

012 - Hot reload feature proposal #83

Open


		## Motivation

		As Kroxylicious moves toward use cases where it acts as a multi-tenant gateway (multiple independent virtual clusters serving different teams or workloads), the blast radius of failures becomes critical. A configuration error affecting one tenant's cluster should not disrupt another tenant's traffic.


		### Health Checks

		The transitions between `degraded` and `healthy` are driven by health checks. This proposal defines that these transitions exist and that some mechanism triggers them, but does not prescribe what constitutes a health check. The criteria for health (upstream broker connectivity, KMS availability, filter readiness, etc.) are a separate concern from the lifecycle model itself.


		This has several consequences:

		1. Startup is all-or-nothing. If one virtual cluster fails to start (e.g. port conflict, filter initialisation failure), the entire proxy process fails. Other clusters that could have started successfully are taken down with it.


		1. Startup is all-or-nothing. If one virtual cluster fails to start (e.g. port conflict, filter initialisation failure), the entire proxy process fails. Other clusters that could have started successfully are taken down with it.

		2. Shutdown is unstructured. The proxy stops accepting connections and closes channels, but there is no formal draining phase that ensures in-flight Kafka requests complete before the connection is torn down.


		2. Shutdown is unstructured. The proxy stops accepting connections and closes channels, but there is no formal draining phase that ensures in-flight Kafka requests complete before the connection is torn down.

		3. No foundation for partial failure. Proposals such as [012 - Hot Reload](https://github.com/kroxylicious/design/pull/83) need the ability to express "cluster-b failed to apply new configuration but cluster-a is still serving traffic." Without a lifecycle model this state is undefined and unreportable.


		## Motivation

		A virtual cluster is the natural unit of independent operation — the smallest scope at which the proxy can contain a failure without affecting unrelated traffic. Today this independence is not modelled: the proxy treats all clusters as a single unit that either starts completely or fails completely.

	In best-effort mode, the proxy starts and serves traffic for clusters that initialised successfully, while reporting failed clusters via health endpoints and logs. Kubernetes readiness probes or monitoring systems can apply their own thresholds (e.g. "all clusters must be accepting" vs "at least one cluster must not be failed"). The operator would typically set this policy.
	In best-effort mode, the proxy starts and serves traffic for clusters that initialised successfully, while reporting failed clusters via health endpoints and logs. Kubernetes readiness probes or monitoring systems can apply their own thresholds (e.g. "all clusters must be accepting" vs "at least one cluster must not be failed"). The Kubernetes operator would typically set this policy.


		### Observability

		Cluster lifecycle state should be observable — through management endpoints, logging, or metrics — so that operators and tooling can determine which clusters are accepting connections, which have failed, and why. The specific reporting mechanism is an implementation concern and not prescribed by this proposal.


		Some configuration changes will likely always require draining — for example, changes to the upstream cluster identity or TLS configuration that invalidate existing connections. The optimisation is about identifying changes where draining can be safely skipped, not eliminating it.

		### Proxy-level lifecycle

Uh oh!

Conversation

SamBarker commented Feb 17, 2026

Summary

Key Design Decisions

Relationship to Other Proposals

Test plan

Uh oh!

tombentley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants