Skip to content

feat: Virtual cluster lifecycle state model#89

Open
SamBarker wants to merge 4 commits intokroxylicious:mainfrom
SamBarker:virtual-cluster-lifecycle
Open

feat: Virtual cluster lifecycle state model#89
SamBarker wants to merge 4 commits intokroxylicious:mainfrom
SamBarker:virtual-cluster-lifecycle

Conversation

@SamBarker
Copy link
Member

Summary

Introduces proposal 014 defining a lifecycle state machine for virtual clusters with six states:

  • initializing — cluster being set up from a clean state
  • degraded — viable but dependent resource status unconfirmed/unavailable
  • healthy — fully operational, all health checks passing
  • draining — rejecting new connections, existing connections completing
  • failed — configuration not viable, resources released
  • stopped — terminal, cluster no longer operational

Key Design Decisions

  • degraded as default after init — clusters enter degraded not healthy, because runtime dependency status is unverified until health checks confirm it
  • Health check criteria are a separate concern — the lifecycle model defines that degradedhealthy transitions exist, not what triggers them
  • Port binding is a proxy-level concern — scoped out of the VC lifecycle; identified as future work (proxy-level lifecycle)
  • failed releases all resources — retry from failed is always a clean initializing cycle
  • stopped is terminal — reload routes through draininginitializing, never through stopped
  • Fail-fast startup by default — best-effort is opt-in via startupPolicy config

Relationship to Other Proposals

This provides a foundation for 012 - Hot Reload, which can define reload transitions on this state model rather than inventing its own.

Test plan

  • Review state definitions for clarity and completeness
  • Verify state transition diagram consistency with transition descriptions
  • Review rejected alternatives for completeness
  • Consider whether future enhancements section adequately captures known follow-up work

🤖 Generated with Claude Code

Introduce proposal 014 defining a lifecycle state machine for virtual
clusters: initializing, degraded, healthy, draining, failed, and stopped.

This provides a foundation for resilient startup (best-effort mode),
graceful shutdown with drain timeouts, runtime health distinction
(degraded vs healthy), and configuration reload via the hot reload
proposal (012).

Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
@SamBarker SamBarker requested a review from a team as a code owner February 17, 2026 03:43
014 and 015 are already claimed by other proposals.

Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Copy link
Member

@tombentley tombentley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SamBarker I think this is useful work, but have doubts that we really need to have "healthy" as a state.


## Motivation

As Kroxylicious moves toward use cases where it acts as a multi-tenant gateway (multiple independent virtual clusters serving different teams or workloads), the blast radius of failures becomes critical. A configuration error affecting one tenant's cluster should not disrupt another tenant's traffic.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conflates tenancy with virtual cluster. They're not necessarily the same thing. For example, @k-wall's isolation filter is a different take on tenancy, one where different tenants are in the same VC.

I think we should try to keep any mention of tenancy out of this proposal, because what is proposed here can stand on its own terms without needing to bring any kind of tenancy into the discussion.

The hot-reload proposal aims to make the virtual cluster the "unit of reconfiguration". That seems reasonable in the sense that a VC defines a bunch of network-level configuration, and network level stuff is a significant source of potential reconfiguration failure. So making the VC the "unit of reconfiguration" also makes the VC the "failure domain of reconfiguration". Roughly speaking, I think the VC is the largest possible failure domain that's smaller than a whole proxy instance.

So thinking in terms of failure domain seems like a more useful was of framing the motivation, without needing to make assumptions about what the user is using virtual clusters for.


### Health Checks

The transitions between `degraded` and `healthy` are driven by health checks. This proposal defines that these transitions exist and that some mechanism triggers them, but does not prescribe what constitutes a health check. The criteria for health (upstream broker connectivity, KMS availability, filter readiness, etc.) are a separate concern from the lifecycle model itself.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me that having these two distinct states is necessary.
For sure "healthy" and "unhealthy" are distinct states, but they don't don't need to be part of this state machine. AFAICS you could combine these states into a single "running" state (they have the same inward and outward transitions) and everything else would be the same.

The advantage of doing that:

  • a simpler state machine (always a good thing).
  • it frees you up from this awkward fence-sitting where you're claiming there are distinct states without defining the things which cause the transitions.
  • they would seem depend on external things (like is the Kafka cluster running?) which means transitions between them are likely to depend on some kind of polling mechanism, which is going to be a delayed signal.
  • their definition might not even be the same for all observers. For example what if some clients connect to the proxy directly but some via some network loadbalancer or similar. You can't define healthy in the same way for both these clients without the possibility of disagreement (the loadbalancer not working makes the VC appear unhealthy for some but not others). By building in a notion of "healthy" you're making a rod for your own back however you end up defining it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without taking a position on the overall point your making.

they would seem depend on external things (like is the Kafka cluster running?) which means transitions between them are likely to depend on some kind of polling mechanism, which is going to be a delayed signal.

The delayed nature of the signal was what pushed me towards healthy & degraded, partly I'd admit because I was including notions of availability of remote resources into the picture (which your reasonably pushing back on) but if felt strange to transition to running if. for instance, we can't connect to the upstream broker.

their definition might not even be the same for all observers. For example what if some clients connect to the proxy directly but some via some network loadbalancer or similar. You can't define healthy in the same way for both these clients without the possibility of disagreement (the loadbalancer not working makes the VC appear unhealthy for some but not others).

I was constructing healthy, or rather health in general, as from the proxies perspective. What does the run time think the state of play is.

By building in a notion of "healthy" you're making a rod for your own back however you end up defining it.

I do find that point persuasive, we are always going to get different people wanting different definitions of healthy. However I think everyone would agree if the proxy can't connect to an upstream endpoint then its degraded (at best). While we don't have any immediate requirements to model anything other than the basic starting, running, stopping states I can easily imagine requirements where detecting a failed broker we might want to apply rate limits to new connections to manage a thundering heard on re-connect.

The health of a virtual cluster feels like an important thing for proxy administrators to be aware of and understand. So I was combining it into this life cycle. I can see there is a case to make they are orthogonal concepts.

So I'm very interested to hear from @kroxylicious/developers as to how they see the model.

…ve tenancy motivation

- Replace healthy/degraded states with single 'accepting' state that makes no
  health claims — lifecycle tracks what the proxy is doing, not runtime health
- Reframe motivation around failure domains rather than multi-tenancy
- Replace ASCII state diagram with excalidraw PNG
- Add 'Runtime health as lifecycle state' rejected alternative explaining why
  health is orthogonal to lifecycle

Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Health criteria vary by deployment and may become per-destination
once request-level routing is in play. Defining a health model
prematurely would constrain future design without immediate value.

Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

This has several consequences:

1. **Startup is all-or-nothing.** If one virtual cluster fails to start (e.g. port conflict, filter initialisation failure), the entire proxy process fails. Other clusters that could have started successfully are taken down with it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor word pedantry: The other clusters can't be taken down if they never came up.

Suggested change
1. **Startup is all-or-nothing.** If one virtual cluster fails to start (e.g. port conflict, filter initialisation failure), the entire proxy process fails. Other clusters that could have started successfully are taken down with it.
1. **Startup is all-or-nothing.** If one virtual cluster fails to start (e.g. port conflict, filter initialisation failure), the entire proxy process fails. Other clusters that could have started successfully never become available.


1. **Startup is all-or-nothing.** If one virtual cluster fails to start (e.g. port conflict, filter initialisation failure), the entire proxy process fails. Other clusters that could have started successfully are taken down with it.

2. **Shutdown is unstructured.** The proxy stops accepting connections and closes channels, but there is no formal draining phase that ensures in-flight Kafka requests complete before the connection is torn down.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. **Shutdown is unstructured.** The proxy stops accepting connections and closes channels, but there is no formal draining phase that ensures in-flight Kafka requests complete before the connection is torn down.
2. **Shutdown is unstructured.** The proxy stops accepting connections and closes channels, but there is no formal draining phase that ensures in-flight Kafka requests complete before the connection is torn down. While this does no violate any of the guarantees of the Kafka protocol (which needs to cope with network partitions), it would be good to shutdown more gracefully in situations where that's possible.


1. **Startup is all-or-nothing.** If one virtual cluster fails to start (e.g. port conflict, filter initialisation failure), the entire proxy process fails. Other clusters that could have started successfully are taken down with it.

2. **Shutdown is unstructured.** The proxy stops accepting connections and closes channels, but there is no formal draining phase that ensures in-flight Kafka requests complete before the connection is torn down.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not immediately apparent to me how this would work in practice. Suppose I have a producer pipelining Produce requests. How do we intend to stop that flow in such a way that the producer doesn't end up with an unacknowledged request?

You mention later on about timeout, but:

  • the broker doesn't know what the client's timeout it configured to, so it doesn't know how long it needs to wait
  • if the client timesout request 1 while thinking that 2, 3 and 4 are in flight then it will send request 5


2. **Shutdown is unstructured.** The proxy stops accepting connections and closes channels, but there is no formal draining phase that ensures in-flight Kafka requests complete before the connection is torn down.

3. **No foundation for partial failure.** Proposals such as [012 - Hot Reload](https://github.com/kroxylicious/design/pull/83) need the ability to express "cluster-b failed to apply new configuration but cluster-a is still serving traffic." Without a lifecycle model this state is undefined and unreportable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me exactly how these are related. The Hot Reload proposal currently says:

The initial default is all-or-nothing rollback: if any cluster operation fails during apply (e.g. a port conflict when rebinding, a TLS error, a plugin initialisation failure), all previously successful operations in that apply are rolled back in reverse order.
Added clusters are removed, modified clusters are reverted to their original configuration, and removed clusters are re-added.

Looking at the diagram of the state machine in this proposal, I think a VM would be in accepting when the reload started.

  • Would it then transition draining -> initializing -> failed -> stopped as the problem was detected? And then automatically the revert happens (applying the old configuration). According to the diagram that must be a different instance of the state machine (because stopped is the final state). So it would start initializing -> accepting (with luck). I thought it must be like this because the Hot Reload says "Cluster modification: remove + add".
  • Or does it transition draining -> initializing -> failed, at which point the revert happens and we see failed -> initializing -> accepting? In this case it doesn't seems like the reload behaviour can really be described as "Cluster modification: remove + add".


## Motivation

A virtual cluster is the natural unit of independent operation — the smallest scope at which the proxy can contain a failure without affecting unrelated traffic. Today this independence is not modelled: the proxy treats all clusters as a single unit that either starts completely or fails completely.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"modelled" is a bit of an odd word here (which suggests to me that there are come components/classes in the source code which directly correspond to this, but really its more like an emergent property of the system, rather than an inherent one) . I think it would be clearer to say something like "the independence is notional"

| **initializing** | The cluster is being set up. Not yet accepting connections. Used on first boot, when retrying from `failed`, and during configuration reload. |
| **accepting** | The proxy has completed setup for this cluster and is accepting connections. This state makes no claim about the availability of upstream brokers or other runtime dependencies — it means the proxy is ready to handle connection attempts. |
| **draining** | New connections are rejected. Existing connections remain open to give in-flight requests the opportunity to complete. Connections are closed once idle or when the drain timeout is reached. |
| **failed** | The proxy determined the configuration not to be viable. All partially-acquired resources are released on entry to this state. The proxy retains the cluster's configuration and failure reason for diagnostics and retry. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The proxy retains the cluster's configuration and failure reason for diagnostics and retry." Again, I don't really understand how this relates to the rollback behaviour described in the Hot Reload proposal. Assuming the rollback is successful then won't the failure reason be rapidly either out of date (the VM isn't failed because the config was rolled back), or it the reason is removed then what purpose did its transient presence serve?

# startupPolicy: best-effort # start with whatever clusters succeed
```

In best-effort mode, the proxy starts and serves traffic for clusters that initialised successfully, while reporting failed clusters via health endpoints and logs. Kubernetes readiness probes or monitoring systems can apply their own thresholds (e.g. "all clusters must be accepting" vs "at least one cluster must not be failed"). The operator would typically set this policy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
In best-effort mode, the proxy starts and serves traffic for clusters that initialised successfully, while reporting failed clusters via health endpoints and logs. Kubernetes readiness probes or monitoring systems can apply their own thresholds (e.g. "all clusters must be accepting" vs "at least one cluster must not be failed"). The operator would typically set this policy.
In best-effort mode, the proxy starts and serves traffic for clusters that initialised successfully, while reporting failed clusters via health endpoints and logs. Kubernetes readiness probes or monitoring systems can apply their own thresholds (e.g. "all clusters must be accepting" vs "at least one cluster must not be failed"). The Kubernetes operator would typically set this policy.

Just because operator is easily misunderstood to mean a human


### Observability

Cluster lifecycle state should be observable — through management endpoints, logging, or metrics — so that operators and tooling can determine which clusters are accepting connections, which have failed, and why. The specific reporting mechanism is an implementation concern and not prescribed by this proposal.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specific reporting mechanism is an implementation concern and not prescribed by this proposal.

Management endpoints and metrics count as public API so should be described here.

Comment on lines +112 to +134
### Internal Representation

Each virtual cluster holds a state object:

```java
public record ClusterState(
LifecyclePhase phase,
Instant since,
@Nullable String reason) {

public enum LifecyclePhase {
INITIALIZING,
ACCEPTING,
DRAINING,
FAILED,
STOPPED
}
}
```

State transitions should be validated — e.g. a cluster cannot move from `stopped` to any other state. Invalid transitions indicate a programming error and should throw.

The component responsible for managing cluster state (likely an evolution of the existing `EndpointRegistry` or a new `ClusterLifecycleManager`) should be the single source of truth for state transitions, ensuring they are logged and observable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try to avoid putting implementation details like this into proposals.

  • Those will become apparent when a PR is opened, so it's not really necessary.
  • It's highly likely that they'll become stale once an implementation is available
  • It encourages other people opening proposals to do the same. We end up with people thinking they need to write an essay just to get anything done. Personally that's not how I want things to work. Everyone benefits from up-front discussion and agreement about public, supported APIs. PR reviews should be sufficient for other code changes.


Some configuration changes will likely always require draining — for example, changes to the upstream cluster identity or TLS configuration that invalidate existing connections. The optimisation is about identifying changes where draining can be safely skipped, not eliminating it.

### Proxy-level lifecycle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've defined how proxy started works wrt the VM state machines. It's not really clear to me whether we need a proposal about the rest, unless you think that that state is also exposed via metrics/mgmt etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants