Skip to content

Conversation

@flpanbin
Copy link
Contributor

Related issue #22

Background

Following the meeting on September 10th, 2025, we discussed contributing our production-tested sandbox design to the Kubernetes community. This design has been successfully used in production environments.

Overview

This PR introduces the initial draft CRD design for the Sandbox custom resource, which aims to provide a declarative, standardized API for managing isolated, stateful, singleton workloads - particularly ideal for AI agent runtimes and development environments.
We welcome community feedback and will iterate on the CRD design based on comments and suggestions.

Call for Community Feedback

We're actively seeking input from the community on:
API Design: Are the field names and structure intuitive?
Missing Features: What additional capabilities should we consider?
Use Cases: How does this align with your specific requirements?
Compatibility: Any concerns with existing Kubernetes patterns?
Please share your thoughts, suggestions, and concerns in the comments below!

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 15, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @flpanbin!

It looks like this is your first PR to kubernetes-sigs/agent-sandbox 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/agent-sandbox has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 15, 2025
@flpanbin flpanbin marked this pull request as draft September 15, 2025 05:33
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 15, 2025
@flpanbin flpanbin changed the title [Draft] Refine sandbox CRD design Refine sandbox CRD design Sep 15, 2025
@flpanbin flpanbin marked this pull request as ready for review September 16, 2025 06:12
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 16, 2025
@lengrongfu
Copy link

@justinsb @janetkuo Based on our discussion last week, we have renamed some fields and removed obsolete ones. The Sandbox CRD has been refined to align with the new design. We welcome your feedback on the changes in this PR.

@justinsb
Copy link
Contributor

Thanks for sharing @flpanbin ! There's a lot going on here, and I think one of the principles of the Sandbox CRD as a kubernetes project is that it should follow the kube design patterns and be relatively unopinionated. So right now we are using PodTemplate, because that's what Deployment and StatefulSet and Daemonset do. We're consuming the whole thing, even though maybe some fields are less relevant, because we want to enable people to build more opinionated layers on top. (For example, lots of companies build their own BigcoDeployment on top of Deployment, with just the features they want)

To move forwards I can think of these two ways:

  • Have your CRD create a Sandbox type, instead of creating a Pod/Deployment/whatever-it-is-currently-creating. Because Sandbox exposes the whole PodTemplate, this should be possible today, and where it isn't we want to add fields to Sandbox. Your CRD could live in your own repo or in our examples/ folder. If it is in our examples/ folder it would be nice to have a README describing the key fields you don't want to expose to end-users (or other reasons for creating an abstraction on top) - there are many good reasons, it just helps to motivate your CRD.
  • If you think your users could use Sandbox directly, except that Sandbox is missing some fields/features, then let's figure out what those are and add them more individually. I know status.conditions has come up before and @barney-s is adding them in active PRs, but I think we could use more use-cases for status.conditions to know what is important (for example, do you just want a Ready condition, or do you want more granular information).

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 17, 2025
@flpanbin flpanbin closed this Sep 17, 2025
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 17, 2025
@flpanbin flpanbin reopened this Sep 17, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: flpanbin
Once this PR has been reviewed and has the lgtm label, please assign justinsb for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

1 similar comment
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: flpanbin
Once this PR has been reviewed and has the lgtm label, please assign justinsb for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 17, 2025
@flpanbin
Copy link
Contributor Author

flpanbin commented Sep 17, 2025

@justinsb Thanks for the thoughtful feedback. I fully understand your viewpoint on following the Kubernetes design pattern. First, let me explain the reason why we chose custom fields instead of directly using PodTemplateSpec.

  • Simplified User Experience for Development Environment Use Case: Most users only need to specify image, resources, networking, and storage. Exposing the full PodTemplateSpec pushes them into container probes, security contexts, and other advanced settings that are hard to set correctly for this use case.
  • Security considerations: We intentionally limit configurable fields to avoid risky or confusing configurations (e.g., not exposing the full securityContext). This allows safer defaults while still enabling a functional environment.

I fully agree exposing PodTemplateSpec directly. The sandbox project’s scope is a general, Kubernetes-native abstraction for single-instance, stateful workloads; “developer environment” is just one use case on top of that. To improve the core Sandbox, I’d like to propose adding a few generic fields.

Proposal: Add a few fields to Sandbox

  • networking: Provide optional Service-level exposure and/or references to externally managed routes, without binding Sandbox to a specific ingress/gateway stack.
    • Service exposure and ports:
      • Optional spec.networking.service block; if omitted, the controller does not create an external Service (headless Service for discovery can remain internal).
      • spec.networking.service.type: ClusterIP | NodePort | LoadBalancer
      • spec.networking.service.ports[] with:
        • name (string) — aligns with container port name for mapping
        • port (int32) — Service port
        • targetPort (int32|string, optional) — defaults to the named container port if omitted
        • protocol (TCP|UDP, default TCP)
    • Optionally spec.networking.routeRefs[] to reference externally managed HTTPRoute/TCPRoute (Gateway API) or Ingress. Sandbox does not create these; it only references them and can surface reachability in status.
  • schedule: Allow an RFC3339 shutdownTime for automatic stop. This encodes a common lifecycle action for single-instance, stateful workloads.
  • pause: A boolean to explicitly stop the runtime (delete the Pod while preserving the object and persistent state), and resume when false. This aligns with the project’s focus on long-running, stateful, singleton workloads.

Status implications

  • If networking is added, surface reachability in status (e.g., URL/IP/ports/ready).
  • If schedule/pause are added, reflect lifecycle states via conditions (e.g., Stopping, Resuming, Scheduled, Ready) so callers can reason about transitions.

On alternative of creating a higher-level CRD example

  • If it is helpful to the user experience, we can add an example controller/CRD under examples/ in the future.

We’re open to adjusting details based on community feedback.

Copy link
Member

@janetkuo janetkuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flpanbin Thanks for breaking down the features. It's good to discuss the APIs first and then add each feature one by one. In general we prefer smaller PRs that can be reviewed more quickly, less likely to introduce bugs, and easier to roll back if needed.

spec.networking

With #9, sandbox creates a headless service automatically. Adding a networking field could be good for users to customize the network layer further.

spec.schedule

There's a related design proposal for TTL #21 which handles a similar case but in a different way. Let's discuss which we prefer (or do we need both?)
In terms of field name, I'd just call it something more explicit, perhaps shutdownTime. The term "schedule" could be confused with node scheduling, and is less clear on what this schedule is for.

spec.pause

This looks interesting. Would you provide more details on how the state is saved when pausing and restored when resuming? How is it different from the shutdownTime/TTL (pausing saves state, and shutdown doesn't)? This change is likely bigger and might require multiple PRs.

@flpanbin
Copy link
Contributor Author

@janetkuo Thanks for the feedback! You're absolutely right about smaller, focused PRs being easier to review and less risky. I completely agree with that approach. I will close the PR and submit issues and PRs separately for each feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants