Skip to content

Conversation

@Choraden
Copy link

What type of PR is this?

/kind feature

What this PR does / why we need it:

This change introduces a new component, TemplateNodeInfoRegistry, which wraps the existing TemplateNodeInfoProvider. It caches the computed template NodeInfos and exposes them via a thread-safe interface.
This registry is added to the AutoscalingContext, allowing processors (like the DRA processor) to access the cached templates instead of relying on the less reliable NodeGroup.TemplateNodeInfo().

Which issue(s) this PR fixes:

Fixes #8881
Fixes #8882

Special notes for your reviewer:

--

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. kind/feature Categorizes issue or PR as related to a new feature. labels Dec 10, 2025
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Dec 10, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot
Copy link
Contributor

Welcome @Choraden!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Dec 10, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @Choraden. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. area/cluster-autoscaler labels Dec 10, 2025
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed do-not-merge/needs-area cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Dec 10, 2025
@Choraden Choraden force-pushed the template_node_info_registry_v1 branch from f9c0302 to ad96941 Compare December 10, 2025 15:35
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Dec 10, 2025
@Choraden Choraden marked this pull request as draft December 11, 2025 07:32
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 11, 2025
@Choraden Choraden force-pushed the template_node_info_registry_v1 branch from ad96941 to 4fb808c Compare December 17, 2025 12:46
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 17, 2025
@Choraden Choraden force-pushed the template_node_info_registry_v1 branch from 4fb808c to 7f36de5 Compare December 17, 2025 13:16
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 17, 2025
@Choraden Choraden marked this pull request as ready for review December 17, 2025 13:24
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 17, 2025
@Choraden
Copy link
Author

/assign @towca

@jackfrancis
Copy link
Contributor

/cherry-pick cluster-autoscaler-release-1.35

@k8s-infra-cherrypick-robot

@jackfrancis: once the present PR merges, I will cherry-pick it on top of cluster-autoscaler-release-1.35 in a new PR and assign it to you.

Details

In response to this:

/cherry-pick cluster-autoscaler-release-1.35

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

// NewTestProcessors returns a set of simple processors for use in tests.
// Note: This function injects a default TemplateNodeInfoRegistry into the provided AutoscalingContext.
// This is a necessary workaround for synthetic tests that manually construct the context without using NewStaticAutoscaler, ensuring they have access to the registry.
func NewTestProcessors(autoscalingCtx *ca_context.AutoscalingContext) *processors.AutoscalingProcessors {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get that this was the easiest change to make the tests pass, but unfortunately little hacks like these make the tests really hard to understand and extend.

Looking at the usages of this function, it's ~always called after NewScaleTestAutoscalingContext(). IMO the order should be switched, like it's in the prod path - processors are a dependency of the context, not the other way around. NewScaleTestAutoscalingContext() should either take the processors as parameter, or call NewTestProcessors() internally. NewTestProcessors() technically depends on the full context now, but it only uses a small subset of it - config.AutoscalingOptions - which is also used as a parameter to NewScaleTestAutoscalingContext(). Have you explored something like that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to:

  • decouple NewTestProcessors from autoscalingCtx and depend only on config.AutoscalingOptions
  • update NewScaleTestAutoscalingContext to accept TemplateNodeInfoRegistry as in the original NewAutoscalingContext
  • reordered test initialization: create options -> create processors & registry -> create context

This aligns the test setup with the production architecture and improves readability and safety.

Adding it in a separate commit to streamline review. Let me know if you want it squashed eventually.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks a lot for this!

@Choraden Choraden force-pushed the template_node_info_registry_v1 branch from 7f36de5 to f1ba828 Compare December 29, 2025 15:00
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Choraden
Once this PR has been reviewed and has the lgtm label, please ask for approval from towca. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 29, 2025
Copy link
Author

@Choraden Choraden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@towca I've addressed your comments. PTAL

// NewTestProcessors returns a set of simple processors for use in tests.
// Note: This function injects a default TemplateNodeInfoRegistry into the provided AutoscalingContext.
// This is a necessary workaround for synthetic tests that manually construct the context without using NewStaticAutoscaler, ensuring they have access to the registry.
func NewTestProcessors(autoscalingCtx *ca_context.AutoscalingContext) *processors.AutoscalingProcessors {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to:

  • decouple NewTestProcessors from autoscalingCtx and depend only on config.AutoscalingOptions
  • update NewScaleTestAutoscalingContext to accept TemplateNodeInfoRegistry as in the original NewAutoscalingContext
  • reordered test initialization: create options -> create processors & registry -> create context

This aligns the test setup with the production architecture and improves readability and safety.

Adding it in a separate commit to streamline review. Let me know if you want it squashed eventually.

Copy link
Collaborator

@towca towca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some final comments for the test code, but in general LGTM!

"node_5_Dra_Unready": false,
},
},
"2 DRA node group, single driver multiple pools, more pools published including template pools": {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this case intentionally removed? Why?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, it should not have been removed. Reverted.

"node_1": createNodeResourceSlices("node_1", []int{1, 1}),
},
expectedNodesReadiness: map[string]bool{
"node_1": true,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case could pass even if the DRA processor was completely broken. I think we should assert that readiness is actually changed based on the TemplateNodeInfo fallback, not that it's left unchanged.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I changed the test to assert unready node.

"node_1": createNodeResourceSlices("node_1", []int{1, 1}),
},
expectedNodesReadiness: map[string]bool{
"node_1": true,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, I'd revert this case so that readiness doesn't change in the fallback path, but it does in the preferred path.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if !config.optionsBlockDefaulting {
// Apply sane default options that make testing scale-down etc. possible - if not explicitly stated in the config that this is not desired.
applySaneDefaultOpts(&autoscalingCtx)
applySaneDefaultOpts(&opts)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because opts is a copy-by-value of the config.autoscalingOptions field, after these changes applySaneDefaultOpts() now doesn't actually mutate the options in the context. So we end up with defaulted options in the processors created by NewTestProcessors(), but non-defaulted options in everything that uses them directly from the context. The inconsistency probably doesn't matter for existing tests, but it could be a real head-scratcher for someone adding a new test in the future. IMO this should be applySaneDefaultOpts(&autoscalingCtx.AutoscalingOptions).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Done.

// NewTestProcessors returns a set of simple processors for use in tests.
// Note: This function injects a default TemplateNodeInfoRegistry into the provided AutoscalingContext.
// This is a necessary workaround for synthetic tests that manually construct the context without using NewStaticAutoscaler, ensuring they have access to the registry.
func NewTestProcessors(autoscalingCtx *ca_context.AutoscalingContext) *processors.AutoscalingProcessors {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks a lot for this!

@jackfrancis
Copy link
Contributor

/test pull-cluster-autoscaler-e2e-azure-master
/cherry-pick cluster-autoscaler-release-1.35

@k8s-infra-cherrypick-robot

@jackfrancis: once the present PR merges, I will cherry-pick it on top of cluster-autoscaler-release-1.35 in a new PR and assign it to you.

Details

In response to this:

/test pull-cluster-autoscaler-e2e-azure-master
/cherry-pick cluster-autoscaler-release-1.35

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

…NodeInfos

This change introduces a new component, TemplateNodeInfoRegistry, which wraps the existing TemplateNodeInfoProvider.
It caches the computed template NodeInfos and exposes them via a thread-safe interface.
This registry is added to the AutoscalingContext, allowing processors (like the DRA processor) to access the cached templates instead of relying on the less reliable NodeGroup.TemplateNodeInfo().
…gistry

Key changes:
- Updated NewScaleTestAutoscalingContext to accept TemplateNodeInfoRegistry as a parameter.
- Refactored NewTestProcessors to take AutoscalingOptions and return both Processors and TemplateNodeInfoRegistry.
- Reordered test initialization to follow the production path: Options -> Processors/Registry -> AutoscalingContext.

These changes improve testing readability and extendability by
ensuring a consistent setup of the autoscaling environment with the
production logic.
The DRACustomResourcesProcessor now attempts to retrieve NodeInfo from
the TemplateNodeInfoRegistry before falling back to the NodeGroup.
This ensures the processor uses the canonical TemplateNodeInfo for the
current autoscaling loop. Crucially, this preserves any enrichments
(such as custom DRA resource slices) that are computed during the
registry's Recompute phase but might be absent in a fresh, raw
template from the CloudProvider.
@Choraden Choraden force-pushed the template_node_info_registry_v1 branch from f1ba828 to c96e983 Compare January 9, 2026 15:36
@Choraden
Copy link
Author

Choraden commented Jan 9, 2026

@towca PTAL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

5 participants