Skip to content

Conversation

@DerekFrank
Copy link
Contributor

Issue #, if available:

Description of changes:

This change introduces the tasks necessary to install and leverage Karpenter to scale a cluster. This was tested in a dev cluster

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@DerekFrank DerekFrank force-pushed the main branch 2 times, most recently from 1f0c547 to e7800e3 Compare September 16, 2025 23:48
kind: Pipeline
apiVersion: tekton.dev/v1
metadata:
name: derekff-karpenter-testing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use generic naming please.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also can we name the pipeline.yaml file to karpenter-titan-pipeline or something that actually says what the file is ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was mostly including this as an example file. Where do you actually want the pipeline? Also, we can't put the pipeline in OSS because it has to contain secrets that I've manually removed like the slack hook url

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We check in pipelines here - https://github.com/awslabs/kubernetes-iteration-toolkit/tree/main/tests/tekton-resources/pipelines/eks

For secrets and all you don't have to define the param value in the pipeline. In tekton, you have an option of not specifying defaults for a parameter.

operator: In
values:
- medium
- large No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need larges for the purposes of karpenter test ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EC2 indicated that they want us to use a variety of instances to prevent ICEing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC its the other way around, when we want to leverage larges, they raised concern about ICEing , hence medium in the mix, if we just go with Medium, they should be fine iiuc. I was just questioning from the cost perspective give the large scale test. We should avoid larges unless tests absolutely demands it. We don't have to be blocked for this PR purposes, but please follow back on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove larges and let you follow up with EC2 on the cost item

--cluster-name $(params.cluster-name) \
--nodegroup-name karpenter-system-large \
--node-role arn:aws:iam::$(params.aws-account-id):role/$(params.cluster-name)-node-role \
--instance-types r5.24xlarge \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be a a param ? so depending on the scale of the cluster, we can put Karp node proportionally on node type that makes sense ?

Copy link
Contributor Author

@DerekFrank DerekFrank Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be, I don't think it needs to be until we have a usecase. There's no harm in using a large instance for now and tuning it later. We'd also have to tune the karpenter install requests which we don't right now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are already plans to run this at smaller scale as well. But i am ok with punting it for now, if you want to follow it up later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets circle back on that when we decide the scale we want to run it at

--set settings.preferencePolicy=Ignore \
--set "serviceAccount.annotations.eks\.amazonaws\.com/role-arn=arn:aws:iam::$(params.aws-account-id):role/KarpenterControllerRole-$(params.cluster-name)" \
--set controller.resources.requests.cpu=60 \
--set controller.resources.requests.memory=200Gi \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resource allocation to controller, should be dependent on the size of the cluster ? can we parameterize these ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can parameterize everything. Lets not until we have a usecase. Its an easy fast follow

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do have a use to run it at smaller scale, hence the question. We can punt it for later as mentioned in other comment.

aws iam delete-instance-profile --instance-profile-name "KarpenterNodeInstanceProfile-$(params.cluster-name)"
echo "Instance profile KarpenterNodeInstanceProfile-$(params.cluster-name) deleted successfully."
else
echo "Instance profile KarpenterNodeInstanceProfile-$(params.cluster-name) does not exist. Skipping deletion..."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious if you have tested to see if this would fall into this else when instance profile couldn't be deleted successfully ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It falls into the else when the instance profile doesn't exist, which is the desired state. It errs when the instance profile can't be deleted, which is the desired state

hakuna-matatah
hakuna-matatah previously approved these changes Sep 18, 2025
echo "Role $PIA_ROLE_NAME does not exist, no action needed."
fi
done
- name: delete-karpenter-role
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets have a separate tear down for ultra clusters to ensure it won't break exisitng pipelines when there is a bug.

@hakuna-matatah hakuna-matatah merged commit f999043 into awslabs:main Sep 18, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants