-
Notifications
You must be signed in to change notification settings - Fork 261
Datapath tests for Long running clusters. #4142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- Implemented scheduled pipeline running every 1 hour with persistent infrastructure - Split test execution into 2 jobs: Create (with 20min wait) and Delete - Added 8 test scenarios across 2 AKS clusters, 4 VNets, different subnets - Implemented two-phase deletion strategy to prevent PNI ReservationInUse errors - Added context timeouts on kubectl commands with force delete fallbacks - Resource naming uses RG name as BUILD_ID for uniqueness across parallel setups - Added SkipAutoDeleteTill tags to prevent automatic resource cleanup - Conditional setup stages controlled by runSetupStages parameter - Auto-generate RG name from location or allow custom names for parallel setups - Added comprehensive README with setup instructions and troubleshooting - Node selection by agentpool labels with usage tracking to prevent conflicts - Kubernetes naming compliance (RFC 1123) for all resources fix ginkgo flag. Add datapath tests. Delete old test file. Add testcases for provate endpoint. Ginkgo run specs only on specified files. update pipeline params. Add ginkgo tags Add datapath tests. Add ginkgo build tags. remove wait time. set namespace. update pod image. Add more nsg rules to block subnets s1 and s2 test change. Change delegated subnet address range. Use delegated interface for network connectivity tests. Datapath test between clusters. test. test private endpoints. fix private endpoint tests. Set storage account names in putput var. set storage account name. fix pn names. update pe update pe test. update sas token generation. Add node labels for sw2 scenario, cleanup pods on any test failure. enable nsg tests. update storage. Add rules to nsg. disable private endpoint negative test. disable public network access on storage account with private endpoint. wait for default nsg to be created. disable negative test on private endpoint. private endpoint depends on aks cluster vnets, change pipeline job dependencies. Add node labels for each workload type and nic capacity. make sku constant. Update readme, set schedule for long running cluster on test branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a comprehensive long-running test pipeline for SwiftV2 pod networking on Azure AKS. The pipeline creates persistent infrastructure (2 AKS clusters, 4 VNets, storage accounts with private endpoints, NSGs) and runs scheduled tests every 3 hours to validate pod-to-pod connectivity, network security group isolation, and private endpoint access across multi-tenant scenarios.
Key Changes:
- Adds scheduled pipeline with conditional infrastructure setup (runSetupStages parameter)
- Implements 8 pod test scenarios across 2 clusters and 4 VNets with different NIC capacities
- Includes 9 connectivity tests and 5 private endpoint tests with tenant isolation validation
Reviewed changes
Copilot reviewed 19 out of 20 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| .pipelines/swiftv2-long-running/pipeline.yaml | Main pipeline with 3-hour scheduled trigger and runSetupStages parameter |
| .pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml | Two-stage template: setup (conditional) and datapath tests with 4 jobs |
| .pipelines/swiftv2-long-running/scripts/*.sh | Infrastructure setup scripts for AKS, VNets, storage, NSGs, and private endpoints |
| test/integration/swiftv2/longRunningCluster/datapath*.go | Test implementation split into create, connectivity, private endpoint, and delete tests |
| test/integration/swiftv2/helpers/az_helpers.go | Azure CLI and kubectl helper functions for resource management |
| test/integration/manifests/swiftv2/long-running-cluster/*.yaml | Kubernetes resource templates for PodNetwork, PNI, and Pods |
| go.mod, go.sum | Updates to support Ginkgo v2 testing framework |
| hack/aks/Makefile | Updates for SwiftV2 cluster creation with multi-tenancy tags |
| .pipelines/swiftv2-long-running/README.md | Comprehensive documentation of pipeline architecture and test scenarios |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| cmd_delegator_curl="'curl -X PUT http://localhost:8080/DelegatedSubnet/$modified_custsubnet'" | ||
| cmd_containerapp_exec="az containerapp exec -n subnetdelegator-westus-u3h4j -g subnetdelegator-westus --subscription 9b8218f9-902a-4d20-a65c-e98acec5362f --command $cmd_delegator_curl" |
Copilot
AI
Nov 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded credentials and subscription IDs in the script. The script contains a hardcoded subscription ID 9b8218f9-902a-4d20-a65c-e98acec5362f and references a specific container app subnetdelegator-westus-u3h4j in resource group subnetdelegator-westus. These hardcoded values make the script non-portable and could expose sensitive information. Consider parameterizing these values or using environment variables.
| responseFile="response.txt" | ||
| modified_vnet="${vnet_id//\//%2F}" | ||
| cmd_stamp_curl="'curl -v -X PUT http://localhost:8080/VirtualNetwork/$modified_vnet/stampcreatorservicename'" | ||
| cmd_containerapp_exec="az containerapp exec -n subnetdelegator-westus-u3h4j -g subnetdelegator-westus --subscription 9b8218f9-902a-4d20-a65c-e98acec5362f --command $cmd_stamp_curl" |
Copilot
AI
Nov 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same hardcoded credentials issue. The script contains hardcoded subscription ID 9b8218f9-902a-4d20-a65c-e98acec5362f and references to subnetdelegator-westus-u3h4j container app. Consider parameterizing these values.
test/integration/swiftv2/longRunningCluster/datapath_create_test.go
Outdated
Show resolved
Hide resolved
test/integration/swiftv2/longRunningCluster/datapath_delete_test.go
Outdated
Show resolved
Hide resolved
test/integration/swiftv2/longRunningCluster/datapath_connectivity_test.go
Outdated
Show resolved
Hide resolved
test/integration/swiftv2/longRunningCluster/datapath_delete_test.go
Outdated
Show resolved
Hide resolved
test/integration/manifests/swiftv2/long-running-cluster/pod.yaml
Outdated
Show resolved
Hide resolved
test/integration/swiftv2/longRunningCluster/datapath_connectivity_test.go
Outdated
Show resolved
Hide resolved
Co-authored-by: Copilot <[email protected]> Signed-off-by: sivakami-projects <[email protected]>
…st.go Co-authored-by: Copilot <[email protected]> Signed-off-by: sivakami-projects <[email protected]>
…st.go Co-authored-by: Copilot <[email protected]> Signed-off-by: sivakami-projects <[email protected]>
…ity_test.go Co-authored-by: Copilot <[email protected]> Signed-off-by: sivakami-projects <[email protected]>
…st.go Co-authored-by: Copilot <[email protected]> Signed-off-by: sivakami-projects <[email protected]>
|
/azp run Azure Container Networking PR |
|
Azure Pipelines successfully started running 1 pipeline(s). |
3b9bc5c
69f0676 to
d3c4686
Compare
Signed-off-by: sivakami-projects <[email protected]>
|
/azp run Azure Container Networking PR |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Azure Container Networking PR |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Azure Container Networking PR |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Azure Container Networking PR |
|
Azure Pipelines successfully started running 1 pipeline(s). |
jpayne3506
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to do a pass on other supporting files not included into this PR.
| github.com/cilium/ebpf v0.19.0 | ||
| github.com/cilium/ebpf v0.16.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you rolling this back?
| replace ( | ||
| github.com/onsi/ginkgo => github.com/onsi/ginkgo v1.12.0 | ||
| github.com/onsi/gomega => github.com/onsi/gomega v1.10.0 | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to impact other tests within the repo.
| $(COMMON_AKS_FIELDS) | ||
| $(COMMON_AKS_FIELDS) \ | ||
| --network-plugin azure \ | ||
| --nodepool-name nodepool1 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is replacing this field?
| --allow-shared-key-access false \ | ||
| --https-only true \ | ||
| --min-tls-version TLS1_2 \ | ||
| --tags SkipAutoDeleteTill=2032-12-31 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems extreme.. What is the thought process behind this...
| echo "==> Assigning Storage Blob Data Contributor role to service principal" | ||
| SP_OBJECT_ID=$(az ad signed-in-user show --query id -o tsv 2>/dev/null || az account show --query user.name -o tsv) | ||
| SA_SCOPE="/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${RG}/providers/Microsoft.Storage/storageAccounts/${SA}" | ||
|
|
||
| az role assignment create \ | ||
| --assignee "$SP_OBJECT_ID" \ | ||
| --role "Storage Blob Data Contributor" \ | ||
| --scope "$SA_SCOPE" \ | ||
| --output none \ | ||
| && echo "[OK] RBAC role assigned to service principal for $SA" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we clean up the role assignments?
| echo "Waiting 2 minutes for pods to fully start and HTTP servers to be ready..." | ||
| sleep 120 | ||
| echo "Wait period complete, proceeding with connectivity tests" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Historically when we have relied on sleep it has resulted in CI/CD failures.
| # Job 3: Networking & Storage | ||
| # ------------------------------------------------------------ | ||
| - job: NetworkingAndStorage | ||
| timeoutInMinutes: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, you never want this job to timeout? Max is 6 hours no matter what you set.
| Run locally against existing infrastructure: | ||
|
|
||
| ```bash | ||
| export RG="sv2-long-run-centraluseuap" # Match your resource group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This RG naming is way beyond the max cap I typically see. Is there not a managed cluster that gets paired with this?
| branches: | ||
| include: | ||
| - sv2-long-running-pipeline-stage2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the intent to have a separate CI/CD branch for these long running tests?
| echo "Provisioning finished with state: $state" | ||
| break | ||
| fi | ||
| sleep 6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we look at leveraging another option besides sleep
Pipeline to run repeated tests on long running Swiftv2 AKS clusters.
Test pipeline - Tests are scheduled to run every 3 hours on central us euap. Link to Pipeline
Recent test run
Testing Approach
Test Lifecycle (per stage):
Create 8 pod scenarios with PodNetwork, PodNetworkInstance, Pods
Run 9 connectivity tests (HTTP-based)
Run private endpoint tests (storage access)
Delete all resources (Phase 1: Pods, Phase 2: PNI/PN/Namespaces)
Node Selection:
Tests filter by workload-type=$WORKLOAD_TYPE AND nic-capacity labels
Ensures isolation between different workload type stages
Currently: WORKLOAD_TYPE=swiftv2-linux
Files Changed
Pipeline Configuration
pipeline.yaml: Main pipeline with schedule trigger
long-running-pipeline-template.yaml: Stage definitions with VM SKU constants
Setup Scripts
create_aks.sh: AKS cluster creation with node labeling
create_vnets.sh: Customer VNet creation
create_peerings.sh: VNet peering mesh
create_storage.sh: Storage accounts with public access disabled (SA1 only)
create_nsg.sh: NSG rule application with retry logic
create_pe.sh: Private endpoint and DNS zone setup
Test Code
datapath.go: Enhanced with node label filtering, private endpoint testing
datapath_create_test.go: Resource creation scenarios
datapath_connectivity_test.go: HTTP connectivity validation
datapath_private_endpoint_test.go: Private endpoint access/isolation tests
datapath_delete_test.go: Resource cleanup
Documentation
README.md:
Reason for Change:
Issue Fixed:
Requirements:
Notes: