From 896ff1f429ef54f943b384dccbe9d723b9b39a83 Mon Sep 17 00:00:00 2001 From: sivakami Date: Sat, 22 Nov 2025 15:52:19 -0800 Subject: [PATCH 01/64] Add SwiftV2 long-running pipeline with scheduled tests - Implemented scheduled pipeline running every 1 hour with persistent infrastructure - Split test execution into 2 jobs: Create (with 20min wait) and Delete - Added 8 test scenarios across 2 AKS clusters, 4 VNets, different subnets - Implemented two-phase deletion strategy to prevent PNI ReservationInUse errors - Added context timeouts on kubectl commands with force delete fallbacks - Resource naming uses RG name as BUILD_ID for uniqueness across parallel setups - Added SkipAutoDeleteTill tags to prevent automatic resource cleanup - Conditional setup stages controlled by runSetupStages parameter - Auto-generate RG name from location or allow custom names for parallel setups - Added comprehensive README with setup instructions and troubleshooting - Node selection by agentpool labels with usage tracking to prevent conflicts - Kubernetes naming compliance (RFC 1123) for all resources fix ginkgo flag. Add datapath tests. Delete old test file. Add testcases for provate endpoint. Ginkgo run specs only on specified files. update pipeline params. Add ginkgo tags Add datapath tests. Add ginkgo build tags. remove wait time. set namespace. update pod image. Add more nsg rules to block subnets s1 and s2 test change. Change delegated subnet address range. Use delegated interface for network connectivity tests. Datapath test between clusters. test. test private endpoints. fix private endpoint tests. Set storage account names in putput var. set storage account name. fix pn names. update pe update pe test. update sas token generation. Add node labels for sw2 scenario, cleanup pods on any test failure. enable nsg tests. update storage. Add rules to nsg. disable private endpoint negative test. disable public network access on storage account with private endpoint. wait for default nsg to be created. disable negative test on private endpoint. private endpoint depends on aks cluster vnets, change pipeline job dependencies. Add node labels for each workload type and nic capacity. make sku constant. Update readme, set schedule for long running cluster on test branch. --- .pipelines/swiftv2-long-running/README.md | 661 +++++++++++++++++ .pipelines/swiftv2-long-running/pipeline.yaml | 43 +- .../scripts/create_aks.sh | 144 ++-- .../scripts/create_nsg.sh | 184 +++-- .../swiftv2-long-running/scripts/create_pe.sh | 57 +- .../scripts/create_storage.sh | 42 ++ .../scripts/create_vnets.sh | 160 ++-- .../long-running-pipeline-template.yaml | 331 ++++++++- go.mod | 18 +- go.sum | 34 +- hack/aks/Makefile | 15 +- .../swiftv2/long-running-cluster/pod.yaml | 73 ++ .../long-running-cluster/podnetwork.yaml | 15 + .../podnetworkinstance.yaml | 13 + .../integration/swiftv2/helpers/az_helpers.go | 343 +++++++++ .../swiftv2/longRunningCluster/datapath.go | 690 ++++++++++++++++++ .../datapath_connectivity_test.go | 165 +++++ .../datapath_create_test.go | 118 +++ .../datapath_delete_test.go | 117 +++ .../datapath_private_endpoint_test.go | 150 ++++ 20 files changed, 3164 insertions(+), 209 deletions(-) create mode 100644 .pipelines/swiftv2-long-running/README.md mode change 100644 => 100755 .pipelines/swiftv2-long-running/scripts/create_nsg.sh create mode 100644 test/integration/manifests/swiftv2/long-running-cluster/pod.yaml create mode 100644 test/integration/manifests/swiftv2/long-running-cluster/podnetwork.yaml create mode 100644 test/integration/manifests/swiftv2/long-running-cluster/podnetworkinstance.yaml create mode 100644 test/integration/swiftv2/helpers/az_helpers.go create mode 100644 test/integration/swiftv2/longRunningCluster/datapath.go create mode 100644 test/integration/swiftv2/longRunningCluster/datapath_connectivity_test.go create mode 100644 test/integration/swiftv2/longRunningCluster/datapath_create_test.go create mode 100644 test/integration/swiftv2/longRunningCluster/datapath_delete_test.go create mode 100644 test/integration/swiftv2/longRunningCluster/datapath_private_endpoint_test.go diff --git a/.pipelines/swiftv2-long-running/README.md b/.pipelines/swiftv2-long-running/README.md new file mode 100644 index 0000000000..b513dcab00 --- /dev/null +++ b/.pipelines/swiftv2-long-running/README.md @@ -0,0 +1,661 @@ +# SwiftV2 Long-Running Pipeline + +This pipeline tests SwiftV2 pod networking in a persistent environment with scheduled test runs. + +## Architecture Overview + +**Infrastructure (Persistent)**: +- **2 AKS Clusters**: aks-1, aks-2 (4 nodes each: 2 low-NIC default pool, 2 high-NIC nplinux pool) +- **4 VNets**: cx_vnet_a1, cx_vnet_a2, cx_vnet_a3 (Customer 1 with PE to storage), cx_vnet_b1 (Customer 2) +- **VNet Peerings**: vnet mesh. +- **Storage Account**: With private endpoint from cx_vnet_a1 +- **NSGs**: Restricting traffic between subnets (s1, s2) in vnet cx_vnet_a1. +- **Node Labels**: All nodes labeled with `workload-type` and `nic-capacity` for targeted test execution + +**Test Scenarios (8 total per workload type)**: +- Multiple pods across 2 clusters, 4 VNets, different subnets (s1, s2), and node types (low-NIC, high-NIC) +- Each test run: Create all resources → Wait 20 minutes → Delete all resources +- Tests run automatically every 1 hour via scheduled trigger + +**Multi-Stage Workload Testing**: +- Tests are organized by workload type using node label `workload-type` +- Each workload type runs as a separate stage sequentially +- Current implementation: `swiftv2-linux` (Stage: ManagedNodeDataPathTests) +- Future stages can be added for different workload types (e.g., `swiftv2-l1vhaccelnet`, `swiftv2-linuxbyon`) +- Each stage uses the same infrastructure but targets different labeled nodes + +## Pipeline Modes + +### Resource Group Naming Conventions + +The pipeline uses **strict naming conventions** for resource groups to ensure proper organization and lifecycle management: + +**1. Production Scheduled Runs (Master/Main Branch)**: +``` +Pattern: sv2-long-run- +Examples: sv2-long-run-centraluseuap, sv2-long-run-eastus, sv2-long-run-westus2 +``` +- **When to use**: Creating infrastructure for scheduled automated tests on master/main branch +- **Purpose**: Long-running persistent infrastructure for continuous validation +- **Lifecycle**: Persistent (tagged with `SkipAutoDeleteTill=2032-12-31`) +- **Example**: If running scheduled tests in Central US EUAP region, use `sv2-long-run-centraluseuap` + +**2. Test/Development/PR Validation Runs**: +``` +Pattern: sv2-long-run-$(Build.BuildId) +Examples: sv2-long-run-12345, sv2-long-run-67890 +``` +- **When to use**: Temporary testing, one-time validation, or PR testing +- **Purpose**: Short-lived infrastructure for specific test runs +- **Lifecycle**: Can be cleaned up after testing completes +- **Example**: PR validation run with Build ID 12345 → `sv2-long-run-12345` + +**3. Parallel/Custom Environments**: +``` +Pattern: sv2-long-run-- +Examples: sv2-long-run-centraluseuap-dev, sv2-long-run-eastus-staging +``` +- **When to use**: Parallel environments, feature testing, version upgrades +- **Purpose**: Isolated environment alongside production +- **Lifecycle**: Persistent or temporary based on use case +- **Example**: Development environment in Central US EUAP → `sv2-long-run-centraluseuap-dev` + +**Important Notes**: +- ⚠️ Always follow the naming pattern for scheduled runs on master: `sv2-long-run-` +- ⚠️ Do not use build IDs for production scheduled infrastructure (it breaks continuity) +- ⚠️ Region name should match the `location` parameter for consistency +- ✅ All resource names within the setup use the resource group name as BUILD_ID prefix + +### Mode 1: Scheduled Test Runs (Default) +**Trigger**: Automated cron schedule every 1 hour +**Purpose**: Continuous validation of long-running infrastructure +**Setup Stages**: Disabled +**Test Duration**: ~30-40 minutes per run +**Resource Group**: Static (default: `sv2-long-run-`, e.g., `sv2-long-run-centraluseuap`) + +```yaml +# Runs automatically every 1 hour +# No manual/external triggers allowed +``` + +### Mode 2: Initial Setup or Rebuild +**Trigger**: Manual run with parameter change +**Purpose**: Create new infrastructure or rebuild existing +**Setup Stages**: Enabled via `runSetupStages: true` +**Resource Group**: Must follow naming conventions (see below) + +**To create new infrastructure for scheduled runs on master branch**: +1. Go to Pipeline → Run pipeline +2. Set `runSetupStages` = `true` +3. Set `resourceGroupName` = `sv2-long-run-` (e.g., `sv2-long-run-centraluseuap`) + - **Critical**: Use this exact naming pattern for production scheduled tests + - Region should match the `location` parameter +4. Optionally adjust `location` to match your resource group name +5. Run pipeline + +**To create new infrastructure for testing/development**: +1. Go to Pipeline → Run pipeline +2. Set `runSetupStages` = `true` +3. Set `resourceGroupName` = `sv2-long-run-$(Build.BuildId)` or custom name + - For temporary testing: Use build ID pattern for auto-cleanup + - For parallel environments: Use descriptive suffix (e.g., `sv2-long-run-centraluseuap-dev`) +4. Optionally adjust `location` +5. Run pipeline + +## Pipeline Parameters + +Parameters are organized by usage: + +### Common Parameters (Always Relevant) +| Parameter | Default | Description | +|-----------|---------|-------------| +| `location` | `centraluseuap` | Azure region for resources. Auto-generates RG name: `sv2-long-run-`. | +| `runSetupStages` | `false` | Set to `true` to create new infrastructure. `false` for scheduled test runs. | +| `subscriptionId` | `37deca37-...` | Azure subscription ID. | +| `serviceConnection` | `Azure Container Networking...` | Azure DevOps service connection. | + +### Setup-Only Parameters (Only Used When runSetupStages=true) + +| Parameter | Default | Description | +|-----------|---------|-------------| +| `resourceGroupName` | `""` (empty) | **Leave empty** to auto-generate based on usage pattern. See Resource Group Naming Conventions below. | + +**Resource Group Naming Conventions**: +- **For scheduled runs on master/main branch**: Use `sv2-long-run-` (e.g., `sv2-long-run-centraluseuap`) + - This ensures consistent naming for production scheduled tests + - Example: Creating infrastructure in `centraluseuap` for scheduled runs → `sv2-long-run-centraluseuap` +- **For test/dev runs or PR validation**: Use `sv2-long-run-$(Build.BuildId)` + - Auto-cleanup after testing + - Example: `sv2-long-run-12345` (where 12345 is the build ID) +- **For parallel environments**: Use descriptive suffix (e.g., `sv2-long-run-centraluseuap-dev`, `sv2-long-run-eastus-staging`) + +**Note**: VM SKUs are hardcoded as constants in the pipeline template: +- Default nodepool: `Standard_D4s_v3` (low-nic capacity, 1 NIC) +- NPLinux nodepool: `Standard_D16s_v3` (high-nic capacity, 7 NICs) + +Setup-only parameters are ignored when `runSetupStages=false` (scheduled runs). + +## Pipeline Stage Organization + +The pipeline is organized into stages based on workload type, allowing sequential testing of different node configurations using the same infrastructure. + +### Stage 1: AKS Cluster and Networking Setup (Conditional) +**Runs when**: `runSetupStages=true` +**Purpose**: One-time infrastructure creation +**Creates**: AKS clusters, VNets, peerings, storage accounts, NSGs, private endpoints, node labels + +### Stage 2: ManagedNodeDataPathTests (Current) +**Workload Type**: `swiftv2-linux` +**Node Label Filter**: `workload-type=swiftv2-linux` +**Jobs**: +1. Create Test Resources (8 pod scenarios) +2. Connectivity Tests (9 test cases) +3. Private Endpoint Tests (5 test cases) +4. Delete Test Resources (cleanup) + +**Node Selection**: +- Tests automatically filter nodes by `workload-type=swiftv2-linux` AND `nic-capacity` labels +- Environment variable `WORKLOAD_TYPE=swiftv2-linux` is set for this stage +- Ensures tests only run on nodes designated for this workload type + +### Future Stages (Planned Architecture) +Additional stages can be added to test different workload types sequentially: + +**Example: Stage 3 - BYONodeDataPathTests** +```yaml +- stage: BYONodeDataPathTests + displayName: "SwiftV2 Data Path Tests - BYO Node ID" + dependsOn: ManagedNodeDataPathTests + variables: + WORKLOAD_TYPE: "swiftv2-byonodeid" + # Same job structure as ManagedNodeDataPathTests + # Tests run on nodes labeled: workload-type=swiftv2-byonodeid +``` + +**Example: Stage 4 - WindowsNodeDataPathTests** +```yaml +- stage: WindowsNodeDataPathTests + displayName: "SwiftV2 Data Path Tests - Windows Nodes" + dependsOn: BYONodeDataPathTests + variables: + WORKLOAD_TYPE: "swiftv2-windows" + # Same job structure + # Tests run on nodes labeled: workload-type=swiftv2-windows +``` + +**Benefits of Stage-Based Architecture**: +- ✅ Sequential execution: Each workload type tested independently +- ✅ Isolated node pools: No resource contention between workload types +- ✅ Same infrastructure: All stages use the same VNets, storage, NSGs +- ✅ Same test suite: Connectivity and private endpoint tests run for each workload type +- ✅ Easy extensibility: Add new stages without modifying existing ones +- ✅ Clear results: Separate test results per workload type + +**Node Labeling for Multiple Workload Types**: +Each node pool gets labeled with its designated workload type during setup: +```bash +# During cluster creation or node pool addition: +kubectl label nodes -l agentpool=nodepool1 workload-type=swiftv2-linux +kubectl label nodes -l agentpool=byonodepool workload-type=swiftv2-byonodeid +kubectl label nodes -l agentpool=winnodepool workload-type=swiftv2-windows +``` + +## How It Works + +### Scheduled Test Flow +Every 1 hour, the pipeline: +1. Skips setup stages (infrastructure already exists) +2. **Job 1 - Create Resources**: Creates 8 test scenarios (PodNetwork, PNI, Pods with HTTP servers on port 8080) +3. **Job 2 - Connectivity Tests**: Tests HTTP connectivity between pods (9 test cases), then waits 20 minutes +4. **Job 3 - Private Endpoint Tests**: Tests private endpoint access and tenant isolation (5 test cases) +5. **Job 4 - Delete Resources**: Deletes all test resources (Phase 1: Pods, Phase 2: PNI/PN/Namespaces) +6. Reports results + +**Connectivity Tests (9 scenarios)**: + +| Test | Source → Destination | Expected Result | Purpose | +|------|---------------------|-----------------|---------| +| SameVNetSameSubnet | pod-c1-aks1-a1s2-low → pod-c1-aks1-a1s2-high | ✓ Success | Basic connectivity in same subnet | +| NSGBlocked_S1toS2 | pod-c1-aks1-a1s1-low → pod-c1-aks1-a1s2-high | ✗ Blocked | NSG rule blocks s1→s2 in cx_vnet_a1 | +| NSGBlocked_S2toS1 | pod-c1-aks1-a1s2-low → pod-c1-aks1-a1s1-low | ✗ Blocked | NSG rule blocks s2→s1 (bidirectional) | +| DifferentVNetSameCustomer | pod-c1-aks1-a2s1-high → pod-c1-aks2-a2s1-low | ✓ Success | Cross-cluster, same customer VNet | +| PeeredVNets | pod-c1-aks1-a1s2-low → pod-c1-aks1-a2s1-high | ✓ Success | Peered VNets (a1 ↔ a2) | +| PeeredVNets_A2toA3 | pod-c1-aks1-a2s1-high → pod-c1-aks2-a3s1-high | ✓ Success | Peered VNets across clusters | +| DifferentCustomers_A1toB1 | pod-c1-aks1-a1s2-low → pod-c2-aks2-b1s1-low | ✗ Blocked | Customer isolation (C1 → C2) | +| DifferentCustomers_A2toB1 | pod-c1-aks1-a2s1-high → pod-c2-aks2-b1s1-high | ✗ Blocked | Customer isolation (C1 → C2) | + +**Test Results**: 4 should succeed, 5 should be blocked (3 NSG rules + 2 customer isolation) + +**Private Endpoint Tests (5 scenarios)**: + +| Test | Source → Destination | Expected Result | Purpose | +|------|---------------------|-----------------|---------| +| TenantA_VNetA1_S1_to_StorageA | pod-c1-aks1-a1s1-low → Storage-A | ✓ Success | Tenant A pod can access Storage-A via private endpoint | +| TenantA_VNetA1_S2_to_StorageA | pod-c1-aks1-a1s2-low → Storage-A | ✓ Success | Tenant A pod can access Storage-A via private endpoint | +| TenantA_VNetA2_to_StorageA | pod-c1-aks1-a2s1-high → Storage-A | ✓ Success | Tenant A pod from peered VNet can access Storage-A | +| TenantA_VNetA3_to_StorageA | pod-c1-aks2-a3s1-high → Storage-A | ✓ Success | Tenant A pod from different cluster can access Storage-A | +| TenantB_to_StorageA_Isolation | pod-c2-aks2-b1s1-low → Storage-A | ✗ Blocked | Tenant B pod CANNOT access Storage-A (tenant isolation) | + +**Test Results**: 4 should succeed, 1 should be blocked (tenant isolation) + +## Test Case Details + +### 8 Pod Scenarios (Created in Job 1) + +All test scenarios create the following resources: +- **PodNetwork**: Defines the network configuration for a VNet/subnet combination +- **PodNetworkInstance**: Instance-level configuration with IP allocation +- **Pod**: Test pod running nicolaka/netshoot with HTTP server on port 8080 + +| # | Scenario | Cluster | VNet | Subnet | Node Type | Pod Name | Purpose | +|---|----------|---------|------|--------|-----------|----------|---------| +| 1 | Customer2-AKS2-VnetB1-S1-LowNic | aks-2 | cx_vnet_b1 | s1 | low-nic | pod-c2-aks2-b1s1-low | Tenant B pod for isolation testing | +| 2 | Customer2-AKS2-VnetB1-S1-HighNic | aks-2 | cx_vnet_b1 | s1 | high-nic | pod-c2-aks2-b1s1-high | Tenant B pod on high-NIC node | +| 3 | Customer1-AKS1-VnetA1-S1-LowNic | aks-1 | cx_vnet_a1 | s1 | low-nic | pod-c1-aks1-a1s1-low | Tenant A pod in NSG-protected subnet | +| 4 | Customer1-AKS1-VnetA1-S2-LowNic | aks-1 | cx_vnet_a1 | s2 | low-nic | pod-c1-aks1-a1s2-low | Tenant A pod for NSG isolation test | +| 5 | Customer1-AKS1-VnetA1-S2-HighNic | aks-1 | cx_vnet_a1 | s2 | high-nic | pod-c1-aks1-a1s2-high | Tenant A pod on high-NIC node | +| 6 | Customer1-AKS1-VnetA2-S1-HighNic | aks-1 | cx_vnet_a2 | s1 | high-nic | pod-c1-aks1-a2s1-high | Tenant A pod in peered VNet | +| 7 | Customer1-AKS2-VnetA2-S1-LowNic | aks-2 | cx_vnet_a2 | s1 | low-nic | pod-c1-aks2-a2s1-low | Cross-cluster same VNet test | +| 8 | Customer1-AKS2-VnetA3-S1-HighNic | aks-2 | cx_vnet_a3 | s1 | high-nic | pod-c1-aks2-a3s1-high | Private endpoint access test | + +### Connectivity Tests (9 Test Cases in Job 2) + +Tests HTTP connectivity between pods using curl with 5-second timeout: + +**Expected to SUCCEED (4 tests)**: + +| Test | Source → Destination | Validation | Purpose | +|------|---------------------|------------|---------| +| SameVNetSameSubnet | pod-c1-aks1-a1s2-low → pod-c1-aks1-a1s2-high | HTTP 200 | Basic same-subnet connectivity | +| DifferentVNetSameCustomer | pod-c1-aks1-a2s1-high → pod-c1-aks2-a2s1-low | HTTP 200 | Cross-cluster, same VNet (a2) | +| PeeredVNets | pod-c1-aks1-a1s2-low → pod-c1-aks1-a2s1-high | HTTP 200 | VNet peering (a1 ↔ a2) | +| PeeredVNets_A2toA3 | pod-c1-aks1-a2s1-high → pod-c1-aks2-a3s1-high | HTTP 200 | VNet peering across clusters | + +**Expected to FAIL (5 tests)**: + +| Test | Source → Destination | Expected Error | Purpose | +|------|---------------------|----------------|---------| +| NSGBlocked_S1toS2 | pod-c1-aks1-a1s1-low → pod-c1-aks1-a1s2-high | Connection timeout | NSG blocks s1→s2 in cx_vnet_a1 | +| NSGBlocked_S2toS1 | pod-c1-aks1-a1s2-low → pod-c1-aks1-a1s1-low | Connection timeout | NSG blocks s2→s1 (bidirectional) | +| DifferentCustomers_A1toB1 | pod-c1-aks1-a1s2-low → pod-c2-aks2-b1s1-low | Connection timeout | Customer isolation (no peering) | +| DifferentCustomers_A2toB1 | pod-c1-aks1-a2s1-high → pod-c2-aks2-b1s1-high | Connection timeout | Customer isolation (no peering) | +| UnpeeredVNets_A3toB1 | pod-c1-aks2-a3s1-high → pod-c2-aks2-b1s1-low | Connection timeout | No peering between a3 and b1 | + +**NSG Rules Configuration**: +- cx_vnet_a1 has NSG rules blocking traffic between s1 and s2 subnets: + - Deny outbound from s1 to s2 (priority 100) + - Deny inbound from s1 to s2 (priority 110) + - Deny outbound from s2 to s1 (priority 100) + - Deny inbound from s2 to s1 (priority 110) + +### Private Endpoint Tests (5 Test Cases in Job 3) + +Tests access to Azure Storage Account via Private Endpoint with public network access disabled: + +**Expected to SUCCEED (4 tests)**: + +| Test | Source → Storage | Validation | Purpose | +|------|-----------------|------------|---------| +| TenantA_VNetA1_S1_to_StorageA | pod-c1-aks1-a1s1-low → Storage-A | Blob download via SAS | Access via private endpoint from VNet A1 | +| TenantA_VNetA1_S2_to_StorageA | pod-c1-aks1-a1s2-low → Storage-A | Blob download via SAS | Access via private endpoint from VNet A1 | +| TenantA_VNetA2_to_StorageA | pod-c1-aks1-a2s1-high → Storage-A | Blob download via SAS | Access via peered VNet (A2 peered with A1) | +| TenantA_VNetA3_to_StorageA | pod-c1-aks2-a3s1-high → Storage-A | Blob download via SAS | Access via peered VNet from different cluster | + +**Expected to FAIL (1 test)**: + +| Test | Source → Storage | Expected Error | Purpose | +|------|-----------------|----------------|---------| +| TenantB_to_StorageA_Isolation | pod-c2-aks2-b1s1-low → Storage-A | Connection timeout/failed | Tenant isolation - no private endpoint access, public blocked | + +**Private Endpoint Configuration**: +- Private endpoint created in cx_vnet_a1 subnet 'pe' +- Private DNS zone `privatelink.blob.core.windows.net` linked to: + - cx_vnet_a1, cx_vnet_a2, cx_vnet_a3 (Tenant A VNets) + - aks-1 and aks-2 cluster VNets +- Storage Account 1 (Tenant A): + - Public network access: **Disabled** + - Shared key access: Disabled (Azure AD only) + - Blob public access: Disabled +- Storage Account 2 (Tenant B): Public access enabled (for future tests) + +**Test Flow**: +1. DNS resolution: Storage FQDN resolves to private IP for Tenant A, fails/public IP for Tenant B +2. Generate SAS token: Azure AD authentication via management plane +3. Download blob: Using curl with SAS token via data plane +4. Validation: Verify blob content matches expected value + +### Resource Creation Patterns + +**Naming Convention**: +``` +BUILD_ID = + +PodNetwork: pn--- +PodNetworkInstance: pni--- +Namespace: pn--- +Pod: pod- +``` + +**Example** (for `resourceGroupName=sv2-long-run-centraluseuap`): +``` +pn-sv2-long-run-centraluseuap-a1-s1 +pni-sv2-long-run-centraluseuap-a1-s1 +pn-sv2-long-run-centraluseuap-a1-s1 (namespace) +pod-c1-aks1-a1s1-low +``` + +**VNet Name Simplification**: +- `cx_vnet_a1` → `a1` +- `cx_vnet_a2` → `a2` +- `cx_vnet_a3` → `a3` +- `cx_vnet_b1` → `b1` + +### Setup Flow (When runSetupStages = true) +1. Create resource group with `SkipAutoDeleteTill=2032-12-31` tag +2. Create 2 AKS clusters with 2 node pools each (tagged for persistence) +3. Create 4 customer VNets with subnets and delegations (tagged for persistence) +4. Create VNet peerings +5. Create storage accounts with persistence tags +6. Create NSGs for subnet isolation +7. Run initial test (create → wait → delete) + +**All infrastructure resources are tagged with `SkipAutoDeleteTill=2032-12-31`** to prevent automatic cleanup by Azure subscription policies. + +## Resource Naming + +All test resources use the pattern: `-static-setup--` + +**Examples**: +- PodNetwork: `pn-static-setup-a1-s1` +- PodNetworkInstance: `pni-static-setup-a1-s1` +- Pod: `pod-c1-aks1-a1s1-low` +- Namespace: `pn-static-setup-a1-s1` + +VNet names are simplified: +- `cx_vnet_a1` → `a1` +- `cx_vnet_b1` → `b1` + +## Switching to a New Setup + +**Scenario**: You created a new setup in RG `sv2-long-run-eastus` and want scheduled runs to use it. + +**Steps**: +1. Go to Pipeline → Edit +2. Update location parameter default value: + ```yaml + - name: location + default: "centraluseuap" # Change this + ``` +3. Save and commit +4. RG name will automatically become `sv2-long-run-centraluseuap` + +Alternatively, manually trigger with the new location or override `resourceGroupName` directly. + +## Creating Multiple Test Setups + +**Use Case**: You want to create a new test environment without affecting the existing one (e.g., for testing different configurations, regions, or versions). + +**Steps**: +1. Go to Pipeline → Run pipeline +2. Set `runSetupStages` = `true` +3. **Set `resourceGroupName`** based on usage: + - **For scheduled runs on master/main branch**: `sv2-long-run-` (e.g., `sv2-long-run-centraluseuap`, `sv2-long-run-eastus`) + - Use this naming pattern for production scheduled tests + - **For test/dev runs**: `sv2-long-run-$(Build.BuildId)` or custom (e.g., `sv2-long-run-12345`) + - For temporary testing or PR validation + - **For parallel environments**: Custom with descriptive suffix (e.g., `sv2-long-run-centraluseuap-dev`, `sv2-long-run-centraluseuap-v2`) +4. Optionally adjust `location` +5. Run pipeline + +**After setup completes**: +- The new infrastructure will be tagged with `SkipAutoDeleteTill=2032-12-31` +- Resources are isolated by the unique resource group name +- To run tests against the new setup, the scheduled pipeline would need to be updated with the new RG name + +**Example Scenarios**: +| Scenario | Resource Group Name | Purpose | Naming Pattern | +|----------|-------------------|---------|----------------| +| Production scheduled (Central US EUAP) | `sv2-long-run-centraluseuap` | Daily scheduled tests on master | `sv2-long-run-` | +| Production scheduled (East US) | `sv2-long-run-eastus` | Regional scheduled testing on master | `sv2-long-run-` | +| Temporary test run | `sv2-long-run-12345` | One-time testing (Build ID: 12345) | `sv2-long-run-$(Build.BuildId)` | +| Development environment | `sv2-long-run-centraluseuap-dev` | Development/testing | Custom with suffix | +| Version upgrade testing | `sv2-long-run-centraluseuap-v2` | Parallel environment for upgrades | Custom with suffix | + +## Resource Naming + instead of ping use +The pipeline uses the **resource group name as the BUILD_ID** to ensure unique resource names per test setup. This allows multiple parallel test environments without naming collisions. + +**Generated Resource Names**: +``` +BUILD_ID = + +PodNetwork: pn--- +PodNetworkInstance: pni--- +Namespace: pn--- +Pod: pod- +``` + +**Example for `resourceGroupName=sv2-long-run-centraluseuap`**: +``` +pn-sv2-long-run-centraluseuap-b1-s1 (PodNetwork for cx_vnet_b1, subnet s1) +pni-sv2-long-run-centraluseuap-b1-s1 (PodNetworkInstance) +pn-sv2-long-run-centraluseuap-a1-s1 (PodNetwork for cx_vnet_a1, subnet s1) +pni-sv2-long-run-centraluseuap-a1-s2 (PodNetworkInstance for cx_vnet_a1, subnet s2) +``` + +**Example for different setup `resourceGroupName=sv2-long-run-eastus`**: +``` +pn-sv2-long-run-eastus-b1-s1 (Different from centraluseuap setup) +pni-sv2-long-run-eastus-b1-s1 +pn-sv2-long-run-eastus-a1-s1 +``` + +This ensures **no collision** between different test setups running in parallel. + +## Deletion Strategy +### Phase 1: Delete All Pods +Deletes all pods across all scenarios first. This ensures IP reservations are released. + +``` +Deleting pod pod-c2-aks2-b1s1-low... +Deleting pod pod-c2-aks2-b1s1-high... +... +``` + +### Phase 2: Delete Shared Resources +Groups resources by vnet/subnet/cluster and deletes PNI/PN/Namespace once per group. + +``` +Deleting PodNetworkInstance pni-static-setup-b1-s1... +Deleting PodNetwork pn-static-setup-b1-s1... +Deleting namespace pn-static-setup-b1-s1... +``` + +**Why**: Multiple pods can share the same PNI. Deleting PNI while pods exist causes "ReservationInUse" errors. + +## Troubleshooting + +### Tests are running on wrong cluster +- Check `resourceGroupName` parameter points to correct RG +- Verify RG contains aks-1 and aks-2 clusters +- Check kubeconfig retrieval in logs + +### Setup stages not running +- Verify `runSetupStages` parameter is set to `true` +- Check condition: `condition: eq(${{ parameters.runSetupStages }}, true)` + +### Schedule not triggering +- Verify cron expression: `"0 */1 * * *"` (every 1 hour) +- Check branch in schedule matches your working branch +- Ensure `always: true` is set (runs even without code changes) + +### PNI stuck with "ReservationInUse" +- Check if pods were deleted first (Phase 1 logs) +- Manual fix: Delete pod → Wait 10s → Patch PNI to remove finalizers + +### Pipeline timeout after 6 hours +- This is expected behavior (timeoutInMinutes: 360) +- Tests should complete in ~30-40 minutes +- If tests hang, check deletion logs for stuck resources + +## Manual Testing + +Run locally against existing infrastructure: + +```bash +export RG="sv2-long-run-centraluseuap" # Match your resource group +export BUILD_ID="$RG" # Use same RG name as BUILD_ID for unique resource names + +cd test/integration/swiftv2/longRunningCluster +ginkgo -v -trace --timeout=6h . +``` + +## Node Pool Configuration + +### Node Labels and Architecture + +All nodes in the clusters are labeled with two key labels for workload identification and NIC capacity. These labels are applied during cluster creation by the `create_aks.sh` script. + +**1. Workload Type Label** (`workload-type`): +- Purpose: Identifies which test scenario group the node belongs to +- Current value: `swiftv2-linux` (applied to all nodes in current setup) +- Applied during: Cluster creation in Stage 1 (AKSClusterAndNetworking) +- Applied by: `.pipelines/swiftv2-long-running/scripts/create_aks.sh` +- Future use: Supports multiple workload types running as separate stages (e.g., `swiftv2-windows`, `swiftv2-byonodeid`) +- Stage isolation: Each test stage uses `WORKLOAD_TYPE` environment variable to filter nodes + +**2. NIC Capacity Label** (`nic-capacity`): +- Purpose: Identifies the NIC capacity tier of the node +- Applied during: Cluster creation in Stage 1 (AKSClusterAndNetworking) +- Applied by: `.pipelines/swiftv2-long-running/scripts/create_aks.sh` +- Values: + - `low-nic`: Default nodepool (nodepool1) with `Standard_D4s_v3` (1 NIC) + - `high-nic`: NPLinux nodepool (nplinux) with `Standard_D16s_v3` (7 NICs) + +**Label Application in create_aks.sh**: +```bash +# Step 1: All nodes get workload-type label +kubectl label nodes --all workload-type=swiftv2-linux --overwrite + +# Step 2: Default nodepool gets low-nic capacity label +kubectl label nodes -l agentpool=nodepool1 nic-capacity=low-nic --overwrite + +# Step 3: NPLinux nodepool gets high-nic capacity label +kubectl label nodes -l agentpool=nplinux nic-capacity=high-nic --overwrite +``` + +**Example Node Labels**: +```yaml +# Low-NIC node (nodepool1) +labels: + agentpool: nodepool1 + workload-type: swiftv2-linux + nic-capacity: low-nic + +# High-NIC node (nplinux) +labels: + agentpool: nplinux + workload-type: swiftv2-linux + nic-capacity: high-nic +``` + +### Node Selection in Tests + +Tests use these labels to select appropriate nodes dynamically: +- **Function**: `GetNodesByNicCount()` in `test/integration/swiftv2/longRunningCluster/datapath.go` +- **Filtering**: Nodes filtered by BOTH `workload-type` AND `nic-capacity` labels +- **Environment Variable**: `WORKLOAD_TYPE` (set by each test stage) determines which nodes are used + - Current: `WORKLOAD_TYPE=swiftv2-linux` in ManagedNodeDataPathTests stage + - Future: Different values for each stage (e.g., `swiftv2-byonodeid`, `swiftv2-windows`) +- **Selection Logic**: + ```go + // Get low-nic nodes with matching workload type + kubectl get nodes -l "nic-capacity=low-nic,workload-type=$WORKLOAD_TYPE" + + // Get high-nic nodes with matching workload type + kubectl get nodes -l "nic-capacity=high-nic,workload-type=$WORKLOAD_TYPE" + ``` +- **Pod Assignment**: + - Low-NIC nodes: Limited to 1 pod per node + - High-NIC nodes: Currently limited to 1 pod per node in test logic + +**Node Pool Configuration**: + +| Node Pool | VM SKU | NICs | Label | Pods per Node | +|-----------|--------|------|-------|---------------| +| nodepool1 (default) | `Standard_D4s_v3` | 1 | `nic-capacity=low-nic` | 1 | +| nplinux | `Standard_D16s_v3` | 7 | `nic-capacity=high-nic` | 1 (current test logic) | + +**Note**: VM SKUs are hardcoded as constants in the pipeline template and cannot be changed by users. + +## Schedule Modification + +To change test frequency, edit the cron schedule: + +```yaml +schedules: + - cron: "0 */1 * * *" # Every 1 hour (current) + # Examples: + # - cron: "0 */2 * * *" # Every 2 hours + # - cron: "0 */6 * * *" # Every 6 hours + # - cron: "0 0,8,16 * * *" # At 12am, 8am, 4pm + # - cron: "0 0 * * *" # Daily at midnight +``` + +## File Structure + +``` +.pipelines/swiftv2-long-running/ +├── pipeline.yaml # Main pipeline with schedule +├── README.md # This file +├── template/ +│ └── long-running-pipeline-template.yaml # Stage definitions (2 jobs) +└── scripts/ + ├── create_aks.sh # AKS cluster creation + ├── create_vnets.sh # VNet and subnet creation + ├── create_peerings.sh # VNet peering setup + ├── create_storage.sh # Storage account creation + ├── create_nsg.sh # Network security groups + └── create_pe.sh # Private endpoint setup + +test/integration/swiftv2/longRunningCluster/ +├── datapath_test.go # Original combined test (deprecated) +├── datapath_create_test.go # Create test scenarios (Job 1) +├── datapath_delete_test.go # Delete test scenarios (Job 2) +├── datapath.go # Resource orchestration +└── helpers/ + └── az_helpers.go # Azure/kubectl helper functions +``` + +## Best Practices + +1. **Keep infrastructure persistent**: Only recreate when necessary (cluster upgrades, config changes) +2. **Monitor scheduled runs**: Set up alerts for test failures +3. **Resource naming**: BUILD_ID is automatically set to the resource group name, ensuring unique resource names per setup +4. **Tag resources appropriately**: All setup resources automatically tagged with `SkipAutoDeleteTill=2032-12-31` + - AKS clusters + - AKS VNets + - Customer VNets (cx_vnet_a1, cx_vnet_a2, cx_vnet_a3, cx_vnet_b1) + - Storage accounts +5. **Avoid resource group collisions**: Always use unique `resourceGroupName` when creating new setups +6. **Document changes**: Update this README when modifying test scenarios or infrastructure + +## Resource Tags + +All infrastructure resources are automatically tagged during creation: + +```bash +SkipAutoDeleteTill=2032-12-31 +``` + +This prevents automatic cleanup by Azure subscription policies that delete resources after a certain period. The tag is applied to: +- Resource group (via create_resource_group job) +- AKS clusters (aks-1, aks-2) +- AKS cluster VNets +- Customer VNets (cx_vnet_a1, cx_vnet_a2, cx_vnet_a3, cx_vnet_b1) +- Storage accounts (sa1xxxx, sa2xxxx) + +To manually update the tag date: +```bash +az resource update --ids --set tags.SkipAutoDeleteTill=2033-12-31 +``` diff --git a/.pipelines/swiftv2-long-running/pipeline.yaml b/.pipelines/swiftv2-long-running/pipeline.yaml index b6d085901d..7abc3e1f79 100644 --- a/.pipelines/swiftv2-long-running/pipeline.yaml +++ b/.pipelines/swiftv2-long-running/pipeline.yaml @@ -1,4 +1,14 @@ trigger: none +pr: none + +# Schedule: Run every 1 hour +schedules: + - cron: "0 */3 * * *" # Every 3 hours at minute 0 + displayName: "Run tests every 3 hours" + branches: + include: + - sv2-long-running-pipeline-stage2 + always: true # Run even if there are no code changes parameters: - name: subscriptionId @@ -6,30 +16,26 @@ parameters: type: string default: "37deca37-c375-4a14-b90a-043849bd2bf1" + - name: serviceConnection + displayName: "Azure Service Connection" + type: string + default: "Azure Container Networking - Standalone Test Service Connection" + - name: location displayName: "Deployment Region" type: string default: "centraluseuap" - - name: resourceGroupName - displayName: "Resource Group Name" - type: string - default: "long-run-$(Build.BuildId)" - - - name: vmSkuDefault - displayName: "VM SKU for Default Node Pool" - type: string - default: "Standard_D2s_v3" - - - name: vmSkuHighNIC - displayName: "VM SKU for High NIC Node Pool" - type: string - default: "Standard_D16s_v3" + - name: runSetupStages + displayName: "Create New Infrastructure Setup" + type: boolean + default: false - - name: serviceConnection - displayName: "Azure Service Connection" + # Setup-only parameters (only used when runSetupStages=true) + - name: resourceGroupName + displayName: "Resource Group Name used when Create new Infrastructure Setup is selected" type: string - default: "Azure Container Networking - Standalone Test Service Connection" + default: "sv2-long-run-$(Build.BuildId)" extends: template: template/long-running-pipeline-template.yaml @@ -37,6 +43,5 @@ extends: subscriptionId: ${{ parameters.subscriptionId }} location: ${{ parameters.location }} resourceGroupName: ${{ parameters.resourceGroupName }} - vmSkuDefault: ${{ parameters.vmSkuDefault }} - vmSkuHighNIC: ${{ parameters.vmSkuHighNIC }} serviceConnection: ${{ parameters.serviceConnection }} + runSetupStages: ${{ parameters.runSetupStages }} diff --git a/.pipelines/swiftv2-long-running/scripts/create_aks.sh b/.pipelines/swiftv2-long-running/scripts/create_aks.sh index 4ab38c0f42..999a406900 100644 --- a/.pipelines/swiftv2-long-running/scripts/create_aks.sh +++ b/.pipelines/swiftv2-long-running/scripts/create_aks.sh @@ -7,57 +7,113 @@ RG=$3 VM_SKU_DEFAULT=$4 VM_SKU_HIGHNIC=$5 -CLUSTER_COUNT=2 -CLUSTER_PREFIX="aks" -DEFAULT_NODE_COUNT=1 -COMMON_TAGS="fastpathenabled=true RGOwner=LongRunningTestPipelines stampcreatorserviceinfo=true" - -wait_for_provisioning() { # Helper for safe retry/wait for provisioning states (basic) - local rg="$1" clusterName="$2" - echo "Waiting for AKS '$clusterName' in RG '$rg' to reach Succeeded/Failed (polling)..." +CLUSTER_COUNT=2 +CLUSTER_PREFIX="aks" + + +stamp_vnet() { + local vnet_id="$1" + + responseFile="response.txt" + modified_vnet="${vnet_id//\//%2F}" + cmd_stamp_curl="'curl -v -X PUT http://localhost:8080/VirtualNetwork/$modified_vnet/stampcreatorservicename'" + cmd_containerapp_exec="az containerapp exec -n subnetdelegator-westus-u3h4j -g subnetdelegator-westus --subscription 9b8218f9-902a-4d20-a65c-e98acec5362f --command $cmd_stamp_curl" + + max_retries=10 + sleep_seconds=15 + retry_count=0 + + while [[ $retry_count -lt $max_retries ]]; do + script --quiet -c "$cmd_containerapp_exec" "$responseFile" + if grep -qF "200 OK" "$responseFile"; then + echo "Subnet Delegator successfully stamped the vnet" + return 0 + else + echo "Subnet Delegator failed to stamp the vnet, attempt $((retry_count+1))" + cat "$responseFile" + retry_count=$((retry_count+1)) + sleep "$sleep_seconds" + fi + done + + echo "Failed to stamp the vnet even after $max_retries attempts" + exit 1 +} + +wait_for_provisioning() { + local rg="$1" clusterName="$2" + echo "Waiting for AKS '$clusterName' in RG '$rg'..." while :; do state=$(az aks show --resource-group "$rg" --name "$clusterName" --query provisioningState -o tsv 2>/dev/null || true) - if [ -z "$state" ]; then - sleep 3 - continue + if [[ "$state" =~ Succeeded ]]; then + echo "Provisioning state: $state" + break fi - case "$state" in - Succeeded|Succeeded*) echo "Provisioning state: $state"; break ;; - Failed|Canceled|Rejected) echo "Provisioning finished with state: $state"; break ;; - *) printf "."; sleep 6 ;; - esac + if [[ "$state" =~ Failed|Canceled ]]; then + echo "Provisioning finished with state: $state" + break + fi + sleep 6 done } +######################################### +# Main script starts here +######################################### + for i in $(seq 1 "$CLUSTER_COUNT"); do - echo "==============================" - echo " Working on cluster set #$i" - echo "==============================" - - CLUSTER_NAME="${CLUSTER_PREFIX}-${i}" - echo "Creating AKS cluster '$CLUSTER_NAME' in RG '$RG'" - - make -C ./hack/aks azcfg AZCLI=az REGION=$LOCATION - - make -C ./hack/aks swiftv2-podsubnet-cluster-up \ - AZCLI=az REGION=$LOCATION \ - SUB=$SUBSCRIPTION_ID \ - GROUP=$RG \ - CLUSTER=$CLUSTER_NAME \ - NODE_COUNT=$DEFAULT_NODE_COUNT \ - VM_SIZE=$VM_SKU_DEFAULT \ - - echo " - waiting for AKS provisioning state..." - wait_for_provisioning "$RG" "$CLUSTER_NAME" - - echo "Adding multi-tenant nodepool ' to '$CLUSTER_NAME'" - make -C ./hack/aks linux-swiftv2-nodepool-up \ - AZCLI=az REGION=$LOCATION \ - GROUP=$RG \ - VM_SIZE=$VM_SKU_HIGHNIC \ - CLUSTER=$CLUSTER_NAME \ - SUB=$SUBSCRIPTION_ID \ + echo "Creating cluster #$i..." + CLUSTER_NAME="${CLUSTER_PREFIX}-${i}" + + make -C ./hack/aks azcfg AZCLI=az REGION=$LOCATION + + # Create cluster with SkipAutoDeleteTill tag for persistent infrastructure + make -C ./hack/aks swiftv2-podsubnet-cluster-up \ + AZCLI=az REGION=$LOCATION \ + SUB=$SUBSCRIPTION_ID \ + GROUP=$RG \ + CLUSTER=$CLUSTER_NAME \ + VM_SIZE=$VM_SKU_DEFAULT + + # Add SkipAutoDeleteTill tag to cluster (2032-12-31 for long-term persistence) + az aks update -g "$RG" -n "$CLUSTER_NAME" --tags SkipAutoDeleteTill=2032-12-31 || echo "Warning: Failed to add tag to cluster" + + wait_for_provisioning "$RG" "$CLUSTER_NAME" + + vnet_id=$(az network vnet show -g "$RG" --name "$CLUSTER_NAME" --query id -o tsv) + echo "Found VNET: $vnet_id" + + # Add SkipAutoDeleteTill tag to AKS VNet + az network vnet update --ids "$vnet_id" --set tags.SkipAutoDeleteTill=2032-12-31 || echo "Warning: Failed to add tag to vnet" + + stamp_vnet "$vnet_id" + + make -C ./hack/aks linux-swiftv2-nodepool-up \ + AZCLI=az REGION=$LOCATION \ + GROUP=$RG \ + VM_SIZE=$VM_SKU_HIGHNIC \ + CLUSTER=$CLUSTER_NAME \ + SUB=$SUBSCRIPTION_ID + + az aks get-credentials -g "$RG" -n "$CLUSTER_NAME" --admin --overwrite-existing \ + --file "/tmp/${CLUSTER_NAME}.kubeconfig" + + # Label all nodes with workload-type and nic-capacity labels + echo "==> Labeling all nodes in $CLUSTER_NAME with workload-type=swiftv2-linux" + kubectl --kubeconfig "/tmp/${CLUSTER_NAME}.kubeconfig" label nodes --all workload-type=swiftv2-linux --overwrite + echo "[OK] All nodes labeled with workload-type=swiftv2-linux" + + # Label default nodepool (nodepool1) with low-nic capacity + echo "==> Labeling default nodepool (nodepool1) nodes with nic-capacity=low-nic" + kubectl --kubeconfig "/tmp/${CLUSTER_NAME}.kubeconfig" label nodes -l agentpool=nodepool1 nic-capacity=low-nic --overwrite + echo "[OK] Default nodepool nodes labeled with nic-capacity=low-nic" + + # Label nplinux nodepool with high-nic capacity + echo "==> Labeling nplinux nodepool nodes with nic-capacity=high-nic" + kubectl --kubeconfig "/tmp/${CLUSTER_NAME}.kubeconfig" label nodes -l agentpool=nplinux nic-capacity=high-nic --overwrite + echo "[OK] nplinux nodepool nodes labeled with nic-capacity=high-nic" done -echo "All done. Created $CLUSTER_COUNT cluster set(s)." + +echo "All clusters complete." diff --git a/.pipelines/swiftv2-long-running/scripts/create_nsg.sh b/.pipelines/swiftv2-long-running/scripts/create_nsg.sh old mode 100644 new mode 100755 index cec91cd7cf..34c04f5c70 --- a/.pipelines/swiftv2-long-running/scripts/create_nsg.sh +++ b/.pipelines/swiftv2-long-running/scripts/create_nsg.sh @@ -7,9 +7,59 @@ RG=$2 LOCATION=$3 VNET_A1="cx_vnet_a1" -SUBNET1_PREFIX="10.10.1.0/24" -SUBNET2_PREFIX="10.10.2.0/24" -NSG_NAME="${VNET_A1}-nsg" + +# Get actual subnet CIDR ranges dynamically +echo "==> Retrieving actual subnet address prefixes..." +SUBNET1_PREFIX=$(az network vnet subnet show -g "$RG" --vnet-name "$VNET_A1" -n s1 --query "addressPrefix" -o tsv) +SUBNET2_PREFIX=$(az network vnet subnet show -g "$RG" --vnet-name "$VNET_A1" -n s2 --query "addressPrefix" -o tsv) + +echo "Subnet s1 CIDR: $SUBNET1_PREFIX" +echo "Subnet s2 CIDR: $SUBNET2_PREFIX" + +if [[ -z "$SUBNET1_PREFIX" || -z "$SUBNET2_PREFIX" ]]; then + echo "[ERROR] Failed to retrieve subnet address prefixes!" >&2 + exit 1 +fi + +# Wait 5 minutes for NSGs to be associated with subnets +echo "==> Waiting 5 minutes for NSG associations to complete..." +sleep 300 + +# Get NSG IDs associated with each subnet with retry logic +echo "==> Retrieving NSGs associated with subnets..." +max_retries=10 +retry_count=0 +retry_delay=30 + +while [[ $retry_count -lt $max_retries ]]; do + NSG_S1_ID=$(az network vnet subnet show -g "$RG" --vnet-name "$VNET_A1" -n s1 --query "networkSecurityGroup.id" -o tsv 2>/dev/null || echo "") + NSG_S2_ID=$(az network vnet subnet show -g "$RG" --vnet-name "$VNET_A1" -n s2 --query "networkSecurityGroup.id" -o tsv 2>/dev/null || echo "") + + if [[ -n "$NSG_S1_ID" && -n "$NSG_S2_ID" ]]; then + echo "[OK] Successfully retrieved NSG associations for both subnets" + break + fi + + retry_count=$((retry_count + 1)) + if [[ $retry_count -lt $max_retries ]]; then + echo "[RETRY $retry_count/$max_retries] NSG associations not ready yet. Waiting ${retry_delay}s before retry..." + echo " Subnet s1 NSG ID: ${NSG_S1_ID:-}" + echo " Subnet s2 NSG ID: ${NSG_S2_ID:-}" + sleep $retry_delay + else + echo "[ERROR] Failed to retrieve NSG associations after $max_retries attempts!" >&2 + echo " Subnet s1 NSG ID: ${NSG_S1_ID:-}" >&2 + echo " Subnet s2 NSG ID: ${NSG_S2_ID:-}" >&2 + exit 1 + fi +done + +# Extract NSG names from IDs +NSG_S1_NAME=$(basename "$NSG_S1_ID") +NSG_S2_NAME=$(basename "$NSG_S2_ID") + +echo "Subnet s1 NSG: $NSG_S1_NAME" +echo "Subnet s2 NSG: $NSG_S2_NAME" verify_nsg() { local rg="$1"; local name="$2" @@ -33,77 +83,119 @@ verify_nsg_rule() { fi } -verify_subnet_nsg_association() { - local rg="$1"; local vnet="$2"; local subnet="$3"; local nsg="$4" - echo "==> Verifying NSG association on subnet $subnet..." - local associated_nsg - associated_nsg=$(az network vnet subnet show -g "$rg" --vnet-name "$vnet" -n "$subnet" --query "networkSecurityGroup.id" -o tsv 2>/dev/null || echo "") - if [[ "$associated_nsg" == *"$nsg"* ]]; then - echo "[OK] Verified subnet $subnet is associated with NSG $nsg." - else - echo "[ERROR] Subnet $subnet is NOT associated with NSG $nsg!" >&2 - exit 1 - fi +wait_for_nsg() { + local rg="$1"; local name="$2" + echo "==> Waiting for NSG $name to become available..." + local max_attempts=30 + local attempt=0 + while [[ $attempt -lt $max_attempts ]]; do + if az network nsg show -g "$rg" -n "$name" &>/dev/null; then + local provisioning_state + provisioning_state=$(az network nsg show -g "$rg" -n "$name" --query "provisioningState" -o tsv) + if [[ "$provisioning_state" == "Succeeded" ]]; then + echo "[OK] NSG $name is available (provisioningState: $provisioning_state)." + return 0 + fi + echo "Waiting... NSG $name provisioningState: $provisioning_state" + fi + attempt=$((attempt + 1)) + sleep 10 + done + echo "[ERROR] NSG $name did not become available within the expected time!" >&2 + exit 1 } # ------------------------------- -# 1. Create NSG +# 1. Wait for NSGs to be available # ------------------------------- -echo "==> Creating Network Security Group: $NSG_NAME" -az network nsg create -g "$RG" -n "$NSG_NAME" -l "$LOCATION" --output none \ - && echo "[OK] NSG '$NSG_NAME' created." -verify_nsg "$RG" "$NSG_NAME" +wait_for_nsg "$RG" "$NSG_S1_NAME" +wait_for_nsg "$RG" "$NSG_S2_NAME" # ------------------------------- -# 2. Create NSG Rules +# 2. Create NSG Rules on Subnet1's NSG # ------------------------------- -echo "==> Creating NSG rule to DENY traffic from Subnet1 ($SUBNET1_PREFIX) to Subnet2 ($SUBNET2_PREFIX)" +# Rule 1: Deny Outbound traffic FROM Subnet1 TO Subnet2 +echo "==> Creating NSG rule on $NSG_S1_NAME to DENY OUTBOUND traffic from Subnet1 ($SUBNET1_PREFIX) to Subnet2 ($SUBNET2_PREFIX)" az network nsg rule create \ --resource-group "$RG" \ - --nsg-name "$NSG_NAME" \ - --name deny-subnet1-to-subnet2 \ + --nsg-name "$NSG_S1_NAME" \ + --name deny-s1-to-s2-outbound \ --priority 100 \ --source-address-prefixes "$SUBNET1_PREFIX" \ --destination-address-prefixes "$SUBNET2_PREFIX" \ - --direction Inbound \ + --source-port-ranges "*" \ + --destination-port-ranges "*" \ + --direction Outbound \ --access Deny \ --protocol "*" \ - --description "Deny all traffic from Subnet1 to Subnet2" \ + --description "Deny outbound traffic from Subnet1 to Subnet2" \ --output none \ - && echo "[OK] Deny rule from Subnet1 → Subnet2 created." + && echo "[OK] Deny outbound rule from Subnet1 → Subnet2 created on $NSG_S1_NAME." -verify_nsg_rule "$RG" "$NSG_NAME" "deny-subnet1-to-subnet2" +verify_nsg_rule "$RG" "$NSG_S1_NAME" "deny-s1-to-s2-outbound" -echo "==> Creating NSG rule to DENY traffic from Subnet2 ($SUBNET2_PREFIX) to Subnet1 ($SUBNET1_PREFIX)" +# Rule 2: Deny Inbound traffic FROM Subnet2 TO Subnet1 (for packets arriving at s1) +echo "==> Creating NSG rule on $NSG_S1_NAME to DENY INBOUND traffic from Subnet2 ($SUBNET2_PREFIX) to Subnet1 ($SUBNET1_PREFIX)" az network nsg rule create \ --resource-group "$RG" \ - --nsg-name "$NSG_NAME" \ - --name deny-subnet2-to-subnet1 \ - --priority 200 \ + --nsg-name "$NSG_S1_NAME" \ + --name deny-s2-to-s1-inbound \ + --priority 110 \ --source-address-prefixes "$SUBNET2_PREFIX" \ --destination-address-prefixes "$SUBNET1_PREFIX" \ + --source-port-ranges "*" \ + --destination-port-ranges "*" \ --direction Inbound \ --access Deny \ --protocol "*" \ - --description "Deny all traffic from Subnet2 to Subnet1" \ + --description "Deny inbound traffic from Subnet2 to Subnet1" \ --output none \ - && echo "[OK] Deny rule from Subnet2 → Subnet1 created." + && echo "[OK] Deny inbound rule from Subnet2 → Subnet1 created on $NSG_S1_NAME." -verify_nsg_rule "$RG" "$NSG_NAME" "deny-subnet2-to-subnet1" +verify_nsg_rule "$RG" "$NSG_S1_NAME" "deny-s2-to-s1-inbound" # ------------------------------- -# 3. Associate NSG with Subnets +# 3. Create NSG Rules on Subnet2's NSG # ------------------------------- -for SUBNET in s1 s2; do - echo "==> Associating NSG $NSG_NAME with subnet $SUBNET" - az network vnet subnet update \ - --name "$SUBNET" \ - --vnet-name "$VNET_A1" \ - --resource-group "$RG" \ - --network-security-group "$NSG_NAME" \ - --output none - verify_subnet_nsg_association "$RG" "$VNET_A1" "$SUBNET" "$NSG_NAME" -done +# Rule 3: Deny Outbound traffic FROM Subnet2 TO Subnet1 +echo "==> Creating NSG rule on $NSG_S2_NAME to DENY OUTBOUND traffic from Subnet2 ($SUBNET2_PREFIX) to Subnet1 ($SUBNET1_PREFIX)" +az network nsg rule create \ + --resource-group "$RG" \ + --nsg-name "$NSG_S2_NAME" \ + --name deny-s2-to-s1-outbound \ + --priority 100 \ + --source-address-prefixes "$SUBNET2_PREFIX" \ + --destination-address-prefixes "$SUBNET1_PREFIX" \ + --source-port-ranges "*" \ + --destination-port-ranges "*" \ + --direction Outbound \ + --access Deny \ + --protocol "*" \ + --description "Deny outbound traffic from Subnet2 to Subnet1" \ + --output none \ + && echo "[OK] Deny outbound rule from Subnet2 → Subnet1 created on $NSG_S2_NAME." + +verify_nsg_rule "$RG" "$NSG_S2_NAME" "deny-s2-to-s1-outbound" + +# Rule 4: Deny Inbound traffic FROM Subnet1 TO Subnet2 (for packets arriving at s2) +echo "==> Creating NSG rule on $NSG_S2_NAME to DENY INBOUND traffic from Subnet1 ($SUBNET1_PREFIX) to Subnet2 ($SUBNET2_PREFIX)" +az network nsg rule create \ + --resource-group "$RG" \ + --nsg-name "$NSG_S2_NAME" \ + --name deny-s1-to-s2-inbound \ + --priority 110 \ + --source-address-prefixes "$SUBNET1_PREFIX" \ + --destination-address-prefixes "$SUBNET2_PREFIX" \ + --source-port-ranges "*" \ + --destination-port-ranges "*" \ + --direction Inbound \ + --access Deny \ + --protocol "*" \ + --description "Deny inbound traffic from Subnet1 to Subnet2" \ + --output none \ + && echo "[OK] Deny inbound rule from Subnet1 → Subnet2 created on $NSG_S2_NAME." + +verify_nsg_rule "$RG" "$NSG_S2_NAME" "deny-s1-to-s2-inbound" -echo "NSG '$NSG_NAME' created successfully with bidirectional isolation between Subnet1 and Subnet2." +echo "NSG rules applied successfully on $NSG_S1_NAME and $NSG_S2_NAME with bidirectional isolation between Subnet1 and Subnet2." diff --git a/.pipelines/swiftv2-long-running/scripts/create_pe.sh b/.pipelines/swiftv2-long-running/scripts/create_pe.sh index c9f7e782e0..4d83a8a700 100644 --- a/.pipelines/swiftv2-long-running/scripts/create_pe.sh +++ b/.pipelines/swiftv2-long-running/scripts/create_pe.sh @@ -57,7 +57,7 @@ az network private-dns zone create -g "$RG" -n "$PRIVATE_DNS_ZONE" --output none verify_dns_zone "$RG" "$PRIVATE_DNS_ZONE" -# 2. Link DNS zone to VNet +# 2. Link DNS zone to Customer VNets for VNET in "$VNET_A1" "$VNET_A2" "$VNET_A3"; do LINK_NAME="${VNET}-link" echo "==> Linking DNS zone $PRIVATE_DNS_ZONE to VNet $VNET" @@ -71,9 +71,34 @@ for VNET in "$VNET_A1" "$VNET_A2" "$VNET_A3"; do verify_dns_link "$RG" "$PRIVATE_DNS_ZONE" "$LINK_NAME" done -# 3. Create Private Endpoint +# 2b. Link DNS zone to AKS Cluster VNets (so pods can resolve private endpoint) +echo "==> Linking DNS zone to AKS cluster VNets" +for CLUSTER in "aks-1" "aks-2"; do + echo "==> Getting VNet for $CLUSTER" + AKS_VNET_ID=$(az aks show -g "$RG" -n "$CLUSTER" --query "agentPoolProfiles[0].vnetSubnetId" -o tsv | cut -d'/' -f1-9) + + if [ -z "$AKS_VNET_ID" ]; then + echo "[WARNING] Could not get VNet for $CLUSTER, skipping DNS link" + continue + fi + + LINK_NAME="${CLUSTER}-vnet-link" + echo "==> Linking DNS zone to $CLUSTER VNet" + az network private-dns link vnet create \ + -g "$RG" -n "$LINK_NAME" \ + --zone-name "$PRIVATE_DNS_ZONE" \ + --virtual-network "$AKS_VNET_ID" \ + --registration-enabled false \ + --output none \ + && echo "[OK] Linked DNS zone to $CLUSTER VNet." + verify_dns_link "$RG" "$PRIVATE_DNS_ZONE" "$LINK_NAME" +done + +# 3. Create Private Endpoint with Private DNS Zone integration echo "==> Creating Private Endpoint for Storage Account: $SA1_NAME" SA1_ID=$(az storage account show -g "$RG" -n "$SA1_NAME" --query id -o tsv) +DNS_ZONE_ID=$(az network private-dns zone show -g "$RG" -n "$PRIVATE_DNS_ZONE" --query id -o tsv) + az network private-endpoint create \ -g "$RG" -n "$PE_NAME" -l "$LOCATION" \ --vnet-name "$VNET_A1" --subnet "$SUBNET_PE_A1" \ @@ -84,4 +109,32 @@ az network private-endpoint create \ && echo "[OK] Private Endpoint $PE_NAME created for $SA1_NAME." verify_private_endpoint "$RG" "$PE_NAME" +# 4. Create Private DNS Zone Group to auto-register the private endpoint IP +echo "==> Creating Private DNS Zone Group to register DNS record" +az network private-endpoint dns-zone-group create \ + -g "$RG" \ + --endpoint-name "$PE_NAME" \ + --name "default" \ + --private-dns-zone "$DNS_ZONE_ID" \ + --zone-name "blob" \ + --output none \ + && echo "[OK] DNS Zone Group created, DNS record will be auto-registered." + +# 5. Verify DNS record was created +echo "==> Waiting 10 seconds for DNS record propagation..." +sleep 10 + +echo "==> Verifying DNS A record for $SA1_NAME" +PE_IP=$(az network private-endpoint show -g "$RG" -n "$PE_NAME" --query 'customDnsConfigs[0].ipAddresses[0]' -o tsv) +echo "Private Endpoint IP: $PE_IP" + +DNS_RECORD=$(az network private-dns record-set a list -g "$RG" -z "$PRIVATE_DNS_ZONE" --query "[?contains(name, '$SA1_NAME')].{Name:name, IP:aRecords[0].ipv4Address}" -o tsv) +echo "DNS Record: $DNS_RECORD" + +if [ -z "$DNS_RECORD" ]; then + echo "[WARNING] DNS A record not found. Manual verification needed." +else + echo "[OK] DNS A record created successfully." +fi + echo "All Private DNS and Endpoint resources created and verified successfully." diff --git a/.pipelines/swiftv2-long-running/scripts/create_storage.sh b/.pipelines/swiftv2-long-running/scripts/create_storage.sh index caefc69294..fd5f7addae 100644 --- a/.pipelines/swiftv2-long-running/scripts/create_storage.sh +++ b/.pipelines/swiftv2-long-running/scripts/create_storage.sh @@ -26,8 +26,10 @@ for SA in "$SA1" "$SA2"; do --allow-shared-key-access false \ --https-only true \ --min-tls-version TLS1_2 \ + --tags SkipAutoDeleteTill=2032-12-31 \ --query "name" -o tsv \ && echo "Storage account $SA created successfully." + # Verify creation success echo "==> Verifying storage account $SA exists..." if az storage account show --name "$SA" --resource-group "$RG" &>/dev/null; then @@ -36,8 +38,48 @@ for SA in "$SA1" "$SA2"; do echo "[ERROR] Storage account $SA not found after creation!" >&2 exit 1 fi + + # Assign RBAC role to pipeline service principal for blob access + echo "==> Assigning Storage Blob Data Contributor role to service principal" + SP_OBJECT_ID=$(az ad signed-in-user show --query id -o tsv 2>/dev/null || az account show --query user.name -o tsv) + SA_SCOPE="/subscriptions/${SUBSCRIPTION_ID}/resourceGroups/${RG}/providers/Microsoft.Storage/storageAccounts/${SA}" + + az role assignment create \ + --assignee "$SP_OBJECT_ID" \ + --role "Storage Blob Data Contributor" \ + --scope "$SA_SCOPE" \ + --output none \ + && echo "[OK] RBAC role assigned to service principal for $SA" + + # Create container and upload test blob for private endpoint testing + echo "==> Creating test container in $SA" + az storage container create \ + --name "test" \ + --account-name "$SA" \ + --auth-mode login \ + && echo "[OK] Container 'test' created in $SA" + + # Upload test blob + echo "==> Uploading test blob to $SA" + az storage blob upload \ + --account-name "$SA" \ + --container-name "test" \ + --name "hello.txt" \ + --data "Hello from Private Endpoint - Storage: $SA" \ + --auth-mode login \ + --overwrite \ + && echo "[OK] Test blob 'hello.txt' uploaded to $SA/test/" done +# # Disable public network access ONLY on SA1 (Tenant A storage with private endpoint) +# echo "==> Disabling public network access on $SA1" +# az storage account update \ +# --name "$SA1" \ +# --resource-group "$RG" \ +# --public-network-access Disabled \ +# --output none \ +# && echo "[OK] Public network access disabled on $SA1" + echo "All storage accounts created and verified successfully." # Set pipeline output variables diff --git a/.pipelines/swiftv2-long-running/scripts/create_vnets.sh b/.pipelines/swiftv2-long-running/scripts/create_vnets.sh index eb894d06ff..4649c3aca1 100644 --- a/.pipelines/swiftv2-long-running/scripts/create_vnets.sh +++ b/.pipelines/swiftv2-long-running/scripts/create_vnets.sh @@ -2,35 +2,31 @@ set -e trap 'echo "[ERROR] Failed while creating VNets or subnets. Check Azure CLI logs above." >&2' ERR -SUBSCRIPTION_ID=$1 +SUB_ID=$1 LOCATION=$2 RG=$3 +BUILD_ID=$4 -az account set --subscription "$SUBSCRIPTION_ID" - -# VNets and subnets -VNET_A1="cx_vnet_a1" -VNET_A2="cx_vnet_a2" -VNET_A3="cx_vnet_a3" -VNET_B1="cx_vnet_b1" - -A1_S1="10.10.1.0/24" -A1_S2="10.10.2.0/24" -A1_PE="10.10.100.0/24" - -A2_MAIN="10.11.1.0/24" - -A3_MAIN="10.12.1.0/24" - -B1_MAIN="10.20.1.0/24" +# --- VNet definitions --- +# Create customer vnets for two customers A and B. +# Using 172.16.0.0/12 range to avoid overlap with AKS infra 10.0.0.0/8 +VNAMES=( "cx_vnet_a1" "cx_vnet_a2" "cx_vnet_a3" "cx_vnet_b1" ) +VCIDRS=( "172.16.0.0/16" "172.17.0.0/16" "172.18.0.0/16" "172.19.0.0/16" ) +NODE_SUBNETS=( "172.16.0.0/24" "172.17.0.0/24" "172.18.0.0/24" "172.19.0.0/24" ) +EXTRA_SUBNETS_LIST=( "s1 s2 pe" "s1" "s1" "s1" ) +EXTRA_CIDRS_LIST=( "172.16.1.0/24,172.16.2.0/24,172.16.3.0/24" \ + "172.17.1.0/24" \ + "172.18.1.0/24" \ + "172.19.1.0/24" ) +az account set --subscription "$SUB_ID" # ------------------------------- # Verification functions # ------------------------------- verify_vnet() { - local rg="$1"; local vnet="$2" + local vnet="$1" echo "==> Verifying VNet: $vnet" - if az network vnet show -g "$rg" -n "$vnet" &>/dev/null; then + if az network vnet show -g "$RG" -n "$vnet" &>/dev/null; then echo "[OK] Verified VNet $vnet exists." else echo "[ERROR] VNet $vnet not found!" >&2 @@ -39,9 +35,9 @@ verify_vnet() { } verify_subnet() { - local rg="$1"; local vnet="$2"; local subnet="$3" + local vnet="$1"; local subnet="$2" echo "==> Verifying subnet: $subnet in $vnet" - if az network vnet subnet show -g "$rg" --vnet-name "$vnet" -n "$subnet" &>/dev/null; then + if az network vnet subnet show -g "$RG" --vnet-name "$vnet" -n "$subnet" &>/dev/null; then echo "[OK] Verified subnet $subnet exists in $vnet." else echo "[ERROR] Subnet $subnet not found in $vnet!" >&2 @@ -50,35 +46,99 @@ verify_subnet() { } # ------------------------------- -# Create VNets and Subnets -# ------------------------------- -# A1 -az network vnet create -g "$RG" -n "$VNET_A1" --address-prefix 10.10.0.0/16 --subnet-name s1 --subnet-prefix "$A1_S1" -l "$LOCATION" --output none \ - && echo "Created $VNET_A1 with subnet s1" -az network vnet subnet create -g "$RG" --vnet-name "$VNET_A1" -n s2 --address-prefix "$A1_S2" --output none \ - && echo "Created $VNET_A1 with subnet s2" -az network vnet subnet create -g "$RG" --vnet-name "$VNET_A1" -n pe --address-prefix "$A1_PE" --output none \ - && echo "Created $VNET_A1 with subnet pe" -# Verify A1 -verify_vnet "$RG" "$VNET_A1" -for sn in s1 s2 pe; do verify_subnet "$RG" "$VNET_A1" "$sn"; done +create_vnet_subets() { + local vnet="$1" + local vnet_cidr="$2" + local node_subnet_cidr="$3" + local extra_subnets="$4" + local extra_cidrs="$5" -# A2 -az network vnet create -g "$RG" -n "$VNET_A2" --address-prefix 10.11.0.0/16 --subnet-name s1 --subnet-prefix "$A2_MAIN" -l "$LOCATION" --output none \ - && echo "Created $VNET_A2 with subnet s1" -verify_vnet "$RG" "$VNET_A2" -verify_subnet "$RG" "$VNET_A2" "s1" + echo "==> Creating VNet: $vnet with CIDR: $vnet_cidr" + az network vnet create -g "$RG" -l "$LOCATION" --name "$vnet" --address-prefixes "$vnet_cidr" \ + --tags SkipAutoDeleteTill=2032-12-31 -o none + + IFS=' ' read -r -a extra_subnet_array <<< "$extra_subnets" + IFS=',' read -r -a extra_cidr_array <<< "$extra_cidrs" + + for i in "${!extra_subnet_array[@]}"; do + subnet_name="${extra_subnet_array[$i]}" + subnet_cidr="${extra_cidr_array[$i]}" + echo "==> Creating extra subnet: $subnet_name with CIDR: $subnet_cidr" + + # Only delegate pod subnets (not private endpoint subnets) + if [[ "$subnet_name" != "pe" ]]; then + az network vnet subnet create -g "$RG" \ + --vnet-name "$vnet" --name "$subnet_name" \ + --delegations Microsoft.SubnetDelegator/msfttestclients \ + --address-prefixes "$subnet_cidr" -o none + else + az network vnet subnet create -g "$RG" \ + --vnet-name "$vnet" --name "$subnet_name" \ + --address-prefixes "$subnet_cidr" -o none + fi + done +} + +delegate_subnet() { + local vnet="$1" + local subnet="$2" + local max_attempts=7 + local attempt=1 + + echo "==> Delegating subnet: $subnet in VNet: $vnet to Subnet Delegator" + subnet_id=$(az network vnet subnet show -g "$RG" --vnet-name "$vnet" -n "$subnet" --query id -o tsv) + modified_custsubnet="${subnet_id//\//%2F}" + + responseFile="delegate_response.txt" + cmd_delegator_curl="'curl -X PUT http://localhost:8080/DelegatedSubnet/$modified_custsubnet'" + cmd_containerapp_exec="az containerapp exec -n subnetdelegator-westus-u3h4j -g subnetdelegator-westus --subscription 9b8218f9-902a-4d20-a65c-e98acec5362f --command $cmd_delegator_curl" + + while [ $attempt -le $max_attempts ]; do + echo "Attempt $attempt of $max_attempts..." + + # Use script command to provide PTY for az containerapp exec + script --quiet -c "$cmd_containerapp_exec" "$responseFile" + + if grep -qF "success" "$responseFile"; then + echo "Subnet Delegator successfully registered the subnet" + rm -f "$responseFile" + return 0 + else + echo "Subnet Delegator failed to register the subnet (attempt $attempt)" + cat "$responseFile" + + if [ $attempt -lt $max_attempts ]; then + echo "Retrying in 5 seconds..." + sleep 5 + fi + fi + + ((attempt++)) + done + + echo "[ERROR] Failed to delegate subnet after $max_attempts attempts" + rm -f "$responseFile" + exit 1 +} -# A3 -az network vnet create -g "$RG" -n "$VNET_A3" --address-prefix 10.12.0.0/16 --subnet-name s1 --subnet-prefix "$A3_MAIN" -l "$LOCATION" --output none \ - && echo "Created $VNET_A3 with subnet s1" -verify_vnet "$RG" "$VNET_A3" -verify_subnet "$RG" "$VNET_A3" "s1" +# --- Loop over VNets --- +for i in "${!VNAMES[@]}"; do + VNET=${VNAMES[$i]} + VNET_CIDR=${VCIDRS[$i]} + NODE_SUBNET_CIDR=${NODE_SUBNETS[$i]} + EXTRA_SUBNETS=${EXTRA_SUBNETS_LIST[$i]} + EXTRA_SUBNET_CIDRS=${EXTRA_CIDRS_LIST[$i]} -# B1 -az network vnet create -g "$RG" -n "$VNET_B1" --address-prefix 10.20.0.0/16 --subnet-name s1 --subnet-prefix "$B1_MAIN" -l "$LOCATION" --output none \ - && echo "Created $VNET_B1 with subnet s1" -verify_vnet "$RG" "$VNET_B1" -verify_subnet "$RG" "$VNET_B1" "s1" + # Create VNet + subnets + create_vnet_subets "$VNET" "$VNET_CIDR" "$NODE_SUBNET_CIDR" "$EXTRA_SUBNETS" "$EXTRA_SUBNET_CIDRS" + verify_vnet "$VNET" + # Loop over extra subnets to verify and delegate the pod subnets. + for PODSUBNET in $EXTRA_SUBNETS; do + verify_subnet "$VNET" "$PODSUBNET" + if [[ "$PODSUBNET" != "pe" ]]; then + delegate_subnet "$VNET" "$PODSUBNET" + fi + done +done -echo " All VNets and subnets created and verified successfully." +echo "All VNets and subnets created and verified successfully." \ No newline at end of file diff --git a/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml b/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml index cc6016f17a..7236fc8776 100644 --- a/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml +++ b/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml @@ -5,16 +5,30 @@ parameters: type: string - name: resourceGroupName type: string - - name: vmSkuDefault - type: string - - name: vmSkuHighNIC - type: string - name: serviceConnection type: string + - name: runSetupStages + type: boolean + default: false + +variables: + - name: rgName + ${{ if eq(parameters.runSetupStages, true) }}: + value: ${{ parameters.resourceGroupName }} + ${{ else }}: + value: sv2-long-run-${{ parameters.location }} + - name: vmSkuDefault + value: "Standard_D4s_v3" + - name: vmSkuHighNIC + value: "Standard_D16s_v3" stages: + # ================================================================= + # Stage 1: AKS Cluster and Networking Setup (Conditional) + # ================================================================= - stage: AKSClusterAndNetworking displayName: "Stage: AKS Cluster and Networking Setup" + condition: eq(${{ parameters.runSetupStages }}, true) jobs: # ------------------------------------------------------------ # Job 1: Create Resource Group @@ -32,10 +46,13 @@ stages: scriptType: bash scriptLocation: inlineScript inlineScript: | - echo "==> Creating resource group ${{ parameters.resourceGroupName }} in ${{ parameters.location }}" + echo "Org: $SYSTEM_COLLECTIONURI" + echo "Project: $SYSTEM_TEAMPROJECT" + echo "==> Creating resource group $(rgName) in ${{ parameters.location }}" az group create \ - --name "${{ parameters.resourceGroupName }}" \ + --name "$(rgName)" \ --location "${{ parameters.location }}" \ + --tags SkipAutoDeleteTill=2032-12-31 \ --subscription "${{ parameters.subscriptionId }}" echo "Resource group created successfully." @@ -59,16 +76,17 @@ stages: arguments: > ${{ parameters.subscriptionId }} ${{ parameters.location }} - ${{ parameters.resourceGroupName }} - ${{ parameters.vmSkuDefault }} - ${{ parameters.vmSkuHighNIC }} + $(rgName) + $(vmSkuDefault) + $(vmSkuHighNIC) # ------------------------------------------------------------ # Job 3: Networking & Storage # ------------------------------------------------------------ - job: NetworkingAndStorage + timeoutInMinutes: 0 displayName: "Networking and Storage Setup" - dependsOn: CreateResourceGroup + dependsOn: CreateCluster pool: vmImage: ubuntu-latest steps: @@ -85,7 +103,8 @@ stages: arguments: > ${{ parameters.subscriptionId }} ${{ parameters.location }} - ${{ parameters.resourceGroupName }} + $(rgName) + $(Build.BuildId) # Task 2: Create Peerings - task: AzureCLI@2 @@ -96,7 +115,7 @@ stages: scriptLocation: scriptPath scriptPath: ".pipelines/swiftv2-long-running/scripts/create_peerings.sh" arguments: > - ${{ parameters.resourceGroupName }} + $(rgName) # Task 3: Create Storage Accounts - task: AzureCLI@2 @@ -110,31 +129,297 @@ stages: arguments: > ${{ parameters.subscriptionId }} ${{ parameters.location }} - ${{ parameters.resourceGroupName }} + $(rgName) - # Task 4: Create NSG + # Task 4: Create Private Endpoint - task: AzureCLI@2 - displayName: "Create network security groups to restrict access between subnets" + displayName: "Create Private Endpoint for Storage Account" inputs: azureSubscription: ${{ parameters.serviceConnection }} scriptType: bash scriptLocation: scriptPath - scriptPath: ".pipelines/swiftv2-long-running/scripts/create_nsg.sh" + scriptPath: ".pipelines/swiftv2-long-running/scripts/create_pe.sh" arguments: > ${{ parameters.subscriptionId }} - ${{ parameters.resourceGroupName }} ${{ parameters.location }} - - # Task 5: Create Private Endpoint + $(rgName) + $(CreateStorageAccounts.StorageAccount1) + + # Task 5: Create NSG - task: AzureCLI@2 - displayName: "Create Private Endpoint for Storage Account" + displayName: "Create network security groups to restrict access between subnets" inputs: azureSubscription: ${{ parameters.serviceConnection }} scriptType: bash scriptLocation: scriptPath - scriptPath: ".pipelines/swiftv2-long-running/scripts/create_pe.sh" + scriptPath: ".pipelines/swiftv2-long-running/scripts/create_nsg.sh" arguments: > ${{ parameters.subscriptionId }} + $(rgName) ${{ parameters.location }} - ${{ parameters.resourceGroupName }} - $(CreateStorageAccounts.StorageAccount1) + # ================================================================= + # Stage 2: Datapath Tests + # ================================================================= + - stage: ManagedNodeDataPathTests + displayName: "Stage: Swiftv2 Data Path Tests on Linux Managed Nodes" + dependsOn: AKSClusterAndNetworking + condition: or(eq(${{ parameters.runSetupStages }}, false), succeeded()) + variables: + storageAccount1: $[ stageDependencies.AKSClusterAndNetworking.NetworkingAndStorage.outputs['CreateStorageAccounts.StorageAccount1'] ] + storageAccount2: $[ stageDependencies.AKSClusterAndNetworking.NetworkingAndStorage.outputs['CreateStorageAccounts.StorageAccount2'] ] + jobs: + # ------------------------------------------------------------ + # Job 1: Create Test Resources and Wait + # ------------------------------------------------------------ + - job: CreateTestResources + displayName: "Create Resources and Wait 20 Minutes" + timeoutInMinutes: 90 + pool: + vmImage: ubuntu-latest + steps: + - checkout: self + + - task: GoTool@0 + displayName: "Use Go 1.22.5" + inputs: + version: "1.22.5" + + - task: AzureCLI@2 + displayName: "Create Test Resources" + inputs: + azureSubscription: ${{ parameters.serviceConnection }} + scriptType: bash + scriptLocation: inlineScript + inlineScript: | + echo "==> Installing Ginkgo CLI" + go install github.com/onsi/ginkgo/v2/ginkgo@latest + + echo "==> Adding Go bin to PATH" + export PATH=$PATH:$(go env GOPATH)/bin + + echo "==> Downloading Go dependencies" + go mod download + + echo "==> Setting up kubeconfig for cluster aks-1" + az aks get-credentials \ + --resource-group $(rgName) \ + --name aks-1 \ + --file /tmp/aks-1.kubeconfig \ + --overwrite-existing \ + --admin + + echo "==> Setting up kubeconfig for cluster aks-2" + az aks get-credentials \ + --resource-group $(rgName) \ + --name aks-2 \ + --file /tmp/aks-2.kubeconfig \ + --overwrite-existing \ + --admin + + echo "==> Verifying cluster aks-1 connectivity" + kubectl --kubeconfig /tmp/aks-1.kubeconfig get nodes + + echo "==> Verifying cluster aks-2 connectivity" + kubectl --kubeconfig /tmp/aks-2.kubeconfig get nodes + + echo "==> Creating test resources (8 scenarios)" + export RG="$(rgName)" + export BUILD_ID="$(rgName)" + export WORKLOAD_TYPE="swiftv2-linux" + cd ./test/integration/swiftv2/longRunningCluster + ginkgo -v -trace --timeout=1h --tags=create_test + + - script: | + echo "Waiting 2 minutes for pods to fully start and HTTP servers to be ready..." + sleep 120 + echo "Wait period complete, proceeding with connectivity tests" + displayName: "Wait for pods to be ready" + + # ------------------------------------------------------------ + # Job 2: Run Connectivity Tests + # ------------------------------------------------------------ + - job: ConnectivityTests + displayName: "Test Pod-to-Pod Connectivity" + dependsOn: CreateTestResources + timeoutInMinutes: 30 + pool: + vmImage: ubuntu-latest + steps: + - checkout: self + + - task: GoTool@0 + displayName: "Use Go 1.22.5" + inputs: + version: "1.22.5" + + - task: AzureCLI@2 + displayName: "Run Connectivity Tests" + inputs: + azureSubscription: ${{ parameters.serviceConnection }} + scriptType: bash + scriptLocation: inlineScript + inlineScript: | + echo "==> Installing Ginkgo CLI" + go install github.com/onsi/ginkgo/v2/ginkgo@latest + + echo "==> Adding Go bin to PATH" + export PATH=$PATH:$(go env GOPATH)/bin + + echo "==> Downloading Go dependencies" + go mod download + + echo "==> Setting up kubeconfig for cluster aks-1" + az aks get-credentials \ + --resource-group $(rgName) \ + --name aks-1 \ + --file /tmp/aks-1.kubeconfig \ + --overwrite-existing \ + --admin + + echo "==> Setting up kubeconfig for cluster aks-2" + az aks get-credentials \ + --resource-group $(rgName) \ + --name aks-2 \ + --file /tmp/aks-2.kubeconfig \ + --overwrite-existing \ + --admin + + echo "==> Running connectivity tests" + export RG="$(rgName)" + export BUILD_ID="$(rgName)" + export WORKLOAD_TYPE="swiftv2-linux" + cd ./test/integration/swiftv2/longRunningCluster + ginkgo -v -trace --timeout=30m --tags=connectivity_test + + # ------------------------------------------------------------ + # Job 3: Private Endpoint Connectivity Tests + # ------------------------------------------------------------ + - job: PrivateEndpointTests + displayName: "Test Private Endpoint Access" + dependsOn: ConnectivityTests + timeoutInMinutes: 30 + pool: + vmImage: ubuntu-latest + steps: + - checkout: self + + - task: GoTool@0 + displayName: "Use Go 1.22.5" + inputs: + version: "1.22.5" + + - task: AzureCLI@2 + displayName: "Run Private Endpoint Tests" + inputs: + azureSubscription: ${{ parameters.serviceConnection }} + scriptType: bash + scriptLocation: inlineScript + inlineScript: | + echo "==> Installing Ginkgo CLI" + go install github.com/onsi/ginkgo/v2/ginkgo@latest + + echo "==> Adding Go bin to PATH" + export PATH=$PATH:$(go env GOPATH)/bin + + echo "==> Downloading Go dependencies" + go mod download + + echo "==> Setting up kubeconfig for cluster aks-1" + az aks get-credentials \ + --resource-group $(rgName) \ + --name aks-1 \ + --file /tmp/aks-1.kubeconfig \ + --overwrite-existing \ + --admin + + echo "==> Setting up kubeconfig for cluster aks-2" + az aks get-credentials \ + --resource-group $(rgName) \ + --name aks-2 \ + --file /tmp/aks-2.kubeconfig \ + --overwrite-existing \ + --admin + + echo "==> Running Private Endpoint connectivity tests" + export RG="$(rgName)" + export BUILD_ID="$(rgName)" + export WORKLOAD_TYPE="swiftv2-linux" + + # Get storage account names - either from stage variables or discover from resource group + STORAGE_ACCOUNT_1="$(storageAccount1)" + STORAGE_ACCOUNT_2="$(storageAccount2)" + + # If variables are empty (when runSetupStages=false), discover from resource group + if [ -z "$STORAGE_ACCOUNT_1" ] || [ -z "$STORAGE_ACCOUNT_2" ]; then + echo "Storage account variables not set, discovering from resource group..." + STORAGE_ACCOUNT_1=$(az storage account list -g $(rgName) --query "[?starts_with(name, 'sa1')].name" -o tsv) + STORAGE_ACCOUNT_2=$(az storage account list -g $(rgName) --query "[?starts_with(name, 'sa2')].name" -o tsv) + echo "Discovered: STORAGE_ACCOUNT_1=$STORAGE_ACCOUNT_1, STORAGE_ACCOUNT_2=$STORAGE_ACCOUNT_2" + fi + + export STORAGE_ACCOUNT_1 + export STORAGE_ACCOUNT_2 + cd ./test/integration/swiftv2/longRunningCluster + ginkgo -v -trace --timeout=30m --tags=private_endpoint_test + + # ------------------------------------------------------------ + # Job 4: Delete Test Resources + # ------------------------------------------------------------ + - job: DeleteTestResources + displayName: "Delete PodNetwork, PNI, and Pods" + dependsOn: + - CreateTestResources + - ConnectivityTests + - PrivateEndpointTests + # Always run cleanup, even if previous jobs failed + condition: always() + timeoutInMinutes: 60 + pool: + vmImage: ubuntu-latest + steps: + - checkout: self + + - task: GoTool@0 + displayName: "Use Go 1.22.5" + inputs: + version: "1.22.5" + + - task: AzureCLI@2 + displayName: "Delete Test Resources" + inputs: + azureSubscription: ${{ parameters.serviceConnection }} + scriptType: bash + scriptLocation: inlineScript + inlineScript: | + echo "==> Installing Ginkgo CLI" + go install github.com/onsi/ginkgo/v2/ginkgo@latest + + echo "==> Adding Go bin to PATH" + export PATH=$PATH:$(go env GOPATH)/bin + + echo "==> Downloading Go dependencies" + go mod download + + echo "==> Setting up kubeconfig for cluster aks-1" + az aks get-credentials \ + --resource-group $(rgName) \ + --name aks-1 \ + --file /tmp/aks-1.kubeconfig \ + --overwrite-existing \ + --admin + + echo "==> Setting up kubeconfig for cluster aks-2" + az aks get-credentials \ + --resource-group $(rgName) \ + --name aks-2 \ + --file /tmp/aks-2.kubeconfig \ + --overwrite-existing \ + --admin + + echo "==> Deleting test resources (8 scenarios)" + export RG="$(rgName)" + export BUILD_ID="$(rgName)" + export WORKLOAD_TYPE="swiftv2-linux" + cd ./test/integration/swiftv2/longRunningCluster + ginkgo -v -trace --timeout=1h --tags=delete_test + \ No newline at end of file diff --git a/go.mod b/go.mod index bf07d7f6ac..8096f632b3 100644 --- a/go.mod +++ b/go.mod @@ -1,6 +1,8 @@ module github.com/Azure/azure-container-networking -go 1.24.1 +go 1.24.0 + +toolchain go1.24.10 require ( github.com/Azure/azure-sdk-for-go/sdk/azcore v1.19.1 @@ -68,7 +70,6 @@ require ( github.com/gofrs/uuid v4.4.0+incompatible // indirect github.com/gogo/protobuf v1.3.2 // indirect github.com/golang/groupcache v0.0.0-20210331224755-41bb18bfe9da // indirect - github.com/hpcloud/tail v1.0.0 // indirect github.com/inconshreveable/mousetrap v1.1.0 // indirect github.com/josharian/intern v1.0.0 // indirect github.com/json-iterator/go v1.1.12 // indirect @@ -104,12 +105,9 @@ require ( golang.org/x/term v0.36.0 // indirect golang.org/x/text v0.30.0 // indirect golang.org/x/time v0.14.0 - golang.org/x/xerrors v0.0.0-20220907171357-04be3eba64a2 // indirect gomodules.xyz/jsonpatch/v2 v2.4.0 // indirect - gopkg.in/fsnotify.v1 v1.4.7 // indirect gopkg.in/inf.v0 v0.9.1 // indirect gopkg.in/tomb.v1 v1.0.0-20141024135613-dd632973f1e7 // indirect - gopkg.in/yaml.v2 v2.4.0 // indirect gopkg.in/yaml.v3 v3.0.1 // indirect k8s.io/kube-openapi v0.0.0-20250910181357-589584f1c912 // indirect sigs.k8s.io/json v0.0.0-20241014173422-cfa47c3a1cc8 // indirect @@ -125,6 +123,7 @@ require ( github.com/cilium/cilium v1.15.16 github.com/cilium/ebpf v0.19.0 github.com/jsternberg/zap-logfmt v1.3.0 + github.com/onsi/ginkgo/v2 v2.23.4 golang.org/x/sync v0.17.0 gotest.tools/v3 v3.5.2 k8s.io/kubectl v0.34.1 @@ -147,9 +146,11 @@ require ( github.com/go-openapi/spec v0.20.11 // indirect github.com/go-openapi/strfmt v0.21.9 // indirect github.com/go-openapi/validate v0.22.3 // indirect + github.com/go-task/slim-sprig/v3 v3.0.0 // indirect github.com/go-viper/mapstructure/v2 v2.4.0 // indirect github.com/google/btree v1.1.3 // indirect github.com/google/gopacket v1.1.19 // indirect + github.com/google/pprof v0.0.0-20250630185457-6e76a2b096b5 // indirect github.com/gorilla/websocket v1.5.4-0.20250319132907-e064f32e3674 // indirect github.com/hashicorp/golang-lru/v2 v2.0.7 // indirect github.com/kr/pretty v0.3.1 // indirect @@ -174,10 +175,12 @@ require ( go.opentelemetry.io/otel/sdk v1.38.0 // indirect go.opentelemetry.io/otel/sdk/metric v1.38.0 // indirect go.opentelemetry.io/otel/trace v1.38.0 // indirect + go.uber.org/automaxprocs v1.6.0 // indirect go.uber.org/dig v1.17.1 // indirect go.yaml.in/yaml/v2 v2.4.3 // indirect go.yaml.in/yaml/v3 v3.0.4 // indirect go4.org/netipx v0.0.0-20231129151722-fdeea329fbba // indirect + golang.org/x/tools v0.37.0 // indirect gopkg.in/evanphx/json-patch.v4 v4.12.0 // indirect sigs.k8s.io/randfill v1.0.0 // indirect sigs.k8s.io/structured-merge-diff/v6 v6.3.0 // indirect @@ -193,11 +196,6 @@ require ( k8s.io/kubelet v0.34.1 ) -replace ( - github.com/onsi/ginkgo => github.com/onsi/ginkgo v1.12.0 - github.com/onsi/gomega => github.com/onsi/gomega v1.10.0 -) - retract ( v1.16.17 // contains only retractions, new version to retract 1.15.22. v1.16.16 // contains only retractions, has to be newer than 1.16.15. diff --git a/go.sum b/go.sum index dbcced8ba9..c1ac6b2891 100644 --- a/go.sum +++ b/go.sum @@ -114,6 +114,7 @@ github.com/evanphx/json-patch/v5 v5.9.11/go.mod h1:3j+LviiESTElxA4p3EMKAB9HXj3/X github.com/frankban/quicktest v1.14.6 h1:7Xjx+VpznH+oBnejlPUj8oUpdxnVs4f8XU8WnHkI4W8= github.com/frankban/quicktest v1.14.6/go.mod h1:4ptaffx2x8+WTWXmUCuVU6aPUX1/Mz7zb5vbUoiM6w0= github.com/fsnotify/fsnotify v1.4.7/go.mod h1:jwhsz4b93w/PPRr/qN1Yymfu8t87LnFCMoQvtojpjFo= +github.com/fsnotify/fsnotify v1.4.9/go.mod h1:znqG4EE+3YCdAaPaxE2ZRY/06pZUdp0tY4IgpuI1SZQ= github.com/fsnotify/fsnotify v1.6.0/go.mod h1:sl3t1tCWJFWoRz9R8WJCbQihKKwmorjAbSClcnxKAGw= github.com/fsnotify/fsnotify v1.9.0 h1:2Ml+OJNzbYCTzsxtv8vKSFD9PbJjmhYF14k/jKC7S9k= github.com/fsnotify/fsnotify v1.9.0/go.mod h1:8jBTzvmWwFyi3Pb8djgCCO5IBqzKJ/Jwo8TRcHyHii0= @@ -160,6 +161,7 @@ github.com/go-openapi/validate v0.22.3 h1:KxG9mu5HBRYbecRb37KRCihvGGtND2aXziBAv0 github.com/go-openapi/validate v0.22.3/go.mod h1:kVxh31KbfsxU8ZyoHaDbLBWU5CnMdqBUEtadQ2G4d5M= github.com/go-quicktest/qt v1.101.1-0.20240301121107-c6c8733fa1e6 h1:teYtXy9B7y5lHTp8V9KPxpYRAVA7dozigQcMiBust1s= github.com/go-quicktest/qt v1.101.1-0.20240301121107-c6c8733fa1e6/go.mod h1:p4lGIVX+8Wa6ZPNDvqcxq36XpUDLh42FLetFU7odllI= +github.com/go-task/slim-sprig v0.0.0-20210107165309-348f09dbbbc0/go.mod h1:fyg7847qk6SyHyPtNmDHnmrv/HOrqktSC+C9fM+CJOE= github.com/go-task/slim-sprig/v3 v3.0.0 h1:sUs3vkvUymDpBKi3qH1YSqBQk9+9D/8M2mN1vB6EwHI= github.com/go-task/slim-sprig/v3 v3.0.0/go.mod h1:W848ghGpv3Qj3dhTPRyJypKRiqCdHZiAzKg9hl15HA8= github.com/go-viper/mapstructure/v2 v2.4.0 h1:EBsztssimR/CONLSZZ04E8qAkxNYq4Qp9LvH92wZUgs= @@ -186,6 +188,7 @@ github.com/golang/protobuf v1.4.0-rc.2/go.mod h1:LlEzMj4AhA7rCAGe4KMBDvJI+AwstrU github.com/golang/protobuf v1.4.0-rc.4.0.20200313231945-b860323f09d0/go.mod h1:WU3c8KckQ9AFe+yFwt9sWVRKCVIyN9cPHBJSNnbL67w= github.com/golang/protobuf v1.4.0/go.mod h1:jodUvKwWbYaEsadDk5Fwe5c77LiNKVO9IDvqG2KuDX0= github.com/golang/protobuf v1.4.1/go.mod h1:U8fpvMrcmy5pZrNK1lt4xCsGvpyWQ/VVv6QDs8UjoX8= +github.com/golang/protobuf v1.4.2/go.mod h1:oDoupMAO8OvCJWAcko0GGGIgR6R6ocIYbsSw735rRwI= github.com/golang/protobuf v1.4.3/go.mod h1:oDoupMAO8OvCJWAcko0GGGIgR6R6ocIYbsSw735rRwI= github.com/golang/protobuf v1.5.4 h1:i7eJL8qZTpSEXOPTxNKhASYpMn+8e5Q6AdndVa1dWek= github.com/golang/protobuf v1.5.4/go.mod h1:lnTiLA8Wa4RWRcIUkrtSVa5nRhsEGBg48fD6rSs7xps= @@ -224,7 +227,6 @@ github.com/hashicorp/go-version v1.7.0 h1:5tqGy27NaOTB8yJKUZELlFAS/LTKJkrmONwQKe github.com/hashicorp/go-version v1.7.0/go.mod h1:fltr4n8CU8Ke44wwGCBoEymUuxUHl09ZGVZPK5anwXA= github.com/hashicorp/golang-lru/v2 v2.0.7 h1:a+bsQ5rvGLjzHuww6tVxozPZFVghXaHOwFs4luLUK2k= github.com/hashicorp/golang-lru/v2 v2.0.7/go.mod h1:QeFd9opnmA6QUJc5vARoKUSoFhyfM2/ZepoAG6RGpeM= -github.com/hpcloud/tail v1.0.0 h1:nfCOvKYfkgYP8hkirhJocXT2+zOD8yUNjXaWfTlyFKI= github.com/hpcloud/tail v1.0.0/go.mod h1:ab1qPbhIpdTxEkNHXyeSf5vhxWSCs/tWer42PpOxQnU= github.com/inconshreveable/mousetrap v1.1.0 h1:wN+x4NVGpMsO7ErUn/mUI3vEoE6Jt13X2s0bqwp9tc8= github.com/inconshreveable/mousetrap v1.1.0/go.mod h1:vpF70FUmC8bwa3OWnCshd2FqLfsEA9PFc4w1p2J65bw= @@ -297,16 +299,24 @@ github.com/mwitkow/go-conntrack v0.0.0-20190716064945-2f068394615f/go.mod h1:qRW github.com/mxk/go-flowrate v0.0.0-20140419014527-cca7078d478f h1:y5//uYreIhSUg3J1GEMiLbxo1LJaP8RfCpH6pymGZus= github.com/mxk/go-flowrate v0.0.0-20140419014527-cca7078d478f/go.mod h1:ZdcZmHo+o7JKHSa8/e818NopupXU1YMK5fe1lsApnBw= github.com/niemeyer/pretty v0.0.0-20200227124842-a10e7caefd8e/go.mod h1:zD1mROLANZcx1PVRCS0qkT7pwLkGfwJo4zjcN/Tysno= +github.com/nxadm/tail v1.4.4/go.mod h1:kenIhsEOeOJmVchQTgglprH7qJGnHDVpk1VPCcaMI8A= +github.com/nxadm/tail v1.4.8/go.mod h1:+ncqLTQzXmGhMZNUePPaPqPvBxHAIsmXswZKocGu+AU= github.com/nxadm/tail v1.4.11 h1:8feyoE3OzPrcshW5/MJ4sGESc5cqmGkGCWlco4l0bqY= github.com/nxadm/tail v1.4.11/go.mod h1:OTaG3NK980DZzxbRq6lEuzgU+mug70nY11sMd4JXXHc= github.com/oklog/ulid v1.3.1 h1:EGfNDEx6MqHz8B3uNV6QAib1UR2Lm97sHi3ocA6ESJ4= github.com/oklog/ulid v1.3.1/go.mod h1:CirwcVhetQ6Lv90oh/F+FBtV6XMibvdAFo93nm5qn4U= -github.com/onsi/ginkgo v1.12.0 h1:Iw5WCbBcaAAd0fpRb1c9r5YCylv4XDoCSigm1zLevwU= -github.com/onsi/ginkgo v1.12.0/go.mod h1:oUhWkIvk5aDxtKvDDuw8gItl8pKl42LzjC9KZE0HfGg= +github.com/onsi/ginkgo v1.6.0/go.mod h1:lLunBs/Ym6LB5Z9jYTR76FiuTmxDTDusOGeTQH+WWjE= +github.com/onsi/ginkgo v1.8.0/go.mod h1:lLunBs/Ym6LB5Z9jYTR76FiuTmxDTDusOGeTQH+WWjE= +github.com/onsi/ginkgo v1.12.1/go.mod h1:zj2OWP4+oCPe1qIXoGWkgMRwljMUYCdkwsT2108oapk= +github.com/onsi/ginkgo v1.16.5 h1:8xi0RTUf59SOSfEtZMvwTvXYMzG4gV23XVHOZiXNtnE= +github.com/onsi/ginkgo v1.16.5/go.mod h1:+E8gABHa3K6zRBolWtd+ROzc/U5bkGt0FwiG042wbpU= github.com/onsi/ginkgo/v2 v2.23.4 h1:ktYTpKJAVZnDT4VjxSbiBenUjmlL/5QkBEocaWXiQus= github.com/onsi/ginkgo/v2 v2.23.4/go.mod h1:Bt66ApGPBFzHyR+JO10Zbt0Gsp4uWxu5mIOTusL46e8= -github.com/onsi/gomega v1.10.0 h1:Gwkk+PTu/nfOwNMtUB/mRUv0X7ewW5dO4AERT1ThVKo= -github.com/onsi/gomega v1.10.0/go.mod h1:Ho0h+IUsWyvy1OpqCwxlQ/21gkhVunqlU8fDGcoTdcA= +github.com/onsi/gomega v1.5.0/go.mod h1:ex+gbHU/CVuBBDIJjb2X0qEXbFg53c61hWP/1CpauHY= +github.com/onsi/gomega v1.7.1/go.mod h1:XdKZgCCFLUoM/7CFJVPcG8C1xQ1AJ0vpAezJrB7JYyY= +github.com/onsi/gomega v1.10.1/go.mod h1:iN09h71vgCQne3DLsj+A5owkum+a2tYe+TOCB1ybHNo= +github.com/onsi/gomega v1.37.0 h1:CdEG8g0S133B4OswTDC/5XPSzE1OeP29QOioj2PID2Y= +github.com/onsi/gomega v1.37.0/go.mod h1:8D9+Txp43QWKhM24yyOBEdpkzN8FvJyAwecBgsU4KU0= github.com/opentracing/opentracing-go v1.2.1-0.20220228012449-10b1cf09e00b h1:FfH+VrHHk6Lxt9HdVS0PXzSXFyS2NbZKXv33FYPol0A= github.com/opentracing/opentracing-go v1.2.1-0.20220228012449-10b1cf09e00b/go.mod h1:AC62GU6hc0BrNm+9RK9VSiwa/EUe1bkIeFORAMcHvJU= github.com/patrickmn/go-cache v2.1.0+incompatible h1:HRMgzkcYKYpi3C8ajMPV8OFXaaRUnok+kx1WdO15EQc= @@ -325,6 +335,8 @@ github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2 h1:Jamvg5psRI github.com/pmezard/go-difflib v1.0.1-0.20181226105442-5d4384ee4fb2/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4= github.com/power-devops/perfstat v0.0.0-20210106213030-5aafc221ea8c h1:ncq/mPwQF4JjgDlrVEn3C11VoGHZN7m8qihwgMEtzYw= github.com/power-devops/perfstat v0.0.0-20210106213030-5aafc221ea8c/go.mod h1:OmDBASR4679mdNQnz2pUhc2G8CO2JrUAVFDRBDP/hJE= +github.com/prashantv/gostub v1.1.0 h1:BTyx3RfQjRHnUWaGF9oQos79AlQ5k8WNktv7VGvVH4g= +github.com/prashantv/gostub v1.1.0/go.mod h1:A5zLQHz7ieHGG7is6LLXLz7I8+3LZzsrV0P1IAHhP5U= github.com/prometheus/client_golang v1.23.2 h1:Je96obch5RDVy3FDMndoUsjAhG5Edi49h0RJWRi/o0o= github.com/prometheus/client_golang v1.23.2/go.mod h1:Tb1a6LWHB3/SPIzCoaDXI4I8UHKeFTEQ1YCr+0Gyqmg= github.com/prometheus/client_model v0.0.0-20190812154241-14fe0d1b01d4/go.mod h1:xMI15A0UPsDsEKsMN9yxemIoYk6Tm2C1GtYGdfGttqA= @@ -367,6 +379,7 @@ github.com/stretchr/objx v0.5.0/go.mod h1:Yh+to48EsGEfYuaHDzXPcE3xhTkx73EhmCGUpE github.com/stretchr/objx v0.5.2 h1:xuMeJ0Sdp5ZMRXx/aWO6RZxdr3beISkG5/G/aIRr3pY= github.com/stretchr/objx v0.5.2/go.mod h1:FRsXN1f5AsAjCGJKqEizvkpNtU+EGNCLh3NxZ/8L+MA= github.com/stretchr/testify v1.3.0/go.mod h1:M5WIy9Dh21IEIfnGCwXGc5bZfKNJtfHm1UVUgZn+9EI= +github.com/stretchr/testify v1.5.1/go.mod h1:5W2xD1RspED5o8YsWQXVCued0rvSQ+mT+I5cxcmMvtA= github.com/stretchr/testify v1.6.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg= github.com/stretchr/testify v1.7.0/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg= github.com/stretchr/testify v1.7.1/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg= @@ -465,6 +478,7 @@ golang.org/x/net v0.0.0-20190311183353-d8887717615a/go.mod h1:t9HGtf8HONx5eT2rtn golang.org/x/net v0.0.0-20190404232315-eb5bcb51f2a3/go.mod h1:t9HGtf8HONx5eT2rtn7q6eTqICYqUVnKs3thJo3Qplg= golang.org/x/net v0.0.0-20190620200207-3b0461eec859/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s= golang.org/x/net v0.0.0-20200226121028-0de0cce0169b/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s= +golang.org/x/net v0.0.0-20200520004742-59133d7f0dd7/go.mod h1:qpuaurCH72eLCgpAm/N6yyVIVM9cpaDIP3A8BGJEC5A= golang.org/x/net v0.0.0-20201021035429-f5854403a974/go.mod h1:sp8m0HH+o8qH0wwXwYZr8TS3Oi6o0r6Gce1SSxlDquU= golang.org/x/net v0.0.0-20201110031124-69a78807bb2b/go.mod h1:sp8m0HH+o8qH0wwXwYZr8TS3Oi6o0r6Gce1SSxlDquU= golang.org/x/net v0.0.0-20210226172049-e18ecbb05110/go.mod h1:m0MpNAwzfU5UDzcl9v0D8zg8gWTRqZa9RBIspLL5mdg= @@ -489,11 +503,15 @@ golang.org/x/sys v0.0.0-20180830151530-49385e6e1522/go.mod h1:STP8DvDyc/dI5b8T5h golang.org/x/sys v0.0.0-20180909124046-d0be0721c37e/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY= golang.org/x/sys v0.0.0-20190215142949-d0b11bdaac8a/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY= golang.org/x/sys v0.0.0-20190412213103-97732733099d/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20190904154756-749cb33beabd/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= golang.org/x/sys v0.0.0-20190916202348-b4ddaad3f8a3/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20191005200804-aed5e4c7ecf9/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= golang.org/x/sys v0.0.0-20191120155948-bd437916bb0e/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20200323222414-85ca7c5b95cd/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= golang.org/x/sys v0.0.0-20200930185726-fdedc70b468f/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= golang.org/x/sys v0.0.0-20201119102817-f84b799fce68/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= golang.org/x/sys v0.0.0-20201204225414-ed752295db88/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= +golang.org/x/sys v0.0.0-20210112080510-489259a85091/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= golang.org/x/sys v0.0.0-20210330210617-4fbd30eecc44/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= golang.org/x/sys v0.0.0-20210423082822-04245dca01da/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs= golang.org/x/sys v0.0.0-20210510120138-977fb7262007/go.mod h1:oPkhp1MJrh7nUepCBck5+mAzfO9JrbApNNgaTdGDITg= @@ -532,6 +550,7 @@ golang.org/x/tools v0.0.0-20190524140312-2c0ae7006135/go.mod h1:RgjU9mgBXZiqYHBn golang.org/x/tools v0.0.0-20191119224855-298f0cb1881e/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo= golang.org/x/tools v0.0.0-20200130002326-2f3ba24bd6e7/go.mod h1:TB2adYChydJhpapKDTa4BR/hXlZSLoq2Wpct/0txZ28= golang.org/x/tools v0.0.0-20200619180055-7c47624df98f/go.mod h1:EkVYQZoAsY45+roYkvgYkIh4xh/qjgUK9TdY2XT94GE= +golang.org/x/tools v0.0.0-20201224043029-2b0845dc783e/go.mod h1:emZCQorbCU4vsT4fOWvOPXz4eW1wZW4PmDk9uLelYpA= golang.org/x/tools v0.0.0-20210106214847-113979e3529a/go.mod h1:emZCQorbCU4vsT4fOWvOPXz4eW1wZW4PmDk9uLelYpA= golang.org/x/tools v0.1.1/go.mod h1:o0xws9oXOQQZyjljx8fwUC0k7L1pTE6eaCbjGeHmOkk= golang.org/x/tools v0.1.12/go.mod h1:hNGJHUnrk76NpqgfD5Aqm5Crs+Hm0VOH/i9J2+nxYbc= @@ -541,8 +560,6 @@ golang.org/x/xerrors v0.0.0-20190717185122-a985d3407aa7/go.mod h1:I/5z698sn9Ka8T golang.org/x/xerrors v0.0.0-20191011141410-1b5146add898/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0= golang.org/x/xerrors v0.0.0-20191204190536-9bdfabe68543/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0= golang.org/x/xerrors v0.0.0-20200804184101-5ec99f83aff1/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0= -golang.org/x/xerrors v0.0.0-20220907171357-04be3eba64a2 h1:H2TDz8ibqkAF6YGhCdN3jS9O0/s90v0rJh3X/OLHEUk= -golang.org/x/xerrors v0.0.0-20220907171357-04be3eba64a2/go.mod h1:K8+ghG5WaK9qNqU5K3HdILfMLy1f3aNYFI/wnl100a8= gomodules.xyz/jsonpatch/v2 v2.4.0 h1:Ci3iUJyx9UeRx7CeFN8ARgGbkESwJK+KB9lLcWxY/Zw= gomodules.xyz/jsonpatch/v2 v2.4.0/go.mod h1:AH3dM2RI6uoBZxn3LVrfvJ3E0/9dG4cSrbuBJT4moAY= gonum.org/v1/gonum v0.16.0 h1:5+ul4Swaf3ESvrOnidPp4GZbzf0mxVQpDCYUQE7OJfk= @@ -582,7 +599,6 @@ gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c h1:Hei/4ADfdWqJk1ZMxUNpqntN gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c/go.mod h1:JHkPIbrfpd72SG/EVd6muEfDQjcINNoR0C8j2r3qZ4Q= gopkg.in/evanphx/json-patch.v4 v4.12.0 h1:n6jtcsulIzXPJaxegRbvFNNrZDjbij7ny3gmSPG+6V4= gopkg.in/evanphx/json-patch.v4 v4.12.0/go.mod h1:p8EYWUEYMpynmqDbY58zCKCFZw8pRWMG4EsWvDvM72M= -gopkg.in/fsnotify.v1 v1.4.7 h1:xOHLXZwVvI9hhs+cLKq5+I5onOuwQLhQwiu63xxlHs4= gopkg.in/fsnotify.v1 v1.4.7/go.mod h1:Tz8NjZHkW78fSQdbUxIjBTcgA1z1m8ZHf0WmKUhAMys= gopkg.in/inf.v0 v0.9.1 h1:73M5CoZyi3ZLMOyDlQh031Cx6N9NDJ2Vvfl76EDAgDc= gopkg.in/inf.v0 v0.9.1/go.mod h1:cWUDdTG/fYaXco+Dcufb5Vnc6Gp2YChqWtbxRZE0mXw= @@ -590,8 +606,10 @@ gopkg.in/natefinch/lumberjack.v2 v2.2.1 h1:bBRl1b0OH9s/DuPhuXpNl+VtCaJXFZ5/uEFST gopkg.in/natefinch/lumberjack.v2 v2.2.1/go.mod h1:YD8tP3GAjkrDg1eZH7EGmyESg/lsYskCTPBJVb9jqSc= gopkg.in/tomb.v1 v1.0.0-20141024135613-dd632973f1e7 h1:uRGJdciOHaEIrze2W8Q3AKkepLTh2hOroT7a+7czfdQ= gopkg.in/tomb.v1 v1.0.0-20141024135613-dd632973f1e7/go.mod h1:dt/ZhP58zS4L8KSrWDmTeBkI65Dw0HsyUHuEVlX15mw= +gopkg.in/yaml.v2 v2.2.1/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI= gopkg.in/yaml.v2 v2.2.2/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI= gopkg.in/yaml.v2 v2.2.4/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI= +gopkg.in/yaml.v2 v2.3.0/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI= gopkg.in/yaml.v2 v2.4.0 h1:D8xgwECY7CYvx+Y2n4sBz93Jn9JRvxdiyyo8CTfuKaY= gopkg.in/yaml.v2 v2.4.0/go.mod h1:RDklbk79AGWmwhnvt/jBztapEOGDOx6ZbXqjP6csGnQ= gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= diff --git a/hack/aks/Makefile b/hack/aks/Makefile index 5e1c8f3f9b..3b31345ec5 100644 --- a/hack/aks/Makefile +++ b/hack/aks/Makefile @@ -29,6 +29,7 @@ PUBLIC_IPv6 ?= $(PUBLIC_IP_ID)/$(IP_PREFIX)-$(CLUSTER)-v6 KUBE_PROXY_JSON_PATH ?= ./kube-proxy.json LTS ?= false + # overrideable variables SUB ?= $(AZURE_SUBSCRIPTION) CLUSTER ?= $(USER)-$(REGION) @@ -280,22 +281,22 @@ swiftv2-dummy-cluster-up: rg-up ipv4 swift-net-up ## Bring up a SWIFT AzCNI clus --network-plugin azure \ --vnet-subnet-id /subscriptions/$(SUB)/resourceGroups/$(GROUP)/providers/Microsoft.Network/virtualNetworks/$(VNET)/subnets/nodenet \ --pod-subnet-id /subscriptions/$(SUB)/resourceGroups/$(GROUP)/providers/Microsoft.Network/virtualNetworks/$(VNET)/subnets/podnet \ + --tags stampcreatorserviceinfo=true \ --load-balancer-outbound-ips $(PUBLIC_IPv4) \ --no-ssh-key \ --yes @$(MAKE) set-kubeconf swiftv2-podsubnet-cluster-up: ipv4 swift-net-up ## Bring up a SWIFTv2 PodSubnet cluster - $(COMMON_AKS_FIELDS) + $(COMMON_AKS_FIELDS) \ --network-plugin azure \ - --nodepool-name nodepool1 \ - --load-balancer-outbound-ips $(PUBLIC_IPv4) \ + --node-vm-size $(VM_SIZE) \ --vnet-subnet-id /subscriptions/$(SUB)/resourceGroups/$(GROUP)/providers/Microsoft.Network/virtualNetworks/$(VNET)/subnets/nodenet \ --pod-subnet-id /subscriptions/$(SUB)/resourceGroups/$(GROUP)/providers/Microsoft.Network/virtualNetworks/$(VNET)/subnets/podnet \ - --service-cidr "10.0.0.0/16" \ - --dns-service-ip "10.0.0.10" \ - --tags fastpathenabled=true RGOwner=LongRunningTestPipelines stampcreatorserviceinfo=true \ + --nodepool-tags fastpathenabled=true aks-nic-enable-multi-tenancy=true \ + --tags stampcreatorserviceinfo=true \ --aks-custom-headers AKSHTTPCustomFeatures=Microsoft.ContainerService/NetworkingMultiTenancyPreview \ + --load-balancer-outbound-ips $(PUBLIC_IPv4) \ --yes @$(MAKE) set-kubeconf @@ -446,7 +447,7 @@ linux-swiftv2-nodepool-up: ## Add linux node pool to swiftv2 cluster --os-type Linux \ --max-pods 250 \ --subscription $(SUB) \ - --tags fastpathenabled=true,aks-nic-enable-multi-tenancy=true \ + --tags fastpathenabled=true aks-nic-enable-multi-tenancy=true stampcreatorserviceinfo=true\ --aks-custom-headers AKSHTTPCustomFeatures=Microsoft.ContainerService/NetworkingMultiTenancyPreview \ --pod-subnet-id /subscriptions/$(SUB)/resourceGroups/$(GROUP)/providers/Microsoft.Network/virtualNetworks/$(VNET)/subnets/podnet diff --git a/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml b/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml new file mode 100644 index 0000000000..ffb5293b18 --- /dev/null +++ b/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml @@ -0,0 +1,73 @@ +apiVersion: v1 +kind: Pod +metadata: + name: {{ .PodName }} + namespace: {{ .Namespace }} + labels: + kubernetes.azure.com/pod-network-instance: {{ .PNIName }} + kubernetes.azure.com/pod-network: {{ .PNName }} +spec: + nodeSelector: + kubernetes.io/hostname: {{ .NodeName }} + containers: + - name: net-debugger + image: {{ .Image }} + command: ["/bin/bash", "-c"] + args: + - | + echo "Pod Network Diagnostics started on $(hostname)"; + echo "Pod IP: $(hostname -i)"; + echo "Starting HTTP server on port 8080"; + + # Create a simple HTTP server directory + mkdir -p /tmp/www + cat > /tmp/www/index.html <<'EOF' + + + Network Test Pod + +

Pod Network Test

+

Hostname: $(hostname)

+

IP Address: $(hostname -i)

+

Timestamp: $(date)

+ + + EOF + + # Start Python HTTP server on port 8080 in background + cd /tmp/www && python3 -m http.server 8080 & + HTTP_PID=$! + echo "HTTP server started with PID $HTTP_PID on port 8080" + + # Give server a moment to start + sleep 2 + + # Verify server is running + if netstat -tuln | grep -q ':8080'; then + echo "HTTP server is listening on port 8080" + else + echo "WARNING: HTTP server may not be listening on port 8080" + fi + + # Keep showing network info periodically + while true; do + echo "=== Network Status at $(date) ===" + ip addr show + ip route show + echo "=== Listening ports ===" + netstat -tuln | grep LISTEN || ss -tuln | grep LISTEN + sleep 300 # Every 5 minutes + done + ports: + - containerPort: 8080 + protocol: TCP + resources: + limits: + cpu: 300m + memory: 600Mi + requests: + cpu: 300m + memory: 600Mi + securityContext: + privileged: true + restartPolicy: Always diff --git a/test/integration/manifests/swiftv2/long-running-cluster/podnetwork.yaml b/test/integration/manifests/swiftv2/long-running-cluster/podnetwork.yaml new file mode 100644 index 0000000000..25a7491d90 --- /dev/null +++ b/test/integration/manifests/swiftv2/long-running-cluster/podnetwork.yaml @@ -0,0 +1,15 @@ +apiVersion: multitenancy.acn.azure.com/v1alpha1 +kind: PodNetwork +metadata: + name: {{ .PNName }} +{{- if .SubnetToken }} + labels: + kubernetes.azure.com/override-subnet-token: "{{ .SubnetToken }}" +{{- end }} +spec: + networkID: "{{ .VnetGUID }}" +{{- if not .SubnetToken }} + subnetGUID: "{{ .SubnetGUID }}" +{{- end }} + subnetResourceID: "{{ .SubnetARMID }}" + deviceType: acn.azure.com/vnet-nic diff --git a/test/integration/manifests/swiftv2/long-running-cluster/podnetworkinstance.yaml b/test/integration/manifests/swiftv2/long-running-cluster/podnetworkinstance.yaml new file mode 100644 index 0000000000..4d1f8ca384 --- /dev/null +++ b/test/integration/manifests/swiftv2/long-running-cluster/podnetworkinstance.yaml @@ -0,0 +1,13 @@ +apiVersion: multitenancy.acn.azure.com/v1alpha1 +kind: PodNetworkInstance +metadata: + name: {{ .PNIName }} + namespace: {{ .Namespace }} +spec: + podNetworkConfigs: + - podNetwork: {{ .PNName }} + {{- if eq .Type "explicit" }} + podIPReservationSize: {{ .Reservations }} + {{- else }} + podIPReservationSize: 1 + {{- end }} diff --git a/test/integration/swiftv2/helpers/az_helpers.go b/test/integration/swiftv2/helpers/az_helpers.go new file mode 100644 index 0000000000..c6e5d4b090 --- /dev/null +++ b/test/integration/swiftv2/helpers/az_helpers.go @@ -0,0 +1,343 @@ +package helpers + +import ( + "context" + "fmt" + "os/exec" + "strings" + "time" +) + +func runAzCommand(cmd string, args ...string) (string, error) { + out, err := exec.Command(cmd, args...).CombinedOutput() + if err != nil { + return "", fmt.Errorf("failed to run %s %v: %w\nOutput: %s", cmd, args, err, string(out)) + } + return strings.TrimSpace(string(out)), nil +} + +func GetVnetGUID(rg, vnet string) (string, error) { + return runAzCommand("az", "network", "vnet", "show", "--resource-group", rg, "--name", vnet, "--query", "resourceGuid", "-o", "tsv") +} + +func GetSubnetARMID(rg, vnet, subnet string) (string, error) { + return runAzCommand("az", "network", "vnet", "subnet", "show", "--resource-group", rg, "--vnet-name", vnet, "--name", subnet, "--query", "id", "-o", "tsv") +} + +func GetSubnetGUID(rg, vnet, subnet string) (string, error) { + subnetID, err := GetSubnetARMID(rg, vnet, subnet) + if err != nil { + return "", err + } + return runAzCommand("az", "resource", "show", "--ids", subnetID, "--api-version", "2023-09-01", "--query", "properties.serviceAssociationLinks[0].properties.subnetId", "-o", "tsv") +} + +func GetSubnetToken(rg, vnet, subnet string) (string, error) { + // Optionally implement if you use subnet token override + return "", nil +} + +// GetClusterNodes returns a slice of node names from a cluster using the given kubeconfig +func GetClusterNodes(kubeconfig string) ([]string, error) { + cmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "get", "nodes", "-o", "name") + out, err := cmd.CombinedOutput() + if err != nil { + return nil, fmt.Errorf("failed to get nodes using kubeconfig %s: %w\nOutput: %s", kubeconfig, err, string(out)) + } + + lines := strings.Split(strings.TrimSpace(string(out)), "\n") + nodes := make([]string, 0, len(lines)) + + for _, line := range lines { + // kubectl returns "node/", we strip the prefix + if strings.HasPrefix(line, "node/") { + nodes = append(nodes, strings.TrimPrefix(line, "node/")) + } + } + return nodes, nil +} + +// EnsureNamespaceExists checks if a namespace exists and creates it if it doesn't +func EnsureNamespaceExists(kubeconfig, namespace string) error { + cmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "get", "namespace", namespace) + err := cmd.Run() + + if err == nil { + return nil // Namespace exists + } + + // Namespace doesn't exist, create it + cmd = exec.Command("kubectl", "--kubeconfig", kubeconfig, "create", "namespace", namespace) + out, err := cmd.CombinedOutput() + if err != nil { + return fmt.Errorf("failed to create namespace %s: %s\n%s", namespace, err, string(out)) + } + + return nil +} + +// DeletePod deletes a pod in the specified namespace and waits for it to be fully removed +func DeletePod(kubeconfig, namespace, podName string) error { + fmt.Printf("Deleting pod %s in namespace %s...\n", podName, namespace) + + // Initiate pod deletion with context timeout + ctx, cancel := context.WithTimeout(context.Background(), 90*time.Second) + defer cancel() + + cmd := exec.CommandContext(ctx, "kubectl", "--kubeconfig", kubeconfig, "delete", "pod", podName, "-n", namespace, "--ignore-not-found=true") + out, err := cmd.CombinedOutput() + if err != nil { + if ctx.Err() == context.DeadlineExceeded { + fmt.Printf("kubectl delete pod command timed out after 90s, attempting force delete...\n") + } else { + return fmt.Errorf("failed to delete pod %s in namespace %s: %s\n%s", podName, namespace, err, string(out)) + } + } + + // Wait for pod to be completely gone (critical for IP release) + fmt.Printf("Waiting for pod %s to be fully removed...\n", podName) + for attempt := 1; attempt <= 30; attempt++ { + ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) + checkCmd := exec.CommandContext(ctx, "kubectl", "--kubeconfig", kubeconfig, "get", "pod", podName, "-n", namespace, "--ignore-not-found=true", "-o", "name") + checkOut, _ := checkCmd.CombinedOutput() + cancel() + + if strings.TrimSpace(string(checkOut)) == "" { + fmt.Printf("Pod %s fully removed after %d seconds\n", podName, attempt*2) + // Extra wait to ensure IP reservation is released in DNC + time.Sleep(5 * time.Second) + return nil + } + + if attempt%5 == 0 { + fmt.Printf("Pod %s still terminating (attempt %d/30)...\n", podName, attempt) + } + time.Sleep(2 * time.Second) + } + + // If pod still exists after 60 seconds, force delete + fmt.Printf("Pod %s still exists after 60s, attempting force delete...\n", podName) + ctx, cancel = context.WithTimeout(context.Background(), 30*time.Second) + defer cancel() + + forceCmd := exec.CommandContext(ctx, "kubectl", "--kubeconfig", kubeconfig, "delete", "pod", podName, "-n", namespace, "--grace-period=0", "--force", "--ignore-not-found=true") + forceOut, forceErr := forceCmd.CombinedOutput() + if forceErr != nil { + fmt.Printf("Warning: Force delete failed: %s\n%s\n", forceErr, string(forceOut)) + } + + // Wait a bit more for force delete to complete + time.Sleep(10 * time.Second) + fmt.Printf("Pod %s deletion completed (may have required force)\n", podName) + return nil +} + +// DeletePodNetworkInstance deletes a PodNetworkInstance and waits for it to be removed +func DeletePodNetworkInstance(kubeconfig, namespace, pniName string) error { + fmt.Printf("Deleting PodNetworkInstance %s in namespace %s...\n", pniName, namespace) + + // Initiate PNI deletion + cmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "delete", "podnetworkinstance", pniName, "-n", namespace, "--ignore-not-found=true") + out, err := cmd.CombinedOutput() + if err != nil { + return fmt.Errorf("failed to delete PodNetworkInstance %s: %s\n%s", pniName, err, string(out)) + } + + // Wait for PNI to be completely gone (it may take time for DNC to release reservations) + fmt.Printf("Waiting for PodNetworkInstance %s to be fully removed...\n", pniName) + for attempt := 1; attempt <= 60; attempt++ { + checkCmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "get", "podnetworkinstance", pniName, "-n", namespace, "--ignore-not-found=true", "-o", "name") + checkOut, _ := checkCmd.CombinedOutput() + + if strings.TrimSpace(string(checkOut)) == "" { + fmt.Printf("PodNetworkInstance %s fully removed after %d seconds\n", pniName, attempt*2) + return nil + } + + if attempt%10 == 0 { + // Check for ReservationInUse errors + descCmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "describe", "podnetworkinstance", pniName, "-n", namespace) + descOut, _ := descCmd.CombinedOutput() + descStr := string(descOut) + + if strings.Contains(descStr, "ReservationInUse") { + fmt.Printf("PNI %s still has active reservations (attempt %d/60). Waiting for DNC to release...\n", pniName, attempt) + } else { + fmt.Printf("PNI %s still terminating (attempt %d/60)...\n", pniName, attempt) + } + } + time.Sleep(2 * time.Second) + } + + // If PNI still exists after 120 seconds, try to remove finalizers + fmt.Printf("PNI %s still exists after 120s, attempting to remove finalizers...\n", pniName) + patchCmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "patch", "podnetworkinstance", pniName, "-n", namespace, "-p", `{"metadata":{"finalizers":[]}}`, "--type=merge") + patchOut, patchErr := patchCmd.CombinedOutput() + if patchErr != nil { + fmt.Printf("Warning: Failed to remove finalizers: %s\n%s\n", patchErr, string(patchOut)) + } else { + fmt.Printf("Finalizers removed, waiting for deletion...\n") + time.Sleep(5 * time.Second) + } + + fmt.Printf("PodNetworkInstance %s deletion completed\n", pniName) + return nil +} + +// DeletePodNetwork deletes a PodNetwork and waits for it to be removed +func DeletePodNetwork(kubeconfig, pnName string) error { + fmt.Printf("Deleting PodNetwork %s...\n", pnName) + + cmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "delete", "podnetwork", pnName, "--ignore-not-found=true") + out, err := cmd.CombinedOutput() + if err != nil { + return fmt.Errorf("failed to delete PodNetwork %s: %s\n%s", pnName, err, string(out)) + } + + // Wait for PN to be completely gone + fmt.Printf("Waiting for PodNetwork %s to be fully removed...\n", pnName) + for attempt := 1; attempt <= 30; attempt++ { + checkCmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "get", "podnetwork", pnName, "--ignore-not-found=true", "-o", "name") + checkOut, _ := checkCmd.CombinedOutput() + + if strings.TrimSpace(string(checkOut)) == "" { + fmt.Printf("PodNetwork %s fully removed after %d seconds\n", pnName, attempt*2) + return nil + } + + if attempt%10 == 0 { + fmt.Printf("PodNetwork %s still terminating (attempt %d/30)...\n", pnName, attempt) + } + time.Sleep(2 * time.Second) + } + + // Try to remove finalizers if still stuck + fmt.Printf("PodNetwork %s still exists, attempting to remove finalizers...\n", pnName) + patchCmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "patch", "podnetwork", pnName, "-p", `{"metadata":{"finalizers":[]}}`, "--type=merge") + patchOut, patchErr := patchCmd.CombinedOutput() + if patchErr != nil { + fmt.Printf("Warning: Failed to remove finalizers: %s\n%s\n", patchErr, string(patchOut)) + } + + time.Sleep(5 * time.Second) + fmt.Printf("PodNetwork %s deletion completed\n", pnName) + return nil +} + +// DeleteNamespace deletes a namespace and waits for it to be removed +func DeleteNamespace(kubeconfig, namespace string) error { + fmt.Printf("Deleting namespace %s...\n", namespace) + + cmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "delete", "namespace", namespace, "--ignore-not-found=true") + out, err := cmd.CombinedOutput() + if err != nil { + return fmt.Errorf("failed to delete namespace %s: %s\n%s", namespace, err, string(out)) + } + + // Wait for namespace to be completely gone + fmt.Printf("Waiting for namespace %s to be fully removed...\n", namespace) + for attempt := 1; attempt <= 60; attempt++ { + checkCmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "get", "namespace", namespace, "--ignore-not-found=true", "-o", "name") + checkOut, _ := checkCmd.CombinedOutput() + + if strings.TrimSpace(string(checkOut)) == "" { + fmt.Printf("Namespace %s fully removed after %d seconds\n", namespace, attempt*2) + return nil + } + + if attempt%15 == 0 { + fmt.Printf("Namespace %s still terminating (attempt %d/60)...\n", namespace, attempt) + } + time.Sleep(2 * time.Second) + } + + // Try to remove finalizers if still stuck + fmt.Printf("Namespace %s still exists, attempting to remove finalizers...\n", namespace) + patchCmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "patch", "namespace", namespace, "-p", `{"metadata":{"finalizers":[]}}`, "--type=merge") + patchOut, patchErr := patchCmd.CombinedOutput() + if patchErr != nil { + fmt.Printf("Warning: Failed to remove finalizers: %s\n%s\n", patchErr, string(patchOut)) + } + + time.Sleep(5 * time.Second) + fmt.Printf("Namespace %s deletion completed\n", namespace) + return nil +} + +// WaitForPodRunning waits for a pod to reach Running state with retries +func WaitForPodRunning(kubeconfig, namespace, podName string, maxRetries, sleepSeconds int) error { + for attempt := 1; attempt <= maxRetries; attempt++ { + cmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "get", "pod", podName, "-n", namespace, "-o", "jsonpath={.status.phase}") + out, err := cmd.CombinedOutput() + + if err == nil && strings.TrimSpace(string(out)) == "Running" { + fmt.Printf("Pod %s is now Running\n", podName) + return nil + } + + if attempt < maxRetries { + fmt.Printf("Pod %s not running yet (attempt %d/%d), status: %s. Waiting %d seconds...\n", + podName, attempt, maxRetries, strings.TrimSpace(string(out)), sleepSeconds) + time.Sleep(time.Duration(sleepSeconds) * time.Second) + } + } + + return fmt.Errorf("pod %s did not reach Running state after %d attempts", podName, maxRetries) +} + +// GetPodIP retrieves the IP address of a pod +func GetPodIP(kubeconfig, namespace, podName string) (string, error) { + ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) + defer cancel() + + cmd := exec.CommandContext(ctx, "kubectl", "--kubeconfig", kubeconfig, "get", "pod", podName, + "-n", namespace, "-o", "jsonpath={.status.podIP}") + out, err := cmd.CombinedOutput() + if err != nil { + return "", fmt.Errorf("failed to get pod IP for %s in namespace %s: %w\nOutput: %s", podName, namespace, err, string(out)) + } + + ip := strings.TrimSpace(string(out)) + if ip == "" { + return "", fmt.Errorf("pod %s in namespace %s has no IP address assigned", podName, namespace) + } + + return ip, nil +} + +// GetPodDelegatedIP retrieves the eth1 IP address (delegated subnet IP) of a pod +// This is the IP used for cross-VNet communication and is subject to NSG rules +func GetPodDelegatedIP(kubeconfig, namespace, podName string) (string, error) { + ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) + defer cancel() + + // Get eth1 IP address by running 'ip addr show eth1' in the pod + cmd := exec.CommandContext(ctx, "kubectl", "--kubeconfig", kubeconfig, "exec", podName, + "-n", namespace, "--", "sh", "-c", "ip -4 addr show eth1 | grep 'inet ' | awk '{print $2}' | cut -d'/' -f1") + out, err := cmd.CombinedOutput() + if err != nil { + return "", fmt.Errorf("failed to get eth1 IP for %s in namespace %s: %w\nOutput: %s", podName, namespace, err, string(out)) + } + + ip := strings.TrimSpace(string(out)) + if ip == "" { + return "", fmt.Errorf("pod %s in namespace %s has no eth1 IP address (delegated subnet not configured?)", podName, namespace) + } + + return ip, nil +} + +// ExecInPod executes a command in a pod and returns the output +func ExecInPod(kubeconfig, namespace, podName, command string) (string, error) { + ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second) + defer cancel() + + cmd := exec.CommandContext(ctx, "kubectl", "--kubeconfig", kubeconfig, "exec", podName, + "-n", namespace, "--", "sh", "-c", command) + out, err := cmd.CombinedOutput() + if err != nil { + return string(out), fmt.Errorf("failed to exec in pod %s in namespace %s: %w", podName, namespace, err) + } + + return string(out), nil +} diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go new file mode 100644 index 0000000000..4d138dca32 --- /dev/null +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -0,0 +1,690 @@ +package longRunningCluster + +import ( + "bytes" + "fmt" + "os" + "os/exec" + "strings" + "text/template" + "time" + + "github.com/Azure/azure-container-networking/test/integration/swiftv2/helpers" +) + +func applyTemplate(templatePath string, data interface{}, kubeconfig string) error { + tmpl, err := template.ParseFiles(templatePath) + if err != nil { + return err + } + + var buf bytes.Buffer + if err := tmpl.Execute(&buf, data); err != nil { + return err + } + + cmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "apply", "-f", "-") + cmd.Stdin = &buf + out, err := cmd.CombinedOutput() + if err != nil { + return fmt.Errorf("kubectl apply failed: %s\n%s", err, string(out)) + } + + fmt.Println(string(out)) + return nil +} + +// ------------------------- +// PodNetwork +// ------------------------- +type PodNetworkData struct { + PNName string + VnetGUID string + SubnetGUID string + SubnetARMID string + SubnetToken string +} + +func CreatePodNetwork(kubeconfig string, data PodNetworkData, templatePath string) error { + return applyTemplate(templatePath, data, kubeconfig) +} + +// ------------------------- +// PodNetworkInstance +// ------------------------- +type PNIData struct { + PNIName string + PNName string + Namespace string + Type string + Reservations int +} + +func CreatePodNetworkInstance(kubeconfig string, data PNIData, templatePath string) error { + return applyTemplate(templatePath, data, kubeconfig) +} + +// ------------------------- +// Pod +// ------------------------- +type PodData struct { + PodName string + NodeName string + OS string + PNName string + PNIName string + Namespace string + Image string +} + +func CreatePod(kubeconfig string, data PodData, templatePath string) error { + return applyTemplate(templatePath, data, kubeconfig) +} + +// ------------------------- +// High-level orchestration +// ------------------------- + +// TestResources holds all the configuration needed for creating test resources +type TestResources struct { + Kubeconfig string + PNName string + PNIName string + VnetGUID string + SubnetGUID string + SubnetARMID string + SubnetToken string + PodNetworkTemplate string + PNITemplate string + PodTemplate string + PodImage string +} + +// PodScenario defines a single pod creation scenario +type PodScenario struct { + Name string // Descriptive name for the scenario + Cluster string // "aks-1" or "aks-2" + VnetName string // e.g., "cx_vnet_a1", "cx_vnet_b1" + SubnetName string // e.g., "s1", "s2" + NodeSelector string // "low-nic" or "high-nic" + PodNameSuffix string // Unique suffix for pod name +} + +// TestScenarios holds all pod scenarios to test +type TestScenarios struct { + ResourceGroup string + BuildID string + PodImage string + Scenarios []PodScenario + VnetSubnetCache map[string]VnetSubnetInfo // Cache for vnet/subnet info + UsedNodes map[string]bool // Tracks which nodes are already used (one pod per node for low-NIC) +} + +// VnetSubnetInfo holds network information for a vnet/subnet combination +type VnetSubnetInfo struct { + VnetGUID string + SubnetGUID string + SubnetARMID string + SubnetToken string +} + +// NodePoolInfo holds information about nodes in different pools +type NodePoolInfo struct { + LowNicNodes []string + HighNicNodes []string +} + +// GetNodesByNicCount categorizes nodes by NIC count based on nic-capacity labels +func GetNodesByNicCount(kubeconfig string) (NodePoolInfo, error) { + nodeInfo := NodePoolInfo{ + LowNicNodes: []string{}, + HighNicNodes: []string{}, + } + + // Get workload type from environment variable (defaults to swiftv2-linux) + workloadType := os.Getenv("WORKLOAD_TYPE") + if workloadType == "" { + workloadType = "swiftv2-linux" + } + + fmt.Printf("Filtering nodes by workload-type=%s\n", workloadType) + + // Get nodes with low-nic capacity and matching workload-type + cmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "get", "nodes", + "-l", fmt.Sprintf("nic-capacity=low-nic,workload-type=%s", workloadType), "-o", "name") + out, err := cmd.CombinedOutput() + if err != nil { + return NodePoolInfo{}, fmt.Errorf("failed to get low-nic nodes: %w\nOutput: %s", err, string(out)) + } + + lines := strings.Split(strings.TrimSpace(string(out)), "\n") + for _, line := range lines { + if strings.HasPrefix(line, "node/") { + nodeInfo.LowNicNodes = append(nodeInfo.LowNicNodes, strings.TrimPrefix(line, "node/")) + } + } + + // Get nodes with high-nic capacity and matching workload-type + cmd = exec.Command("kubectl", "--kubeconfig", kubeconfig, "get", "nodes", + "-l", fmt.Sprintf("nic-capacity=high-nic,workload-type=%s", workloadType), "-o", "name") + out, err = cmd.CombinedOutput() + if err != nil { + return NodePoolInfo{}, fmt.Errorf("failed to get high-nic nodes: %w\nOutput: %s", err, string(out)) + } + + lines = strings.Split(strings.TrimSpace(string(out)), "\n") + for _, line := range lines { + if line != "" && strings.HasPrefix(line, "node/") { + nodeInfo.HighNicNodes = append(nodeInfo.HighNicNodes, strings.TrimPrefix(line, "node/")) + } + } + + fmt.Printf("Found %d low-nic nodes and %d high-nic nodes with workload-type=%s\n", + len(nodeInfo.LowNicNodes), len(nodeInfo.HighNicNodes), workloadType) + + return nodeInfo, nil +} + +// CreatePodNetworkResource creates a PodNetwork +func CreatePodNetworkResource(resources TestResources) error { + err := CreatePodNetwork(resources.Kubeconfig, PodNetworkData{ + PNName: resources.PNName, + VnetGUID: resources.VnetGUID, + SubnetGUID: resources.SubnetGUID, + SubnetARMID: resources.SubnetARMID, + SubnetToken: resources.SubnetToken, + }, resources.PodNetworkTemplate) + if err != nil { + return fmt.Errorf("failed to create PodNetwork: %w", err) + } + return nil +} + +// CreateNamespaceResource creates a namespace +func CreateNamespaceResource(kubeconfig, namespace string) error { + err := helpers.EnsureNamespaceExists(kubeconfig, namespace) + if err != nil { + return fmt.Errorf("failed to create namespace: %w", err) + } + return nil +} + +// CreatePodNetworkInstanceResource creates a PodNetworkInstance +func CreatePodNetworkInstanceResource(resources TestResources) error { + err := CreatePodNetworkInstance(resources.Kubeconfig, PNIData{ + PNIName: resources.PNIName, + PNName: resources.PNName, + Namespace: resources.PNName, + Type: "explicit", + Reservations: 2, + }, resources.PNITemplate) + if err != nil { + return fmt.Errorf("failed to create PodNetworkInstance: %w", err) + } + return nil +} + +// CreatePodResource creates a single pod on a specified node and waits for it to be running +func CreatePodResource(resources TestResources, podName, nodeName string) error { + err := CreatePod(resources.Kubeconfig, PodData{ + PodName: podName, + NodeName: nodeName, + OS: "linux", + PNName: resources.PNName, + PNIName: resources.PNIName, + Namespace: resources.PNName, + Image: resources.PodImage, + }, resources.PodTemplate) + if err != nil { + return fmt.Errorf("failed to create pod %s: %w", podName, err) + } + + // Wait for pod to be running with retries + err = helpers.WaitForPodRunning(resources.Kubeconfig, resources.PNName, podName, 10, 30) + if err != nil { + return fmt.Errorf("pod %s did not reach running state: %w", podName, err) + } + + return nil +} + +// GetOrFetchVnetSubnetInfo retrieves cached network info or fetches it from Azure +func GetOrFetchVnetSubnetInfo(rg, vnetName, subnetName string, cache map[string]VnetSubnetInfo) (VnetSubnetInfo, error) { + key := fmt.Sprintf("%s/%s", vnetName, subnetName) + + if info, exists := cache[key]; exists { + return info, nil + } + + // Fetch from Azure + vnetGUID, err := helpers.GetVnetGUID(rg, vnetName) + if err != nil { + return VnetSubnetInfo{}, fmt.Errorf("failed to get VNet GUID: %w", err) + } + + subnetGUID, err := helpers.GetSubnetGUID(rg, vnetName, subnetName) + if err != nil { + return VnetSubnetInfo{}, fmt.Errorf("failed to get Subnet GUID: %w", err) + } + + subnetARMID, err := helpers.GetSubnetARMID(rg, vnetName, subnetName) + if err != nil { + return VnetSubnetInfo{}, fmt.Errorf("failed to get Subnet ARM ID: %w", err) + } + + subnetToken, err := helpers.GetSubnetToken(rg, vnetName, subnetName) + if err != nil { + return VnetSubnetInfo{}, fmt.Errorf("failed to get Subnet Token: %w", err) + } + + info := VnetSubnetInfo{ + VnetGUID: vnetGUID, + SubnetGUID: subnetGUID, + SubnetARMID: subnetARMID, + SubnetToken: subnetToken, + } + + cache[key] = info + return info, nil +} + +// CreateScenarioResources creates all resources for a specific pod scenario +func CreateScenarioResources(scenario PodScenario, testScenarios TestScenarios) error { + // Get kubeconfig for the cluster + kubeconfig := fmt.Sprintf("/tmp/%s.kubeconfig", scenario.Cluster) + + // Get network info + netInfo, err := GetOrFetchVnetSubnetInfo(testScenarios.ResourceGroup, scenario.VnetName, scenario.SubnetName, testScenarios.VnetSubnetCache) + if err != nil { + return fmt.Errorf("failed to get network info for %s/%s: %w", scenario.VnetName, scenario.SubnetName, err) + } + + // Create unique names for this scenario (simplify vnet name and make K8s compatible) + // Remove "cx_vnet_" prefix and replace underscores with hyphens + vnetShort := strings.TrimPrefix(scenario.VnetName, "cx_vnet_") + vnetShort = strings.ReplaceAll(vnetShort, "_", "-") + subnetNameSafe := strings.ReplaceAll(scenario.SubnetName, "_", "-") + pnName := fmt.Sprintf("pn-%s-%s-%s", testScenarios.BuildID, vnetShort, subnetNameSafe) + pniName := fmt.Sprintf("pni-%s-%s-%s", testScenarios.BuildID, vnetShort, subnetNameSafe) + + resources := TestResources{ + Kubeconfig: kubeconfig, + PNName: pnName, + PNIName: pniName, + VnetGUID: netInfo.VnetGUID, + SubnetGUID: netInfo.SubnetGUID, + SubnetARMID: netInfo.SubnetARMID, + SubnetToken: netInfo.SubnetToken, + PodNetworkTemplate: "../../manifests/swiftv2/long-running-cluster/podnetwork.yaml", + PNITemplate: "../../manifests/swiftv2/long-running-cluster/podnetworkinstance.yaml", + PodTemplate: "../../manifests/swiftv2/long-running-cluster/pod.yaml", + PodImage: testScenarios.PodImage, + } + + // Step 1: Create PodNetwork + err = CreatePodNetworkResource(resources) + if err != nil { + return fmt.Errorf("scenario %s: %w", scenario.Name, err) + } + + // Step 2: Create namespace + err = CreateNamespaceResource(resources.Kubeconfig, resources.PNName) + if err != nil { + return fmt.Errorf("scenario %s: %w", scenario.Name, err) + } + + // Step 3: Create PodNetworkInstance + err = CreatePodNetworkInstanceResource(resources) + if err != nil { + return fmt.Errorf("scenario %s: %w", scenario.Name, err) + } + + // Step 4: Get nodes by NIC count + nodeInfo, err := GetNodesByNicCount(kubeconfig) + if err != nil { + return fmt.Errorf("scenario %s: failed to get nodes: %w", scenario.Name, err) + } + + // Step 5: Select appropriate node based on scenario + var targetNode string + + // Initialize used nodes tracker if not exists + if testScenarios.UsedNodes == nil { + testScenarios.UsedNodes = make(map[string]bool) + } + + if scenario.NodeSelector == "low-nic" { + if len(nodeInfo.LowNicNodes) == 0 { + return fmt.Errorf("scenario %s: no low-NIC nodes available", scenario.Name) + } + // Find first unused node in the pool (low-NIC nodes can only handle one pod) + targetNode = "" + for _, node := range nodeInfo.LowNicNodes { + if !testScenarios.UsedNodes[node] { + targetNode = node + testScenarios.UsedNodes[node] = true + break + } + } + if targetNode == "" { + return fmt.Errorf("scenario %s: all low-NIC nodes already in use", scenario.Name) + } + } else { // "high-nic" + if len(nodeInfo.HighNicNodes) == 0 { + return fmt.Errorf("scenario %s: no high-NIC nodes available", scenario.Name) + } + // Find first unused node in the pool + targetNode = "" + for _, node := range nodeInfo.HighNicNodes { + if !testScenarios.UsedNodes[node] { + targetNode = node + testScenarios.UsedNodes[node] = true + break + } + } + if targetNode == "" { + return fmt.Errorf("scenario %s: all high-NIC nodes already in use", scenario.Name) + } + } + + // Step 6: Create pod + podName := fmt.Sprintf("pod-%s", scenario.PodNameSuffix) + err = CreatePodResource(resources, podName, targetNode) + if err != nil { + return fmt.Errorf("scenario %s: %w", scenario.Name, err) + } + + fmt.Printf("Successfully created scenario: %s (pod: %s on node: %s)\n", scenario.Name, podName, targetNode) + return nil +} + +// DeleteScenarioResources deletes all resources for a specific pod scenario +func DeleteScenarioResources(scenario PodScenario, buildID string) error { + kubeconfig := fmt.Sprintf("/tmp/%s.kubeconfig", scenario.Cluster) + + // Create same names as creation (simplify vnet name and make K8s compatible) + // Remove "cx_vnet_" prefix and replace underscores with hyphens + vnetShort := strings.TrimPrefix(scenario.VnetName, "cx_vnet_") + vnetShort = strings.ReplaceAll(vnetShort, "_", "-") + subnetNameSafe := strings.ReplaceAll(scenario.SubnetName, "_", "-") + pnName := fmt.Sprintf("pn-%s-%s-%s", buildID, vnetShort, subnetNameSafe) + pniName := fmt.Sprintf("pni-%s-%s-%s", buildID, vnetShort, subnetNameSafe) + podName := fmt.Sprintf("pod-%s", scenario.PodNameSuffix) + + // Delete pod + err := helpers.DeletePod(kubeconfig, pnName, podName) + if err != nil { + return fmt.Errorf("scenario %s: failed to delete pod: %w", scenario.Name, err) + } + + // Delete PodNetworkInstance + err = helpers.DeletePodNetworkInstance(kubeconfig, pnName, pniName) + if err != nil { + return fmt.Errorf("scenario %s: failed to delete PNI: %w", scenario.Name, err) + } + + // Delete PodNetwork + err = helpers.DeletePodNetwork(kubeconfig, pnName) + if err != nil { + return fmt.Errorf("scenario %s: failed to delete PN: %w", scenario.Name, err) + } + + // Delete namespace + err = helpers.DeleteNamespace(kubeconfig, pnName) + if err != nil { + return fmt.Errorf("scenario %s: failed to delete namespace: %w", scenario.Name, err) + } + + fmt.Printf("Successfully deleted scenario: %s\n", scenario.Name) + return nil +} + +// CreateAllScenarios creates resources for all test scenarios +func CreateAllScenarios(testScenarios TestScenarios) error { + for _, scenario := range testScenarios.Scenarios { + fmt.Printf("\n=== Creating scenario: %s ===\n", scenario.Name) + err := CreateScenarioResources(scenario, testScenarios) + if err != nil { + return err + } + } + return nil +} + +// DeleteAllScenarios deletes resources for all test scenarios +// Strategy: Delete all pods first, then delete shared PNI/PN/Namespace resources +func DeleteAllScenarios(testScenarios TestScenarios) error { + // Phase 1: Delete all pods first + fmt.Printf("\n=== Phase 1: Deleting all pods ===\n") + for _, scenario := range testScenarios.Scenarios { + kubeconfig := fmt.Sprintf("/tmp/%s.kubeconfig", scenario.Cluster) + vnetShort := strings.TrimPrefix(scenario.VnetName, "cx_vnet_") + vnetShort = strings.ReplaceAll(vnetShort, "_", "-") + subnetNameSafe := strings.ReplaceAll(scenario.SubnetName, "_", "-") + pnName := fmt.Sprintf("pn-%s-%s-%s", testScenarios.BuildID, vnetShort, subnetNameSafe) + podName := fmt.Sprintf("pod-%s", scenario.PodNameSuffix) + + fmt.Printf("Deleting pod for scenario: %s\n", scenario.Name) + err := helpers.DeletePod(kubeconfig, pnName, podName) + if err != nil { + fmt.Printf("Warning: Failed to delete pod for scenario %s: %v\n", scenario.Name, err) + } + } + + // Phase 2: Delete shared PNI/PN/Namespace resources (grouped by vnet/subnet/cluster) + fmt.Printf("\n=== Phase 2: Deleting shared PNI/PN/Namespace resources ===\n") + resourceGroups := make(map[string]bool) + + for _, scenario := range testScenarios.Scenarios { + kubeconfig := fmt.Sprintf("/tmp/%s.kubeconfig", scenario.Cluster) + vnetShort := strings.TrimPrefix(scenario.VnetName, "cx_vnet_") + vnetShort = strings.ReplaceAll(vnetShort, "_", "-") + subnetNameSafe := strings.ReplaceAll(scenario.SubnetName, "_", "-") + pnName := fmt.Sprintf("pn-%s-%s-%s", testScenarios.BuildID, vnetShort, subnetNameSafe) + pniName := fmt.Sprintf("pni-%s-%s-%s", testScenarios.BuildID, vnetShort, subnetNameSafe) + + // Create unique key for this vnet/subnet/cluster combination + resourceKey := fmt.Sprintf("%s:%s", scenario.Cluster, pnName) + + // Skip if we already deleted resources for this combination + if resourceGroups[resourceKey] { + continue + } + resourceGroups[resourceKey] = true + + fmt.Printf("\nDeleting shared resources for %s/%s on %s\n", scenario.VnetName, scenario.SubnetName, scenario.Cluster) + + // Delete PodNetworkInstance + err := helpers.DeletePodNetworkInstance(kubeconfig, pnName, pniName) + if err != nil { + fmt.Printf("Warning: Failed to delete PNI %s: %v\n", pniName, err) + } + + // Delete PodNetwork + err = helpers.DeletePodNetwork(kubeconfig, pnName) + if err != nil { + fmt.Printf("Warning: Failed to delete PN %s: %v\n", pnName, err) + } + + // Delete namespace + err = helpers.DeleteNamespace(kubeconfig, pnName) + if err != nil { + fmt.Printf("Warning: Failed to delete namespace %s: %v\n", pnName, err) + } + } + + fmt.Printf("\n=== All scenarios deleted ===\n") + return nil +} + +// DeleteTestResources deletes all test resources in reverse order +func DeleteTestResources(kubeconfig, pnName, pniName string) error { + // Delete pods (first two nodes only, matching creation) + for i := 0; i < 2; i++ { + podName := fmt.Sprintf("pod-c2-%d", i) + err := helpers.DeletePod(kubeconfig, pnName, podName) + if err != nil { + return fmt.Errorf("failed to delete pod %s: %w", podName, err) + } + } + + // Delete PodNetworkInstance + err := helpers.DeletePodNetworkInstance(kubeconfig, pnName, pniName) + if err != nil { + return fmt.Errorf("failed to delete PodNetworkInstance: %w", err) + } + + // Delete PodNetwork + err = helpers.DeletePodNetwork(kubeconfig, pnName) + if err != nil { + return fmt.Errorf("failed to delete PodNetwork: %w", err) + } + + // Delete namespace + err = helpers.DeleteNamespace(kubeconfig, pnName) + if err != nil { + return fmt.Errorf("failed to delete namespace: %w", err) + } + + return nil +} + +// ConnectivityTest defines a connectivity test between two pods +type ConnectivityTest struct { + Name string + SourcePod string + SourceNamespace string // Namespace of the source pod + DestinationPod string + DestNamespace string // Namespace of the destination pod + Cluster string // Cluster where source pod is running (for backward compatibility) + DestCluster string // Cluster where destination pod is running (if different from source) + Description string + ShouldFail bool // If true, connectivity is expected to fail (NSG block, customer isolation) + + // Fields for private endpoint tests + SourceCluster string // Cluster where source pod is running + SourcePodName string // Name of the source pod + SourceNS string // Namespace of the source pod + DestEndpoint string // Destination endpoint (IP or hostname) + TestType string // Type of test: "pod-to-pod" or "storage-access" + Purpose string // Description of the test purpose +} + +// RunConnectivityTest tests HTTP connectivity between two pods +func RunConnectivityTest(test ConnectivityTest, rg, buildId string) error { + // Get kubeconfig for the source cluster + sourceKubeconfig := fmt.Sprintf("/tmp/%s.kubeconfig", test.Cluster) + + // Get kubeconfig for the destination cluster (default to source cluster if not specified) + destKubeconfig := sourceKubeconfig + if test.DestCluster != "" { + destKubeconfig = fmt.Sprintf("/tmp/%s.kubeconfig", test.DestCluster) + } + + // Get destination pod's eth1 IP (delegated subnet IP for cross-VNet connectivity) + // This is the IP that is subject to NSG rules, not the overlay eth0 IP + destIP, err := helpers.GetPodDelegatedIP(destKubeconfig, test.DestNamespace, test.DestinationPod) + if err != nil { + return fmt.Errorf("failed to get destination pod delegated IP: %w", err) + } + + fmt.Printf("Testing connectivity from %s/%s (cluster: %s) to %s/%s (cluster: %s, eth1: %s) on port 8080\n", + test.SourceNamespace, test.SourcePod, test.Cluster, + test.DestNamespace, test.DestinationPod, test.DestCluster, destIP) + + // Run curl command from source pod to destination pod using eth1 IP + // Using -m 10 for 10 second timeout, -f to fail on HTTP errors + // Using --interface eth1 to force traffic through delegated subnet interface + curlCmd := fmt.Sprintf("curl --interface eth1 -f -m 10 http://%s:8080/", destIP) + + output, err := helpers.ExecInPod(sourceKubeconfig, test.SourceNamespace, test.SourcePod, curlCmd) + if err != nil { + return fmt.Errorf("connectivity test failed: %w\nOutput: %s", err, output) + } + + fmt.Printf("Connectivity successful! Response preview: %s\n", truncateString(output, 100)) + return nil +} + +// Helper function to truncate long strings +func truncateString(s string, maxLen int) string { + if len(s) <= maxLen { + return s + } + return s[:maxLen] + "..." +} + +// GenerateStorageSASToken generates a SAS token for a blob in a storage account +func GenerateStorageSASToken(storageAccountName, containerName, blobName string) (string, error) { + // Calculate expiry time: 7 days from now (Azure CLI limit) + expiryTime := time.Now().UTC().Add(7 * 24 * time.Hour).Format("2006-01-02") + + cmd := exec.Command("az", "storage", "blob", "generate-sas", + "--account-name", storageAccountName, + "--container-name", containerName, + "--name", blobName, + "--permissions", "r", + "--expiry", expiryTime, + "--auth-mode", "login", + "--as-user", + "--output", "tsv") + + out, err := cmd.CombinedOutput() + if err != nil { + return "", fmt.Errorf("failed to generate SAS token: %s\n%s", err, string(out)) + } + + sasToken := strings.TrimSpace(string(out)) + if sasToken == "" { + return "", fmt.Errorf("generated SAS token is empty") + } + + return sasToken, nil +} + +// GetStoragePrivateEndpoint retrieves the private IP address of a storage account's private endpoint +func GetStoragePrivateEndpoint(resourceGroup, storageAccountName string) (string, error) { + // Return the storage account blob endpoint FQDN + // This will resolve to the private IP via Private DNS Zone + return fmt.Sprintf("%s.blob.core.windows.net", storageAccountName), nil +} + +// RunPrivateEndpointTest tests connectivity from a pod to a private endpoint (storage account) +func RunPrivateEndpointTest(testScenarios TestScenarios, test ConnectivityTest) error { + // Get kubeconfig for the cluster + kubeconfig := fmt.Sprintf("/tmp/%s.kubeconfig", test.SourceCluster) + + fmt.Printf("Testing private endpoint access from %s to %s\n", + test.SourcePodName, test.DestEndpoint) + + // Step 1: Verify DNS resolution + fmt.Printf("==> Checking DNS resolution for %s\n", test.DestEndpoint) + resolveCmd := fmt.Sprintf("nslookup %s | tail -2", test.DestEndpoint) + resolveOutput, resolveErr := helpers.ExecInPod(kubeconfig, test.SourceNS, test.SourcePodName, resolveCmd) + if resolveErr != nil { + return fmt.Errorf("DNS resolution failed: %w\nOutput: %s", resolveErr, resolveOutput) + } + fmt.Printf("DNS Resolution Result:\n%s\n", resolveOutput) + + // Step 2: Generate SAS token for test blob + fmt.Printf("==> Generating SAS token for test blob\n") + // Extract storage account name from FQDN (e.g., sa106936191.blob.core.windows.net -> sa106936191) + storageAccountName := strings.Split(test.DestEndpoint, ".")[0] + sasToken, err := GenerateStorageSASToken(storageAccountName, "test", "hello.txt") + if err != nil { + return fmt.Errorf("failed to generate SAS token: %w", err) + } + + // Step 3: Download test blob using SAS token (proves both connectivity AND data plane access) + fmt.Printf("==> Downloading test blob via private endpoint\n") + blobURL := fmt.Sprintf("https://%s/test/hello.txt?%s", test.DestEndpoint, sasToken) + curlCmd := fmt.Sprintf("curl -f -s --connect-timeout 5 --max-time 10 '%s'", blobURL) + + output, err := helpers.ExecInPod(kubeconfig, test.SourceNS, test.SourcePodName, curlCmd) + if err != nil { + return fmt.Errorf("private endpoint connectivity test failed: %w\nOutput: %s", err, output) + } + + fmt.Printf("Private endpoint access successful! Blob content: %s\n", truncateString(output, 100)) + return nil +} diff --git a/test/integration/swiftv2/longRunningCluster/datapath_connectivity_test.go b/test/integration/swiftv2/longRunningCluster/datapath_connectivity_test.go new file mode 100644 index 0000000000..2852992581 --- /dev/null +++ b/test/integration/swiftv2/longRunningCluster/datapath_connectivity_test.go @@ -0,0 +1,165 @@ +//go:build connectivity_test +// +build connectivity_test + +package longRunningCluster + +import ( + "fmt" + "os" + "strings" + "testing" + + "github.com/onsi/ginkgo/v2" + "github.com/onsi/gomega" +) + +func TestDatapathConnectivity(t *testing.T) { + gomega.RegisterFailHandler(ginkgo.Fail) + suiteConfig, reporterConfig := ginkgo.GinkgoConfiguration() + suiteConfig.Timeout = 0 + ginkgo.RunSpecs(t, "Datapath Connectivity Suite", suiteConfig, reporterConfig) +} + +var _ = ginkgo.Describe("Datapath Connectivity Tests", func() { + rg := os.Getenv("RG") + buildId := os.Getenv("BUILD_ID") + + if rg == "" || buildId == "" { + ginkgo.Fail(fmt.Sprintf("Missing required environment variables: RG='%s', BUILD_ID='%s'", rg, buildId)) + } + + ginkgo.It("tests HTTP connectivity between pods", ginkgo.NodeTimeout(0), func() { + // Helper function to generate namespace from vnet and subnet + // Format: pn--- + // Example: pn-sv2-long-run-centraluseuap-a1-s1 + getNamespace := func(vnetName, subnetName string) string { + // Extract vnet prefix (a1, a2, a3, b1, etc.) from cx_vnet_a1 -> a1 + vnetPrefix := strings.TrimPrefix(vnetName, "cx_vnet_") + return fmt.Sprintf("pn-%s-%s-%s", rg, vnetPrefix, subnetName) + } + + // Define connectivity test cases + // Format: {SourcePod, DestinationPod, Cluster, Description, ShouldFail} + connectivityTests := []ConnectivityTest{ + { + Name: "SameVNetSameSubnet", + SourcePod: "pod-c1-aks1-a1s2-low", + SourceNamespace: getNamespace("cx_vnet_a1", "s2"), + DestinationPod: "pod-c1-aks1-a1s2-high", + DestNamespace: getNamespace("cx_vnet_a1", "s2"), + Cluster: "aks-1", + Description: "Test connectivity between low-NIC and high-NIC pods in same VNet/Subnet (cx_vnet_a1/s2)", + ShouldFail: false, + }, + { + Name: "NSGBlocked_S1toS2", + SourcePod: "pod-c1-aks1-a1s1-low", + SourceNamespace: getNamespace("cx_vnet_a1", "s1"), + DestinationPod: "pod-c1-aks1-a1s2-high", + DestNamespace: getNamespace("cx_vnet_a1", "s2"), + Cluster: "aks-1", + Description: "Test NSG isolation: s1 -> s2 in cx_vnet_a1 (should be blocked by NSG rule)", + ShouldFail: true, + }, + { + Name: "NSGBlocked_S2toS1", + SourcePod: "pod-c1-aks1-a1s2-low", + SourceNamespace: getNamespace("cx_vnet_a1", "s2"), + DestinationPod: "pod-c1-aks1-a1s1-low", + DestNamespace: getNamespace("cx_vnet_a1", "s1"), + Cluster: "aks-1", + Description: "Test NSG isolation: s2 -> s1 in cx_vnet_a1 (should be blocked by NSG rule)", + ShouldFail: true, + }, + { + Name: "DifferentClusters_SameVNet", + SourcePod: "pod-c1-aks1-a2s1-high", + SourceNamespace: getNamespace("cx_vnet_a2", "s1"), + DestinationPod: "pod-c1-aks2-a2s1-low", + DestNamespace: getNamespace("cx_vnet_a2", "s1"), + Cluster: "aks-1", + DestCluster: "aks-2", + Description: "Test connectivity across different clusters, same customer VNet (cx_vnet_a2)", + ShouldFail: false, + }, + { + Name: "PeeredVNets", + SourcePod: "pod-c1-aks1-a1s2-low", + SourceNamespace: getNamespace("cx_vnet_a1", "s2"), + DestinationPod: "pod-c1-aks1-a2s1-high", + DestNamespace: getNamespace("cx_vnet_a2", "s1"), + Cluster: "aks-1", + Description: "Test connectivity between peered VNets (cx_vnet_a1/s2 <-> cx_vnet_a2/s1)", + ShouldFail: false, + }, + { + Name: "PeeredVNets_A2toA3", + SourcePod: "pod-c1-aks1-a2s1-high", + SourceNamespace: getNamespace("cx_vnet_a2", "s1"), + DestinationPod: "pod-c1-aks2-a3s1-high", + DestNamespace: getNamespace("cx_vnet_a3", "s1"), + Cluster: "aks-1", + DestCluster: "aks-2", + Description: "Test connectivity between peered VNets across clusters (cx_vnet_a2 <-> cx_vnet_a3)", + ShouldFail: false, + }, + { + Name: "DifferentCustomers_A1toB1", + SourcePod: "pod-c1-aks1-a1s2-low", + SourceNamespace: getNamespace("cx_vnet_a1", "s2"), + DestinationPod: "pod-c2-aks2-b1s1-low", + DestNamespace: getNamespace("cx_vnet_b1", "s1"), + Cluster: "aks-1", + DestCluster: "aks-2", + Description: "Test isolation: Customer 1 to Customer 2 should fail (cx_vnet_a1 -> cx_vnet_b1)", + ShouldFail: true, + }, + { + Name: "DifferentCustomers_A2toB1", + SourcePod: "pod-c1-aks1-a2s1-high", + SourceNamespace: getNamespace("cx_vnet_a2", "s1"), + DestinationPod: "pod-c2-aks2-b1s1-high", + DestNamespace: getNamespace("cx_vnet_b1", "s1"), + Cluster: "aks-1", + DestCluster: "aks-2", + Description: "Test isolation: Customer 1 to Customer 2 should fail (cx_vnet_a2 -> cx_vnet_b1)", + ShouldFail: true, + }, + } + + ginkgo.By(fmt.Sprintf("Running %d connectivity tests", len(connectivityTests))) + + successCount := 0 + failureCount := 0 + + for _, test := range connectivityTests { + ginkgo.By(fmt.Sprintf("Test: %s - %s", test.Name, test.Description)) + + err := RunConnectivityTest(test, rg, buildId) + + if test.ShouldFail { + // This test should fail (NSG blocked or customer isolation) + if err == nil { + fmt.Printf("Test %s: UNEXPECTED SUCCESS (expected to be blocked!)\n", test.Name) + failureCount++ + ginkgo.Fail(fmt.Sprintf("Test %s: Expected failure but succeeded (blocking not working!)", test.Name)) + } else { + fmt.Printf("Test %s: Correctly blocked (connection failed as expected)\n", test.Name) + successCount++ + } + } else { + // This test should succeed + if err != nil { + fmt.Printf("Test %s: FAILED - %v\n", test.Name, err) + failureCount++ + gomega.Expect(err).To(gomega.BeNil(), fmt.Sprintf("Test %s failed: %v", test.Name, err)) + } else { + fmt.Printf("Test %s: Connectivity successful\n", test.Name) + successCount++ + } + } + } + + ginkgo.By(fmt.Sprintf("Connectivity test summary: %d succeeded, %d failures", successCount, failureCount)) + }) +}) diff --git a/test/integration/swiftv2/longRunningCluster/datapath_create_test.go b/test/integration/swiftv2/longRunningCluster/datapath_create_test.go new file mode 100644 index 0000000000..9ba860e022 --- /dev/null +++ b/test/integration/swiftv2/longRunningCluster/datapath_create_test.go @@ -0,0 +1,118 @@ +//go:build create_test +// +build create_test + +package longRunningCluster + +import ( + "fmt" + "os" + "testing" + + "github.com/onsi/ginkgo/v2" + "github.com/onsi/gomega" +) + +func TestDatapathCreate(t *testing.T) { + gomega.RegisterFailHandler(ginkgo.Fail) + suiteConfig, reporterConfig := ginkgo.GinkgoConfiguration() + suiteConfig.Timeout = 0 + ginkgo.RunSpecs(t, "Datapath Create Suite", suiteConfig, reporterConfig) +} + +var _ = ginkgo.Describe("Datapath Create Tests", func() { + rg := os.Getenv("RG") + buildId := os.Getenv("BUILD_ID") + + if rg == "" || buildId == "" { + ginkgo.Fail(fmt.Sprintf("Missing required environment variables: RG='%s', BUILD_ID='%s'", rg, buildId)) + } + + ginkgo.It("creates PodNetwork, PodNetworkInstance, and Pods", ginkgo.NodeTimeout(0), func() { + // Define all test scenarios + scenarios := []PodScenario{ + // Customer 2 scenarios on aks-2 with cx_vnet_b1 + { + Name: "Customer2-AKS2-VnetB1-S1-LowNic", + Cluster: "aks-2", + VnetName: "cx_vnet_b1", + SubnetName: "s1", + NodeSelector: "low-nic", + PodNameSuffix: "c2-aks2-b1s1-low", + }, + { + Name: "Customer2-AKS2-VnetB1-S1-HighNic", + Cluster: "aks-2", + VnetName: "cx_vnet_b1", + SubnetName: "s1", + NodeSelector: "high-nic", + PodNameSuffix: "c2-aks2-b1s1-high", + }, + // Customer 1 scenarios + { + Name: "Customer1-AKS1-VnetA1-S1-LowNic", + Cluster: "aks-1", + VnetName: "cx_vnet_a1", + SubnetName: "s1", + NodeSelector: "low-nic", + PodNameSuffix: "c1-aks1-a1s1-low", + }, + { + Name: "Customer1-AKS1-VnetA1-S2-LowNic", + Cluster: "aks-1", + VnetName: "cx_vnet_a1", + SubnetName: "s2", + NodeSelector: "low-nic", + PodNameSuffix: "c1-aks1-a1s2-low", + }, + { + Name: "Customer1-AKS1-VnetA1-S2-HighNic", + Cluster: "aks-1", + VnetName: "cx_vnet_a1", + SubnetName: "s2", + NodeSelector: "high-nic", + PodNameSuffix: "c1-aks1-a1s2-high", + }, + { + Name: "Customer1-AKS1-VnetA2-S1-HighNic", + Cluster: "aks-1", + VnetName: "cx_vnet_a2", + SubnetName: "s1", + NodeSelector: "high-nic", + PodNameSuffix: "c1-aks1-a2s1-high", + }, + { + Name: "Customer1-AKS2-VnetA2-S1-LowNic", + Cluster: "aks-2", + VnetName: "cx_vnet_a2", + SubnetName: "s1", + NodeSelector: "low-nic", + PodNameSuffix: "c1-aks2-a2s1-low", + }, + { + Name: "Customer1-AKS2-VnetA3-S1-HighNic", + Cluster: "aks-2", + VnetName: "cx_vnet_a3", + SubnetName: "s1", + NodeSelector: "high-nic", + PodNameSuffix: "c1-aks2-a3s1-high", + }, + } + + // Initialize test scenarios with cache + testScenarios := TestScenarios{ + ResourceGroup: rg, + BuildID: buildId, + PodImage: "nicolaka/netshoot:latest", + Scenarios: scenarios, + VnetSubnetCache: make(map[string]VnetSubnetInfo), + UsedNodes: make(map[string]bool), + } + + // Create all scenario resources + ginkgo.By(fmt.Sprintf("Creating all test scenarios (%d scenarios)", len(scenarios))) + err := CreateAllScenarios(testScenarios) + gomega.Expect(err).To(gomega.BeNil(), "Failed to create test scenarios") + + ginkgo.By("Successfully created all test scenarios") + }) +}) diff --git a/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go b/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go new file mode 100644 index 0000000000..72020d609b --- /dev/null +++ b/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go @@ -0,0 +1,117 @@ +// +build delete_test + +package longRunningCluster + +import ( + "fmt" + "os" + "testing" + + "github.com/onsi/ginkgo/v2" + "github.com/onsi/gomega" +) + +func TestDatapathDelete(t *testing.T) { + gomega.RegisterFailHandler(ginkgo.Fail) + suiteConfig, reporterConfig := ginkgo.GinkgoConfiguration() + suiteConfig.Timeout = 0 + ginkgo.RunSpecs(t, "Datapath Delete Suite", suiteConfig, reporterConfig) +} + +var _ = ginkgo.Describe("Datapath Delete Tests", func() { + rg := os.Getenv("RG") + buildId := os.Getenv("BUILD_ID") + + if rg == "" || buildId == "" { + ginkgo.Fail(fmt.Sprintf("Missing required environment variables: RG='%s', BUILD_ID='%s'", rg, buildId)) + } + + ginkgo.It("deletes PodNetwork, PodNetworkInstance, and Pods", ginkgo.NodeTimeout(0), func() { + // Define all test scenarios (same as create) + scenarios := []PodScenario{ + // Customer 2 scenarios on aks-2 with cx_vnet_b1 + { + Name: "Customer2-AKS2-VnetB1-S1-LowNic", + Cluster: "aks-2", + VnetName: "cx_vnet_b1", + SubnetName: "s1", + NodeSelector: "low-nic", + PodNameSuffix: "c2-aks2-b1s1-low", + }, + { + Name: "Customer2-AKS2-VnetB1-S1-HighNic", + Cluster: "aks-2", + VnetName: "cx_vnet_b1", + SubnetName: "s1", + NodeSelector: "high-nic", + PodNameSuffix: "c2-aks2-b1s1-high", + }, + // Customer 1 scenarios + { + Name: "Customer1-AKS1-VnetA1-S1-LowNic", + Cluster: "aks-1", + VnetName: "cx_vnet_a1", + SubnetName: "s1", + NodeSelector: "low-nic", + PodNameSuffix: "c1-aks1-a1s1-low", + }, + { + Name: "Customer1-AKS1-VnetA1-S2-LowNic", + Cluster: "aks-1", + VnetName: "cx_vnet_a1", + SubnetName: "s2", + NodeSelector: "low-nic", + PodNameSuffix: "c1-aks1-a1s2-low", + }, + { + Name: "Customer1-AKS1-VnetA1-S2-HighNic", + Cluster: "aks-1", + VnetName: "cx_vnet_a1", + SubnetName: "s2", + NodeSelector: "high-nic", + PodNameSuffix: "c1-aks1-a1s2-high", + }, + { + Name: "Customer1-AKS1-VnetA2-S1-HighNic", + Cluster: "aks-1", + VnetName: "cx_vnet_a2", + SubnetName: "s1", + NodeSelector: "high-nic", + PodNameSuffix: "c1-aks1-a2s1-high", + }, + { + Name: "Customer1-AKS2-VnetA2-S1-LowNic", + Cluster: "aks-2", + VnetName: "cx_vnet_a2", + SubnetName: "s1", + NodeSelector: "low-nic", + PodNameSuffix: "c1-aks2-a2s1-low", + }, + { + Name: "Customer1-AKS2-VnetA3-S1-HighNic", + Cluster: "aks-2", + VnetName: "cx_vnet_a3", + SubnetName: "s1", + NodeSelector: "high-nic", + PodNameSuffix: "c1-aks2-a3s1-high", + }, + } + + // Initialize test scenarios with cache + testScenarios := TestScenarios{ + ResourceGroup: rg, + BuildID: buildId, + PodImage: "nicolaka/netshoot:latest", + Scenarios: scenarios, + VnetSubnetCache: make(map[string]VnetSubnetInfo), + UsedNodes: make(map[string]bool), + } + + // Delete all scenario resources + ginkgo.By("Deleting all test scenarios") + err := DeleteAllScenarios(testScenarios) + gomega.Expect(err).To(gomega.BeNil(), "Failed to delete test scenarios") + + ginkgo.By("Successfully deleted all test scenarios") + }) +}) diff --git a/test/integration/swiftv2/longRunningCluster/datapath_private_endpoint_test.go b/test/integration/swiftv2/longRunningCluster/datapath_private_endpoint_test.go new file mode 100644 index 0000000000..dc77302db1 --- /dev/null +++ b/test/integration/swiftv2/longRunningCluster/datapath_private_endpoint_test.go @@ -0,0 +1,150 @@ +//go:build private_endpoint_test +// +build private_endpoint_test + +package longRunningCluster + +import ( + "fmt" + "os" + "testing" + + "github.com/onsi/ginkgo/v2" + "github.com/onsi/gomega" +) + +func TestDatapathPrivateEndpoint(t *testing.T) { + gomega.RegisterFailHandler(ginkgo.Fail) + suiteConfig, reporterConfig := ginkgo.GinkgoConfiguration() + suiteConfig.Timeout = 0 + ginkgo.RunSpecs(t, "Datapath Private Endpoint Suite", suiteConfig, reporterConfig) +} + +var _ = ginkgo.Describe("Private Endpoint Tests", func() { + rg := os.Getenv("RG") + buildId := os.Getenv("BUILD_ID") + storageAccount1 := os.Getenv("STORAGE_ACCOUNT_1") + storageAccount2 := os.Getenv("STORAGE_ACCOUNT_2") + + ginkgo.It("tests private endpoint access and isolation", func() { + // Validate environment variables inside the It block + if rg == "" || buildId == "" { + ginkgo.Fail(fmt.Sprintf("Missing required environment variables: RG='%s', BUILD_ID='%s'", rg, buildId)) + } + + if storageAccount1 == "" || storageAccount2 == "" { + ginkgo.Fail(fmt.Sprintf("Missing storage account environment variables: STORAGE_ACCOUNT_1='%s', STORAGE_ACCOUNT_2='%s'", storageAccount1, storageAccount2)) + } + + // Initialize test scenarios with cache + testScenarios := TestScenarios{ + ResourceGroup: rg, + BuildID: buildId, + PodImage: "nicolaka/netshoot:latest", + VnetSubnetCache: make(map[string]VnetSubnetInfo), + UsedNodes: make(map[string]bool), + } + + // Get storage account endpoint for Tenant A (Customer 1) + storageAccountName := storageAccount1 + ginkgo.By(fmt.Sprintf("Getting private endpoint for storage account: %s", storageAccountName)) + + storageEndpoint, err := GetStoragePrivateEndpoint(testScenarios.ResourceGroup, storageAccountName) + gomega.Expect(err).To(gomega.BeNil(), "Failed to get storage account private endpoint") + gomega.Expect(storageEndpoint).NotTo(gomega.BeEmpty(), "Storage account private endpoint is empty") + + ginkgo.By(fmt.Sprintf("Storage account private endpoint: %s", storageEndpoint)) + + // Test scenarios for Private Endpoint connectivity + privateEndpointTests := []ConnectivityTest{ + // Test 1: Private Endpoint Access (Tenant A) - Pod from VNet-A1 Subnet 1 + { + Name: "Private Endpoint Access: VNet-A1-S1 to Storage-A", + SourceCluster: "aks-1", + SourcePodName: "pod-c1-aks1-a1s1-low", + SourceNS: "pn-" + testScenarios.BuildID + "-a1-s1", + DestEndpoint: storageEndpoint, + ShouldFail: false, + TestType: "storage-access", + Purpose: "Verify Tenant A pod can access Storage-A via private endpoint", + }, + // Test 2: Private Endpoint Access (Tenant A) - Pod from VNet-A1 Subnet 2 + { + Name: "Private Endpoint Access: VNet-A1-S2 to Storage-A", + SourceCluster: "aks-1", + SourcePodName: "pod-c1-aks1-a1s2-low", + SourceNS: "pn-" + testScenarios.BuildID + "-a1-s2", + DestEndpoint: storageEndpoint, + ShouldFail: false, + TestType: "storage-access", + Purpose: "Verify Tenant A pod can access Storage-A via private endpoint", + }, + // Test 3: Private Endpoint Access (Tenant A) - Pod from VNet-A2 + { + Name: "Private Endpoint Access: VNet-A2-S1 to Storage-A", + SourceCluster: "aks-1", + SourcePodName: "pod-c1-aks1-a2s1-high", + SourceNS: "pn-" + testScenarios.BuildID + "-a2-s1", + DestEndpoint: storageEndpoint, + ShouldFail: false, + TestType: "storage-access", + Purpose: "Verify Tenant A pod from peered VNet can access Storage-A", + }, + // Test 4: Private Endpoint Access (Tenant A) - Pod from VNet-A3 (cross-cluster) + { + Name: "Private Endpoint Access: VNet-A3-S1 to Storage-A (cross-cluster)", + SourceCluster: "aks-2", + SourcePodName: "pod-c1-aks2-a3s1-high", + SourceNS: "pn-" + testScenarios.BuildID + "-a3-s1", + DestEndpoint: storageEndpoint, + ShouldFail: false, + TestType: "storage-access", + Purpose: "Verify Tenant A pod from different cluster can access Storage-A", + } + } + + ginkgo.By(fmt.Sprintf("Running %d Private Endpoint connectivity tests", len(privateEndpointTests))) + + successCount := 0 + failureCount := 0 + + for _, test := range privateEndpointTests { + ginkgo.By(fmt.Sprintf("\n=== Test: %s ===", test.Name)) + ginkgo.By(fmt.Sprintf("Purpose: %s", test.Purpose)) + ginkgo.By(fmt.Sprintf("Expected: %s", func() string { + if test.ShouldFail { + return "BLOCKED" + } + return "SUCCESS" + }())) + + err := RunPrivateEndpointTest(testScenarios, test) + + if test.ShouldFail { + // Expected to fail (e.g., tenant isolation) + if err != nil { + ginkgo.By(fmt.Sprintf("Test correctly BLOCKED as expected: %s", test.Name)) + successCount++ + } else { + ginkgo.By(fmt.Sprintf("Test FAILED: Expected connection to be blocked but it succeeded: %s", test.Name)) + failureCount++ + } + } else { + // Expected to succeed + if err != nil { + ginkgo.By(fmt.Sprintf("Test FAILED: %s - Error: %v", test.Name, err)) + failureCount++ + } else { + ginkgo.By(fmt.Sprintf("Test PASSED: %s", test.Name)) + successCount++ + } + } + } + + ginkgo.By(fmt.Sprintf("\n=== Private Endpoint Test Summary ===")) + ginkgo.By(fmt.Sprintf("Total tests: %d", len(privateEndpointTests))) + ginkgo.By(fmt.Sprintf("Successful connections: %d", successCount)) + ginkgo.By(fmt.Sprintf("Unexpected failures: %d", failureCount)) + + gomega.Expect(failureCount).To(gomega.Equal(0), "Some private endpoint tests failed unexpectedly") + }) +}) From 873c05e37b1ace49de6567dcf09d379044f3d99d Mon Sep 17 00:00:00 2001 From: sivakami Date: Mon, 24 Nov 2025 09:04:12 -0800 Subject: [PATCH 02/64] Update readme file. --- .pipelines/swiftv2-long-running/README.md | 251 ++-------------------- 1 file changed, 14 insertions(+), 237 deletions(-) diff --git a/.pipelines/swiftv2-long-running/README.md b/.pipelines/swiftv2-long-running/README.md index b513dcab00..5b47b43ce9 100644 --- a/.pipelines/swiftv2-long-running/README.md +++ b/.pipelines/swiftv2-long-running/README.md @@ -50,33 +50,11 @@ Examples: sv2-long-run-12345, sv2-long-run-67890 - **Lifecycle**: Can be cleaned up after testing completes - **Example**: PR validation run with Build ID 12345 → `sv2-long-run-12345` -**3. Parallel/Custom Environments**: -``` -Pattern: sv2-long-run-- -Examples: sv2-long-run-centraluseuap-dev, sv2-long-run-eastus-staging -``` -- **When to use**: Parallel environments, feature testing, version upgrades -- **Purpose**: Isolated environment alongside production -- **Lifecycle**: Persistent or temporary based on use case -- **Example**: Development environment in Central US EUAP → `sv2-long-run-centraluseuap-dev` - **Important Notes**: -- ⚠️ Always follow the naming pattern for scheduled runs on master: `sv2-long-run-` -- ⚠️ Do not use build IDs for production scheduled infrastructure (it breaks continuity) -- ⚠️ Region name should match the `location` parameter for consistency -- ✅ All resource names within the setup use the resource group name as BUILD_ID prefix - -### Mode 1: Scheduled Test Runs (Default) -**Trigger**: Automated cron schedule every 1 hour -**Purpose**: Continuous validation of long-running infrastructure -**Setup Stages**: Disabled -**Test Duration**: ~30-40 minutes per run -**Resource Group**: Static (default: `sv2-long-run-`, e.g., `sv2-long-run-centraluseuap`) +- Always follow the naming pattern for scheduled runs on master: `sv2-long-run-` +- Do not use build IDs for production scheduled infrastructure (it breaks continuity) +- All resource names within the setup use the resource group name as BUILD_ID prefix -```yaml -# Runs automatically every 1 hour -# No manual/external triggers allowed -``` ### Mode 2: Initial Setup or Rebuild **Trigger**: Manual run with parameter change @@ -120,15 +98,6 @@ Parameters are organized by usage: |-----------|---------|-------------| | `resourceGroupName` | `""` (empty) | **Leave empty** to auto-generate based on usage pattern. See Resource Group Naming Conventions below. | -**Resource Group Naming Conventions**: -- **For scheduled runs on master/main branch**: Use `sv2-long-run-` (e.g., `sv2-long-run-centraluseuap`) - - This ensures consistent naming for production scheduled tests - - Example: Creating infrastructure in `centraluseuap` for scheduled runs → `sv2-long-run-centraluseuap` -- **For test/dev runs or PR validation**: Use `sv2-long-run-$(Build.BuildId)` - - Auto-cleanup after testing - - Example: `sv2-long-run-12345` (where 12345 is the build ID) -- **For parallel environments**: Use descriptive suffix (e.g., `sv2-long-run-centraluseuap-dev`, `sv2-long-run-eastus-staging`) - **Note**: VM SKUs are hardcoded as constants in the pipeline template: - Default nodepool: `Standard_D4s_v3` (low-nic capacity, 1 NIC) - NPLinux nodepool: `Standard_D16s_v3` (high-nic capacity, 7 NICs) @@ -161,21 +130,21 @@ The pipeline is organized into stages based on workload type, allowing sequentia ### Future Stages (Planned Architecture) Additional stages can be added to test different workload types sequentially: -**Example: Stage 3 - BYONodeDataPathTests** +**Example: Stage 3 - LinuxBYONodeDataPathTests** ```yaml -- stage: BYONodeDataPathTests +- stage: LinuxBYONodeDataPathTests displayName: "SwiftV2 Data Path Tests - BYO Node ID" dependsOn: ManagedNodeDataPathTests variables: - WORKLOAD_TYPE: "swiftv2-byonodeid" + WORKLOAD_TYPE: "swiftv2-linuxbyon" # Same job structure as ManagedNodeDataPathTests # Tests run on nodes labeled: workload-type=swiftv2-byonodeid ``` -**Example: Stage 4 - WindowsNodeDataPathTests** +**Example: Stage 4 - L1vhAccelnetNodeDataPathTests** ```yaml -- stage: WindowsNodeDataPathTests - displayName: "SwiftV2 Data Path Tests - Windows Nodes" +- stage: L1vhAccelnetNodeDataPathTests + displayName: "SwiftV2 Data Path Tests - Windows Nodes Accelnet" dependsOn: BYONodeDataPathTests variables: WORKLOAD_TYPE: "swiftv2-windows" @@ -183,27 +152,20 @@ Additional stages can be added to test different workload types sequentially: # Tests run on nodes labeled: workload-type=swiftv2-windows ``` -**Benefits of Stage-Based Architecture**: -- ✅ Sequential execution: Each workload type tested independently -- ✅ Isolated node pools: No resource contention between workload types -- ✅ Same infrastructure: All stages use the same VNets, storage, NSGs -- ✅ Same test suite: Connectivity and private endpoint tests run for each workload type -- ✅ Easy extensibility: Add new stages without modifying existing ones -- ✅ Clear results: Separate test results per workload type - **Node Labeling for Multiple Workload Types**: Each node pool gets labeled with its designated workload type during setup: ```bash # During cluster creation or node pool addition: -kubectl label nodes -l agentpool=nodepool1 workload-type=swiftv2-linux -kubectl label nodes -l agentpool=byonodepool workload-type=swiftv2-byonodeid -kubectl label nodes -l agentpool=winnodepool workload-type=swiftv2-windows +kubectl label nodes -l workload-type=swiftv2-linux +kubectl label nodes -l workload-type=swiftv2-linuxbyon +kubectl label nodes -l workload-type=swiftv2-l1vhaccelnet +kubectl label nodes -l workload-type=swiftv2-l1vhib ``` ## How It Works ### Scheduled Test Flow -Every 1 hour, the pipeline: +Every 3 hour, the pipeline: 1. Skips setup stages (infrastructure already exists) 2. **Job 1 - Create Resources**: Creates 8 test scenarios (PodNetwork, PNI, Pods with HTTP servers on port 8080) 3. **Job 2 - Connectivity Tests**: Tests HTTP connectivity between pods (9 test cases), then waits 20 minutes @@ -361,142 +323,6 @@ pod-c1-aks1-a1s1-low **All infrastructure resources are tagged with `SkipAutoDeleteTill=2032-12-31`** to prevent automatic cleanup by Azure subscription policies. -## Resource Naming - -All test resources use the pattern: `-static-setup--` - -**Examples**: -- PodNetwork: `pn-static-setup-a1-s1` -- PodNetworkInstance: `pni-static-setup-a1-s1` -- Pod: `pod-c1-aks1-a1s1-low` -- Namespace: `pn-static-setup-a1-s1` - -VNet names are simplified: -- `cx_vnet_a1` → `a1` -- `cx_vnet_b1` → `b1` - -## Switching to a New Setup - -**Scenario**: You created a new setup in RG `sv2-long-run-eastus` and want scheduled runs to use it. - -**Steps**: -1. Go to Pipeline → Edit -2. Update location parameter default value: - ```yaml - - name: location - default: "centraluseuap" # Change this - ``` -3. Save and commit -4. RG name will automatically become `sv2-long-run-centraluseuap` - -Alternatively, manually trigger with the new location or override `resourceGroupName` directly. - -## Creating Multiple Test Setups - -**Use Case**: You want to create a new test environment without affecting the existing one (e.g., for testing different configurations, regions, or versions). - -**Steps**: -1. Go to Pipeline → Run pipeline -2. Set `runSetupStages` = `true` -3. **Set `resourceGroupName`** based on usage: - - **For scheduled runs on master/main branch**: `sv2-long-run-` (e.g., `sv2-long-run-centraluseuap`, `sv2-long-run-eastus`) - - Use this naming pattern for production scheduled tests - - **For test/dev runs**: `sv2-long-run-$(Build.BuildId)` or custom (e.g., `sv2-long-run-12345`) - - For temporary testing or PR validation - - **For parallel environments**: Custom with descriptive suffix (e.g., `sv2-long-run-centraluseuap-dev`, `sv2-long-run-centraluseuap-v2`) -4. Optionally adjust `location` -5. Run pipeline - -**After setup completes**: -- The new infrastructure will be tagged with `SkipAutoDeleteTill=2032-12-31` -- Resources are isolated by the unique resource group name -- To run tests against the new setup, the scheduled pipeline would need to be updated with the new RG name - -**Example Scenarios**: -| Scenario | Resource Group Name | Purpose | Naming Pattern | -|----------|-------------------|---------|----------------| -| Production scheduled (Central US EUAP) | `sv2-long-run-centraluseuap` | Daily scheduled tests on master | `sv2-long-run-` | -| Production scheduled (East US) | `sv2-long-run-eastus` | Regional scheduled testing on master | `sv2-long-run-` | -| Temporary test run | `sv2-long-run-12345` | One-time testing (Build ID: 12345) | `sv2-long-run-$(Build.BuildId)` | -| Development environment | `sv2-long-run-centraluseuap-dev` | Development/testing | Custom with suffix | -| Version upgrade testing | `sv2-long-run-centraluseuap-v2` | Parallel environment for upgrades | Custom with suffix | - -## Resource Naming - instead of ping use -The pipeline uses the **resource group name as the BUILD_ID** to ensure unique resource names per test setup. This allows multiple parallel test environments without naming collisions. - -**Generated Resource Names**: -``` -BUILD_ID = - -PodNetwork: pn--- -PodNetworkInstance: pni--- -Namespace: pn--- -Pod: pod- -``` - -**Example for `resourceGroupName=sv2-long-run-centraluseuap`**: -``` -pn-sv2-long-run-centraluseuap-b1-s1 (PodNetwork for cx_vnet_b1, subnet s1) -pni-sv2-long-run-centraluseuap-b1-s1 (PodNetworkInstance) -pn-sv2-long-run-centraluseuap-a1-s1 (PodNetwork for cx_vnet_a1, subnet s1) -pni-sv2-long-run-centraluseuap-a1-s2 (PodNetworkInstance for cx_vnet_a1, subnet s2) -``` - -**Example for different setup `resourceGroupName=sv2-long-run-eastus`**: -``` -pn-sv2-long-run-eastus-b1-s1 (Different from centraluseuap setup) -pni-sv2-long-run-eastus-b1-s1 -pn-sv2-long-run-eastus-a1-s1 -``` - -This ensures **no collision** between different test setups running in parallel. - -## Deletion Strategy -### Phase 1: Delete All Pods -Deletes all pods across all scenarios first. This ensures IP reservations are released. - -``` -Deleting pod pod-c2-aks2-b1s1-low... -Deleting pod pod-c2-aks2-b1s1-high... -... -``` - -### Phase 2: Delete Shared Resources -Groups resources by vnet/subnet/cluster and deletes PNI/PN/Namespace once per group. - -``` -Deleting PodNetworkInstance pni-static-setup-b1-s1... -Deleting PodNetwork pn-static-setup-b1-s1... -Deleting namespace pn-static-setup-b1-s1... -``` - -**Why**: Multiple pods can share the same PNI. Deleting PNI while pods exist causes "ReservationInUse" errors. - -## Troubleshooting - -### Tests are running on wrong cluster -- Check `resourceGroupName` parameter points to correct RG -- Verify RG contains aks-1 and aks-2 clusters -- Check kubeconfig retrieval in logs - -### Setup stages not running -- Verify `runSetupStages` parameter is set to `true` -- Check condition: `condition: eq(${{ parameters.runSetupStages }}, true)` - -### Schedule not triggering -- Verify cron expression: `"0 */1 * * *"` (every 1 hour) -- Check branch in schedule matches your working branch -- Ensure `always: true` is set (runs even without code changes) - -### PNI stuck with "ReservationInUse" -- Check if pods were deleted first (Phase 1 logs) -- Manual fix: Delete pod → Wait 10s → Patch PNI to remove finalizers - -### Pipeline timeout after 6 hours -- This is expected behavior (timeoutInMinutes: 360) -- Tests should complete in ~30-40 minutes -- If tests hang, check deletion logs for stuck resources ## Manual Testing @@ -544,21 +370,6 @@ kubectl label nodes -l agentpool=nodepool1 nic-capacity=low-nic --overwrite kubectl label nodes -l agentpool=nplinux nic-capacity=high-nic --overwrite ``` -**Example Node Labels**: -```yaml -# Low-NIC node (nodepool1) -labels: - agentpool: nodepool1 - workload-type: swiftv2-linux - nic-capacity: low-nic - -# High-NIC node (nplinux) -labels: - agentpool: nplinux - workload-type: swiftv2-linux - nic-capacity: high-nic -``` - ### Node Selection in Tests Tests use these labels to select appropriate nodes dynamically: @@ -588,20 +399,6 @@ Tests use these labels to select appropriate nodes dynamically: **Note**: VM SKUs are hardcoded as constants in the pipeline template and cannot be changed by users. -## Schedule Modification - -To change test frequency, edit the cron schedule: - -```yaml -schedules: - - cron: "0 */1 * * *" # Every 1 hour (current) - # Examples: - # - cron: "0 */2 * * *" # Every 2 hours - # - cron: "0 */6 * * *" # Every 6 hours - # - cron: "0 0,8,16 * * *" # At 12am, 8am, 4pm - # - cron: "0 0 * * *" # Daily at midnight -``` - ## File Structure ``` @@ -639,23 +436,3 @@ test/integration/swiftv2/longRunningCluster/ - Storage accounts 5. **Avoid resource group collisions**: Always use unique `resourceGroupName` when creating new setups 6. **Document changes**: Update this README when modifying test scenarios or infrastructure - -## Resource Tags - -All infrastructure resources are automatically tagged during creation: - -```bash -SkipAutoDeleteTill=2032-12-31 -``` - -This prevents automatic cleanup by Azure subscription policies that delete resources after a certain period. The tag is applied to: -- Resource group (via create_resource_group job) -- AKS clusters (aks-1, aks-2) -- AKS cluster VNets -- Customer VNets (cx_vnet_a1, cx_vnet_a2, cx_vnet_a3, cx_vnet_b1) -- Storage accounts (sa1xxxx, sa2xxxx) - -To manually update the tag date: -```bash -az resource update --ids --set tags.SkipAutoDeleteTill=2033-12-31 -``` From 33954156525ae67742b555a17c521125a6b87ac4 Mon Sep 17 00:00:00 2001 From: sivakami Date: Mon, 24 Nov 2025 09:37:55 -0800 Subject: [PATCH 03/64] fix syntax for pe test. --- .../longRunningCluster/datapath_private_endpoint_test.go | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath_private_endpoint_test.go b/test/integration/swiftv2/longRunningCluster/datapath_private_endpoint_test.go index dc77302db1..0d94087a50 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath_private_endpoint_test.go +++ b/test/integration/swiftv2/longRunningCluster/datapath_private_endpoint_test.go @@ -99,7 +99,7 @@ var _ = ginkgo.Describe("Private Endpoint Tests", func() { ShouldFail: false, TestType: "storage-access", Purpose: "Verify Tenant A pod from different cluster can access Storage-A", - } + }, } ginkgo.By(fmt.Sprintf("Running %d Private Endpoint connectivity tests", len(privateEndpointTests))) From b34b332b0e55b94c6f3992463b7039be487f4637 Mon Sep 17 00:00:00 2001 From: sivakami Date: Tue, 2 Dec 2025 09:29:40 -0800 Subject: [PATCH 04/64] Create NSG rules with unique priority. --- .../scripts/create_nsg.sh | 271 ++++++++++++------ 1 file changed, 182 insertions(+), 89 deletions(-) diff --git a/.pipelines/swiftv2-long-running/scripts/create_nsg.sh b/.pipelines/swiftv2-long-running/scripts/create_nsg.sh index 34c04f5c70..09f4dade4c 100755 --- a/.pipelines/swiftv2-long-running/scripts/create_nsg.sh +++ b/.pipelines/swiftv2-long-running/scripts/create_nsg.sh @@ -106,96 +106,189 @@ wait_for_nsg() { } # ------------------------------- -# 1. Wait for NSGs to be available +# 1. Wait for NSG to be available # ------------------------------- wait_for_nsg "$RG" "$NSG_S1_NAME" -wait_for_nsg "$RG" "$NSG_S2_NAME" -# ------------------------------- -# 2. Create NSG Rules on Subnet1's NSG -# ------------------------------- -# Rule 1: Deny Outbound traffic FROM Subnet1 TO Subnet2 -echo "==> Creating NSG rule on $NSG_S1_NAME to DENY OUTBOUND traffic from Subnet1 ($SUBNET1_PREFIX) to Subnet2 ($SUBNET2_PREFIX)" -az network nsg rule create \ - --resource-group "$RG" \ - --nsg-name "$NSG_S1_NAME" \ - --name deny-s1-to-s2-outbound \ - --priority 100 \ - --source-address-prefixes "$SUBNET1_PREFIX" \ - --destination-address-prefixes "$SUBNET2_PREFIX" \ - --source-port-ranges "*" \ - --destination-port-ranges "*" \ - --direction Outbound \ - --access Deny \ - --protocol "*" \ - --description "Deny outbound traffic from Subnet1 to Subnet2" \ - --output none \ - && echo "[OK] Deny outbound rule from Subnet1 → Subnet2 created on $NSG_S1_NAME." - -verify_nsg_rule "$RG" "$NSG_S1_NAME" "deny-s1-to-s2-outbound" - -# Rule 2: Deny Inbound traffic FROM Subnet2 TO Subnet1 (for packets arriving at s1) -echo "==> Creating NSG rule on $NSG_S1_NAME to DENY INBOUND traffic from Subnet2 ($SUBNET2_PREFIX) to Subnet1 ($SUBNET1_PREFIX)" -az network nsg rule create \ - --resource-group "$RG" \ - --nsg-name "$NSG_S1_NAME" \ - --name deny-s2-to-s1-inbound \ - --priority 110 \ - --source-address-prefixes "$SUBNET2_PREFIX" \ - --destination-address-prefixes "$SUBNET1_PREFIX" \ - --source-port-ranges "*" \ - --destination-port-ranges "*" \ - --direction Inbound \ - --access Deny \ - --protocol "*" \ - --description "Deny inbound traffic from Subnet2 to Subnet1" \ - --output none \ - && echo "[OK] Deny inbound rule from Subnet2 → Subnet1 created on $NSG_S1_NAME." - -verify_nsg_rule "$RG" "$NSG_S1_NAME" "deny-s2-to-s1-inbound" - -# ------------------------------- -# 3. Create NSG Rules on Subnet2's NSG -# ------------------------------- -# Rule 3: Deny Outbound traffic FROM Subnet2 TO Subnet1 -echo "==> Creating NSG rule on $NSG_S2_NAME to DENY OUTBOUND traffic from Subnet2 ($SUBNET2_PREFIX) to Subnet1 ($SUBNET1_PREFIX)" -az network nsg rule create \ - --resource-group "$RG" \ - --nsg-name "$NSG_S2_NAME" \ - --name deny-s2-to-s1-outbound \ - --priority 100 \ - --source-address-prefixes "$SUBNET2_PREFIX" \ - --destination-address-prefixes "$SUBNET1_PREFIX" \ - --source-port-ranges "*" \ - --destination-port-ranges "*" \ - --direction Outbound \ - --access Deny \ - --protocol "*" \ - --description "Deny outbound traffic from Subnet2 to Subnet1" \ - --output none \ - && echo "[OK] Deny outbound rule from Subnet2 → Subnet1 created on $NSG_S2_NAME." - -verify_nsg_rule "$RG" "$NSG_S2_NAME" "deny-s2-to-s1-outbound" - -# Rule 4: Deny Inbound traffic FROM Subnet1 TO Subnet2 (for packets arriving at s2) -echo "==> Creating NSG rule on $NSG_S2_NAME to DENY INBOUND traffic from Subnet1 ($SUBNET1_PREFIX) to Subnet2 ($SUBNET2_PREFIX)" -az network nsg rule create \ - --resource-group "$RG" \ - --nsg-name "$NSG_S2_NAME" \ - --name deny-s1-to-s2-inbound \ - --priority 110 \ - --source-address-prefixes "$SUBNET1_PREFIX" \ - --destination-address-prefixes "$SUBNET2_PREFIX" \ - --source-port-ranges "*" \ - --destination-port-ranges "*" \ - --direction Inbound \ - --access Deny \ - --protocol "*" \ - --description "Deny inbound traffic from Subnet1 to Subnet2" \ - --output none \ - && echo "[OK] Deny inbound rule from Subnet1 → Subnet2 created on $NSG_S2_NAME." - -verify_nsg_rule "$RG" "$NSG_S2_NAME" "deny-s1-to-s2-inbound" - -echo "NSG rules applied successfully on $NSG_S1_NAME and $NSG_S2_NAME with bidirectional isolation between Subnet1 and Subnet2." +# Check if both subnets share the same NSG +if [[ "$NSG_S1_NAME" == "$NSG_S2_NAME" ]]; then + echo "==> Both subnets share the same NSG: $NSG_S1_NAME" + echo "==> Creating all NSG rules on shared NSG with unique priorities" + + # Rule 1: Deny Outbound traffic FROM Subnet1 TO Subnet2 + echo "==> Creating NSG rule on $NSG_S1_NAME to DENY OUTBOUND traffic from Subnet1 ($SUBNET1_PREFIX) to Subnet2 ($SUBNET2_PREFIX)" + az network nsg rule create \ + --resource-group "$RG" \ + --nsg-name "$NSG_S1_NAME" \ + --name deny-s1-to-s2-outbound \ + --priority 100 \ + --source-address-prefixes "$SUBNET1_PREFIX" \ + --destination-address-prefixes "$SUBNET2_PREFIX" \ + --source-port-ranges "*" \ + --destination-port-ranges "*" \ + --direction Outbound \ + --access Deny \ + --protocol "*" \ + --description "Deny outbound traffic from Subnet1 to Subnet2" \ + --output none \ + && echo "[OK] Deny outbound rule from Subnet1 → Subnet2 created on $NSG_S1_NAME." + + verify_nsg_rule "$RG" "$NSG_S1_NAME" "deny-s1-to-s2-outbound" + + # Rule 2: Deny Inbound traffic FROM Subnet2 TO Subnet1 + echo "==> Creating NSG rule on $NSG_S1_NAME to DENY INBOUND traffic from Subnet2 ($SUBNET2_PREFIX) to Subnet1 ($SUBNET1_PREFIX)" + az network nsg rule create \ + --resource-group "$RG" \ + --nsg-name "$NSG_S1_NAME" \ + --name deny-s2-to-s1-inbound \ + --priority 100 \ + --source-address-prefixes "$SUBNET2_PREFIX" \ + --destination-address-prefixes "$SUBNET1_PREFIX" \ + --source-port-ranges "*" \ + --destination-port-ranges "*" \ + --direction Inbound \ + --access Deny \ + --protocol "*" \ + --description "Deny inbound traffic from Subnet2 to Subnet1" \ + --output none \ + && echo "[OK] Deny inbound rule from Subnet2 → Subnet1 created on $NSG_S1_NAME." + + verify_nsg_rule "$RG" "$NSG_S1_NAME" "deny-s2-to-s1-inbound" + + # Rule 3: Deny Outbound traffic FROM Subnet2 TO Subnet1 + echo "==> Creating NSG rule on $NSG_S1_NAME to DENY OUTBOUND traffic from Subnet2 ($SUBNET2_PREFIX) to Subnet1 ($SUBNET1_PREFIX)" + az network nsg rule create \ + --resource-group "$RG" \ + --nsg-name "$NSG_S1_NAME" \ + --name deny-s2-to-s1-outbound \ + --priority 110 \ + --source-address-prefixes "$SUBNET2_PREFIX" \ + --destination-address-prefixes "$SUBNET1_PREFIX" \ + --source-port-ranges "*" \ + --destination-port-ranges "*" \ + --direction Outbound \ + --access Deny \ + --protocol "*" \ + --description "Deny outbound traffic from Subnet2 to Subnet1" \ + --output none \ + && echo "[OK] Deny outbound rule from Subnet2 → Subnet1 created on $NSG_S1_NAME." + + verify_nsg_rule "$RG" "$NSG_S1_NAME" "deny-s2-to-s1-outbound" + + # Rule 4: Deny Inbound traffic FROM Subnet1 TO Subnet2 + echo "==> Creating NSG rule on $NSG_S1_NAME to DENY INBOUND traffic from Subnet1 ($SUBNET1_PREFIX) to Subnet2 ($SUBNET2_PREFIX)" + az network nsg rule create \ + --resource-group "$RG" \ + --nsg-name "$NSG_S1_NAME" \ + --name deny-s1-to-s2-inbound \ + --priority 110 \ + --source-address-prefixes "$SUBNET1_PREFIX" \ + --destination-address-prefixes "$SUBNET2_PREFIX" \ + --source-port-ranges "*" \ + --destination-port-ranges "*" \ + --direction Inbound \ + --access Deny \ + --protocol "*" \ + --description "Deny inbound traffic from Subnet1 to Subnet2" \ + --output none \ + && echo "[OK] Deny inbound rule from Subnet1 → Subnet2 created on $NSG_S1_NAME." + + verify_nsg_rule "$RG" "$NSG_S1_NAME" "deny-s1-to-s2-inbound" + + echo "NSG rules applied successfully on shared NSG $NSG_S1_NAME with bidirectional isolation between Subnet1 and Subnet2." +else + echo "==> Subnets have different NSGs" + echo "==> Subnet s1 NSG: $NSG_S1_NAME" + echo "==> Subnet s2 NSG: $NSG_S2_NAME" + + wait_for_nsg "$RG" "$NSG_S2_NAME" + + # ------------------------------- + # 2. Create NSG Rules on Subnet1's NSG + # ------------------------------- + # Rule 1: Deny Outbound traffic FROM Subnet1 TO Subnet2 + echo "==> Creating NSG rule on $NSG_S1_NAME to DENY OUTBOUND traffic from Subnet1 ($SUBNET1_PREFIX) to Subnet2 ($SUBNET2_PREFIX)" + az network nsg rule create \ + --resource-group "$RG" \ + --nsg-name "$NSG_S1_NAME" \ + --name deny-s1-to-s2-outbound \ + --priority 100 \ + --source-address-prefixes "$SUBNET1_PREFIX" \ + --destination-address-prefixes "$SUBNET2_PREFIX" \ + --source-port-ranges "*" \ + --destination-port-ranges "*" \ + --direction Outbound \ + --access Deny \ + --protocol "*" \ + --description "Deny outbound traffic from Subnet1 to Subnet2" \ + --output none \ + && echo "[OK] Deny outbound rule from Subnet1 → Subnet2 created on $NSG_S1_NAME." + + verify_nsg_rule "$RG" "$NSG_S1_NAME" "deny-s1-to-s2-outbound" + + # Rule 2: Deny Inbound traffic FROM Subnet2 TO Subnet1 (for packets arriving at s1) + echo "==> Creating NSG rule on $NSG_S1_NAME to DENY INBOUND traffic from Subnet2 ($SUBNET2_PREFIX) to Subnet1 ($SUBNET1_PREFIX)" + az network nsg rule create \ + --resource-group "$RG" \ + --nsg-name "$NSG_S1_NAME" \ + --name deny-s2-to-s1-inbound \ + --priority 110 \ + --source-address-prefixes "$SUBNET2_PREFIX" \ + --destination-address-prefixes "$SUBNET1_PREFIX" \ + --source-port-ranges "*" \ + --destination-port-ranges "*" \ + --direction Inbound \ + --access Deny \ + --protocol "*" \ + --description "Deny inbound traffic from Subnet2 to Subnet1" \ + --output none \ + && echo "[OK] Deny inbound rule from Subnet2 → Subnet1 created on $NSG_S1_NAME." + + verify_nsg_rule "$RG" "$NSG_S1_NAME" "deny-s2-to-s1-inbound" + + # ------------------------------- + # 3. Create NSG Rules on Subnet2's NSG + # ------------------------------- + # Rule 3: Deny Outbound traffic FROM Subnet2 TO Subnet1 + echo "==> Creating NSG rule on $NSG_S2_NAME to DENY OUTBOUND traffic from Subnet2 ($SUBNET2_PREFIX) to Subnet1 ($SUBNET1_PREFIX)" + az network nsg rule create \ + --resource-group "$RG" \ + --nsg-name "$NSG_S2_NAME" \ + --name deny-s2-to-s1-outbound \ + --priority 100 \ + --source-address-prefixes "$SUBNET2_PREFIX" \ + --destination-address-prefixes "$SUBNET1_PREFIX" \ + --source-port-ranges "*" \ + --destination-port-ranges "*" \ + --direction Outbound \ + --access Deny \ + --protocol "*" \ + --description "Deny outbound traffic from Subnet2 to Subnet1" \ + --output none \ + && echo "[OK] Deny outbound rule from Subnet2 → Subnet1 created on $NSG_S2_NAME." + + verify_nsg_rule "$RG" "$NSG_S2_NAME" "deny-s2-to-s1-outbound" + + # Rule 4: Deny Inbound traffic FROM Subnet1 TO Subnet2 (for packets arriving at s2) + echo "==> Creating NSG rule on $NSG_S2_NAME to DENY INBOUND traffic from Subnet1 ($SUBNET1_PREFIX) to Subnet2 ($SUBNET2_PREFIX)" + az network nsg rule create \ + --resource-group "$RG" \ + --nsg-name "$NSG_S2_NAME" \ + --name deny-s1-to-s2-inbound \ + --priority 110 \ + --source-address-prefixes "$SUBNET1_PREFIX" \ + --destination-address-prefixes "$SUBNET2_PREFIX" \ + --source-port-ranges "*" \ + --destination-port-ranges "*" \ + --direction Inbound \ + --access Deny \ + --protocol "*" \ + --description "Deny inbound traffic from Subnet1 to Subnet2" \ + --output none \ + && echo "[OK] Deny inbound rule from Subnet1 → Subnet2 created on $NSG_S2_NAME." + + verify_nsg_rule "$RG" "$NSG_S2_NAME" "deny-s1-to-s2-inbound" + + echo "NSG rules applied successfully on $NSG_S1_NAME and $NSG_S2_NAME with bidirectional isolation between Subnet1 and Subnet2." +fi From f0d74b68aaa05f55e13493fa1b48b7aee047a7f9 Mon Sep 17 00:00:00 2001 From: sivakami Date: Tue, 2 Dec 2025 10:37:41 -0800 Subject: [PATCH 05/64] Scale tests - create 15 pods across 2 clusters. --- .../long-running-pipeline-template.yaml | 77 +++++- .../pod-with-device-plugin.yaml | 74 ++++++ .../longRunningCluster/datapath_scale_test.go | 221 ++++++++++++++++++ 3 files changed, 371 insertions(+), 1 deletion(-) create mode 100644 test/integration/manifests/swiftv2/long-running-cluster/pod-with-device-plugin.yaml create mode 100644 test/integration/swiftv2/longRunningCluster/datapath_scale_test.go diff --git a/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml b/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml index 7236fc8776..014b634459 100644 --- a/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml +++ b/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml @@ -363,7 +363,80 @@ stages: ginkgo -v -trace --timeout=30m --tags=private_endpoint_test # ------------------------------------------------------------ - # Job 4: Delete Test Resources + # Job 4: Scale Tests with Device Plugin + # ------------------------------------------------------------ + - job: ScaleTest + displayName: "Scale Test - Create and Delete 15 Pods with Device Plugin" + dependsOn: + - CreateTestResources + - ConnectivityTests + - PrivateEndpointTests + condition: succeeded() + timeoutInMinutes: 90 + pool: + vmImage: ubuntu-latest + steps: + - checkout: self + + - task: GoTool@0 + displayName: "Use Go 1.22.5" + inputs: + version: "1.22.5" + + - task: AzureCLI@2 + displayName: "Run Scale Test (Create and Delete)" + inputs: + azureSubscription: ${{ parameters.serviceConnection }} + scriptType: bash + scriptLocation: inlineScript + inlineScript: | + echo "==> Installing Ginkgo CLI" + go install github.com/onsi/ginkgo/v2/ginkgo@latest + + echo "==> Adding Go bin to PATH" + export PATH=$PATH:$(go env GOPATH)/bin + + echo "==> Downloading Go dependencies" + go mod download + + echo "==> Setting up kubeconfig for cluster aks-1" + az aks get-credentials \ + --resource-group $(rgName) \ + --name aks-1 \ + --file /tmp/aks-1.kubeconfig \ + --overwrite-existing \ + --admin + + echo "==> Setting up kubeconfig for cluster aks-2" + az aks get-credentials \ + --resource-group $(rgName) \ + --name aks-2 \ + --file /tmp/aks-2.kubeconfig \ + --overwrite-existing \ + --admin + + echo "==> Verifying cluster aks-1 connectivity" + kubectl --kubeconfig /tmp/aks-1.kubeconfig get nodes + + echo "==> Verifying cluster aks-2 connectivity" + kubectl --kubeconfig /tmp/aks-2.kubeconfig get nodes + + echo "==> Running scale test: Create 15 pods with device plugin across both clusters" + echo "NOTE: Pods are auto-scheduled by Kubernetes scheduler and device plugin" + echo " - 8 pods in aks-1 (cx_vnet_a1/s1)" + echo " - 7 pods in aks-2 (cx_vnet_a2/s1)" + echo "Pod limits per PodNetwork/PodNetworkInstance:" + echo " - Subnet IP address capacity" + echo " - Node capacity (typically 250 pods per node)" + echo " - Available device plugin resources (NICs per node)" + export RG="$(rgName)" + export BUILD_ID="$(rgName)" + export WORKLOAD_TYPE="swiftv2-linux" + cd ./test/integration/swiftv2/longRunningCluster + ginkgo -v -trace --timeout=1h --tags=scale_test + + # ------------------------------------------------------------ + # Job 5: Delete Test Resources # ------------------------------------------------------------ - job: DeleteTestResources displayName: "Delete PodNetwork, PNI, and Pods" @@ -371,6 +444,7 @@ stages: - CreateTestResources - ConnectivityTests - PrivateEndpointTests + - ScaleTest # Always run cleanup, even if previous jobs failed condition: always() timeoutInMinutes: 60 @@ -422,4 +496,5 @@ stages: export WORKLOAD_TYPE="swiftv2-linux" cd ./test/integration/swiftv2/longRunningCluster ginkgo -v -trace --timeout=1h --tags=delete_test + \ No newline at end of file diff --git a/test/integration/manifests/swiftv2/long-running-cluster/pod-with-device-plugin.yaml b/test/integration/manifests/swiftv2/long-running-cluster/pod-with-device-plugin.yaml new file mode 100644 index 0000000000..47031cd4b4 --- /dev/null +++ b/test/integration/manifests/swiftv2/long-running-cluster/pod-with-device-plugin.yaml @@ -0,0 +1,74 @@ +apiVersion: v1 +kind: Pod +metadata: + name: {{ .PodName }} + namespace: {{ .Namespace }} + labels: + kubernetes.azure.com/pod-network-instance: {{ .PNIName }} + kubernetes.azure.com/pod-network: {{ .PNName }} +spec: + containers: + - name: net-debugger + image: {{ .Image }} + command: ["/bin/bash", "-c"] + args: + - | + echo "Pod Network Diagnostics started on $(hostname)"; + echo "Node: $(hostname)"; + echo "Pod IP: $(hostname -i)"; + echo "Starting HTTP server on port 8080"; + + # Create a simple HTTP server directory + mkdir -p /tmp/www + cat > /tmp/www/index.html <<'EOF' + + + Network Test Pod - Auto Scheduled + +

Pod Network Test with Device Plugin (Auto-Scheduled)

+

Hostname: $(hostname)

+

IP Address: $(hostname -i)

+

Timestamp: $(date)

+ + + EOF + + # Start Python HTTP server on port 8080 in background + cd /tmp/www && python3 -m http.server 8080 & + HTTP_PID=$! + echo "HTTP server started with PID $HTTP_PID on port 8080" + + # Give server a moment to start + sleep 2 + + # Verify server is running + if netstat -tuln | grep -q ':8080'; then + echo "HTTP server is listening on port 8080" + else + echo "WARNING: HTTP server may not be listening on port 8080" + fi + + # Keep showing network info periodically + while true; do + echo "=== Network Status at $(date) ===" + ip addr show + ip route show + echo "=== Listening ports ===" + netstat -tuln | grep LISTEN || ss -tuln | grep LISTEN + sleep 300 # Every 5 minutes + done + ports: + - containerPort: 8080 + protocol: TCP + resources: + limits: + cpu: 300m + memory: 600Mi + acn.azure.com/vnet-nic: "1" + requests: + cpu: 300m + memory: 600Mi + acn.azure.com/vnet-nic: "1" + securityContext: + privileged: true + restartPolicy: Always diff --git a/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go b/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go new file mode 100644 index 0000000000..d122843b40 --- /dev/null +++ b/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go @@ -0,0 +1,221 @@ +//go:build scale_test +// +build scale_test + +package longRunningCluster + +import ( + "fmt" + "os" + "strings" + "sync" + "testing" + "time" + + "github.com/Azure/azure-container-networking/test/integration/swiftv2/helpers" + "github.com/onsi/ginkgo/v2" + "github.com/onsi/gomega" +) + +func TestDatapathScale(t *testing.T) { + gomega.RegisterFailHandler(ginkgo.Fail) + suiteConfig, reporterConfig := ginkgo.GinkgoConfiguration() + suiteConfig.Timeout = 0 + ginkgo.RunSpecs(t, "Datapath Scale Suite", suiteConfig, reporterConfig) +} + +var _ = ginkgo.Describe("Datapath Scale Tests", func() { + rg := os.Getenv("RG") + buildId := os.Getenv("BUILD_ID") + + if rg == "" || buildId == "" { + ginkgo.Fail(fmt.Sprintf("Missing required environment variables: RG='%s', BUILD_ID='%s'", rg, buildId)) + } + + ginkgo.It("creates and deletes 15 pods in a burst using device plugin", ginkgo.NodeTimeout(0), func() { + // NOTE: Maximum pods per PodNetwork/PodNetworkInstance is limited by: + // 1. Subnet IP address capacity + // 2. Node capacity (typically 250 pods per node) + // 3. Available NICs on nodes (device plugin resources) + // For this test: Creating 15 pods across aks-1 and aks-2 + // Device plugin and Kubernetes scheduler automatically place pods on nodes with available NICs + + // Define scenarios for both clusters - 8 pods on aks-1, 7 pods on aks-2 + scenarios := []struct { + cluster string + vnetName string + subnet string + podCount int + }{ + {cluster: "aks-1", vnetName: "cx_vnet_a1", subnet: "s1", podCount: 8}, + {cluster: "aks-2", vnetName: "cx_vnet_a2", subnet: "s1", podCount: 7}, + } + + // Initialize test scenarios with cache + testScenarios := TestScenarios{ + ResourceGroup: rg, + BuildID: buildId, + VnetSubnetCache: make(map[string]VnetSubnetInfo), + UsedNodes: make(map[string]bool), + PodImage: "mcr.microsoft.com/mirror/docker/library/busybox:1.36", + } + + startTime := time.Now() + var allResources []TestResources + + // Create PodNetwork and PodNetworkInstance for each scenario + for _, scenario := range scenarios { + kubeconfig := fmt.Sprintf("/tmp/%s.kubeconfig", scenario.cluster) + + // Get network info + ginkgo.By(fmt.Sprintf("Getting network info for %s/%s in cluster %s", scenario.vnetName, scenario.subnet, scenario.cluster)) + netInfo, err := GetOrFetchVnetSubnetInfo(testScenarios.ResourceGroup, scenario.vnetName, scenario.subnet, testScenarios.VnetSubnetCache) + gomega.Expect(err).To(gomega.BeNil(), fmt.Sprintf("Failed to get network info for %s/%s", scenario.vnetName, scenario.subnet)) + + // Create unique names + vnetShort := strings.TrimPrefix(scenario.vnetName, "cx_vnet_") + vnetShort = strings.ReplaceAll(vnetShort, "_", "-") + subnetNameSafe := strings.ReplaceAll(scenario.subnet, "_", "-") + pnName := fmt.Sprintf("pn-scale-%s-%s-%s", testScenarios.BuildID, vnetShort, subnetNameSafe) + pniName := fmt.Sprintf("pni-scale-%s-%s-%s", testScenarios.BuildID, vnetShort, subnetNameSafe) + + resources := TestResources{ + Kubeconfig: kubeconfig, + PNName: pnName, + PNIName: pniName, + VnetGUID: netInfo.VnetGUID, + SubnetGUID: netInfo.SubnetGUID, + SubnetARMID: netInfo.SubnetARMID, + SubnetToken: netInfo.SubnetToken, + PodNetworkTemplate: "../../manifests/swiftv2/long-running-cluster/podnetwork.yaml", + PNITemplate: "../../manifests/swiftv2/long-running-cluster/podnetworkinstance.yaml", + PodTemplate: "../../manifests/swiftv2/long-running-cluster/pod-with-device-plugin.yaml", + PodImage: testScenarios.PodImage, + } + + // Step 1: Create PodNetwork + ginkgo.By(fmt.Sprintf("Creating PodNetwork: %s in cluster %s", pnName, scenario.cluster)) + err = CreatePodNetworkResource(resources) + gomega.Expect(err).To(gomega.BeNil(), "Failed to create PodNetwork") + + // Step 2: Create namespace + ginkgo.By(fmt.Sprintf("Creating namespace: %s in cluster %s", pnName, scenario.cluster)) + err = CreateNamespaceResource(resources.Kubeconfig, resources.PNName) + gomega.Expect(err).To(gomega.BeNil(), "Failed to create namespace") + + // Step 3: Create PodNetworkInstance + ginkgo.By(fmt.Sprintf("Creating PodNetworkInstance: %s in cluster %s", pniName, scenario.cluster)) + err = CreatePodNetworkInstanceResource(resources) + gomega.Expect(err).To(gomega.BeNil(), "Failed to create PodNetworkInstance") + + allResources = append(allResources, resources) + } + + // Step 4: Create pods in burst across both clusters - let scheduler place them automatically + totalPods := 0 + for _, s := range scenarios { + totalPods += s.podCount + } + ginkgo.By(fmt.Sprintf("Creating %d pods in burst (auto-scheduled by device plugin)", totalPods)) + + var wg sync.WaitGroup + errors := make(chan error, totalPods) + podIndex := 0 + + for i, scenario := range scenarios { + for j := 0; j < scenario.podCount; j++ { + wg.Add(1) + go func(resources TestResources, cluster string, idx int) { + defer wg.Done() + defer ginkgo.GinkgoRecover() + + podName := fmt.Sprintf("scale-pod-%d", idx) + ginkgo.By(fmt.Sprintf("Creating pod %s in cluster %s (auto-scheduled)", podName, cluster)) + + // Create pod without specifying node - let device plugin and scheduler decide + err := CreatePod(resources.Kubeconfig, PodData{ + PodName: podName, + NodeName: "", // No node specified - auto-schedule + OS: "linux", + PNName: resources.PNName, + PNIName: resources.PNIName, + Namespace: resources.PNName, + Image: resources.PodImage, + }, resources.PodTemplate) + if err != nil { + errors <- fmt.Errorf("failed to create pod %s in cluster %s: %w", podName, cluster, err) + } + }(allResources[i], scenario.cluster, podIndex) + podIndex++ + } + } + + wg.Wait() + close(errors) + + elapsedTime := time.Since(startTime) + + // Check for any errors + var errList []error + for err := range errors { + errList = append(errList, err) + } + gomega.Expect(errList).To(gomega.BeEmpty(), "Some pods failed to create") + + ginkgo.By(fmt.Sprintf("Successfully created %d pods in %s", totalPods, elapsedTime)) + + // Wait for pods to stabilize + ginkgo.By("Waiting 30 seconds for pods to stabilize") + time.Sleep(30 * time.Second) + + // Verify all pods are running + ginkgo.By("Verifying all pods are in Running state") + podIndex = 0 + for i, scenario := range scenarios { + for j := 0; j < scenario.podCount; j++ { + podName := fmt.Sprintf("scale-pod-%d", podIndex) + err := helpers.WaitForPodRunning(allResources[i].Kubeconfig, allResources[i].PNName, podName, 5, 10) + gomega.Expect(err).To(gomega.BeNil(), fmt.Sprintf("Pod %s did not reach running state in cluster %s", podName, scenario.cluster)) + podIndex++ + } + } + + ginkgo.By(fmt.Sprintf("All %d pods are running successfully across both clusters", totalPods)) + + // Cleanup: Delete all scale test resources + ginkgo.By("Cleaning up scale test resources") + podIndex = 0 + for i, scenario := range scenarios { + resources := allResources[i] + kubeconfig := resources.Kubeconfig + + for j := 0; j < scenario.podCount; j++ { + podName := fmt.Sprintf("scale-pod-%d", podIndex) + ginkgo.By(fmt.Sprintf("Deleting pod: %s from cluster %s", podName, scenario.cluster)) + err := DeletePod(kubeconfig, resources.PNName, podName) + if err != nil { + fmt.Printf("Warning: Failed to delete pod %s: %v\n", podName, err) + } + podIndex++ + } + + // Delete namespace (this will also delete PNI) + ginkgo.By(fmt.Sprintf("Deleting namespace: %s from cluster %s", resources.PNName, scenario.cluster)) + err = DeleteNamespace(kubeconfig, resources.PNName) + gomega.Expect(err).To(gomega.BeNil(), "Failed to delete namespace") + + // Delete PodNetworkInstance + ginkgo.By(fmt.Sprintf("Deleting PodNetworkInstance: %s from cluster %s", resources.PNIName, scenario.cluster)) + err = DeletePodNetworkInstance(kubeconfig, resources.PNName, resources.PNIName) + if err != nil { + fmt.Printf("Warning: Failed to delete PNI %s: %v\n", resources.PNIName, err) + } + + // Delete PodNetwork + ginkgo.By(fmt.Sprintf("Deleting PodNetwork: %s from cluster %s", resources.PNName, scenario.cluster)) + err = DeletePodNetwork(kubeconfig, resources.PNName) + gomega.Expect(err).To(gomega.BeNil(), "Failed to delete PodNetwork") + } + + ginkgo.By("Scale test cleanup completed") + }) +}) From e9f50e61fe78520b5472b80412286d5bd34833ef Mon Sep 17 00:00:00 2001 From: sivakami-projects <126191544+sivakami-projects@users.noreply.github.com> Date: Fri, 5 Dec 2025 10:43:05 -0800 Subject: [PATCH 06/64] Update go.mod Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: sivakami-projects <126191544+sivakami-projects@users.noreply.github.com> --- go.mod | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/go.mod b/go.mod index 8096f632b3..3bafced2fa 100644 --- a/go.mod +++ b/go.mod @@ -1,8 +1,8 @@ module github.com/Azure/azure-container-networking -go 1.24.0 +go 1.22.5 -toolchain go1.24.10 +toolchain go1.22.5 require ( github.com/Azure/azure-sdk-for-go/sdk/azcore v1.19.1 From 8364bf57bba5f2492df4fc055c44e2c11a7a9ec1 Mon Sep 17 00:00:00 2001 From: sivakami-projects <126191544+sivakami-projects@users.noreply.github.com> Date: Fri, 5 Dec 2025 10:44:03 -0800 Subject: [PATCH 07/64] Update test/integration/swiftv2/longRunningCluster/datapath_create_test.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: sivakami-projects <126191544+sivakami-projects@users.noreply.github.com> --- .../longRunningCluster/datapath_create_test.go | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath_create_test.go b/test/integration/swiftv2/longRunningCluster/datapath_create_test.go index 9ba860e022..a818192a37 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath_create_test.go +++ b/test/integration/swiftv2/longRunningCluster/datapath_create_test.go @@ -20,14 +20,13 @@ func TestDatapathCreate(t *testing.T) { } var _ = ginkgo.Describe("Datapath Create Tests", func() { - rg := os.Getenv("RG") - buildId := os.Getenv("BUILD_ID") - - if rg == "" || buildId == "" { - ginkgo.Fail(fmt.Sprintf("Missing required environment variables: RG='%s', BUILD_ID='%s'", rg, buildId)) - } ginkgo.It("creates PodNetwork, PodNetworkInstance, and Pods", ginkgo.NodeTimeout(0), func() { + rg := os.Getenv("RG") + buildId := os.Getenv("BUILD_ID") + if rg == "" || buildId == "" { + ginkgo.Fail(fmt.Sprintf("Missing required environment variables: RG='%s', BUILD_ID='%s'", rg, buildId)) + } // Define all test scenarios scenarios := []PodScenario{ // Customer 2 scenarios on aks-2 with cx_vnet_b1 From 1d2ed59ca3002a610217f3ee2dc8a4593b638e57 Mon Sep 17 00:00:00 2001 From: sivakami-projects <126191544+sivakami-projects@users.noreply.github.com> Date: Fri, 5 Dec 2025 10:44:30 -0800 Subject: [PATCH 08/64] Update test/integration/swiftv2/longRunningCluster/datapath_delete_test.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: sivakami-projects <126191544+sivakami-projects@users.noreply.github.com> --- .../longRunningCluster/datapath_delete_test.go | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go b/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go index 72020d609b..650518a3a7 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go +++ b/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go @@ -19,14 +19,12 @@ func TestDatapathDelete(t *testing.T) { } var _ = ginkgo.Describe("Datapath Delete Tests", func() { - rg := os.Getenv("RG") - buildId := os.Getenv("BUILD_ID") - - if rg == "" || buildId == "" { - ginkgo.Fail(fmt.Sprintf("Missing required environment variables: RG='%s', BUILD_ID='%s'", rg, buildId)) - } - ginkgo.It("deletes PodNetwork, PodNetworkInstance, and Pods", ginkgo.NodeTimeout(0), func() { + rg := os.Getenv("RG") + buildId := os.Getenv("BUILD_ID") + if rg == "" || buildId == "" { + ginkgo.Fail(fmt.Sprintf("Missing required environment variables: RG='%s', BUILD_ID='%s'", rg, buildId)) + } // Define all test scenarios (same as create) scenarios := []PodScenario{ // Customer 2 scenarios on aks-2 with cx_vnet_b1 From efbfb029c6e3c337c3ba8fe4d0a77d2caf2fc87e Mon Sep 17 00:00:00 2001 From: sivakami-projects <126191544+sivakami-projects@users.noreply.github.com> Date: Fri, 5 Dec 2025 10:44:41 -0800 Subject: [PATCH 09/64] Update test/integration/swiftv2/longRunningCluster/datapath_connectivity_test.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: sivakami-projects <126191544+sivakami-projects@users.noreply.github.com> --- .../longRunningCluster/datapath_connectivity_test.go | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath_connectivity_test.go b/test/integration/swiftv2/longRunningCluster/datapath_connectivity_test.go index 2852992581..e37a5368d2 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath_connectivity_test.go +++ b/test/integration/swiftv2/longRunningCluster/datapath_connectivity_test.go @@ -21,14 +21,13 @@ func TestDatapathConnectivity(t *testing.T) { } var _ = ginkgo.Describe("Datapath Connectivity Tests", func() { - rg := os.Getenv("RG") - buildId := os.Getenv("BUILD_ID") - - if rg == "" || buildId == "" { - ginkgo.Fail(fmt.Sprintf("Missing required environment variables: RG='%s', BUILD_ID='%s'", rg, buildId)) - } ginkgo.It("tests HTTP connectivity between pods", ginkgo.NodeTimeout(0), func() { + rg := os.Getenv("RG") + buildId := os.Getenv("BUILD_ID") + if rg == "" || buildId == "" { + ginkgo.Fail(fmt.Sprintf("Missing required environment variables: RG='%s', BUILD_ID='%s'", rg, buildId)) + } // Helper function to generate namespace from vnet and subnet // Format: pn--- // Example: pn-sv2-long-run-centraluseuap-a1-s1 From 04a22a03170391ea05c1ba7c3ff6e5166da3524b Mon Sep 17 00:00:00 2001 From: sivakami-projects <126191544+sivakami-projects@users.noreply.github.com> Date: Fri, 5 Dec 2025 10:46:33 -0800 Subject: [PATCH 10/64] Update test/integration/swiftv2/longRunningCluster/datapath_delete_test.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: sivakami-projects <126191544+sivakami-projects@users.noreply.github.com> --- .../swiftv2/longRunningCluster/datapath_delete_test.go | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go b/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go index 650518a3a7..0a4b56c95e 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go +++ b/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go @@ -1,4 +1,4 @@ -// +build delete_test +//go:build delete_test package longRunningCluster From 56fbeb208a3a6db8690fe2d16183af3a9ad58cfd Mon Sep 17 00:00:00 2001 From: sivakami Date: Sun, 7 Dec 2025 18:15:01 -0800 Subject: [PATCH 11/64] Error handling for private endpoint tests. --- .../swiftv2/longRunningCluster/datapath.go | 91 ++++++++++++++++--- 1 file changed, 79 insertions(+), 12 deletions(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go index 4d138dca32..d0bab3c5a0 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath.go +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -2,6 +2,7 @@ package longRunningCluster import ( "bytes" + "context" "fmt" "os" "os/exec" @@ -619,19 +620,33 @@ func GenerateStorageSASToken(storageAccountName, containerName, blobName string) // Calculate expiry time: 7 days from now (Azure CLI limit) expiryTime := time.Now().UTC().Add(7 * 24 * time.Hour).Format("2006-01-02") + // Try account key first (more reliable, no RBAC delay) cmd := exec.Command("az", "storage", "blob", "generate-sas", "--account-name", storageAccountName, "--container-name", containerName, "--name", blobName, "--permissions", "r", "--expiry", expiryTime, - "--auth-mode", "login", - "--as-user", "--output", "tsv") out, err := cmd.CombinedOutput() if err != nil { - return "", fmt.Errorf("failed to generate SAS token: %s\n%s", err, string(out)) + // If account key fails, fall back to user delegation (requires RBAC) + fmt.Printf("Account key SAS generation failed, trying user delegation: %s\n", string(out)) + cmd = exec.Command("az", "storage", "blob", "generate-sas", + "--account-name", storageAccountName, + "--container-name", containerName, + "--name", blobName, + "--permissions", "r", + "--expiry", expiryTime, + "--auth-mode", "login", + "--as-user", + "--output", "tsv") + + out, err = cmd.CombinedOutput() + if err != nil { + return "", fmt.Errorf("failed to generate SAS token (both account key and user delegation): %s\n%s", err, string(out)) + } } sasToken := strings.TrimSpace(string(out)) @@ -657,16 +672,29 @@ func RunPrivateEndpointTest(testScenarios TestScenarios, test ConnectivityTest) fmt.Printf("Testing private endpoint access from %s to %s\n", test.SourcePodName, test.DestEndpoint) - // Step 1: Verify DNS resolution + // Step 1: Verify pod is running + fmt.Printf("==> Verifying pod %s is running\n", test.SourcePodName) + podStatusCmd := fmt.Sprintf("kubectl --kubeconfig %s get pod %s -n %s -o jsonpath='{.status.phase}'", kubeconfig, test.SourcePodName, test.SourceNS) + statusOut, err := exec.Command("sh", "-c", podStatusCmd).CombinedOutput() + if err != nil { + return fmt.Errorf("failed to get pod status: %w\nOutput: %s", err, string(statusOut)) + } + podStatus := strings.TrimSpace(string(statusOut)) + if podStatus != "Running" { + return fmt.Errorf("pod %s is not running (status: %s)", test.SourcePodName, podStatus) + } + fmt.Printf("Pod is running\n") + + // Step 2: Verify DNS resolution with longer timeout fmt.Printf("==> Checking DNS resolution for %s\n", test.DestEndpoint) resolveCmd := fmt.Sprintf("nslookup %s | tail -2", test.DestEndpoint) - resolveOutput, resolveErr := helpers.ExecInPod(kubeconfig, test.SourceNS, test.SourcePodName, resolveCmd) + resolveOutput, resolveErr := ExecInPodWithTimeout(kubeconfig, test.SourceNS, test.SourcePodName, resolveCmd, 20*time.Second) if resolveErr != nil { return fmt.Errorf("DNS resolution failed: %w\nOutput: %s", resolveErr, resolveOutput) } fmt.Printf("DNS Resolution Result:\n%s\n", resolveOutput) - // Step 2: Generate SAS token for test blob + // Step 3: Generate SAS token for test blob fmt.Printf("==> Generating SAS token for test blob\n") // Extract storage account name from FQDN (e.g., sa106936191.blob.core.windows.net -> sa106936191) storageAccountName := strings.Split(test.DestEndpoint, ".")[0] @@ -675,16 +703,55 @@ func RunPrivateEndpointTest(testScenarios TestScenarios, test ConnectivityTest) return fmt.Errorf("failed to generate SAS token: %w", err) } - // Step 3: Download test blob using SAS token (proves both connectivity AND data plane access) + // Step 4: Download test blob using SAS token with verbose output fmt.Printf("==> Downloading test blob via private endpoint\n") blobURL := fmt.Sprintf("https://%s/test/hello.txt?%s", test.DestEndpoint, sasToken) - curlCmd := fmt.Sprintf("curl -f -s --connect-timeout 5 --max-time 10 '%s'", blobURL) + // Use -v for verbose, capture stderr with 2>&1 to see HTTP response codes + curlCmd := fmt.Sprintf("curl -v -f -s --connect-timeout 10 --max-time 30 '%s' 2>&1", blobURL) - output, err := helpers.ExecInPod(kubeconfig, test.SourceNS, test.SourcePodName, curlCmd) + output, err := ExecInPodWithTimeout(kubeconfig, test.SourceNS, test.SourcePodName, curlCmd, 45*time.Second) if err != nil { - return fmt.Errorf("private endpoint connectivity test failed: %w\nOutput: %s", err, output) + // Check if it's an HTTP error (exit code 22) + if strings.Contains(err.Error(), "exit status 22") { + // Extract HTTP status code from verbose output + httpStatus := "unknown" + if strings.Contains(output, "HTTP/") { + lines := strings.Split(output, "\n") + for _, line := range lines { + if strings.Contains(line, "HTTP/") && (strings.Contains(line, " 4") || strings.Contains(line, " 5")) { + httpStatus = line + break + } + } + } + return fmt.Errorf("HTTP error from private endpoint (exit code 22): %s\nOutput: %s", httpStatus, truncateString(output, 500)) + } + return fmt.Errorf("private endpoint connectivity test failed: %w\nOutput: %s", err, truncateString(output, 500)) } - fmt.Printf("Private endpoint access successful! Blob content: %s\n", truncateString(output, 100)) - return nil + // Verify we got valid content + if strings.Contains(output, "Hello") || strings.Contains(output, "200 OK") { + fmt.Printf("Private endpoint access successful!\n") + return nil + } + + return fmt.Errorf("unexpected response from blob download (no 'Hello' or '200 OK' found)\nOutput: %s", truncateString(output, 500)) +} + +// ExecInPodWithTimeout executes a command in a pod with a custom timeout +func ExecInPodWithTimeout(kubeconfig, namespace, podName, command string, timeout time.Duration) (string, error) { + ctx, cancel := context.WithTimeout(context.Background(), timeout) + defer cancel() + + cmd := exec.CommandContext(ctx, "kubectl", "--kubeconfig", kubeconfig, "exec", podName, + "-n", namespace, "--", "sh", "-c", command) + out, err := cmd.CombinedOutput() + if err != nil { + if ctx.Err() == context.DeadlineExceeded { + return string(out), fmt.Errorf("command timed out after %v in pod %s: %w", timeout, podName, ctx.Err()) + } + return string(out), fmt.Errorf("failed to exec in pod %s in namespace %s: %w", podName, namespace, err) + } + + return string(out), nil } From 0abf219decd12dce2af708eb45ba604f7f4b2bc0 Mon Sep 17 00:00:00 2001 From: sivakami Date: Sun, 7 Dec 2025 18:28:15 -0800 Subject: [PATCH 12/64] Fix scale tests. --- .../swiftv2/longRunningCluster/datapath_scale_test.go | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go b/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go index d122843b40..71aace5628 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go +++ b/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go @@ -191,7 +191,7 @@ var _ = ginkgo.Describe("Datapath Scale Tests", func() { for j := 0; j < scenario.podCount; j++ { podName := fmt.Sprintf("scale-pod-%d", podIndex) ginkgo.By(fmt.Sprintf("Deleting pod: %s from cluster %s", podName, scenario.cluster)) - err := DeletePod(kubeconfig, resources.PNName, podName) + err := helpers.DeletePod(kubeconfig, resources.PNName, podName) if err != nil { fmt.Printf("Warning: Failed to delete pod %s: %v\n", podName, err) } @@ -200,19 +200,19 @@ var _ = ginkgo.Describe("Datapath Scale Tests", func() { // Delete namespace (this will also delete PNI) ginkgo.By(fmt.Sprintf("Deleting namespace: %s from cluster %s", resources.PNName, scenario.cluster)) - err = DeleteNamespace(kubeconfig, resources.PNName) + err := helpers.DeleteNamespace(kubeconfig, resources.PNName) gomega.Expect(err).To(gomega.BeNil(), "Failed to delete namespace") // Delete PodNetworkInstance ginkgo.By(fmt.Sprintf("Deleting PodNetworkInstance: %s from cluster %s", resources.PNIName, scenario.cluster)) - err = DeletePodNetworkInstance(kubeconfig, resources.PNName, resources.PNIName) + err = helpers.DeletePodNetworkInstance(kubeconfig, resources.PNName, resources.PNIName) if err != nil { fmt.Printf("Warning: Failed to delete PNI %s: %v\n", resources.PNIName, err) } // Delete PodNetwork ginkgo.By(fmt.Sprintf("Deleting PodNetwork: %s from cluster %s", resources.PNName, scenario.cluster)) - err = DeletePodNetwork(kubeconfig, resources.PNName) + err = helpers.DeletePodNetwork(kubeconfig, resources.PNName) gomega.Expect(err).To(gomega.BeNil(), "Failed to delete PodNetwork") } From ad85ecc7f030cbdaa2008b7c258c4467a1528ec7 Mon Sep 17 00:00:00 2001 From: sivakami Date: Sun, 7 Dec 2025 19:59:09 -0800 Subject: [PATCH 13/64] private endpoint tests. --- .../swiftv2/longRunningCluster/datapath.go | 56 ++++++++++++------- 1 file changed, 36 insertions(+), 20 deletions(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go index d0bab3c5a0..0be0b98394 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath.go +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -654,6 +654,14 @@ func GenerateStorageSASToken(storageAccountName, containerName, blobName string) return "", fmt.Errorf("generated SAS token is empty") } + // Remove any surrounding quotes that might be added by some shells + sasToken = strings.Trim(sasToken, "\"'") + + // Validate SAS token format - should start with typical SAS parameters + if !strings.Contains(sasToken, "sv=") && !strings.Contains(sasToken, "sig=") { + return "", fmt.Errorf("generated SAS token appears invalid (missing sv= or sig=): %s", sasToken) + } + return sasToken, nil } @@ -703,34 +711,42 @@ func RunPrivateEndpointTest(testScenarios TestScenarios, test ConnectivityTest) return fmt.Errorf("failed to generate SAS token: %w", err) } + // Debug: Print SAS token info + fmt.Printf("SAS token length: %d\n", len(sasToken)) + if len(sasToken) > 60 { + fmt.Printf("SAS token preview: %s...\n", sasToken[:60]) + } else { + fmt.Printf("SAS token: %s\n", sasToken) + } + // Step 4: Download test blob using SAS token with verbose output fmt.Printf("==> Downloading test blob via private endpoint\n") + // Construct URL - ensure SAS token is properly formatted + // Note: SAS token should already be URL-encoded from Azure CLI blobURL := fmt.Sprintf("https://%s/test/hello.txt?%s", test.DestEndpoint, sasToken) - // Use -v for verbose, capture stderr with 2>&1 to see HTTP response codes - curlCmd := fmt.Sprintf("curl -v -f -s --connect-timeout 10 --max-time 30 '%s' 2>&1", blobURL) - - output, err := ExecInPodWithTimeout(kubeconfig, test.SourceNS, test.SourcePodName, curlCmd, 45*time.Second) - if err != nil { - // Check if it's an HTTP error (exit code 22) - if strings.Contains(err.Error(), "exit status 22") { - // Extract HTTP status code from verbose output - httpStatus := "unknown" - if strings.Contains(output, "HTTP/") { - lines := strings.Split(output, "\n") - for _, line := range lines { - if strings.Contains(line, "HTTP/") && (strings.Contains(line, " 4") || strings.Contains(line, " 5")) { - httpStatus = line - break - } - } - } - return fmt.Errorf("HTTP error from private endpoint (exit code 22): %s\nOutput: %s", httpStatus, truncateString(output, 500)) + + // Use wget instead of curl - it handles special characters better + // -O- outputs to stdout, -q is quiet mode, --timeout sets timeout + wgetCmd := fmt.Sprintf("wget -O- --timeout=30 --tries=1 '%s' 2>&1", blobURL) + + // Use wget instead of curl - it handles special characters better + // -O- outputs to stdout, -q is quiet mode, --timeout sets timeout + wgetCmd := fmt.Sprintf("wget -O- --timeout=30 --tries=1 '%s' 2>&1", blobURL) + + output, err := ExecInPodWithTimeout(kubeconfig, test.SourceNS, test.SourcePodName, wgetCmd, 45*time.Second) + if err != nil { + // Check for HTTP errors in wget output + if strings.Contains(output, "ERROR 403") || strings.Contains(output, "ERROR 401") { + return fmt.Errorf("HTTP authentication error from private endpoint\nOutput: %s", truncateString(output, 500)) + } + if strings.Contains(output, "ERROR 404") { + return fmt.Errorf("blob not found (404) on private endpoint\nOutput: %s", truncateString(output, 500)) } return fmt.Errorf("private endpoint connectivity test failed: %w\nOutput: %s", err, truncateString(output, 500)) } // Verify we got valid content - if strings.Contains(output, "Hello") || strings.Contains(output, "200 OK") { + if strings.Contains(output, "Hello") || strings.Contains(output, "200 OK") || strings.Contains(output, "saved") { fmt.Printf("Private endpoint access successful!\n") return nil } From 82661882535f451634f58b94e34209396a6a2017 Mon Sep 17 00:00:00 2001 From: sivakami Date: Sun, 7 Dec 2025 20:06:01 -0800 Subject: [PATCH 14/64] remove duplicate statements. --- test/integration/swiftv2/longRunningCluster/datapath.go | 4 ---- 1 file changed, 4 deletions(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go index 0be0b98394..339cd43c71 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath.go +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -729,10 +729,6 @@ func RunPrivateEndpointTest(testScenarios TestScenarios, test ConnectivityTest) // -O- outputs to stdout, -q is quiet mode, --timeout sets timeout wgetCmd := fmt.Sprintf("wget -O- --timeout=30 --tries=1 '%s' 2>&1", blobURL) - // Use wget instead of curl - it handles special characters better - // -O- outputs to stdout, -q is quiet mode, --timeout sets timeout - wgetCmd := fmt.Sprintf("wget -O- --timeout=30 --tries=1 '%s' 2>&1", blobURL) - output, err := ExecInPodWithTimeout(kubeconfig, test.SourceNS, test.SourcePodName, wgetCmd, 45*time.Second) if err != nil { // Check for HTTP errors in wget output From b50005d21f5a21fcd6fdfbeeb984dcd2d9504035 Mon Sep 17 00:00:00 2001 From: sivakami Date: Sun, 7 Dec 2025 20:31:28 -0800 Subject: [PATCH 15/64] Private endpoint tests. --- .../swiftv2/longRunningCluster/datapath.go | 24 ++++++++++++++----- 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go index 339cd43c71..a5a606f5e1 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath.go +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -630,9 +630,20 @@ func GenerateStorageSASToken(storageAccountName, containerName, blobName string) "--output", "tsv") out, err := cmd.CombinedOutput() - if err != nil { + sasToken := strings.TrimSpace(string(out)) + + // Check if account key method produced valid token + accountKeyWorked := err == nil && !strings.Contains(sasToken, "WARNING") && + !strings.Contains(sasToken, "ERROR") && (strings.Contains(sasToken, "sv=") || strings.Contains(sasToken, "sig=")) + + if !accountKeyWorked { // If account key fails, fall back to user delegation (requires RBAC) - fmt.Printf("Account key SAS generation failed, trying user delegation: %s\n", string(out)) + if err != nil { + fmt.Printf("Account key SAS generation failed (error): %s\n", string(out)) + } else { + fmt.Printf("Account key SAS generation failed (no credentials): %s\n", sasToken) + } + cmd = exec.Command("az", "storage", "blob", "generate-sas", "--account-name", storageAccountName, "--container-name", containerName, @@ -642,21 +653,22 @@ func GenerateStorageSASToken(storageAccountName, containerName, blobName string) "--auth-mode", "login", "--as-user", "--output", "tsv") - + out, err = cmd.CombinedOutput() if err != nil { return "", fmt.Errorf("failed to generate SAS token (both account key and user delegation): %s\n%s", err, string(out)) } + + sasToken = strings.TrimSpace(string(out)) } - sasToken := strings.TrimSpace(string(out)) if sasToken == "" { return "", fmt.Errorf("generated SAS token is empty") } // Remove any surrounding quotes that might be added by some shells sasToken = strings.Trim(sasToken, "\"'") - + // Validate SAS token format - should start with typical SAS parameters if !strings.Contains(sasToken, "sv=") && !strings.Contains(sasToken, "sig=") { return "", fmt.Errorf("generated SAS token appears invalid (missing sv= or sig=): %s", sasToken) @@ -724,7 +736,7 @@ func RunPrivateEndpointTest(testScenarios TestScenarios, test ConnectivityTest) // Construct URL - ensure SAS token is properly formatted // Note: SAS token should already be URL-encoded from Azure CLI blobURL := fmt.Sprintf("https://%s/test/hello.txt?%s", test.DestEndpoint, sasToken) - + // Use wget instead of curl - it handles special characters better // -O- outputs to stdout, -q is quiet mode, --timeout sets timeout wgetCmd := fmt.Sprintf("wget -O- --timeout=30 --tries=1 '%s' 2>&1", blobURL) From afa8280de51b6f139ee8f3d78c7b4d2d2d9d9e8f Mon Sep 17 00:00:00 2001 From: sivakami Date: Sun, 7 Dec 2025 23:01:43 -0800 Subject: [PATCH 16/64] wait for pods to be scheduled in scale tests. --- .../integration/swiftv2/helpers/az_helpers.go | 22 +++++++++++++++++++ .../longRunningCluster/datapath_scale_test.go | 8 +++++++ 2 files changed, 30 insertions(+) diff --git a/test/integration/swiftv2/helpers/az_helpers.go b/test/integration/swiftv2/helpers/az_helpers.go index c6e5d4b090..88e5f5d4bf 100644 --- a/test/integration/swiftv2/helpers/az_helpers.go +++ b/test/integration/swiftv2/helpers/az_helpers.go @@ -264,6 +264,28 @@ func DeleteNamespace(kubeconfig, namespace string) error { return nil } +// WaitForPodScheduled waits for a pod to be scheduled (assigned to a node) with retries +func WaitForPodScheduled(kubeconfig, namespace, podName string, maxRetries, sleepSeconds int) error { + for attempt := 1; attempt <= maxRetries; attempt++ { + cmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "get", "pod", podName, "-n", namespace, "-o", "jsonpath={.spec.nodeName}") + out, err := cmd.CombinedOutput() + nodeName := strings.TrimSpace(string(out)) + + if err == nil && nodeName != "" { + fmt.Printf("Pod %s scheduled on node %s\n", podName, nodeName) + return nil + } + + if attempt < maxRetries { + fmt.Printf("Pod %s not scheduled yet (attempt %d/%d). Waiting %d seconds...\n", + podName, attempt, maxRetries, sleepSeconds) + time.Sleep(time.Duration(sleepSeconds) * time.Second) + } + } + + return fmt.Errorf("pod %s was not scheduled (no node assigned) after %d attempts", podName, maxRetries) +} + // WaitForPodRunning waits for a pod to reach Running state with retries func WaitForPodRunning(kubeconfig, namespace, podName string, maxRetries, sleepSeconds int) error { for attempt := 1; attempt <= maxRetries; attempt++ { diff --git a/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go b/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go index 71aace5628..9fca653310 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go +++ b/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go @@ -143,6 +143,14 @@ var _ = ginkgo.Describe("Datapath Scale Tests", func() { }, resources.PodTemplate) if err != nil { errors <- fmt.Errorf("failed to create pod %s in cluster %s: %w", podName, cluster, err) + return + } + + // Wait for pod to be scheduled (node assignment) before considering it created + // This prevents CNS errors about missing node names + err = helpers.WaitForPodScheduled(resources.Kubeconfig, resources.PNName, podName, 10, 6) + if err != nil { + errors <- fmt.Errorf("pod %s in cluster %s was not scheduled: %w", podName, cluster, err) } }(allResources[i], scenario.cluster, podIndex) podIndex++ From fef77088afa7d1a063b1b5fd995dea40df937ac3 Mon Sep 17 00:00:00 2001 From: sivakami Date: Sun, 7 Dec 2025 23:24:43 -0800 Subject: [PATCH 17/64] update pod image. --- .../swiftv2/longRunningCluster/datapath_scale_test.go | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go b/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go index 9fca653310..e445e8eec9 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go +++ b/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go @@ -56,7 +56,7 @@ var _ = ginkgo.Describe("Datapath Scale Tests", func() { BuildID: buildId, VnetSubnetCache: make(map[string]VnetSubnetInfo), UsedNodes: make(map[string]bool), - PodImage: "mcr.microsoft.com/mirror/docker/library/busybox:1.36", + PodImage: "nicolaka/netshoot:latest", } startTime := time.Now() From 5abacaedd997d42dc7787d481dce59e310ec1acf Mon Sep 17 00:00:00 2001 From: sivakami Date: Sun, 7 Dec 2025 20:31:28 -0800 Subject: [PATCH 18/64] Private endpoint tests. --- .../swiftv2/longRunningCluster/datapath.go | 28 +++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go index a5a606f5e1..3c552fbe92 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath.go +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -632,6 +632,13 @@ func GenerateStorageSASToken(storageAccountName, containerName, blobName string) out, err := cmd.CombinedOutput() sasToken := strings.TrimSpace(string(out)) + // Check if account key method produced valid token + accountKeyWorked := err == nil && !strings.Contains(sasToken, "WARNING") && + !strings.Contains(sasToken, "ERROR") && (strings.Contains(sasToken, "sv=") || strings.Contains(sasToken, "sig=")) + + if !accountKeyWorked { + sasToken := strings.TrimSpace(string(out)) + // Check if account key method produced valid token accountKeyWorked := err == nil && !strings.Contains(sasToken, "WARNING") && !strings.Contains(sasToken, "ERROR") && (strings.Contains(sasToken, "sv=") || strings.Contains(sasToken, "sig=")) @@ -644,6 +651,12 @@ func GenerateStorageSASToken(storageAccountName, containerName, blobName string) fmt.Printf("Account key SAS generation failed (no credentials): %s\n", sasToken) } + if err != nil { + fmt.Printf("Account key SAS generation failed (error): %s\n", string(out)) + } else { + fmt.Printf("Account key SAS generation failed (no credentials): %s\n", sasToken) + } + cmd = exec.Command("az", "storage", "blob", "generate-sas", "--account-name", storageAccountName, "--container-name", containerName, @@ -654,12 +667,15 @@ func GenerateStorageSASToken(storageAccountName, containerName, blobName string) "--as-user", "--output", "tsv") + out, err = cmd.CombinedOutput() if err != nil { return "", fmt.Errorf("failed to generate SAS token (both account key and user delegation): %s\n%s", err, string(out)) } sasToken = strings.TrimSpace(string(out)) + + sasToken = strings.TrimSpace(string(out)) } if sasToken == "" { @@ -674,6 +690,14 @@ func GenerateStorageSASToken(storageAccountName, containerName, blobName string) return "", fmt.Errorf("generated SAS token appears invalid (missing sv= or sig=): %s", sasToken) } + // Remove any surrounding quotes that might be added by some shells + sasToken = strings.Trim(sasToken, "\"'") + + // Validate SAS token format - should start with typical SAS parameters + if !strings.Contains(sasToken, "sv=") && !strings.Contains(sasToken, "sig=") { + return "", fmt.Errorf("generated SAS token appears invalid (missing sv= or sig=): %s", sasToken) + } + return sasToken, nil } @@ -741,6 +765,10 @@ func RunPrivateEndpointTest(testScenarios TestScenarios, test ConnectivityTest) // -O- outputs to stdout, -q is quiet mode, --timeout sets timeout wgetCmd := fmt.Sprintf("wget -O- --timeout=30 --tries=1 '%s' 2>&1", blobURL) + // Use wget instead of curl - it handles special characters better + // -O- outputs to stdout, -q is quiet mode, --timeout sets timeout + wgetCmd := fmt.Sprintf("wget -O- --timeout=30 --tries=1 '%s' 2>&1", blobURL) + output, err := ExecInPodWithTimeout(kubeconfig, test.SourceNS, test.SourcePodName, wgetCmd, 45*time.Second) if err != nil { // Check for HTTP errors in wget output From df1dc33c3eb02308df8138e083471a9100cc07aa Mon Sep 17 00:00:00 2001 From: sivakami Date: Sun, 7 Dec 2025 23:53:32 -0800 Subject: [PATCH 19/64] update private endpoint test. --- .../swiftv2/longRunningCluster/datapath.go | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go index 3c552fbe92..ad37f125f1 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath.go +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -765,16 +765,18 @@ func RunPrivateEndpointTest(testScenarios TestScenarios, test ConnectivityTest) // -O- outputs to stdout, -q is quiet mode, --timeout sets timeout wgetCmd := fmt.Sprintf("wget -O- --timeout=30 --tries=1 '%s' 2>&1", blobURL) - // Use wget instead of curl - it handles special characters better - // -O- outputs to stdout, -q is quiet mode, --timeout sets timeout - wgetCmd := fmt.Sprintf("wget -O- --timeout=30 --tries=1 '%s' 2>&1", blobURL) - output, err := ExecInPodWithTimeout(kubeconfig, test.SourceNS, test.SourcePodName, wgetCmd, 45*time.Second) if err != nil { // Check for HTTP errors in wget output if strings.Contains(output, "ERROR 403") || strings.Contains(output, "ERROR 401") { return fmt.Errorf("HTTP authentication error from private endpoint\nOutput: %s", truncateString(output, 500)) } + if strings.Contains(output, "ERROR 404") { + return fmt.Errorf("blob not found (404) on private endpoint\nOutput: %s", truncateString(output, 500)) + // Check for HTTP errors in wget output + if strings.Contains(output, "ERROR 403") || strings.Contains(output, "ERROR 401") { + return fmt.Errorf("HTTP authentication error from private endpoint\nOutput: %s", truncateString(output, 500)) + } if strings.Contains(output, "ERROR 404") { return fmt.Errorf("blob not found (404) on private endpoint\nOutput: %s", truncateString(output, 500)) } @@ -782,6 +784,7 @@ func RunPrivateEndpointTest(testScenarios TestScenarios, test ConnectivityTest) } // Verify we got valid content + if strings.Contains(output, "Hello") || strings.Contains(output, "200 OK") || strings.Contains(output, "saved") { if strings.Contains(output, "Hello") || strings.Contains(output, "200 OK") || strings.Contains(output, "saved") { fmt.Printf("Private endpoint access successful!\n") return nil From 03bd24b6261c50af2c085301b46d485824948313 Mon Sep 17 00:00:00 2001 From: sivakami Date: Sun, 7 Dec 2025 20:31:28 -0800 Subject: [PATCH 20/64] Private endpoint tests. --- .../swiftv2/longRunningCluster/datapath.go | 24 ------------------- 1 file changed, 24 deletions(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go index ad37f125f1..972d5709b2 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath.go +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -632,13 +632,6 @@ func GenerateStorageSASToken(storageAccountName, containerName, blobName string) out, err := cmd.CombinedOutput() sasToken := strings.TrimSpace(string(out)) - // Check if account key method produced valid token - accountKeyWorked := err == nil && !strings.Contains(sasToken, "WARNING") && - !strings.Contains(sasToken, "ERROR") && (strings.Contains(sasToken, "sv=") || strings.Contains(sasToken, "sig=")) - - if !accountKeyWorked { - sasToken := strings.TrimSpace(string(out)) - // Check if account key method produced valid token accountKeyWorked := err == nil && !strings.Contains(sasToken, "WARNING") && !strings.Contains(sasToken, "ERROR") && (strings.Contains(sasToken, "sv=") || strings.Contains(sasToken, "sig=")) @@ -651,12 +644,6 @@ func GenerateStorageSASToken(storageAccountName, containerName, blobName string) fmt.Printf("Account key SAS generation failed (no credentials): %s\n", sasToken) } - if err != nil { - fmt.Printf("Account key SAS generation failed (error): %s\n", string(out)) - } else { - fmt.Printf("Account key SAS generation failed (no credentials): %s\n", sasToken) - } - cmd = exec.Command("az", "storage", "blob", "generate-sas", "--account-name", storageAccountName, "--container-name", containerName, @@ -667,15 +654,12 @@ func GenerateStorageSASToken(storageAccountName, containerName, blobName string) "--as-user", "--output", "tsv") - out, err = cmd.CombinedOutput() if err != nil { return "", fmt.Errorf("failed to generate SAS token (both account key and user delegation): %s\n%s", err, string(out)) } sasToken = strings.TrimSpace(string(out)) - - sasToken = strings.TrimSpace(string(out)) } if sasToken == "" { @@ -690,14 +674,6 @@ func GenerateStorageSASToken(storageAccountName, containerName, blobName string) return "", fmt.Errorf("generated SAS token appears invalid (missing sv= or sig=): %s", sasToken) } - // Remove any surrounding quotes that might be added by some shells - sasToken = strings.Trim(sasToken, "\"'") - - // Validate SAS token format - should start with typical SAS parameters - if !strings.Contains(sasToken, "sv=") && !strings.Contains(sasToken, "sig=") { - return "", fmt.Errorf("generated SAS token appears invalid (missing sv= or sig=): %s", sasToken) - } - return sasToken, nil } From 35a68da97adf341c476c2c39a72c26559f612a08 Mon Sep 17 00:00:00 2001 From: sivakami Date: Sun, 7 Dec 2025 23:53:32 -0800 Subject: [PATCH 21/64] update private endpoint test. --- test/integration/swiftv2/longRunningCluster/datapath.go | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go index 972d5709b2..a3870ee7c4 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath.go +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -741,18 +741,13 @@ func RunPrivateEndpointTest(testScenarios TestScenarios, test ConnectivityTest) // -O- outputs to stdout, -q is quiet mode, --timeout sets timeout wgetCmd := fmt.Sprintf("wget -O- --timeout=30 --tries=1 '%s' 2>&1", blobURL) + output, err := ExecInPodWithTimeout(kubeconfig, test.SourceNS, test.SourcePodName, wgetCmd, 45*time.Second) output, err := ExecInPodWithTimeout(kubeconfig, test.SourceNS, test.SourcePodName, wgetCmd, 45*time.Second) if err != nil { // Check for HTTP errors in wget output if strings.Contains(output, "ERROR 403") || strings.Contains(output, "ERROR 401") { return fmt.Errorf("HTTP authentication error from private endpoint\nOutput: %s", truncateString(output, 500)) } - if strings.Contains(output, "ERROR 404") { - return fmt.Errorf("blob not found (404) on private endpoint\nOutput: %s", truncateString(output, 500)) - // Check for HTTP errors in wget output - if strings.Contains(output, "ERROR 403") || strings.Contains(output, "ERROR 401") { - return fmt.Errorf("HTTP authentication error from private endpoint\nOutput: %s", truncateString(output, 500)) - } if strings.Contains(output, "ERROR 404") { return fmt.Errorf("blob not found (404) on private endpoint\nOutput: %s", truncateString(output, 500)) } @@ -760,7 +755,6 @@ func RunPrivateEndpointTest(testScenarios TestScenarios, test ConnectivityTest) } // Verify we got valid content - if strings.Contains(output, "Hello") || strings.Contains(output, "200 OK") || strings.Contains(output, "saved") { if strings.Contains(output, "Hello") || strings.Contains(output, "200 OK") || strings.Contains(output, "saved") { fmt.Printf("Private endpoint access successful!\n") return nil From f74b1b2963de347f3ffee13372fabee88d8a1a17 Mon Sep 17 00:00:00 2001 From: sivakami Date: Mon, 8 Dec 2025 09:55:30 -0800 Subject: [PATCH 22/64] update pod.yaml --- .../swiftv2/long-running-cluster/pod.yaml | 44 ++++--------------- .../swiftv2/longRunningCluster/datapath.go | 4 +- 2 files changed, 10 insertions(+), 38 deletions(-) diff --git a/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml b/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml index ffb5293b18..ae3738a072 100644 --- a/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml +++ b/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml @@ -17,47 +17,19 @@ spec: - | echo "Pod Network Diagnostics started on $(hostname)"; echo "Pod IP: $(hostname -i)"; - echo "Starting HTTP server on port 8080"; + echo "Starting TCP listener on port 8080"; - # Create a simple HTTP server directory - mkdir -p /tmp/www - cat > /tmp/www/index.html <<'EOF' - - - Network Test Pod - -

Pod Network Test

-

Hostname: $(hostname)

-

IP Address: $(hostname -i)

-

Timestamp: $(date)

- - - EOF + while true; do + nc -l -p 8080 -c 'echo "TCP Connection Success from $(hostname) at $(date)"' + done & - # Start Python HTTP server on port 8080 in background - cd /tmp/www && python3 -m http.server 8080 & - HTTP_PID=$! - echo "HTTP server started with PID $HTTP_PID on port 8080" - - # Give server a moment to start + echo "TCP listener started on port 8080" sleep 2 - - # Verify server is running - if netstat -tuln | grep -q ':8080'; then - echo "HTTP server is listening on port 8080" + if netstat -tuln | grep -q ':8080'; then # Verify listener is running + echo "TCP listener is active on port 8080" else - echo "WARNING: HTTP server may not be listening on port 8080" + echo "WARNING: TCP listener may not be active on port 8080" fi - - # Keep showing network info periodically - while true; do - echo "=== Network Status at $(date) ===" - ip addr show - ip route show - echo "=== Listening ports ===" - netstat -tuln | grep LISTEN || ss -tuln | grep LISTEN - sleep 300 # Every 5 minutes - done ports: - containerPort: 8080 protocol: TCP diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go index a3870ee7c4..b2cd28788f 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath.go +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -594,9 +594,9 @@ func RunConnectivityTest(test ConnectivityTest, rg, buildId string) error { test.DestNamespace, test.DestinationPod, test.DestCluster, destIP) // Run curl command from source pod to destination pod using eth1 IP - // Using -m 10 for 10 second timeout, -f to fail on HTTP errors + // Using -m 10 for 10 second timeout // Using --interface eth1 to force traffic through delegated subnet interface - curlCmd := fmt.Sprintf("curl --interface eth1 -f -m 10 http://%s:8080/", destIP) + curlCmd := fmt.Sprintf("curl --interface eth1 -m 10 http://%s:8080/", destIP) output, err := helpers.ExecInPod(sourceKubeconfig, test.SourceNamespace, test.SourcePod, curlCmd) if err != nil { From e04da22b8034f71b3a0765006bca45a4b468fc38 Mon Sep 17 00:00:00 2001 From: sivakami Date: Mon, 8 Dec 2025 10:05:36 -0800 Subject: [PATCH 23/64] Check if mtpnc is cleaned up after pods are deleted. --- .../integration/swiftv2/helpers/az_helpers.go | 34 +++++++++++++++++++ .../swiftv2/longRunningCluster/datapath.go | 23 +++++++++++++ 2 files changed, 57 insertions(+) diff --git a/test/integration/swiftv2/helpers/az_helpers.go b/test/integration/swiftv2/helpers/az_helpers.go index 88e5f5d4bf..ba68e9ad30 100644 --- a/test/integration/swiftv2/helpers/az_helpers.go +++ b/test/integration/swiftv2/helpers/az_helpers.go @@ -363,3 +363,37 @@ func ExecInPod(kubeconfig, namespace, podName, command string) (string, error) { return string(out), nil } + +// VerifyNoMTPNC checks if there are any pending MTPNC (MultiTenantPodNetworkConfig) resources +// associated with a specific build ID that should have been cleaned up +func VerifyNoMTPNC(kubeconfig, buildID string) error { + cmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "get", "mtpnc", "-A", "-o", "json") + out, err := cmd.CombinedOutput() + if err != nil { + // If MTPNC CRD doesn't exist, that's fine + if strings.Contains(string(out), "the server doesn't have a resource type") { + return nil + } + return fmt.Errorf("failed to get MTPNC resources: %w\nOutput: %s", err, string(out)) + } + + // Parse JSON to check for any MTPNC resources matching our build ID + output := string(out) + if strings.Contains(output, buildID) { + // Extract MTPNC names for better error reporting + lines := strings.Split(output, "\n") + var mtpncNames []string + for _, line := range lines { + if strings.Contains(line, buildID) && strings.Contains(line, "\"name\":") { + // Basic extraction - could be improved with proper JSON parsing + mtpncNames = append(mtpncNames, line) + } + } + + if len(mtpncNames) > 0 { + return fmt.Errorf("found %d MTPNC resources with build ID '%s' that should have been deleted. This may indicate stuck MTPNC deletion", len(mtpncNames), buildID) + } + } + + return nil +} diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go index b2cd28788f..ca2a4cb2cb 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath.go +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -514,6 +514,29 @@ func DeleteAllScenarios(testScenarios TestScenarios) error { } } + // Phase 3: Verify no MTPNC resources are stuck + fmt.Printf("\n=== Phase 3: Verifying MTPNC cleanup ===\n") + clustersChecked := make(map[string]bool) + + for _, scenario := range testScenarios.Scenarios { + // Check each cluster only once + if clustersChecked[scenario.Cluster] { + continue + } + clustersChecked[scenario.Cluster] = true + + kubeconfig := fmt.Sprintf("/tmp/%s.kubeconfig", scenario.Cluster) + fmt.Printf("Checking for pending MTPNC resources in cluster %s\n", scenario.Cluster) + + err := helpers.VerifyNoMTPNC(kubeconfig, testScenarios.BuildID) + if err != nil { + fmt.Printf("WARNING: Found pending MTPNC resources in cluster %s: %v\n", scenario.Cluster, err) + // Don't fail the test, just warn - MTPNC deletion might be in progress + } else { + fmt.Printf("✓ No pending MTPNC resources found in cluster %s\n", scenario.Cluster) + } + } + fmt.Printf("\n=== All scenarios deleted ===\n") return nil } From 595a3c5bb17ee7da780a85ed4253cbcf3e595222 Mon Sep 17 00:00:00 2001 From: sivakami Date: Mon, 8 Dec 2025 10:38:14 -0800 Subject: [PATCH 24/64] Update vnet names. --- .pipelines/swiftv2-long-running/README.md | 100 +++++++++--------- .../scripts/create_nsg.sh | 2 +- .../swiftv2-long-running/scripts/create_pe.sh | 6 +- .../scripts/create_peerings.sh | 8 +- .../scripts/create_vnets.sh | 2 +- .../swiftv2/longRunningCluster/datapath.go | 2 +- .../datapath_connectivity_test.go | 90 ++++++++-------- .../datapath_create_test.go | 50 ++++----- .../datapath_delete_test.go | 50 ++++----- .../datapath_private_endpoint_test.go | 32 +++--- 10 files changed, 171 insertions(+), 171 deletions(-) diff --git a/.pipelines/swiftv2-long-running/README.md b/.pipelines/swiftv2-long-running/README.md index 5b47b43ce9..8892bc07a8 100644 --- a/.pipelines/swiftv2-long-running/README.md +++ b/.pipelines/swiftv2-long-running/README.md @@ -6,10 +6,10 @@ This pipeline tests SwiftV2 pod networking in a persistent environment with sche **Infrastructure (Persistent)**: - **2 AKS Clusters**: aks-1, aks-2 (4 nodes each: 2 low-NIC default pool, 2 high-NIC nplinux pool) -- **4 VNets**: cx_vnet_a1, cx_vnet_a2, cx_vnet_a3 (Customer 1 with PE to storage), cx_vnet_b1 (Customer 2) +- **4 VNets**: cx_vnet_v1, cx_vnet_v2, cx_vnet_v3 (Customer 1 with PE to storage), cx_vnet_v4 (Customer 2) - **VNet Peerings**: vnet mesh. -- **Storage Account**: With private endpoint from cx_vnet_a1 -- **NSGs**: Restricting traffic between subnets (s1, s2) in vnet cx_vnet_a1. +- **Storage Account**: With private endpoint from cx_vnet_v1 +- **NSGs**: Restricting traffic between subnets (s1, s2) in vnet cx_vnet_v1. - **Node Labels**: All nodes labeled with `workload-type` and `nic-capacity` for targeted test execution **Test Scenarios (8 total per workload type)**: @@ -177,14 +177,14 @@ Every 3 hour, the pipeline: | Test | Source → Destination | Expected Result | Purpose | |------|---------------------|-----------------|---------| -| SameVNetSameSubnet | pod-c1-aks1-a1s2-low → pod-c1-aks1-a1s2-high | ✓ Success | Basic connectivity in same subnet | -| NSGBlocked_S1toS2 | pod-c1-aks1-a1s1-low → pod-c1-aks1-a1s2-high | ✗ Blocked | NSG rule blocks s1→s2 in cx_vnet_a1 | -| NSGBlocked_S2toS1 | pod-c1-aks1-a1s2-low → pod-c1-aks1-a1s1-low | ✗ Blocked | NSG rule blocks s2→s1 (bidirectional) | -| DifferentVNetSameCustomer | pod-c1-aks1-a2s1-high → pod-c1-aks2-a2s1-low | ✓ Success | Cross-cluster, same customer VNet | -| PeeredVNets | pod-c1-aks1-a1s2-low → pod-c1-aks1-a2s1-high | ✓ Success | Peered VNets (a1 ↔ a2) | -| PeeredVNets_A2toA3 | pod-c1-aks1-a2s1-high → pod-c1-aks2-a3s1-high | ✓ Success | Peered VNets across clusters | -| DifferentCustomers_A1toB1 | pod-c1-aks1-a1s2-low → pod-c2-aks2-b1s1-low | ✗ Blocked | Customer isolation (C1 → C2) | -| DifferentCustomers_A2toB1 | pod-c1-aks1-a2s1-high → pod-c2-aks2-b1s1-high | ✗ Blocked | Customer isolation (C1 → C2) | +| SameVNetSameSubnet | pod-c1-aks1-v1s2-low → pod-c1-aks1-v1s2-high | ✓ Success | Basic connectivity in same subnet | +| NSGBlocked_S1toS2 | pod-c1-aks1-v1s1-low → pod-c1-aks1-v1s2-high | ✗ Blocked | NSG rule blocks s1→s2 in cx_vnet_v1 | +| NSGBlocked_S2toS1 | pod-c1-aks1-v1s2-low → pod-c1-aks1-v1s1-low | ✗ Blocked | NSG rule blocks s2→s1 (bidirectional) | +| DifferentVNetSameCustomer | pod-c1-aks1-v2s1-high → pod-c1-aks2-v2s1-low | ✓ Success | Cross-cluster, same customer VNet | +| PeeredVNets | pod-c1-aks1-v1s2-low → pod-c1-aks1-v2s1-high | ✓ Success | Peered VNets (v1 ↔ v2) | +| PeeredVNets_V2toV3 | pod-c1-aks1-v2s1-high → pod-c1-aks2-v3s1-high | ✓ Success | Peered VNets across clusters | +| DifferentCustomers_V1toV4 | pod-c1-aks1-v1s2-low → pod-c2-aks2-v4s1-low | ✗ Blocked | Customer isolation (C1 → C2) | +| DifferentCustomers_V2toV4 | pod-c1-aks1-v2s1-high → pod-c2-aks2-v4s1-high | ✗ Blocked | Customer isolation (C1 → C2) | **Test Results**: 4 should succeed, 5 should be blocked (3 NSG rules + 2 customer isolation) @@ -192,11 +192,11 @@ Every 3 hour, the pipeline: | Test | Source → Destination | Expected Result | Purpose | |------|---------------------|-----------------|---------| -| TenantA_VNetA1_S1_to_StorageA | pod-c1-aks1-a1s1-low → Storage-A | ✓ Success | Tenant A pod can access Storage-A via private endpoint | -| TenantA_VNetA1_S2_to_StorageA | pod-c1-aks1-a1s2-low → Storage-A | ✓ Success | Tenant A pod can access Storage-A via private endpoint | -| TenantA_VNetA2_to_StorageA | pod-c1-aks1-a2s1-high → Storage-A | ✓ Success | Tenant A pod from peered VNet can access Storage-A | -| TenantA_VNetA3_to_StorageA | pod-c1-aks2-a3s1-high → Storage-A | ✓ Success | Tenant A pod from different cluster can access Storage-A | -| TenantB_to_StorageA_Isolation | pod-c2-aks2-b1s1-low → Storage-A | ✗ Blocked | Tenant B pod CANNOT access Storage-A (tenant isolation) | +| TenantA_VNetV1_S1_to_StorageA | pod-c1-aks1-v1s1-low → Storage-A | ✓ Success | Tenant A pod can access Storage-A via private endpoint | +| TenantA_VNetV1_S2_to_StorageA | pod-c1-aks1-v1s2-low → Storage-A | ✓ Success | Tenant A pod can access Storage-A via private endpoint | +| TenantA_VNetV2_to_StorageA | pod-c1-aks1-v2s1-high → Storage-A | ✓ Success | Tenant A pod from peered VNet can access Storage-A | +| TenantA_VNetV3_to_StorageA | pod-c1-aks2-v3s1-high → Storage-A | ✓ Success | Tenant A pod from different cluster can access Storage-A | +| TenantB_to_StorageA_Isolation | pod-c2-aks2-v4s1-low → Storage-A | ✗ Blocked | Tenant B pod CANNOT access Storage-A (tenant isolation) | **Test Results**: 4 should succeed, 1 should be blocked (tenant isolation) @@ -211,14 +211,14 @@ All test scenarios create the following resources: | # | Scenario | Cluster | VNet | Subnet | Node Type | Pod Name | Purpose | |---|----------|---------|------|--------|-----------|----------|---------| -| 1 | Customer2-AKS2-VnetB1-S1-LowNic | aks-2 | cx_vnet_b1 | s1 | low-nic | pod-c2-aks2-b1s1-low | Tenant B pod for isolation testing | -| 2 | Customer2-AKS2-VnetB1-S1-HighNic | aks-2 | cx_vnet_b1 | s1 | high-nic | pod-c2-aks2-b1s1-high | Tenant B pod on high-NIC node | -| 3 | Customer1-AKS1-VnetA1-S1-LowNic | aks-1 | cx_vnet_a1 | s1 | low-nic | pod-c1-aks1-a1s1-low | Tenant A pod in NSG-protected subnet | -| 4 | Customer1-AKS1-VnetA1-S2-LowNic | aks-1 | cx_vnet_a1 | s2 | low-nic | pod-c1-aks1-a1s2-low | Tenant A pod for NSG isolation test | -| 5 | Customer1-AKS1-VnetA1-S2-HighNic | aks-1 | cx_vnet_a1 | s2 | high-nic | pod-c1-aks1-a1s2-high | Tenant A pod on high-NIC node | -| 6 | Customer1-AKS1-VnetA2-S1-HighNic | aks-1 | cx_vnet_a2 | s1 | high-nic | pod-c1-aks1-a2s1-high | Tenant A pod in peered VNet | -| 7 | Customer1-AKS2-VnetA2-S1-LowNic | aks-2 | cx_vnet_a2 | s1 | low-nic | pod-c1-aks2-a2s1-low | Cross-cluster same VNet test | -| 8 | Customer1-AKS2-VnetA3-S1-HighNic | aks-2 | cx_vnet_a3 | s1 | high-nic | pod-c1-aks2-a3s1-high | Private endpoint access test | +| 1 | Customer2-AKS2-VnetV4-S1-LowNic | aks-2 | cx_vnet_v4 | s1 | low-nic | pod-c2-aks2-v4s1-low | Tenant B pod for isolation testing | +| 2 | Customer2-AKS2-VnetV4-S1-HighNic | aks-2 | cx_vnet_v4 | s1 | high-nic | pod-c2-aks2-v4s1-high | Tenant B pod on high-NIC node | +| 3 | Customer1-AKS1-VnetV1-S1-LowNic | aks-1 | cx_vnet_v1 | s1 | low-nic | pod-c1-aks1-v1s1-low | Tenant A pod in NSG-protected subnet | +| 4 | Customer1-AKS1-VnetV1-S2-LowNic | aks-1 | cx_vnet_v1 | s2 | low-nic | pod-c1-aks1-v1s2-low | Tenant A pod for NSG isolation test | +| 5 | Customer1-AKS1-VnetV1-S2-HighNic | aks-1 | cx_vnet_v1 | s2 | high-nic | pod-c1-aks1-v1s2-high | Tenant A pod on high-NIC node | +| 6 | Customer1-AKS1-VnetV2-S1-HighNic | aks-1 | cx_vnet_v2 | s1 | high-nic | pod-c1-aks1-v2s1-high | Tenant A pod in peered VNet | +| 7 | Customer1-AKS2-VnetV2-S1-LowNic | aks-2 | cx_vnet_v2 | s1 | low-nic | pod-c1-aks2-v2s1-low | Cross-cluster same VNet test | +| 8 | Customer1-AKS2-VnetV3-S1-HighNic | aks-2 | cx_vnet_v3 | s1 | high-nic | pod-c1-aks2-v3s1-high | Private endpoint access test | ### Connectivity Tests (9 Test Cases in Job 2) @@ -228,23 +228,23 @@ Tests HTTP connectivity between pods using curl with 5-second timeout: | Test | Source → Destination | Validation | Purpose | |------|---------------------|------------|---------| -| SameVNetSameSubnet | pod-c1-aks1-a1s2-low → pod-c1-aks1-a1s2-high | HTTP 200 | Basic same-subnet connectivity | -| DifferentVNetSameCustomer | pod-c1-aks1-a2s1-high → pod-c1-aks2-a2s1-low | HTTP 200 | Cross-cluster, same VNet (a2) | -| PeeredVNets | pod-c1-aks1-a1s2-low → pod-c1-aks1-a2s1-high | HTTP 200 | VNet peering (a1 ↔ a2) | -| PeeredVNets_A2toA3 | pod-c1-aks1-a2s1-high → pod-c1-aks2-a3s1-high | HTTP 200 | VNet peering across clusters | +| SameVNetSameSubnet | pod-c1-aks1-v1s2-low → pod-c1-aks1-v1s2-high | HTTP 200 | Basic same-subnet connectivity | +| DifferentVNetSameCustomer | pod-c1-aks1-v2s1-high → pod-c1-aks2-v2s1-low | HTTP 200 | Cross-cluster, same VNet (v2) | +| PeeredVNets | pod-c1-aks1-v1s2-low → pod-c1-aks1-v2s1-high | HTTP 200 | VNet peering (v1 ↔ v2) | +| PeeredVNets_v2tov3 | pod-c1-aks1-v2s1-high → pod-c1-aks2-v3s1-high | HTTP 200 | VNet peering across clusters | **Expected to FAIL (5 tests)**: | Test | Source → Destination | Expected Error | Purpose | |------|---------------------|----------------|---------| -| NSGBlocked_S1toS2 | pod-c1-aks1-a1s1-low → pod-c1-aks1-a1s2-high | Connection timeout | NSG blocks s1→s2 in cx_vnet_a1 | -| NSGBlocked_S2toS1 | pod-c1-aks1-a1s2-low → pod-c1-aks1-a1s1-low | Connection timeout | NSG blocks s2→s1 (bidirectional) | -| DifferentCustomers_A1toB1 | pod-c1-aks1-a1s2-low → pod-c2-aks2-b1s1-low | Connection timeout | Customer isolation (no peering) | -| DifferentCustomers_A2toB1 | pod-c1-aks1-a2s1-high → pod-c2-aks2-b1s1-high | Connection timeout | Customer isolation (no peering) | -| UnpeeredVNets_A3toB1 | pod-c1-aks2-a3s1-high → pod-c2-aks2-b1s1-low | Connection timeout | No peering between a3 and b1 | +| NSGBlocked_S1toS2 | pod-c1-aks1-v1s1-low → pod-c1-aks1-v1s2-high | Connection timeout | NSG blocks s1→s2 in cx_vnet_v1 | +| NSGBlocked_S2toS1 | pod-c1-aks1-v1s2-low → pod-c1-aks1-v1s1-low | Connection timeout | NSG blocks s2→s1 (bidirectional) | +| DifferentCustomers_V1toV4 | pod-c1-aks1-v1s2-low → pod-c2-aks2-v4s1-low | Connection timeout | Customer isolation (no peering) | +| DifferentCustomers_V2toV4 | pod-c1-aks1-v2s1-high → pod-c2-aks2-v4s1-high | Connection timeout | Customer isolation (no peering) | +| UnpeeredVNets_V3toV4 | pod-c1-aks2-v3s1-high → pod-c2-aks2-v4s1-low | Connection timeout | No peering between v3 and v4 | **NSG Rules Configuration**: -- cx_vnet_a1 has NSG rules blocking traffic between s1 and s2 subnets: +- cx_vnet_v1 has NSG rules blocking traffic between s1 and s2 subnets: - Deny outbound from s1 to s2 (priority 100) - Deny inbound from s1 to s2 (priority 110) - Deny outbound from s2 to s1 (priority 100) @@ -258,21 +258,21 @@ Tests access to Azure Storage Account via Private Endpoint with public network a | Test | Source → Storage | Validation | Purpose | |------|-----------------|------------|---------| -| TenantA_VNetA1_S1_to_StorageA | pod-c1-aks1-a1s1-low → Storage-A | Blob download via SAS | Access via private endpoint from VNet A1 | -| TenantA_VNetA1_S2_to_StorageA | pod-c1-aks1-a1s2-low → Storage-A | Blob download via SAS | Access via private endpoint from VNet A1 | -| TenantA_VNetA2_to_StorageA | pod-c1-aks1-a2s1-high → Storage-A | Blob download via SAS | Access via peered VNet (A2 peered with A1) | -| TenantA_VNetA3_to_StorageA | pod-c1-aks2-a3s1-high → Storage-A | Blob download via SAS | Access via peered VNet from different cluster | +| TenantA_VNetV1_S1_to_StorageA | pod-c1-aks1-v1s1-low → Storage-A | Blob download via SAS | Access via private endpoint from VNet V1 | +| TenantA_VNetV1_S2_to_StorageA | pod-c1-aks1-v1s2-low → Storage-A | Blob download via SAS | Access via private endpoint from VNet V1 | +| TenantA_VNetV2_to_StorageA | pod-c1-aks1-v2s1-high → Storage-A | Blob download via SAS | Access via peered VNet (V2 peered with V1) | +| TenantA_VNetV3_to_StorageA | pod-c1-aks2-v3s1-high → Storage-A | Blob download via SAS | Access via peered VNet from different cluster | **Expected to FAIL (1 test)**: | Test | Source → Storage | Expected Error | Purpose | |------|-----------------|----------------|---------| -| TenantB_to_StorageA_Isolation | pod-c2-aks2-b1s1-low → Storage-A | Connection timeout/failed | Tenant isolation - no private endpoint access, public blocked | +| TenantB_to_StorageA_Isolation | pod-c2-aks2-v4s1-low → Storage-A | Connection timeout/failed | Tenant isolation - no private endpoint access, public blocked | **Private Endpoint Configuration**: -- Private endpoint created in cx_vnet_a1 subnet 'pe' +- Private endpoint created in cx_vnet_v1 subnet 'pe' - Private DNS zone `privatelink.blob.core.windows.net` linked to: - - cx_vnet_a1, cx_vnet_a2, cx_vnet_a3 (Tenant A VNets) + - cx_vnet_v1, cx_vnet_v2, cx_vnet_v3 (Tenant A VNets) - aks-1 and aks-2 cluster VNets - Storage Account 1 (Tenant A): - Public network access: **Disabled** @@ -300,17 +300,17 @@ Pod: pod- **Example** (for `resourceGroupName=sv2-long-run-centraluseuap`): ``` -pn-sv2-long-run-centraluseuap-a1-s1 -pni-sv2-long-run-centraluseuap-a1-s1 -pn-sv2-long-run-centraluseuap-a1-s1 (namespace) -pod-c1-aks1-a1s1-low +pn-sv2-long-run-centraluseuap-v1-s1 +pni-sv2-long-run-centraluseuap-v1-s1 +pn-sv2-long-run-centraluseuap-v1-s1 (namespace) +pod-c1-aks1-v1s1-low ``` **VNet Name Simplification**: -- `cx_vnet_a1` → `a1` -- `cx_vnet_a2` → `a2` -- `cx_vnet_a3` → `a3` -- `cx_vnet_b1` → `b1` +- `cx_vnet_v1` → `v1` +- `cx_vnet_v2` → `v2` +- `cx_vnet_v3` → `v3` +- `cx_vnet_v4` → `v4` ### Setup Flow (When runSetupStages = true) 1. Create resource group with `SkipAutoDeleteTill=2032-12-31` tag @@ -432,7 +432,7 @@ test/integration/swiftv2/longRunningCluster/ 4. **Tag resources appropriately**: All setup resources automatically tagged with `SkipAutoDeleteTill=2032-12-31` - AKS clusters - AKS VNets - - Customer VNets (cx_vnet_a1, cx_vnet_a2, cx_vnet_a3, cx_vnet_b1) + - Customer VNets (cx_vnet_v1, cx_vnet_v2, cx_vnet_v3, cx_vnet_v4) - Storage accounts 5. **Avoid resource group collisions**: Always use unique `resourceGroupName` when creating new setups 6. **Document changes**: Update this README when modifying test scenarios or infrastructure diff --git a/.pipelines/swiftv2-long-running/scripts/create_nsg.sh b/.pipelines/swiftv2-long-running/scripts/create_nsg.sh index 09f4dade4c..39c15b458d 100755 --- a/.pipelines/swiftv2-long-running/scripts/create_nsg.sh +++ b/.pipelines/swiftv2-long-running/scripts/create_nsg.sh @@ -6,7 +6,7 @@ SUBSCRIPTION_ID=$1 RG=$2 LOCATION=$3 -VNET_A1="cx_vnet_a1" +VNET_A1="cx_vnet_v1" # Get actual subnet CIDR ranges dynamically echo "==> Retrieving actual subnet address prefixes..." diff --git a/.pipelines/swiftv2-long-running/scripts/create_pe.sh b/.pipelines/swiftv2-long-running/scripts/create_pe.sh index 4d83a8a700..697a0c1bd9 100644 --- a/.pipelines/swiftv2-long-running/scripts/create_pe.sh +++ b/.pipelines/swiftv2-long-running/scripts/create_pe.sh @@ -7,9 +7,9 @@ LOCATION=$2 RG=$3 SA1_NAME=$4 # Storage account 1 -VNET_A1="cx_vnet_a1" -VNET_A2="cx_vnet_a2" -VNET_A3="cx_vnet_a3" +VNET_A1="cx_vnet_v1" +VNET_A2="cx_vnet_v2" +VNET_A3="cx_vnet_v3" SUBNET_PE_A1="pe" PE_NAME="${SA1_NAME}-pe" PRIVATE_DNS_ZONE="privatelink.blob.core.windows.net" diff --git a/.pipelines/swiftv2-long-running/scripts/create_peerings.sh b/.pipelines/swiftv2-long-running/scripts/create_peerings.sh index d6655492f1..e458b4c215 100644 --- a/.pipelines/swiftv2-long-running/scripts/create_peerings.sh +++ b/.pipelines/swiftv2-long-running/scripts/create_peerings.sh @@ -3,10 +3,10 @@ set -e trap 'echo "[ERROR] Failed during VNet peering creation." >&2' ERR RG=$1 -VNET_A1="cx_vnet_a1" -VNET_A2="cx_vnet_a2" -VNET_A3="cx_vnet_a3" -VNET_B1="cx_vnet_b1" +VNET_A1="cx_vnet_v1" +VNET_A2="cx_vnet_v2" +VNET_A3="cx_vnet_v3" +VNET_B1="cx_vnet_v4" verify_peering() { local rg="$1"; local vnet="$2"; local peering="$3" diff --git a/.pipelines/swiftv2-long-running/scripts/create_vnets.sh b/.pipelines/swiftv2-long-running/scripts/create_vnets.sh index 4649c3aca1..77813709e5 100644 --- a/.pipelines/swiftv2-long-running/scripts/create_vnets.sh +++ b/.pipelines/swiftv2-long-running/scripts/create_vnets.sh @@ -10,7 +10,7 @@ BUILD_ID=$4 # --- VNet definitions --- # Create customer vnets for two customers A and B. # Using 172.16.0.0/12 range to avoid overlap with AKS infra 10.0.0.0/8 -VNAMES=( "cx_vnet_a1" "cx_vnet_a2" "cx_vnet_a3" "cx_vnet_b1" ) +VNAMES=( "cx_vnet_v1" "cx_vnet_v2" "cx_vnet_v3" "cx_vnet_v4" ) VCIDRS=( "172.16.0.0/16" "172.17.0.0/16" "172.18.0.0/16" "172.19.0.0/16" ) NODE_SUBNETS=( "172.16.0.0/24" "172.17.0.0/24" "172.18.0.0/24" "172.19.0.0/24" ) EXTRA_SUBNETS_LIST=( "s1 s2 pe" "s1" "s1" "s1" ) diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go index ca2a4cb2cb..4fad41bd4c 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath.go +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -105,7 +105,7 @@ type TestResources struct { type PodScenario struct { Name string // Descriptive name for the scenario Cluster string // "aks-1" or "aks-2" - VnetName string // e.g., "cx_vnet_a1", "cx_vnet_b1" + VnetName string // e.g., "cx_vnet_v1", "cx_vnet_v4" SubnetName string // e.g., "s1", "s2" NodeSelector string // "low-nic" or "high-nic" PodNameSuffix string // Unique suffix for pod name diff --git a/test/integration/swiftv2/longRunningCluster/datapath_connectivity_test.go b/test/integration/swiftv2/longRunningCluster/datapath_connectivity_test.go index e37a5368d2..7822d04b63 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath_connectivity_test.go +++ b/test/integration/swiftv2/longRunningCluster/datapath_connectivity_test.go @@ -30,9 +30,9 @@ var _ = ginkgo.Describe("Datapath Connectivity Tests", func() { } // Helper function to generate namespace from vnet and subnet // Format: pn--- - // Example: pn-sv2-long-run-centraluseuap-a1-s1 + // Example: pn-sv2-long-run-centraluseuap-v1-s1 getNamespace := func(vnetName, subnetName string) string { - // Extract vnet prefix (a1, a2, a3, b1, etc.) from cx_vnet_a1 -> a1 + // Extract vnet prefix (v1, v2, v3, v4, etc.) from cx_vnet_v1 -> v1 vnetPrefix := strings.TrimPrefix(vnetName, "cx_vnet_") return fmt.Sprintf("pn-%s-%s-%s", rg, vnetPrefix, subnetName) } @@ -42,86 +42,86 @@ var _ = ginkgo.Describe("Datapath Connectivity Tests", func() { connectivityTests := []ConnectivityTest{ { Name: "SameVNetSameSubnet", - SourcePod: "pod-c1-aks1-a1s2-low", - SourceNamespace: getNamespace("cx_vnet_a1", "s2"), - DestinationPod: "pod-c1-aks1-a1s2-high", - DestNamespace: getNamespace("cx_vnet_a1", "s2"), + SourcePod: "pod-c1-aks1-v1s2-low", + SourceNamespace: getNamespace("cx_vnet_v1", "s2"), + DestinationPod: "pod-c1-aks1-v1s2-high", + DestNamespace: getNamespace("cx_vnet_v1", "s2"), Cluster: "aks-1", - Description: "Test connectivity between low-NIC and high-NIC pods in same VNet/Subnet (cx_vnet_a1/s2)", + Description: "Test connectivity between low-NIC and high-NIC pods in same VNet/Subnet (cx_vnet_v1/s2)", ShouldFail: false, }, { Name: "NSGBlocked_S1toS2", - SourcePod: "pod-c1-aks1-a1s1-low", - SourceNamespace: getNamespace("cx_vnet_a1", "s1"), - DestinationPod: "pod-c1-aks1-a1s2-high", - DestNamespace: getNamespace("cx_vnet_a1", "s2"), + SourcePod: "pod-c1-aks1-v1s1-low", + SourceNamespace: getNamespace("cx_vnet_v1", "s1"), + DestinationPod: "pod-c1-aks1-v1s2-high", + DestNamespace: getNamespace("cx_vnet_v1", "s2"), Cluster: "aks-1", - Description: "Test NSG isolation: s1 -> s2 in cx_vnet_a1 (should be blocked by NSG rule)", + Description: "Test NSG isolation: s1 -> s2 in cx_vnet_v1 (should be blocked by NSG rule)", ShouldFail: true, }, { Name: "NSGBlocked_S2toS1", - SourcePod: "pod-c1-aks1-a1s2-low", - SourceNamespace: getNamespace("cx_vnet_a1", "s2"), - DestinationPod: "pod-c1-aks1-a1s1-low", - DestNamespace: getNamespace("cx_vnet_a1", "s1"), + SourcePod: "pod-c1-aks1-v1s2-low", + SourceNamespace: getNamespace("cx_vnet_v1", "s2"), + DestinationPod: "pod-c1-aks1-v1s1-low", + DestNamespace: getNamespace("cx_vnet_v1", "s1"), Cluster: "aks-1", - Description: "Test NSG isolation: s2 -> s1 in cx_vnet_a1 (should be blocked by NSG rule)", + Description: "Test NSG isolation: s2 -> s1 in cx_vnet_v1 (should be blocked by NSG rule)", ShouldFail: true, }, { Name: "DifferentClusters_SameVNet", - SourcePod: "pod-c1-aks1-a2s1-high", - SourceNamespace: getNamespace("cx_vnet_a2", "s1"), - DestinationPod: "pod-c1-aks2-a2s1-low", - DestNamespace: getNamespace("cx_vnet_a2", "s1"), + SourcePod: "pod-c1-aks1-v2s1-high", + SourceNamespace: getNamespace("cx_vnet_v2", "s1"), + DestinationPod: "pod-c1-aks2-v2s1-low", + DestNamespace: getNamespace("cx_vnet_v2", "s1"), Cluster: "aks-1", DestCluster: "aks-2", - Description: "Test connectivity across different clusters, same customer VNet (cx_vnet_a2)", + Description: "Test connectivity across different clusters, same customer VNet (cx_vnet_v2)", ShouldFail: false, }, { Name: "PeeredVNets", - SourcePod: "pod-c1-aks1-a1s2-low", - SourceNamespace: getNamespace("cx_vnet_a1", "s2"), - DestinationPod: "pod-c1-aks1-a2s1-high", - DestNamespace: getNamespace("cx_vnet_a2", "s1"), + SourcePod: "pod-c1-aks1-v1s2-low", + SourceNamespace: getNamespace("cx_vnet_v1", "s2"), + DestinationPod: "pod-c1-aks1-v2s1-high", + DestNamespace: getNamespace("cx_vnet_v2", "s1"), Cluster: "aks-1", - Description: "Test connectivity between peered VNets (cx_vnet_a1/s2 <-> cx_vnet_a2/s1)", + Description: "Test connectivity between peered VNets (cx_vnet_v1/s2 <-> cx_vnet_v2/s1)", ShouldFail: false, }, { - Name: "PeeredVNets_A2toA3", - SourcePod: "pod-c1-aks1-a2s1-high", - SourceNamespace: getNamespace("cx_vnet_a2", "s1"), - DestinationPod: "pod-c1-aks2-a3s1-high", - DestNamespace: getNamespace("cx_vnet_a3", "s1"), + Name: "PeeredVNets_v2tov3", + SourcePod: "pod-c1-aks1-v2s1-high", + SourceNamespace: getNamespace("cx_vnet_v2", "s1"), + DestinationPod: "pod-c1-aks2-v3s1-high", + DestNamespace: getNamespace("cx_vnet_v3", "s1"), Cluster: "aks-1", DestCluster: "aks-2", - Description: "Test connectivity between peered VNets across clusters (cx_vnet_a2 <-> cx_vnet_a3)", + Description: "Test connectivity between peered VNets across clusters (cx_vnet_v2 <-> cx_vnet_v3)", ShouldFail: false, }, { - Name: "DifferentCustomers_A1toB1", - SourcePod: "pod-c1-aks1-a1s2-low", - SourceNamespace: getNamespace("cx_vnet_a1", "s2"), - DestinationPod: "pod-c2-aks2-b1s1-low", - DestNamespace: getNamespace("cx_vnet_b1", "s1"), + Name: "DifferentCustomers_v1tov4", + SourcePod: "pod-c1-aks1-v1s2-low", + SourceNamespace: getNamespace("cx_vnet_v1", "s2"), + DestinationPod: "pod-c2-aks2-v4s1-low", + DestNamespace: getNamespace("cx_vnet_v4", "s1"), Cluster: "aks-1", DestCluster: "aks-2", - Description: "Test isolation: Customer 1 to Customer 2 should fail (cx_vnet_a1 -> cx_vnet_b1)", + Description: "Test isolation: Customer 1 to Customer 2 should fail (cx_vnet_v1 -> cx_vnet_v4)", ShouldFail: true, }, { - Name: "DifferentCustomers_A2toB1", - SourcePod: "pod-c1-aks1-a2s1-high", - SourceNamespace: getNamespace("cx_vnet_a2", "s1"), - DestinationPod: "pod-c2-aks2-b1s1-high", - DestNamespace: getNamespace("cx_vnet_b1", "s1"), + Name: "DifferentCustomers_v2tov4", + SourcePod: "pod-c1-aks1-v2s1-high", + SourceNamespace: getNamespace("cx_vnet_v2", "s1"), + DestinationPod: "pod-c2-aks2-v4s1-high", + DestNamespace: getNamespace("cx_vnet_v4", "s1"), Cluster: "aks-1", DestCluster: "aks-2", - Description: "Test isolation: Customer 1 to Customer 2 should fail (cx_vnet_a2 -> cx_vnet_b1)", + Description: "Test isolation: Customer 1 to Customer 2 should fail (cx_vnet_v2 -> cx_vnet_v4)", ShouldFail: true, }, } diff --git a/test/integration/swiftv2/longRunningCluster/datapath_create_test.go b/test/integration/swiftv2/longRunningCluster/datapath_create_test.go index a818192a37..d93088035f 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath_create_test.go +++ b/test/integration/swiftv2/longRunningCluster/datapath_create_test.go @@ -29,71 +29,71 @@ var _ = ginkgo.Describe("Datapath Create Tests", func() { } // Define all test scenarios scenarios := []PodScenario{ - // Customer 2 scenarios on aks-2 with cx_vnet_b1 + // Customer 2 scenarios on aks-2 with cx_vnet_v4 { - Name: "Customer2-AKS2-VnetB1-S1-LowNic", + Name: "Customer2-AKS2-VnetV4-S1-LowNic", Cluster: "aks-2", - VnetName: "cx_vnet_b1", + VnetName: "cx_vnet_v4", SubnetName: "s1", NodeSelector: "low-nic", - PodNameSuffix: "c2-aks2-b1s1-low", + PodNameSuffix: "c2-aks2-v4s1-low", }, { - Name: "Customer2-AKS2-VnetB1-S1-HighNic", + Name: "Customer2-AKS2-VnetV4-S1-HighNic", Cluster: "aks-2", - VnetName: "cx_vnet_b1", + VnetName: "cx_vnet_v4", SubnetName: "s1", NodeSelector: "high-nic", - PodNameSuffix: "c2-aks2-b1s1-high", + PodNameSuffix: "c2-aks2-v4s1-high", }, // Customer 1 scenarios { - Name: "Customer1-AKS1-VnetA1-S1-LowNic", + Name: "Customer1-AKS1-VnetV1-S1-LowNic", Cluster: "aks-1", - VnetName: "cx_vnet_a1", + VnetName: "cx_vnet_v1", SubnetName: "s1", NodeSelector: "low-nic", - PodNameSuffix: "c1-aks1-a1s1-low", + PodNameSuffix: "c1-aks1-v1s1-low", }, { - Name: "Customer1-AKS1-VnetA1-S2-LowNic", + Name: "Customer1-AKS1-VnetV1-S2-LowNic", Cluster: "aks-1", - VnetName: "cx_vnet_a1", + VnetName: "cx_vnet_v1", SubnetName: "s2", NodeSelector: "low-nic", - PodNameSuffix: "c1-aks1-a1s2-low", + PodNameSuffix: "c1-aks1-v1s2-low", }, { - Name: "Customer1-AKS1-VnetA1-S2-HighNic", + Name: "Customer1-AKS1-VnetV1-S2-HighNic", Cluster: "aks-1", - VnetName: "cx_vnet_a1", + VnetName: "cx_vnet_v1", SubnetName: "s2", NodeSelector: "high-nic", - PodNameSuffix: "c1-aks1-a1s2-high", + PodNameSuffix: "c1-aks1-v1s2-high", }, { - Name: "Customer1-AKS1-VnetA2-S1-HighNic", + Name: "Customer1-AKS1-VnetV2-S1-HighNic", Cluster: "aks-1", - VnetName: "cx_vnet_a2", + VnetName: "cx_vnet_v2", SubnetName: "s1", NodeSelector: "high-nic", - PodNameSuffix: "c1-aks1-a2s1-high", + PodNameSuffix: "c1-aks1-v2s1-high", }, { - Name: "Customer1-AKS2-VnetA2-S1-LowNic", + Name: "Customer1-AKS2-VnetV2-S1-LowNic", Cluster: "aks-2", - VnetName: "cx_vnet_a2", + VnetName: "cx_vnet_v2", SubnetName: "s1", NodeSelector: "low-nic", - PodNameSuffix: "c1-aks2-a2s1-low", + PodNameSuffix: "c1-aks2-v2s1-low", }, { - Name: "Customer1-AKS2-VnetA3-S1-HighNic", + Name: "Customer1-AKS2-VnetV3-S1-HighNic", Cluster: "aks-2", - VnetName: "cx_vnet_a3", + VnetName: "cx_vnet_v3", SubnetName: "s1", NodeSelector: "high-nic", - PodNameSuffix: "c1-aks2-a3s1-high", + PodNameSuffix: "c1-aks2-v3s1-high", }, } diff --git a/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go b/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go index 0a4b56c95e..af2cbd2081 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go +++ b/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go @@ -27,71 +27,71 @@ var _ = ginkgo.Describe("Datapath Delete Tests", func() { } // Define all test scenarios (same as create) scenarios := []PodScenario{ - // Customer 2 scenarios on aks-2 with cx_vnet_b1 + // Customer 2 scenarios on aks-2 with cx_vnet_v4 { - Name: "Customer2-AKS2-VnetB1-S1-LowNic", + Name: "Customer2-AKS2-VnetV4-S1-LowNic", Cluster: "aks-2", - VnetName: "cx_vnet_b1", + VnetName: "cx_vnet_v4", SubnetName: "s1", NodeSelector: "low-nic", - PodNameSuffix: "c2-aks2-b1s1-low", + PodNameSuffix: "c2-aks2-v4s1-low", }, { - Name: "Customer2-AKS2-VnetB1-S1-HighNic", + Name: "Customer2-AKS2-VnetV4-S1-HighNic", Cluster: "aks-2", - VnetName: "cx_vnet_b1", + VnetName: "cx_vnet_v4", SubnetName: "s1", NodeSelector: "high-nic", - PodNameSuffix: "c2-aks2-b1s1-high", + PodNameSuffix: "c2-aks2-v4s1-high", }, // Customer 1 scenarios { - Name: "Customer1-AKS1-VnetA1-S1-LowNic", + Name: "Customer1-AKS1-VnetV1-S1-LowNic", Cluster: "aks-1", - VnetName: "cx_vnet_a1", + VnetName: "cx_vnet_v1", SubnetName: "s1", NodeSelector: "low-nic", - PodNameSuffix: "c1-aks1-a1s1-low", + PodNameSuffix: "c1-aks1-v1s1-low", }, { - Name: "Customer1-AKS1-VnetA1-S2-LowNic", + Name: "Customer1-AKS1-VnetV1-S2-LowNic", Cluster: "aks-1", - VnetName: "cx_vnet_a1", + VnetName: "cx_vnet_v1", SubnetName: "s2", NodeSelector: "low-nic", - PodNameSuffix: "c1-aks1-a1s2-low", + PodNameSuffix: "c1-aks1-v1s2-low", }, { - Name: "Customer1-AKS1-VnetA1-S2-HighNic", + Name: "Customer1-AKS1-VnetV1-S2-HighNic", Cluster: "aks-1", - VnetName: "cx_vnet_a1", + VnetName: "cx_vnet_v1", SubnetName: "s2", NodeSelector: "high-nic", - PodNameSuffix: "c1-aks1-a1s2-high", + PodNameSuffix: "c1-aks1-v1s2-high", }, { - Name: "Customer1-AKS1-VnetA2-S1-HighNic", + Name: "Customer1-AKS1-VnetV2-S1-HighNic", Cluster: "aks-1", - VnetName: "cx_vnet_a2", + VnetName: "cx_vnet_v2", SubnetName: "s1", NodeSelector: "high-nic", - PodNameSuffix: "c1-aks1-a2s1-high", + PodNameSuffix: "c1-aks1-v2s1-high", }, { - Name: "Customer1-AKS2-VnetA2-S1-LowNic", + Name: "Customer1-AKS2-VnetV2-S1-LowNic", Cluster: "aks-2", - VnetName: "cx_vnet_a2", + VnetName: "cx_vnet_v2", SubnetName: "s1", NodeSelector: "low-nic", - PodNameSuffix: "c1-aks2-a2s1-low", + PodNameSuffix: "c1-aks2-v2s1-low", }, { - Name: "Customer1-AKS2-VnetA3-S1-HighNic", + Name: "Customer1-AKS2-VnetV3-S1-HighNic", Cluster: "aks-2", - VnetName: "cx_vnet_a3", + VnetName: "cx_vnet_v3", SubnetName: "s1", NodeSelector: "high-nic", - PodNameSuffix: "c1-aks2-a3s1-high", + PodNameSuffix: "c1-aks2-v3s1-high", }, } diff --git a/test/integration/swiftv2/longRunningCluster/datapath_private_endpoint_test.go b/test/integration/swiftv2/longRunningCluster/datapath_private_endpoint_test.go index 0d94087a50..6fb0131ccb 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath_private_endpoint_test.go +++ b/test/integration/swiftv2/longRunningCluster/datapath_private_endpoint_test.go @@ -56,45 +56,45 @@ var _ = ginkgo.Describe("Private Endpoint Tests", func() { // Test scenarios for Private Endpoint connectivity privateEndpointTests := []ConnectivityTest{ - // Test 1: Private Endpoint Access (Tenant A) - Pod from VNet-A1 Subnet 1 + // Test 1: Private Endpoint Access (Tenant A) - Pod from VNet-V1 Subnet 1 { - Name: "Private Endpoint Access: VNet-A1-S1 to Storage-A", + Name: "Private Endpoint Access: VNet-V1-S1 to Storage-A", SourceCluster: "aks-1", - SourcePodName: "pod-c1-aks1-a1s1-low", - SourceNS: "pn-" + testScenarios.BuildID + "-a1-s1", + SourcePodName: "pod-c1-aks1-v1s1-low", + SourceNS: "pn-" + testScenarios.BuildID + "-v1-s1", DestEndpoint: storageEndpoint, ShouldFail: false, TestType: "storage-access", Purpose: "Verify Tenant A pod can access Storage-A via private endpoint", }, - // Test 2: Private Endpoint Access (Tenant A) - Pod from VNet-A1 Subnet 2 + // Test 2: Private Endpoint Access (Tenant A) - Pod from VNet-V1 Subnet 2 { - Name: "Private Endpoint Access: VNet-A1-S2 to Storage-A", + Name: "Private Endpoint Access: VNet-V1-S2 to Storage-A", SourceCluster: "aks-1", - SourcePodName: "pod-c1-aks1-a1s2-low", - SourceNS: "pn-" + testScenarios.BuildID + "-a1-s2", + SourcePodName: "pod-c1-aks1-v1s2-low", + SourceNS: "pn-" + testScenarios.BuildID + "-v1-s2", DestEndpoint: storageEndpoint, ShouldFail: false, TestType: "storage-access", Purpose: "Verify Tenant A pod can access Storage-A via private endpoint", }, - // Test 3: Private Endpoint Access (Tenant A) - Pod from VNet-A2 + // Test 3: Private Endpoint Access (Tenant A) - Pod from VNet-V2 { - Name: "Private Endpoint Access: VNet-A2-S1 to Storage-A", + Name: "Private Endpoint Access: VNet-V2-S1 to Storage-A", SourceCluster: "aks-1", - SourcePodName: "pod-c1-aks1-a2s1-high", - SourceNS: "pn-" + testScenarios.BuildID + "-a2-s1", + SourcePodName: "pod-c1-aks1-v2s1-high", + SourceNS: "pn-" + testScenarios.BuildID + "-v2-s1", DestEndpoint: storageEndpoint, ShouldFail: false, TestType: "storage-access", Purpose: "Verify Tenant A pod from peered VNet can access Storage-A", }, - // Test 4: Private Endpoint Access (Tenant A) - Pod from VNet-A3 (cross-cluster) + // Test 4: Private Endpoint Access (Tenant A) - Pod from VNet-V3 (cross-cluster) { - Name: "Private Endpoint Access: VNet-A3-S1 to Storage-A (cross-cluster)", + Name: "Private Endpoint Access: VNet-V3-S1 to Storage-A (cross-cluster)", SourceCluster: "aks-2", - SourcePodName: "pod-c1-aks2-a3s1-high", - SourceNS: "pn-" + testScenarios.BuildID + "-a3-s1", + SourcePodName: "pod-c1-aks2-v3s1-high", + SourceNS: "pn-" + testScenarios.BuildID + "-v3-s1", DestEndpoint: storageEndpoint, ShouldFail: false, TestType: "storage-access", From 107f2c39bc541846d18c7303d5bdccd30fbf6e58 Mon Sep 17 00:00:00 2001 From: sivakami Date: Mon, 8 Dec 2025 10:40:05 -0800 Subject: [PATCH 25/64] add container readiness check. --- .../integration/swiftv2/helpers/az_helpers.go | 43 +++++++++++++------ 1 file changed, 29 insertions(+), 14 deletions(-) diff --git a/test/integration/swiftv2/helpers/az_helpers.go b/test/integration/swiftv2/helpers/az_helpers.go index ba68e9ad30..89645d1213 100644 --- a/test/integration/swiftv2/helpers/az_helpers.go +++ b/test/integration/swiftv2/helpers/az_helpers.go @@ -330,23 +330,38 @@ func GetPodIP(kubeconfig, namespace, podName string) (string, error) { // GetPodDelegatedIP retrieves the eth1 IP address (delegated subnet IP) of a pod // This is the IP used for cross-VNet communication and is subject to NSG rules func GetPodDelegatedIP(kubeconfig, namespace, podName string) (string, error) { - ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) - defer cancel() + // Retry logic - pod might be Running but container not ready yet + maxRetries := 5 + for attempt := 1; attempt <= maxRetries; attempt++ { + ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) - // Get eth1 IP address by running 'ip addr show eth1' in the pod - cmd := exec.CommandContext(ctx, "kubectl", "--kubeconfig", kubeconfig, "exec", podName, - "-n", namespace, "--", "sh", "-c", "ip -4 addr show eth1 | grep 'inet ' | awk '{print $2}' | cut -d'/' -f1") - out, err := cmd.CombinedOutput() - if err != nil { - return "", fmt.Errorf("failed to get eth1 IP for %s in namespace %s: %w\nOutput: %s", podName, namespace, err, string(out)) - } + // Get eth1 IP address by running 'ip addr show eth1' in the pod + cmd := exec.CommandContext(ctx, "kubectl", "--kubeconfig", kubeconfig, "exec", podName, + "-n", namespace, "-c", "net-debugger", "--", "sh", "-c", "ip -4 addr show eth1 | grep 'inet ' | awk '{print $2}' | cut -d'/' -f1") + out, err := cmd.CombinedOutput() + cancel() - ip := strings.TrimSpace(string(out)) - if ip == "" { - return "", fmt.Errorf("pod %s in namespace %s has no eth1 IP address (delegated subnet not configured?)", podName, namespace) + if err == nil { + ip := strings.TrimSpace(string(out)) + if ip != "" { + return ip, nil + } + return "", fmt.Errorf("pod %s in namespace %s has no eth1 IP address (delegated subnet not configured?)", podName, namespace) + } + + // Check if it's a container not found error + if strings.Contains(string(out), "container not found") { + if attempt < maxRetries { + fmt.Printf("Container not ready yet in pod %s (attempt %d/%d), waiting 3 seconds...\n", podName, attempt, maxRetries) + time.Sleep(3 * time.Second) + continue + } + } + + return "", fmt.Errorf("failed to get eth1 IP for %s in namespace %s: %w\nOutput: %s", podName, namespace, err, string(out)) } - return ip, nil + return "", fmt.Errorf("pod %s container not ready after %d attempts", podName, maxRetries) } // ExecInPod executes a command in a pod and returns the output @@ -389,7 +404,7 @@ func VerifyNoMTPNC(kubeconfig, buildID string) error { mtpncNames = append(mtpncNames, line) } } - + if len(mtpncNames) > 0 { return fmt.Errorf("found %d MTPNC resources with build ID '%s' that should have been deleted. This may indicate stuck MTPNC deletion", len(mtpncNames), buildID) } From 8a773ff615aba7ff3fb052f00a7337b7c5048c6a Mon Sep 17 00:00:00 2001 From: sivakami Date: Mon, 8 Dec 2025 10:45:56 -0800 Subject: [PATCH 26/64] update pod.yaml --- .../swiftv2/long-running-cluster/pod.yaml | 22 ++++++------------- 1 file changed, 7 insertions(+), 15 deletions(-) diff --git a/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml b/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml index ae3738a072..5ee2cf6033 100644 --- a/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml +++ b/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml @@ -15,21 +15,13 @@ spec: command: ["/bin/bash", "-c"] args: - | - echo "Pod Network Diagnostics started on $(hostname)"; - echo "Pod IP: $(hostname -i)"; - echo "Starting TCP listener on port 8080"; - - while true; do - nc -l -p 8080 -c 'echo "TCP Connection Success from $(hostname) at $(date)"' - done & - - echo "TCP listener started on port 8080" - sleep 2 - if netstat -tuln | grep -q ':8080'; then # Verify listener is running - echo "TCP listener is active on port 8080" - else - echo "WARNING: TCP listener may not be active on port 8080" - fi + echo "Pod Network Diagnostics started on $(hostname)" + echo "Pod IP: $(hostname -i)" + echo "Starting TCP listener on port 8080" + + while true; do + nc -lk -p 8080 -e /bin/sh + done ports: - containerPort: 8080 protocol: TCP From e9bdcfba1c5305ea1d1cc0dd401b5d7302099338 Mon Sep 17 00:00:00 2001 From: sivakami Date: Mon, 8 Dec 2025 15:11:07 -0800 Subject: [PATCH 27/64] Update pod.yaml --- .../long-running-pipeline-template.yaml | 17 +++++------------ .../swiftv2/long-running-cluster/pod.yaml | 7 ++++--- 2 files changed, 9 insertions(+), 15 deletions(-) diff --git a/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml b/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml index 014b634459..332ab93e46 100644 --- a/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml +++ b/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml @@ -67,7 +67,7 @@ stages: steps: - checkout: self - task: AzureCLI@2 - displayName: "Run create_aks.sh" + displayName: "Create AKS clusters" inputs: azureSubscription: ${{ parameters.serviceConnection }} scriptType: bash @@ -172,7 +172,7 @@ stages: # Job 1: Create Test Resources and Wait # ------------------------------------------------------------ - job: CreateTestResources - displayName: "Create Resources and Wait 20 Minutes" + displayName: "Create Resources" timeoutInMinutes: 90 pool: vmImage: ubuntu-latest @@ -444,7 +444,6 @@ stages: - CreateTestResources - ConnectivityTests - PrivateEndpointTests - - ScaleTest # Always run cleanup, even if previous jobs failed condition: always() timeoutInMinutes: 60 @@ -452,12 +451,10 @@ stages: vmImage: ubuntu-latest steps: - checkout: self - - task: GoTool@0 displayName: "Use Go 1.22.5" inputs: - version: "1.22.5" - + version: "1.22.5" - task: AzureCLI@2 displayName: "Delete Test Resources" inputs: @@ -467,13 +464,11 @@ stages: inlineScript: | echo "==> Installing Ginkgo CLI" go install github.com/onsi/ginkgo/v2/ginkgo@latest - echo "==> Adding Go bin to PATH" export PATH=$PATH:$(go env GOPATH)/bin - echo "==> Downloading Go dependencies" go mod download - + echo "==> Setting up kubeconfig for cluster aks-1" az aks get-credentials \ --resource-group $(rgName) \ @@ -481,7 +476,6 @@ stages: --file /tmp/aks-1.kubeconfig \ --overwrite-existing \ --admin - echo "==> Setting up kubeconfig for cluster aks-2" az aks get-credentials \ --resource-group $(rgName) \ @@ -489,12 +483,11 @@ stages: --file /tmp/aks-2.kubeconfig \ --overwrite-existing \ --admin - + echo "==> Deleting test resources (8 scenarios)" export RG="$(rgName)" export BUILD_ID="$(rgName)" export WORKLOAD_TYPE="swiftv2-linux" cd ./test/integration/swiftv2/longRunningCluster ginkgo -v -trace --timeout=1h --tags=delete_test - \ No newline at end of file diff --git a/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml b/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml index 5ee2cf6033..28b422d0d6 100644 --- a/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml +++ b/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml @@ -18,9 +18,10 @@ spec: echo "Pod Network Diagnostics started on $(hostname)" echo "Pod IP: $(hostname -i)" echo "Starting TCP listener on port 8080" - - while true; do - nc -lk -p 8080 -e /bin/sh + + # Start netcat listener that responds to connections + while true; do + echo "TCP Connection Success from $(hostname) at $(date)" | nc -l -p 8080 done ports: - containerPort: 8080 From f751a45caf658bd911a23b65f877fad85c580d9b Mon Sep 17 00:00:00 2001 From: sivakami Date: Mon, 8 Dec 2025 15:49:13 -0800 Subject: [PATCH 28/64] Update connectivity test. --- test/integration/swiftv2/longRunningCluster/datapath.go | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go index 4fad41bd4c..04c8fef3a7 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath.go +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -517,7 +517,7 @@ func DeleteAllScenarios(testScenarios TestScenarios) error { // Phase 3: Verify no MTPNC resources are stuck fmt.Printf("\n=== Phase 3: Verifying MTPNC cleanup ===\n") clustersChecked := make(map[string]bool) - + for _, scenario := range testScenarios.Scenarios { // Check each cluster only once if clustersChecked[scenario.Cluster] { @@ -527,7 +527,7 @@ func DeleteAllScenarios(testScenarios TestScenarios) error { kubeconfig := fmt.Sprintf("/tmp/%s.kubeconfig", scenario.Cluster) fmt.Printf("Checking for pending MTPNC resources in cluster %s\n", scenario.Cluster) - + err := helpers.VerifyNoMTPNC(kubeconfig, testScenarios.BuildID) if err != nil { fmt.Printf("WARNING: Found pending MTPNC resources in cluster %s: %v\n", scenario.Cluster, err) @@ -619,7 +619,8 @@ func RunConnectivityTest(test ConnectivityTest, rg, buildId string) error { // Run curl command from source pod to destination pod using eth1 IP // Using -m 10 for 10 second timeout // Using --interface eth1 to force traffic through delegated subnet interface - curlCmd := fmt.Sprintf("curl --interface eth1 -m 10 http://%s:8080/", destIP) + // Using --http0.9 to allow HTTP/0.9 responses from netcat (which sends raw text without proper HTTP headers) + curlCmd := fmt.Sprintf("curl --http0.9 --interface eth1 -m 10 http://%s:8080/", destIP) output, err := helpers.ExecInPod(sourceKubeconfig, test.SourceNamespace, test.SourcePod, curlCmd) if err != nil { From ff2c7b30bb93800c7e087ef2b2782a19102e499f Mon Sep 17 00:00:00 2001 From: sivakami Date: Mon, 8 Dec 2025 18:46:41 -0800 Subject: [PATCH 29/64] Update netcat curl test. --- .../integration/swiftv2/helpers/az_helpers.go | 24 +++++++++++-------- .../swiftv2/longRunningCluster/datapath.go | 16 ++++++++++--- 2 files changed, 27 insertions(+), 13 deletions(-) diff --git a/test/integration/swiftv2/helpers/az_helpers.go b/test/integration/swiftv2/helpers/az_helpers.go index 89645d1213..c40718af73 100644 --- a/test/integration/swiftv2/helpers/az_helpers.go +++ b/test/integration/swiftv2/helpers/az_helpers.go @@ -330,10 +330,10 @@ func GetPodIP(kubeconfig, namespace, podName string) (string, error) { // GetPodDelegatedIP retrieves the eth1 IP address (delegated subnet IP) of a pod // This is the IP used for cross-VNet communication and is subject to NSG rules func GetPodDelegatedIP(kubeconfig, namespace, podName string) (string, error) { - // Retry logic - pod might be Running but container not ready yet + // Retry logic - pod might be Running but container not ready yet, or network interface still initializing maxRetries := 5 for attempt := 1; attempt <= maxRetries; attempt++ { - ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) + ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) // Get eth1 IP address by running 'ip addr show eth1' in the pod cmd := exec.CommandContext(ctx, "kubectl", "--kubeconfig", kubeconfig, "exec", podName, @@ -349,13 +349,17 @@ func GetPodDelegatedIP(kubeconfig, namespace, podName string) (string, error) { return "", fmt.Errorf("pod %s in namespace %s has no eth1 IP address (delegated subnet not configured?)", podName, namespace) } - // Check if it's a container not found error - if strings.Contains(string(out), "container not found") { - if attempt < maxRetries { - fmt.Printf("Container not ready yet in pod %s (attempt %d/%d), waiting 3 seconds...\n", podName, attempt, maxRetries) - time.Sleep(3 * time.Second) - continue - } + // Check for retryable errors: container not found, signal killed, context deadline exceeded + errStr := strings.ToLower(err.Error()) + outStr := strings.ToLower(string(out)) + isRetryable := strings.Contains(outStr, "container not found") || + strings.Contains(errStr, "signal: killed") || + strings.Contains(errStr, "context deadline exceeded") + + if isRetryable && attempt < maxRetries { + fmt.Printf("Retryable error getting IP for pod %s (attempt %d/%d): %v. Waiting 5 seconds...\n", podName, attempt, maxRetries, err) + time.Sleep(5 * time.Second) + continue } return "", fmt.Errorf("failed to get eth1 IP for %s in namespace %s: %w\nOutput: %s", podName, namespace, err, string(out)) @@ -366,7 +370,7 @@ func GetPodDelegatedIP(kubeconfig, namespace, podName string) (string, error) { // ExecInPod executes a command in a pod and returns the output func ExecInPod(kubeconfig, namespace, podName, command string) (string, error) { - ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second) + ctx, cancel := context.WithTimeout(context.Background(), 20*time.Second) defer cancel() cmd := exec.CommandContext(ctx, "kubectl", "--kubeconfig", kubeconfig, "exec", podName, diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go index 04c8fef3a7..7d03fae46e 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath.go +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -617,13 +617,23 @@ func RunConnectivityTest(test ConnectivityTest, rg, buildId string) error { test.DestNamespace, test.DestinationPod, test.DestCluster, destIP) // Run curl command from source pod to destination pod using eth1 IP - // Using -m 10 for 10 second timeout + // Using -m 3 for 3 second timeout (short because netcat closes connection immediately) // Using --interface eth1 to force traffic through delegated subnet interface // Using --http0.9 to allow HTTP/0.9 responses from netcat (which sends raw text without proper HTTP headers) - curlCmd := fmt.Sprintf("curl --http0.9 --interface eth1 -m 10 http://%s:8080/", destIP) + // Exit code 28 (timeout) is OK if we received data, since netcat doesn't properly close the connection + curlCmd := fmt.Sprintf("curl --http0.9 --interface eth1 -m 3 http://%s:8080/", destIP) output, err := helpers.ExecInPod(sourceKubeconfig, test.SourceNamespace, test.SourcePod, curlCmd) - if err != nil { + + // Check if we received data even if curl timed out (exit code 28) + // Netcat closes the connection without proper HTTP close, causing curl to timeout + // But if we got the expected response, the connectivity test is successful + if err != nil { + if strings.Contains(err.Error(), "exit status 28") && strings.Contains(output, "TCP Connection Success") { + // Timeout but we got the data - this is OK with netcat + fmt.Printf("Connectivity successful (timeout OK, data received)! Response preview: %s\n", truncateString(output, 100)) + return nil + } return fmt.Errorf("connectivity test failed: %w\nOutput: %s", err, output) } From c3d2743464883dd1038e4389decf2f6de2dde936 Mon Sep 17 00:00:00 2001 From: sivakami Date: Mon, 8 Dec 2025 19:50:53 -0800 Subject: [PATCH 30/64] Remove test changes. --- test/integration/swiftv2/longRunningCluster/datapath.go | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go index 7d03fae46e..a892212dd2 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath.go +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -31,7 +31,6 @@ func applyTemplate(templatePath string, data interface{}, kubeconfig string) err return fmt.Errorf("kubectl apply failed: %s\n%s", err, string(out)) } - fmt.Println(string(out)) return nil } @@ -624,7 +623,7 @@ func RunConnectivityTest(test ConnectivityTest, rg, buildId string) error { curlCmd := fmt.Sprintf("curl --http0.9 --interface eth1 -m 3 http://%s:8080/", destIP) output, err := helpers.ExecInPod(sourceKubeconfig, test.SourceNamespace, test.SourcePod, curlCmd) - + // Check if we received data even if curl timed out (exit code 28) // Netcat closes the connection without proper HTTP close, causing curl to timeout // But if we got the expected response, the connectivity test is successful From 73471f7a85db7bdd73d3be3b1dce3b1952122018 Mon Sep 17 00:00:00 2001 From: sivakami Date: Mon, 8 Dec 2025 20:17:46 -0800 Subject: [PATCH 31/64] update datapath.go --- test/integration/swiftv2/longRunningCluster/datapath.go | 1 - 1 file changed, 1 deletion(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go index a892212dd2..5ec1ccefc1 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath.go +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -774,7 +774,6 @@ func RunPrivateEndpointTest(testScenarios TestScenarios, test ConnectivityTest) // -O- outputs to stdout, -q is quiet mode, --timeout sets timeout wgetCmd := fmt.Sprintf("wget -O- --timeout=30 --tries=1 '%s' 2>&1", blobURL) - output, err := ExecInPodWithTimeout(kubeconfig, test.SourceNS, test.SourcePodName, wgetCmd, 45*time.Second) output, err := ExecInPodWithTimeout(kubeconfig, test.SourceNS, test.SourcePodName, wgetCmd, 45*time.Second) if err != nil { // Check for HTTP errors in wget output From e8d62bf36aba7958b7e8bad6e9781825f4bb7a64 Mon Sep 17 00:00:00 2001 From: sivakami Date: Tue, 9 Dec 2025 07:56:32 -0800 Subject: [PATCH 32/64] Fix vnet names. --- .../swiftv2/longRunningCluster/datapath_scale_test.go | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go b/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go index e445e8eec9..47d9a03faa 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go +++ b/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go @@ -46,11 +46,9 @@ var _ = ginkgo.Describe("Datapath Scale Tests", func() { subnet string podCount int }{ - {cluster: "aks-1", vnetName: "cx_vnet_a1", subnet: "s1", podCount: 8}, - {cluster: "aks-2", vnetName: "cx_vnet_a2", subnet: "s1", podCount: 7}, - } - - // Initialize test scenarios with cache + {cluster: "aks-1", vnetName: "cx_vnet_v1", subnet: "s1", podCount: 8}, + {cluster: "aks-2", vnetName: "cx_vnet_v2", subnet: "s1", podCount: 7}, + } // Initialize test scenarios with cache testScenarios := TestScenarios{ ResourceGroup: rg, BuildID: buildId, From 93fbd68d16da51c93307a802dc9bcda6a202fcd5 Mon Sep 17 00:00:00 2001 From: sivakami Date: Tue, 9 Dec 2025 08:14:09 -0800 Subject: [PATCH 33/64] Run scale tests after private endpoint tests. --- .../template/long-running-pipeline-template.yaml | 1 + 1 file changed, 1 insertion(+) diff --git a/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml b/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml index 332ab93e46..69bcadf3c0 100644 --- a/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml +++ b/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml @@ -444,6 +444,7 @@ stages: - CreateTestResources - ConnectivityTests - PrivateEndpointTests + - ScaleTest # Always run cleanup, even if previous jobs failed condition: always() timeoutInMinutes: 60 From 80a86c097cdb75c288337a149c40c2ed84c7c60c Mon Sep 17 00:00:00 2001 From: sivakami Date: Tue, 9 Dec 2025 16:51:45 -0800 Subject: [PATCH 34/64] start with small bursts for scale tests. --- .../swiftv2/longRunningCluster/datapath_scale_test.go | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go b/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go index 47d9a03faa..0ecd86972f 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go +++ b/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go @@ -39,15 +39,15 @@ var _ = ginkgo.Describe("Datapath Scale Tests", func() { // For this test: Creating 15 pods across aks-1 and aks-2 // Device plugin and Kubernetes scheduler automatically place pods on nodes with available NICs - // Define scenarios for both clusters - 8 pods on aks-1, 7 pods on aks-2 + // Define scenarios for both clusters - 3 pods on aks-1, 2 pods on aks-2 (5 total for testing) scenarios := []struct { cluster string vnetName string subnet string podCount int }{ - {cluster: "aks-1", vnetName: "cx_vnet_v1", subnet: "s1", podCount: 8}, - {cluster: "aks-2", vnetName: "cx_vnet_v2", subnet: "s1", podCount: 7}, + {cluster: "aks-1", vnetName: "cx_vnet_v1", subnet: "s1", podCount: 3}, + {cluster: "aks-2", vnetName: "cx_vnet_v2", subnet: "s1", podCount: 2}, } // Initialize test scenarios with cache testScenarios := TestScenarios{ ResourceGroup: rg, From b2c2faebda524fe0c1192bb0111324ad8fad880a Mon Sep 17 00:00:00 2001 From: sivakami Date: Tue, 9 Dec 2025 20:53:08 -0800 Subject: [PATCH 35/64] Set reservation size base test scenario. --- test/integration/swiftv2/longRunningCluster/datapath.go | 8 +++++++- .../swiftv2/longRunningCluster/datapath_scale_test.go | 1 + 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go index 5ec1ccefc1..c1884f4d87 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath.go +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -98,6 +98,7 @@ type TestResources struct { PNITemplate string PodTemplate string PodImage string + Reservations int // Number of IP reservations for PodNetworkInstance } // PodScenario defines a single pod creation scenario @@ -211,12 +212,17 @@ func CreateNamespaceResource(kubeconfig, namespace string) error { // CreatePodNetworkInstanceResource creates a PodNetworkInstance func CreatePodNetworkInstanceResource(resources TestResources) error { + // Use provided reservations count, default to 2 if not specified + reservations := resources.Reservations + if reservations == 0 { + reservations = 2 + } err := CreatePodNetworkInstance(resources.Kubeconfig, PNIData{ PNIName: resources.PNIName, PNName: resources.PNName, Namespace: resources.PNName, Type: "explicit", - Reservations: 2, + Reservations: reservations, }, resources.PNITemplate) if err != nil { return fmt.Errorf("failed to create PodNetworkInstance: %w", err) diff --git a/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go b/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go index 0ecd86972f..3768b460c9 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go +++ b/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go @@ -88,6 +88,7 @@ var _ = ginkgo.Describe("Datapath Scale Tests", func() { PNITemplate: "../../manifests/swiftv2/long-running-cluster/podnetworkinstance.yaml", PodTemplate: "../../manifests/swiftv2/long-running-cluster/pod-with-device-plugin.yaml", PodImage: testScenarios.PodImage, + Reservations: scenario.podCount, // Set reservations to match pod count } // Step 1: Create PodNetwork From f7dc9496b2673e8edbed61b82ea960cfef032dc1 Mon Sep 17 00:00:00 2001 From: sivakami Date: Tue, 9 Dec 2025 21:13:06 -0800 Subject: [PATCH 36/64] Delete pod resources created for scale tests. --- .../datapath_delete_test.go | 58 ++++++++++++++++++- 1 file changed, 57 insertions(+), 1 deletion(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go b/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go index af2cbd2081..128eabebbd 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go +++ b/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go @@ -5,8 +5,10 @@ package longRunningCluster import ( "fmt" "os" + "strings" "testing" + "github.com/Azure/azure-container-networking/test/integration/swiftv2/helpers" "github.com/onsi/ginkgo/v2" "github.com/onsi/gomega" ) @@ -110,6 +112,60 @@ var _ = ginkgo.Describe("Datapath Delete Tests", func() { err := DeleteAllScenarios(testScenarios) gomega.Expect(err).To(gomega.BeNil(), "Failed to delete test scenarios") - ginkgo.By("Successfully deleted all test scenarios") + // Delete scale test resources + ginkgo.By("Deleting scale test resources") + scaleScenarios := []struct { + cluster string + vnetName string + subnet string + podCount int + }{ + {cluster: "aks-1", vnetName: "cx_vnet_v1", subnet: "s1", podCount: 3}, + {cluster: "aks-2", vnetName: "cx_vnet_v2", subnet: "s1", podCount: 2}, + } + + podIndex := 0 + for _, scenario := range scaleScenarios { + kubeconfig := fmt.Sprintf("/tmp/%s.kubeconfig", scenario.cluster) + vnetShort := strings.TrimPrefix(scenario.vnetName, "cx_vnet_") + vnetShort = strings.ReplaceAll(vnetShort, "_", "-") + subnetNameSafe := strings.ReplaceAll(scenario.subnet, "_", "-") + pnName := fmt.Sprintf("pn-scale-%s-%s-%s", buildId, vnetShort, subnetNameSafe) + pniName := fmt.Sprintf("pni-scale-%s-%s-%s", buildId, vnetShort, subnetNameSafe) + + // Delete pods + for j := 0; j < scenario.podCount; j++ { + podName := fmt.Sprintf("scale-pod-%d", podIndex) + ginkgo.By(fmt.Sprintf("Deleting scale test pod: %s from cluster %s", podName, scenario.cluster)) + err := helpers.DeletePod(kubeconfig, pnName, podName) + if err != nil { + fmt.Printf("Warning: Failed to delete scale pod %s: %v\n", podName, err) + } + podIndex++ + } + + // Delete PodNetworkInstance + ginkgo.By(fmt.Sprintf("Deleting scale test PodNetworkInstance: %s from cluster %s", pniName, scenario.cluster)) + err = helpers.DeletePodNetworkInstance(kubeconfig, pnName, pniName) + if err != nil { + fmt.Printf("Warning: Failed to delete scale test PNI %s: %v\n", pniName, err) + } + + // Delete PodNetwork + ginkgo.By(fmt.Sprintf("Deleting scale test PodNetwork: %s from cluster %s", pnName, scenario.cluster)) + err = helpers.DeletePodNetwork(kubeconfig, pnName) + if err != nil { + fmt.Printf("Warning: Failed to delete scale test PodNetwork %s: %v\n", pnName, err) + } + + // Delete namespace + ginkgo.By(fmt.Sprintf("Deleting scale test namespace: %s from cluster %s", pnName, scenario.cluster)) + err = helpers.DeleteNamespace(kubeconfig, pnName) + if err != nil { + fmt.Printf("Warning: Failed to delete scale test namespace %s: %v\n", pnName, err) + } + } + + ginkgo.By("Successfully deleted all test scenarios and scale test resources") }) }) From f532afa4d5f7de2c1a73265f18f7f9caa0423356 Mon Sep 17 00:00:00 2001 From: sivakami Date: Wed, 10 Dec 2025 08:39:38 -0800 Subject: [PATCH 37/64] test change. --- .../long-running-pipeline-template.yaml | 102 +++++++++--------- 1 file changed, 51 insertions(+), 51 deletions(-) diff --git a/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml b/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml index 69bcadf3c0..090e32d283 100644 --- a/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml +++ b/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml @@ -438,57 +438,57 @@ stages: # ------------------------------------------------------------ # Job 5: Delete Test Resources # ------------------------------------------------------------ - - job: DeleteTestResources - displayName: "Delete PodNetwork, PNI, and Pods" - dependsOn: - - CreateTestResources - - ConnectivityTests - - PrivateEndpointTests - - ScaleTest - # Always run cleanup, even if previous jobs failed - condition: always() - timeoutInMinutes: 60 - pool: - vmImage: ubuntu-latest - steps: - - checkout: self - - task: GoTool@0 - displayName: "Use Go 1.22.5" - inputs: - version: "1.22.5" - - task: AzureCLI@2 - displayName: "Delete Test Resources" - inputs: - azureSubscription: ${{ parameters.serviceConnection }} - scriptType: bash - scriptLocation: inlineScript - inlineScript: | - echo "==> Installing Ginkgo CLI" - go install github.com/onsi/ginkgo/v2/ginkgo@latest - echo "==> Adding Go bin to PATH" - export PATH=$PATH:$(go env GOPATH)/bin - echo "==> Downloading Go dependencies" - go mod download + # - job: DeleteTestResources + # displayName: "Delete PodNetwork, PNI, and Pods" + # dependsOn: + # - CreateTestResources + # - ConnectivityTests + # - PrivateEndpointTests + # - ScaleTest + # # Always run cleanup, even if previous jobs failed + # condition: always() + # timeoutInMinutes: 60 + # pool: + # vmImage: ubuntu-latest + # steps: + # - checkout: self + # - task: GoTool@0 + # displayName: "Use Go 1.22.5" + # inputs: + # version: "1.22.5" + # - task: AzureCLI@2 + # displayName: "Delete Test Resources" + # inputs: + # azureSubscription: ${{ parameters.serviceConnection }} + # scriptType: bash + # scriptLocation: inlineScript + # inlineScript: | + # echo "==> Installing Ginkgo CLI" + # go install github.com/onsi/ginkgo/v2/ginkgo@latest + # echo "==> Adding Go bin to PATH" + # export PATH=$PATH:$(go env GOPATH)/bin + # echo "==> Downloading Go dependencies" + # go mod download - echo "==> Setting up kubeconfig for cluster aks-1" - az aks get-credentials \ - --resource-group $(rgName) \ - --name aks-1 \ - --file /tmp/aks-1.kubeconfig \ - --overwrite-existing \ - --admin - echo "==> Setting up kubeconfig for cluster aks-2" - az aks get-credentials \ - --resource-group $(rgName) \ - --name aks-2 \ - --file /tmp/aks-2.kubeconfig \ - --overwrite-existing \ - --admin + # echo "==> Setting up kubeconfig for cluster aks-1" + # az aks get-credentials \ + # --resource-group $(rgName) \ + # --name aks-1 \ + # --file /tmp/aks-1.kubeconfig \ + # --overwrite-existing \ + # --admin + # echo "==> Setting up kubeconfig for cluster aks-2" + # az aks get-credentials \ + # --resource-group $(rgName) \ + # --name aks-2 \ + # --file /tmp/aks-2.kubeconfig \ + # --overwrite-existing \ + # --admin - echo "==> Deleting test resources (8 scenarios)" - export RG="$(rgName)" - export BUILD_ID="$(rgName)" - export WORKLOAD_TYPE="swiftv2-linux" - cd ./test/integration/swiftv2/longRunningCluster - ginkgo -v -trace --timeout=1h --tags=delete_test + # echo "==> Deleting test resources (8 scenarios)" + # export RG="$(rgName)" + # export BUILD_ID="$(rgName)" + # export WORKLOAD_TYPE="swiftv2-linux" + # cd ./test/integration/swiftv2/longRunningCluster + # ginkgo -v -trace --timeout=1h --tags=delete_test \ No newline at end of file From 434cd067d0f08bf4f08907544f039ebc03d18df0 Mon Sep 17 00:00:00 2001 From: sivakami Date: Wed, 10 Dec 2025 09:48:51 -0800 Subject: [PATCH 38/64] Reuse pod network for creating pods for scale tests. --- .../long-running-pipeline-template.yaml | 102 +++++++++--------- .../swiftv2/longRunningCluster/datapath.go | 19 +++- .../datapath_delete_test.go | 63 ++--------- .../longRunningCluster/datapath_scale_test.go | 81 ++++++++------ 4 files changed, 120 insertions(+), 145 deletions(-) diff --git a/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml b/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml index 090e32d283..69bcadf3c0 100644 --- a/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml +++ b/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml @@ -438,57 +438,57 @@ stages: # ------------------------------------------------------------ # Job 5: Delete Test Resources # ------------------------------------------------------------ - # - job: DeleteTestResources - # displayName: "Delete PodNetwork, PNI, and Pods" - # dependsOn: - # - CreateTestResources - # - ConnectivityTests - # - PrivateEndpointTests - # - ScaleTest - # # Always run cleanup, even if previous jobs failed - # condition: always() - # timeoutInMinutes: 60 - # pool: - # vmImage: ubuntu-latest - # steps: - # - checkout: self - # - task: GoTool@0 - # displayName: "Use Go 1.22.5" - # inputs: - # version: "1.22.5" - # - task: AzureCLI@2 - # displayName: "Delete Test Resources" - # inputs: - # azureSubscription: ${{ parameters.serviceConnection }} - # scriptType: bash - # scriptLocation: inlineScript - # inlineScript: | - # echo "==> Installing Ginkgo CLI" - # go install github.com/onsi/ginkgo/v2/ginkgo@latest - # echo "==> Adding Go bin to PATH" - # export PATH=$PATH:$(go env GOPATH)/bin - # echo "==> Downloading Go dependencies" - # go mod download + - job: DeleteTestResources + displayName: "Delete PodNetwork, PNI, and Pods" + dependsOn: + - CreateTestResources + - ConnectivityTests + - PrivateEndpointTests + - ScaleTest + # Always run cleanup, even if previous jobs failed + condition: always() + timeoutInMinutes: 60 + pool: + vmImage: ubuntu-latest + steps: + - checkout: self + - task: GoTool@0 + displayName: "Use Go 1.22.5" + inputs: + version: "1.22.5" + - task: AzureCLI@2 + displayName: "Delete Test Resources" + inputs: + azureSubscription: ${{ parameters.serviceConnection }} + scriptType: bash + scriptLocation: inlineScript + inlineScript: | + echo "==> Installing Ginkgo CLI" + go install github.com/onsi/ginkgo/v2/ginkgo@latest + echo "==> Adding Go bin to PATH" + export PATH=$PATH:$(go env GOPATH)/bin + echo "==> Downloading Go dependencies" + go mod download - # echo "==> Setting up kubeconfig for cluster aks-1" - # az aks get-credentials \ - # --resource-group $(rgName) \ - # --name aks-1 \ - # --file /tmp/aks-1.kubeconfig \ - # --overwrite-existing \ - # --admin - # echo "==> Setting up kubeconfig for cluster aks-2" - # az aks get-credentials \ - # --resource-group $(rgName) \ - # --name aks-2 \ - # --file /tmp/aks-2.kubeconfig \ - # --overwrite-existing \ - # --admin + echo "==> Setting up kubeconfig for cluster aks-1" + az aks get-credentials \ + --resource-group $(rgName) \ + --name aks-1 \ + --file /tmp/aks-1.kubeconfig \ + --overwrite-existing \ + --admin + echo "==> Setting up kubeconfig for cluster aks-2" + az aks get-credentials \ + --resource-group $(rgName) \ + --name aks-2 \ + --file /tmp/aks-2.kubeconfig \ + --overwrite-existing \ + --admin - # echo "==> Deleting test resources (8 scenarios)" - # export RG="$(rgName)" - # export BUILD_ID="$(rgName)" - # export WORKLOAD_TYPE="swiftv2-linux" - # cd ./test/integration/swiftv2/longRunningCluster - # ginkgo -v -trace --timeout=1h --tags=delete_test + echo "==> Deleting test resources (8 scenarios)" + export RG="$(rgName)" + export BUILD_ID="$(rgName)" + export WORKLOAD_TYPE="swiftv2-linux" + cd ./test/integration/swiftv2/longRunningCluster + ginkgo -v -trace --timeout=1h --tags=delete_test \ No newline at end of file diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go index c1884f4d87..47fecdba0a 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath.go +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -90,6 +90,7 @@ type TestResources struct { Kubeconfig string PNName string PNIName string + Namespace string // Namespace for PNI and pods (can be different from PNName for scale tests) VnetGUID string SubnetGUID string SubnetARMID string @@ -217,10 +218,15 @@ func CreatePodNetworkInstanceResource(resources TestResources) error { if reservations == 0 { reservations = 2 } + // Use Namespace field if set, otherwise default to PNName + namespace := resources.Namespace + if namespace == "" { + namespace = resources.PNName + } err := CreatePodNetworkInstance(resources.Kubeconfig, PNIData{ PNIName: resources.PNIName, - PNName: resources.PNName, - Namespace: resources.PNName, + PNName: resources.PNName, // The PodNetwork to reference + Namespace: namespace, // Where to create the PNI resource Type: "explicit", Reservations: reservations, }, resources.PNITemplate) @@ -232,13 +238,18 @@ func CreatePodNetworkInstanceResource(resources TestResources) error { // CreatePodResource creates a single pod on a specified node and waits for it to be running func CreatePodResource(resources TestResources, podName, nodeName string) error { + // Use Namespace field if set, otherwise default to PNName + namespace := resources.Namespace + if namespace == "" { + namespace = resources.PNName + } err := CreatePod(resources.Kubeconfig, PodData{ PodName: podName, NodeName: nodeName, OS: "linux", PNName: resources.PNName, PNIName: resources.PNIName, - Namespace: resources.PNName, + Namespace: namespace, Image: resources.PodImage, }, resources.PodTemplate) if err != nil { @@ -246,7 +257,7 @@ func CreatePodResource(resources TestResources, podName, nodeName string) error } // Wait for pod to be running with retries - err = helpers.WaitForPodRunning(resources.Kubeconfig, resources.PNName, podName, 10, 30) + err = helpers.WaitForPodRunning(resources.Kubeconfig, namespace, podName, 10, 30) if err != nil { return fmt.Errorf("pod %s did not reach running state: %w", podName, err) } diff --git a/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go b/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go index 128eabebbd..853540be04 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go +++ b/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go @@ -107,65 +107,14 @@ var _ = ginkgo.Describe("Datapath Delete Tests", func() { UsedNodes: make(map[string]bool), } - // Delete all scenario resources - ginkgo.By("Deleting all test scenarios") + // Note: Scale test now cleans up after itself (deletes PNI and namespace, keeps shared PodNetwork) + // This delete test only needs to clean up connectivity test resources + + // Delete all connectivity test scenario resources + ginkgo.By("Deleting all connectivity test scenarios") err := DeleteAllScenarios(testScenarios) gomega.Expect(err).To(gomega.BeNil(), "Failed to delete test scenarios") - // Delete scale test resources - ginkgo.By("Deleting scale test resources") - scaleScenarios := []struct { - cluster string - vnetName string - subnet string - podCount int - }{ - {cluster: "aks-1", vnetName: "cx_vnet_v1", subnet: "s1", podCount: 3}, - {cluster: "aks-2", vnetName: "cx_vnet_v2", subnet: "s1", podCount: 2}, - } - - podIndex := 0 - for _, scenario := range scaleScenarios { - kubeconfig := fmt.Sprintf("/tmp/%s.kubeconfig", scenario.cluster) - vnetShort := strings.TrimPrefix(scenario.vnetName, "cx_vnet_") - vnetShort = strings.ReplaceAll(vnetShort, "_", "-") - subnetNameSafe := strings.ReplaceAll(scenario.subnet, "_", "-") - pnName := fmt.Sprintf("pn-scale-%s-%s-%s", buildId, vnetShort, subnetNameSafe) - pniName := fmt.Sprintf("pni-scale-%s-%s-%s", buildId, vnetShort, subnetNameSafe) - - // Delete pods - for j := 0; j < scenario.podCount; j++ { - podName := fmt.Sprintf("scale-pod-%d", podIndex) - ginkgo.By(fmt.Sprintf("Deleting scale test pod: %s from cluster %s", podName, scenario.cluster)) - err := helpers.DeletePod(kubeconfig, pnName, podName) - if err != nil { - fmt.Printf("Warning: Failed to delete scale pod %s: %v\n", podName, err) - } - podIndex++ - } - - // Delete PodNetworkInstance - ginkgo.By(fmt.Sprintf("Deleting scale test PodNetworkInstance: %s from cluster %s", pniName, scenario.cluster)) - err = helpers.DeletePodNetworkInstance(kubeconfig, pnName, pniName) - if err != nil { - fmt.Printf("Warning: Failed to delete scale test PNI %s: %v\n", pniName, err) - } - - // Delete PodNetwork - ginkgo.By(fmt.Sprintf("Deleting scale test PodNetwork: %s from cluster %s", pnName, scenario.cluster)) - err = helpers.DeletePodNetwork(kubeconfig, pnName) - if err != nil { - fmt.Printf("Warning: Failed to delete scale test PodNetwork %s: %v\n", pnName, err) - } - - // Delete namespace - ginkgo.By(fmt.Sprintf("Deleting scale test namespace: %s from cluster %s", pnName, scenario.cluster)) - err = helpers.DeleteNamespace(kubeconfig, pnName) - if err != nil { - fmt.Printf("Warning: Failed to delete scale test namespace %s: %v\n", pnName, err) - } - } - - ginkgo.By("Successfully deleted all test scenarios and scale test resources") + ginkgo.By("Successfully deleted all connectivity test scenarios") }) }) diff --git a/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go b/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go index 3768b460c9..380e3b5c9f 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go +++ b/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go @@ -40,6 +40,7 @@ var _ = ginkgo.Describe("Datapath Scale Tests", func() { // Device plugin and Kubernetes scheduler automatically place pods on nodes with available NICs // Define scenarios for both clusters - 3 pods on aks-1, 2 pods on aks-2 (5 total for testing) + // IMPORTANT: Reuse existing PodNetworks from connectivity tests to avoid "duplicate podnetwork with same network id" error scenarios := []struct { cluster string vnetName string @@ -69,17 +70,21 @@ var _ = ginkgo.Describe("Datapath Scale Tests", func() { netInfo, err := GetOrFetchVnetSubnetInfo(testScenarios.ResourceGroup, scenario.vnetName, scenario.subnet, testScenarios.VnetSubnetCache) gomega.Expect(err).To(gomega.BeNil(), fmt.Sprintf("Failed to get network info for %s/%s", scenario.vnetName, scenario.subnet)) - // Create unique names + // REUSE existing PodNetwork name from connectivity tests (don't create duplicate) vnetShort := strings.TrimPrefix(scenario.vnetName, "cx_vnet_") vnetShort = strings.ReplaceAll(vnetShort, "_", "-") subnetNameSafe := strings.ReplaceAll(scenario.subnet, "_", "-") - pnName := fmt.Sprintf("pn-scale-%s-%s-%s", testScenarios.BuildID, vnetShort, subnetNameSafe) - pniName := fmt.Sprintf("pni-scale-%s-%s-%s", testScenarios.BuildID, vnetShort, subnetNameSafe) + pnName := fmt.Sprintf("pn-%s-%s-%s", testScenarios.BuildID, vnetShort, subnetNameSafe) // Reuse connectivity test PN + pniName := fmt.Sprintf("pni-scale-%s-%s-%s", testScenarios.BuildID, vnetShort, subnetNameSafe) // New PNI for scale test + + // Create scale-specific namespace + scaleNamespace := fmt.Sprintf("%s-scale", pnName) resources := TestResources{ Kubeconfig: kubeconfig, - PNName: pnName, - PNIName: pniName, + PNName: pnName, // References the shared PodNetwork + PNIName: pniName, // New PNI for scale test + Namespace: scaleNamespace, // Scale test specific namespace VnetGUID: netInfo.VnetGUID, SubnetGUID: netInfo.SubnetGUID, SubnetARMID: netInfo.SubnetARMID, @@ -88,21 +93,19 @@ var _ = ginkgo.Describe("Datapath Scale Tests", func() { PNITemplate: "../../manifests/swiftv2/long-running-cluster/podnetworkinstance.yaml", PodTemplate: "../../manifests/swiftv2/long-running-cluster/pod-with-device-plugin.yaml", PodImage: testScenarios.PodImage, - Reservations: scenario.podCount, // Set reservations to match pod count + Reservations: 10, // Reserve 10 IPs for scale test pods } - // Step 1: Create PodNetwork - ginkgo.By(fmt.Sprintf("Creating PodNetwork: %s in cluster %s", pnName, scenario.cluster)) - err = CreatePodNetworkResource(resources) - gomega.Expect(err).To(gomega.BeNil(), "Failed to create PodNetwork") + // Step 1: SKIP creating PodNetwork (reuse existing one from connectivity tests) + ginkgo.By(fmt.Sprintf("Reusing existing PodNetwork: %s in cluster %s", pnName, scenario.cluster)) - // Step 2: Create namespace - ginkgo.By(fmt.Sprintf("Creating namespace: %s in cluster %s", pnName, scenario.cluster)) - err = CreateNamespaceResource(resources.Kubeconfig, resources.PNName) + // Step 2: Create namespace for scale test + ginkgo.By(fmt.Sprintf("Creating scale test namespace: %s in cluster %s", scaleNamespace, scenario.cluster)) + err = CreateNamespaceResource(resources.Kubeconfig, scaleNamespace) gomega.Expect(err).To(gomega.BeNil(), "Failed to create namespace") - // Step 3: Create PodNetworkInstance - ginkgo.By(fmt.Sprintf("Creating PodNetworkInstance: %s in cluster %s", pniName, scenario.cluster)) + // Step 3: Create NEW PodNetworkInstance for scale test + ginkgo.By(fmt.Sprintf("Creating PodNetworkInstance: %s (references PN: %s) in namespace %s in cluster %s", pniName, pnName, scaleNamespace, scenario.cluster)) err = CreatePodNetworkInstanceResource(resources) gomega.Expect(err).To(gomega.BeNil(), "Failed to create PodNetworkInstance") @@ -128,7 +131,11 @@ var _ = ginkgo.Describe("Datapath Scale Tests", func() { defer ginkgo.GinkgoRecover() podName := fmt.Sprintf("scale-pod-%d", idx) - ginkgo.By(fmt.Sprintf("Creating pod %s in cluster %s (auto-scheduled)", podName, cluster)) + namespace := resources.Namespace + if namespace == "" { + namespace = resources.PNName + } + ginkgo.By(fmt.Sprintf("Creating pod %s in namespace %s in cluster %s (auto-scheduled)", podName, namespace, cluster)) // Create pod without specifying node - let device plugin and scheduler decide err := CreatePod(resources.Kubeconfig, PodData{ @@ -137,7 +144,7 @@ var _ = ginkgo.Describe("Datapath Scale Tests", func() { OS: "linux", PNName: resources.PNName, PNIName: resources.PNIName, - Namespace: resources.PNName, + Namespace: namespace, Image: resources.PodImage, }, resources.PodTemplate) if err != nil { @@ -147,7 +154,7 @@ var _ = ginkgo.Describe("Datapath Scale Tests", func() { // Wait for pod to be scheduled (node assignment) before considering it created // This prevents CNS errors about missing node names - err = helpers.WaitForPodScheduled(resources.Kubeconfig, resources.PNName, podName, 10, 6) + err = helpers.WaitForPodScheduled(resources.Kubeconfig, namespace, podName, 10, 6) if err != nil { errors <- fmt.Errorf("pod %s in cluster %s was not scheduled: %w", podName, cluster, err) } @@ -178,9 +185,13 @@ var _ = ginkgo.Describe("Datapath Scale Tests", func() { ginkgo.By("Verifying all pods are in Running state") podIndex = 0 for i, scenario := range scenarios { + namespace := allResources[i].Namespace + if namespace == "" { + namespace = allResources[i].PNName + } for j := 0; j < scenario.podCount; j++ { podName := fmt.Sprintf("scale-pod-%d", podIndex) - err := helpers.WaitForPodRunning(allResources[i].Kubeconfig, allResources[i].PNName, podName, 5, 10) + err := helpers.WaitForPodRunning(allResources[i].Kubeconfig, namespace, podName, 5, 10) gomega.Expect(err).To(gomega.BeNil(), fmt.Sprintf("Pod %s did not reach running state in cluster %s", podName, scenario.cluster)) podIndex++ } @@ -194,33 +205,37 @@ var _ = ginkgo.Describe("Datapath Scale Tests", func() { for i, scenario := range scenarios { resources := allResources[i] kubeconfig := resources.Kubeconfig + namespace := resources.Namespace + if namespace == "" { + namespace = resources.PNName + } for j := 0; j < scenario.podCount; j++ { podName := fmt.Sprintf("scale-pod-%d", podIndex) - ginkgo.By(fmt.Sprintf("Deleting pod: %s from cluster %s", podName, scenario.cluster)) - err := helpers.DeletePod(kubeconfig, resources.PNName, podName) + ginkgo.By(fmt.Sprintf("Deleting pod: %s from namespace %s in cluster %s", podName, namespace, scenario.cluster)) + err := helpers.DeletePod(kubeconfig, namespace, podName) if err != nil { fmt.Printf("Warning: Failed to delete pod %s: %v\n", podName, err) } podIndex++ } - // Delete namespace (this will also delete PNI) - ginkgo.By(fmt.Sprintf("Deleting namespace: %s from cluster %s", resources.PNName, scenario.cluster)) - err := helpers.DeleteNamespace(kubeconfig, resources.PNName) - gomega.Expect(err).To(gomega.BeNil(), "Failed to delete namespace") - - // Delete PodNetworkInstance - ginkgo.By(fmt.Sprintf("Deleting PodNetworkInstance: %s from cluster %s", resources.PNIName, scenario.cluster)) - err = helpers.DeletePodNetworkInstance(kubeconfig, resources.PNName, resources.PNIName) + // Delete PodNetworkInstance first + ginkgo.By(fmt.Sprintf("Deleting PodNetworkInstance: %s from namespace %s in cluster %s", resources.PNIName, namespace, scenario.cluster)) + err := helpers.DeletePodNetworkInstance(kubeconfig, namespace, resources.PNIName) if err != nil { fmt.Printf("Warning: Failed to delete PNI %s: %v\n", resources.PNIName, err) } - // Delete PodNetwork - ginkgo.By(fmt.Sprintf("Deleting PodNetwork: %s from cluster %s", resources.PNName, scenario.cluster)) - err = helpers.DeletePodNetwork(kubeconfig, resources.PNName) - gomega.Expect(err).To(gomega.BeNil(), "Failed to delete PodNetwork") + // Delete namespace (scale test specific) + ginkgo.By(fmt.Sprintf("Deleting scale test namespace: %s from cluster %s", namespace, scenario.cluster)) + err = helpers.DeleteNamespace(kubeconfig, namespace) + if err != nil { + fmt.Printf("Warning: Failed to delete namespace %s: %v\n", namespace, err) + } + + // DO NOT delete PodNetwork - it's shared with connectivity tests + ginkgo.By(fmt.Sprintf("Keeping PodNetwork: %s (shared with connectivity tests) in cluster %s", resources.PNName, scenario.cluster)) } ginkgo.By("Scale test cleanup completed") From 8b71f93a76315426b028a4e134d0a980c484de32 Mon Sep 17 00:00:00 2001 From: sivakami Date: Wed, 10 Dec 2025 10:55:02 -0800 Subject: [PATCH 39/64] scale test update --- .../swiftv2/longRunningCluster/datapath.go | 22 ++----- .../longRunningCluster/datapath_scale_test.go | 66 ++++++------------- 2 files changed, 25 insertions(+), 63 deletions(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go index 47fecdba0a..d65585f025 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath.go +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -85,7 +85,7 @@ func CreatePod(kubeconfig string, data PodData, templatePath string) error { // High-level orchestration // ------------------------- -// TestResources holds all the configuration needed for creating test resources +// TestResources holds all the configuration needed for creating test pods type TestResources struct { Kubeconfig string PNName string @@ -99,7 +99,7 @@ type TestResources struct { PNITemplate string PodTemplate string PodImage string - Reservations int // Number of IP reservations for PodNetworkInstance + Reservations int // Number of IP reservations for PodNetworkInstance } // PodScenario defines a single pod creation scenario @@ -225,8 +225,8 @@ func CreatePodNetworkInstanceResource(resources TestResources) error { } err := CreatePodNetworkInstance(resources.Kubeconfig, PNIData{ PNIName: resources.PNIName, - PNName: resources.PNName, // The PodNetwork to reference - Namespace: namespace, // Where to create the PNI resource + PNName: resources.PNName, + Namespace: namespace, Type: "explicit", Reservations: reservations, }, resources.PNITemplate) @@ -640,10 +640,6 @@ func RunConnectivityTest(test ConnectivityTest, rg, buildId string) error { curlCmd := fmt.Sprintf("curl --http0.9 --interface eth1 -m 3 http://%s:8080/", destIP) output, err := helpers.ExecInPod(sourceKubeconfig, test.SourceNamespace, test.SourcePod, curlCmd) - - // Check if we received data even if curl timed out (exit code 28) - // Netcat closes the connection without proper HTTP close, causing curl to timeout - // But if we got the expected response, the connectivity test is successful if err != nil { if strings.Contains(err.Error(), "exit status 28") && strings.Contains(output, "TCP Connection Success") { // Timeout but we got the data - this is OK with netcat @@ -773,18 +769,8 @@ func RunPrivateEndpointTest(testScenarios TestScenarios, test ConnectivityTest) return fmt.Errorf("failed to generate SAS token: %w", err) } - // Debug: Print SAS token info - fmt.Printf("SAS token length: %d\n", len(sasToken)) - if len(sasToken) > 60 { - fmt.Printf("SAS token preview: %s...\n", sasToken[:60]) - } else { - fmt.Printf("SAS token: %s\n", sasToken) - } - // Step 4: Download test blob using SAS token with verbose output fmt.Printf("==> Downloading test blob via private endpoint\n") - // Construct URL - ensure SAS token is properly formatted - // Note: SAS token should already be URL-encoded from Azure CLI blobURL := fmt.Sprintf("https://%s/test/hello.txt?%s", test.DestEndpoint, sasToken) // Use wget instead of curl - it handles special characters better diff --git a/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go b/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go index 380e3b5c9f..71224f7154 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go +++ b/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go @@ -31,12 +31,12 @@ var _ = ginkgo.Describe("Datapath Scale Tests", func() { ginkgo.Fail(fmt.Sprintf("Missing required environment variables: RG='%s', BUILD_ID='%s'", rg, buildId)) } - ginkgo.It("creates and deletes 15 pods in a burst using device plugin", ginkgo.NodeTimeout(0), func() { + ginkgo.It("creates and deletes 5 pods in a burst using device plugin", ginkgo.NodeTimeout(0), func() { // NOTE: Maximum pods per PodNetwork/PodNetworkInstance is limited by: // 1. Subnet IP address capacity // 2. Node capacity (typically 250 pods per node) // 3. Available NICs on nodes (device plugin resources) - // For this test: Creating 15 pods across aks-1 and aks-2 + // For this test: Creating 5 pods across aks-1 and aks-2 // Device plugin and Kubernetes scheduler automatically place pods on nodes with available NICs // Define scenarios for both clusters - 3 pods on aks-1, 2 pods on aks-2 (5 total for testing) @@ -48,7 +48,7 @@ var _ = ginkgo.Describe("Datapath Scale Tests", func() { podCount int }{ {cluster: "aks-1", vnetName: "cx_vnet_v1", subnet: "s1", podCount: 3}, - {cluster: "aks-2", vnetName: "cx_vnet_v2", subnet: "s1", podCount: 2}, + {cluster: "aks-2", vnetName: "cx_vnet_v3", subnet: "s1", podCount: 2}, } // Initialize test scenarios with cache testScenarios := TestScenarios{ ResourceGroup: rg, @@ -74,17 +74,14 @@ var _ = ginkgo.Describe("Datapath Scale Tests", func() { vnetShort := strings.TrimPrefix(scenario.vnetName, "cx_vnet_") vnetShort = strings.ReplaceAll(vnetShort, "_", "-") subnetNameSafe := strings.ReplaceAll(scenario.subnet, "_", "-") - pnName := fmt.Sprintf("pn-%s-%s-%s", testScenarios.BuildID, vnetShort, subnetNameSafe) // Reuse connectivity test PN + pnName := fmt.Sprintf("pn-%s-%s-%s", testScenarios.BuildID, vnetShort, subnetNameSafe) // Reuse connectivity test PN pniName := fmt.Sprintf("pni-scale-%s-%s-%s", testScenarios.BuildID, vnetShort, subnetNameSafe) // New PNI for scale test - // Create scale-specific namespace - scaleNamespace := fmt.Sprintf("%s-scale", pnName) - resources := TestResources{ Kubeconfig: kubeconfig, - PNName: pnName, // References the shared PodNetwork - PNIName: pniName, // New PNI for scale test - Namespace: scaleNamespace, // Scale test specific namespace + PNName: pnName, // References the shared PodNetwork (also the namespace) + PNIName: pniName, // New PNI for scale test + Namespace: pnName, // Same as PN namespace VnetGUID: netInfo.VnetGUID, SubnetGUID: netInfo.SubnetGUID, SubnetARMID: netInfo.SubnetARMID, @@ -99,13 +96,10 @@ var _ = ginkgo.Describe("Datapath Scale Tests", func() { // Step 1: SKIP creating PodNetwork (reuse existing one from connectivity tests) ginkgo.By(fmt.Sprintf("Reusing existing PodNetwork: %s in cluster %s", pnName, scenario.cluster)) - // Step 2: Create namespace for scale test - ginkgo.By(fmt.Sprintf("Creating scale test namespace: %s in cluster %s", scaleNamespace, scenario.cluster)) - err = CreateNamespaceResource(resources.Kubeconfig, scaleNamespace) - gomega.Expect(err).To(gomega.BeNil(), "Failed to create namespace") + // Step 2: PNI namespace already exists (same as PN namespace), no need to create - // Step 3: Create NEW PodNetworkInstance for scale test - ginkgo.By(fmt.Sprintf("Creating PodNetworkInstance: %s (references PN: %s) in namespace %s in cluster %s", pniName, pnName, scaleNamespace, scenario.cluster)) + // Step 3: Create NEW PodNetworkInstance for scale test in the PN namespace + ginkgo.By(fmt.Sprintf("Creating PodNetworkInstance: %s (references PN: %s) in namespace %s in cluster %s", pniName, pnName, pnName, scenario.cluster)) err = CreatePodNetworkInstanceResource(resources) gomega.Expect(err).To(gomega.BeNil(), "Failed to create PodNetworkInstance") @@ -131,11 +125,7 @@ var _ = ginkgo.Describe("Datapath Scale Tests", func() { defer ginkgo.GinkgoRecover() podName := fmt.Sprintf("scale-pod-%d", idx) - namespace := resources.Namespace - if namespace == "" { - namespace = resources.PNName - } - ginkgo.By(fmt.Sprintf("Creating pod %s in namespace %s in cluster %s (auto-scheduled)", podName, namespace, cluster)) + ginkgo.By(fmt.Sprintf("Creating pod %s in namespace %s in cluster %s (auto-scheduled)", podName, resources.PNName, cluster)) // Create pod without specifying node - let device plugin and scheduler decide err := CreatePod(resources.Kubeconfig, PodData{ @@ -144,7 +134,7 @@ var _ = ginkgo.Describe("Datapath Scale Tests", func() { OS: "linux", PNName: resources.PNName, PNIName: resources.PNIName, - Namespace: namespace, + Namespace: resources.PNName, Image: resources.PodImage, }, resources.PodTemplate) if err != nil { @@ -154,7 +144,7 @@ var _ = ginkgo.Describe("Datapath Scale Tests", func() { // Wait for pod to be scheduled (node assignment) before considering it created // This prevents CNS errors about missing node names - err = helpers.WaitForPodScheduled(resources.Kubeconfig, namespace, podName, 10, 6) + err = helpers.WaitForPodScheduled(resources.Kubeconfig, resources.PNName, podName, 10, 6) if err != nil { errors <- fmt.Errorf("pod %s in cluster %s was not scheduled: %w", podName, cluster, err) } @@ -185,13 +175,9 @@ var _ = ginkgo.Describe("Datapath Scale Tests", func() { ginkgo.By("Verifying all pods are in Running state") podIndex = 0 for i, scenario := range scenarios { - namespace := allResources[i].Namespace - if namespace == "" { - namespace = allResources[i].PNName - } for j := 0; j < scenario.podCount; j++ { podName := fmt.Sprintf("scale-pod-%d", podIndex) - err := helpers.WaitForPodRunning(allResources[i].Kubeconfig, namespace, podName, 5, 10) + err := helpers.WaitForPodRunning(allResources[i].Kubeconfig, allResources[i].PNName, podName, 5, 10) gomega.Expect(err).To(gomega.BeNil(), fmt.Sprintf("Pod %s did not reach running state in cluster %s", podName, scenario.cluster)) podIndex++ } @@ -205,37 +191,27 @@ var _ = ginkgo.Describe("Datapath Scale Tests", func() { for i, scenario := range scenarios { resources := allResources[i] kubeconfig := resources.Kubeconfig - namespace := resources.Namespace - if namespace == "" { - namespace = resources.PNName - } for j := 0; j < scenario.podCount; j++ { podName := fmt.Sprintf("scale-pod-%d", podIndex) - ginkgo.By(fmt.Sprintf("Deleting pod: %s from namespace %s in cluster %s", podName, namespace, scenario.cluster)) - err := helpers.DeletePod(kubeconfig, namespace, podName) + ginkgo.By(fmt.Sprintf("Deleting pod: %s from namespace %s in cluster %s", podName, resources.PNName, scenario.cluster)) + err := helpers.DeletePod(kubeconfig, resources.PNName, podName) if err != nil { fmt.Printf("Warning: Failed to delete pod %s: %v\n", podName, err) } podIndex++ } - // Delete PodNetworkInstance first - ginkgo.By(fmt.Sprintf("Deleting PodNetworkInstance: %s from namespace %s in cluster %s", resources.PNIName, namespace, scenario.cluster)) - err := helpers.DeletePodNetworkInstance(kubeconfig, namespace, resources.PNIName) + // Delete PodNetworkInstance from the PN namespace + ginkgo.By(fmt.Sprintf("Deleting PodNetworkInstance: %s from namespace %s in cluster %s", resources.PNIName, resources.PNName, scenario.cluster)) + err := helpers.DeletePodNetworkInstance(kubeconfig, resources.PNName, resources.PNIName) if err != nil { fmt.Printf("Warning: Failed to delete PNI %s: %v\n", resources.PNIName, err) } - // Delete namespace (scale test specific) - ginkgo.By(fmt.Sprintf("Deleting scale test namespace: %s from cluster %s", namespace, scenario.cluster)) - err = helpers.DeleteNamespace(kubeconfig, namespace) - if err != nil { - fmt.Printf("Warning: Failed to delete namespace %s: %v\n", namespace, err) - } - + // DO NOT delete namespace - it's shared with connectivity tests (same as PN namespace) // DO NOT delete PodNetwork - it's shared with connectivity tests - ginkgo.By(fmt.Sprintf("Keeping PodNetwork: %s (shared with connectivity tests) in cluster %s", resources.PNName, scenario.cluster)) + ginkgo.By(fmt.Sprintf("Keeping PodNetwork and namespace: %s (shared with connectivity tests) in cluster %s", resources.PNName, scenario.cluster)) } ginkgo.By("Scale test cleanup completed") From a00bb45bc10585b9adb3fdbf1adb65b2e5db12d1 Mon Sep 17 00:00:00 2001 From: sivakami Date: Wed, 10 Dec 2025 10:57:31 -0800 Subject: [PATCH 40/64] fix imports --- .../swiftv2/longRunningCluster/datapath_delete_test.go | 2 -- 1 file changed, 2 deletions(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go b/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go index 853540be04..f90d82a05c 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go +++ b/test/integration/swiftv2/longRunningCluster/datapath_delete_test.go @@ -5,10 +5,8 @@ package longRunningCluster import ( "fmt" "os" - "strings" "testing" - "github.com/Azure/azure-container-networking/test/integration/swiftv2/helpers" "github.com/onsi/ginkgo/v2" "github.com/onsi/gomega" ) From 5580154cf5f445378e567fd12ab91ce47051f388 Mon Sep 17 00:00:00 2001 From: sivakami Date: Wed, 10 Dec 2025 15:25:51 -0800 Subject: [PATCH 41/64] Specify pod count per node. --- .pipelines/swiftv2-long-running/scripts/create_aks.sh | 2 ++ hack/aks/Makefile | 2 +- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/.pipelines/swiftv2-long-running/scripts/create_aks.sh b/.pipelines/swiftv2-long-running/scripts/create_aks.sh index 999a406900..e4c3d858c6 100644 --- a/.pipelines/swiftv2-long-running/scripts/create_aks.sh +++ b/.pipelines/swiftv2-long-running/scripts/create_aks.sh @@ -7,6 +7,7 @@ RG=$3 VM_SKU_DEFAULT=$4 VM_SKU_HIGHNIC=$5 +PODS_PER_NODE=7 CLUSTER_COUNT=2 CLUSTER_PREFIX="aks" @@ -94,6 +95,7 @@ for i in $(seq 1 "$CLUSTER_COUNT"); do AZCLI=az REGION=$LOCATION \ GROUP=$RG \ VM_SIZE=$VM_SKU_HIGHNIC \ + PODS_PER_NODE=$PODS_PER_NODE \ CLUSTER=$CLUSTER_NAME \ SUB=$SUBSCRIPTION_ID diff --git a/hack/aks/Makefile b/hack/aks/Makefile index 3b31345ec5..afe113fd18 100644 --- a/hack/aks/Makefile +++ b/hack/aks/Makefile @@ -447,7 +447,7 @@ linux-swiftv2-nodepool-up: ## Add linux node pool to swiftv2 cluster --os-type Linux \ --max-pods 250 \ --subscription $(SUB) \ - --tags fastpathenabled=true aks-nic-enable-multi-tenancy=true stampcreatorserviceinfo=true\ + --tags fastpathenabled=true aks-nic-enable-multi-tenancy=true stampcreatorserviceinfo=true aks-nic-secondary-count=${PODS_PER_NODE}\ --aks-custom-headers AKSHTTPCustomFeatures=Microsoft.ContainerService/NetworkingMultiTenancyPreview \ --pod-subnet-id /subscriptions/$(SUB)/resourceGroups/$(GROUP)/providers/Microsoft.Network/virtualNetworks/$(VNET)/subnets/podnet From 074d59372b83c01b57ae3545a99d017252df680e Mon Sep 17 00:00:00 2001 From: sivakami Date: Thu, 11 Dec 2025 00:26:13 -0800 Subject: [PATCH 42/64] scale test increase pod count. --- .pipelines/swiftv2-long-running/pipeline.yaml | 8 ++++---- .../longRunningCluster/datapath_scale_test.go | 12 ++++++------ 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/.pipelines/swiftv2-long-running/pipeline.yaml b/.pipelines/swiftv2-long-running/pipeline.yaml index 7abc3e1f79..f6d888587d 100644 --- a/.pipelines/swiftv2-long-running/pipeline.yaml +++ b/.pipelines/swiftv2-long-running/pipeline.yaml @@ -3,11 +3,11 @@ pr: none # Schedule: Run every 1 hour schedules: - - cron: "0 */3 * * *" # Every 3 hours at minute 0 - displayName: "Run tests every 3 hours" + - cron: "0 */2 * * *" # Every 2 hours at minute 0 + displayName: "Run tests every 2 hours" branches: include: - - sv2-long-running-pipeline-stage2 + - sv2-long-running-pipeline-scaletests always: true # Run even if there are no code changes parameters: @@ -24,7 +24,7 @@ parameters: - name: location displayName: "Deployment Region" type: string - default: "centraluseuap" + default: "eastus2" - name: runSetupStages displayName: "Create New Infrastructure Setup" diff --git a/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go b/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go index 71224f7154..d7fb7ab410 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go +++ b/test/integration/swiftv2/longRunningCluster/datapath_scale_test.go @@ -31,15 +31,15 @@ var _ = ginkgo.Describe("Datapath Scale Tests", func() { ginkgo.Fail(fmt.Sprintf("Missing required environment variables: RG='%s', BUILD_ID='%s'", rg, buildId)) } - ginkgo.It("creates and deletes 5 pods in a burst using device plugin", ginkgo.NodeTimeout(0), func() { + ginkgo.It("creates and deletes 15 pods in a burst using device plugin", ginkgo.NodeTimeout(0), func() { // NOTE: Maximum pods per PodNetwork/PodNetworkInstance is limited by: // 1. Subnet IP address capacity // 2. Node capacity (typically 250 pods per node) // 3. Available NICs on nodes (device plugin resources) - // For this test: Creating 5 pods across aks-1 and aks-2 + // For this test: Creating 15 pods across aks-1 and aks-2 // Device plugin and Kubernetes scheduler automatically place pods on nodes with available NICs - // Define scenarios for both clusters - 3 pods on aks-1, 2 pods on aks-2 (5 total for testing) + // Define scenarios for both clusters - 8 pods on aks-1, 7 pods on aks-2 (15 total for testing) // IMPORTANT: Reuse existing PodNetworks from connectivity tests to avoid "duplicate podnetwork with same network id" error scenarios := []struct { cluster string @@ -47,8 +47,8 @@ var _ = ginkgo.Describe("Datapath Scale Tests", func() { subnet string podCount int }{ - {cluster: "aks-1", vnetName: "cx_vnet_v1", subnet: "s1", podCount: 3}, - {cluster: "aks-2", vnetName: "cx_vnet_v3", subnet: "s1", podCount: 2}, + {cluster: "aks-1", vnetName: "cx_vnet_v1", subnet: "s1", podCount: 8}, + {cluster: "aks-2", vnetName: "cx_vnet_v3", subnet: "s1", podCount: 7}, } // Initialize test scenarios with cache testScenarios := TestScenarios{ ResourceGroup: rg, @@ -90,7 +90,7 @@ var _ = ginkgo.Describe("Datapath Scale Tests", func() { PNITemplate: "../../manifests/swiftv2/long-running-cluster/podnetworkinstance.yaml", PodTemplate: "../../manifests/swiftv2/long-running-cluster/pod-with-device-plugin.yaml", PodImage: testScenarios.PodImage, - Reservations: 10, // Reserve 10 IPs for scale test pods + Reservations: 20, // Reserve 20 IPs for scale test pods } // Step 1: SKIP creating PodNetwork (reuse existing one from connectivity tests) From dd5beed26858ea2df4f098f7d43ae91b639e7f9a Mon Sep 17 00:00:00 2001 From: sivakami Date: Sun, 7 Dec 2025 20:31:28 -0800 Subject: [PATCH 43/64] Private endpoint tests. --- .../swiftv2/longRunningCluster/datapath.go | 24 +++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go index d65585f025..54aaabbb05 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath.go +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -678,6 +678,13 @@ func GenerateStorageSASToken(storageAccountName, containerName, blobName string) out, err := cmd.CombinedOutput() sasToken := strings.TrimSpace(string(out)) + // Check if account key method produced valid token + accountKeyWorked := err == nil && !strings.Contains(sasToken, "WARNING") && + !strings.Contains(sasToken, "ERROR") && (strings.Contains(sasToken, "sv=") || strings.Contains(sasToken, "sig=")) + + if !accountKeyWorked { + sasToken := strings.TrimSpace(string(out)) + // Check if account key method produced valid token accountKeyWorked := err == nil && !strings.Contains(sasToken, "WARNING") && !strings.Contains(sasToken, "ERROR") && (strings.Contains(sasToken, "sv=") || strings.Contains(sasToken, "sig=")) @@ -690,6 +697,12 @@ func GenerateStorageSASToken(storageAccountName, containerName, blobName string) fmt.Printf("Account key SAS generation failed (no credentials): %s\n", sasToken) } + if err != nil { + fmt.Printf("Account key SAS generation failed (error): %s\n", string(out)) + } else { + fmt.Printf("Account key SAS generation failed (no credentials): %s\n", sasToken) + } + cmd = exec.Command("az", "storage", "blob", "generate-sas", "--account-name", storageAccountName, "--container-name", containerName, @@ -700,12 +713,15 @@ func GenerateStorageSASToken(storageAccountName, containerName, blobName string) "--as-user", "--output", "tsv") + out, err = cmd.CombinedOutput() if err != nil { return "", fmt.Errorf("failed to generate SAS token (both account key and user delegation): %s\n%s", err, string(out)) } sasToken = strings.TrimSpace(string(out)) + + sasToken = strings.TrimSpace(string(out)) } if sasToken == "" { @@ -720,6 +736,14 @@ func GenerateStorageSASToken(storageAccountName, containerName, blobName string) return "", fmt.Errorf("generated SAS token appears invalid (missing sv= or sig=): %s", sasToken) } + // Remove any surrounding quotes that might be added by some shells + sasToken = strings.Trim(sasToken, "\"'") + + // Validate SAS token format - should start with typical SAS parameters + if !strings.Contains(sasToken, "sv=") && !strings.Contains(sasToken, "sig=") { + return "", fmt.Errorf("generated SAS token appears invalid (missing sv= or sig=): %s", sasToken) + } + return sasToken, nil } From 789ed5989745095c9984912c4c194398b55f2777 Mon Sep 17 00:00:00 2001 From: sivakami Date: Mon, 8 Dec 2025 09:55:30 -0800 Subject: [PATCH 44/64] update pod.yaml --- .../swiftv2/long-running-cluster/pod.yaml | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml b/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml index 28b422d0d6..5a3d4015ab 100644 --- a/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml +++ b/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml @@ -23,6 +23,21 @@ spec: while true; do echo "TCP Connection Success from $(hostname) at $(date)" | nc -l -p 8080 done + echo "Pod Network Diagnostics started on $(hostname)"; + echo "Pod IP: $(hostname -i)"; + echo "Starting TCP listener on port 8080"; + + while true; do + nc -l -p 8080 -c 'echo "TCP Connection Success from $(hostname) at $(date)"' + done & + + echo "TCP listener started on port 8080" + sleep 2 + if netstat -tuln | grep -q ':8080'; then # Verify listener is running + echo "TCP listener is active on port 8080" + else + echo "WARNING: TCP listener may not be active on port 8080" + fi ports: - containerPort: 8080 protocol: TCP From be5612d34710eeaf8db53aff1dff40b80a6589d0 Mon Sep 17 00:00:00 2001 From: sivakami Date: Mon, 8 Dec 2025 10:05:36 -0800 Subject: [PATCH 45/64] Check if mtpnc is cleaned up after pods are deleted. --- test/integration/swiftv2/longRunningCluster/datapath.go | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go index 54aaabbb05..40c92ea379 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath.go +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -533,7 +533,7 @@ func DeleteAllScenarios(testScenarios TestScenarios) error { // Phase 3: Verify no MTPNC resources are stuck fmt.Printf("\n=== Phase 3: Verifying MTPNC cleanup ===\n") clustersChecked := make(map[string]bool) - + for _, scenario := range testScenarios.Scenarios { // Check each cluster only once if clustersChecked[scenario.Cluster] { @@ -543,7 +543,7 @@ func DeleteAllScenarios(testScenarios TestScenarios) error { kubeconfig := fmt.Sprintf("/tmp/%s.kubeconfig", scenario.Cluster) fmt.Printf("Checking for pending MTPNC resources in cluster %s\n", scenario.Cluster) - + err := helpers.VerifyNoMTPNC(kubeconfig, testScenarios.BuildID) if err != nil { fmt.Printf("WARNING: Found pending MTPNC resources in cluster %s: %v\n", scenario.Cluster, err) From 8dc9e4e6182bbd23373249216a3d5fd276e8520d Mon Sep 17 00:00:00 2001 From: sivakami Date: Mon, 8 Dec 2025 10:40:05 -0800 Subject: [PATCH 46/64] add container readiness check. --- .../integration/swiftv2/helpers/az_helpers.go | 22 ++++++++----------- 1 file changed, 9 insertions(+), 13 deletions(-) diff --git a/test/integration/swiftv2/helpers/az_helpers.go b/test/integration/swiftv2/helpers/az_helpers.go index c40718af73..e3503e3410 100644 --- a/test/integration/swiftv2/helpers/az_helpers.go +++ b/test/integration/swiftv2/helpers/az_helpers.go @@ -330,10 +330,10 @@ func GetPodIP(kubeconfig, namespace, podName string) (string, error) { // GetPodDelegatedIP retrieves the eth1 IP address (delegated subnet IP) of a pod // This is the IP used for cross-VNet communication and is subject to NSG rules func GetPodDelegatedIP(kubeconfig, namespace, podName string) (string, error) { - // Retry logic - pod might be Running but container not ready yet, or network interface still initializing + // Retry logic - pod might be Running but container not ready yet maxRetries := 5 for attempt := 1; attempt <= maxRetries; attempt++ { - ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) + ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) // Get eth1 IP address by running 'ip addr show eth1' in the pod cmd := exec.CommandContext(ctx, "kubectl", "--kubeconfig", kubeconfig, "exec", podName, @@ -349,17 +349,13 @@ func GetPodDelegatedIP(kubeconfig, namespace, podName string) (string, error) { return "", fmt.Errorf("pod %s in namespace %s has no eth1 IP address (delegated subnet not configured?)", podName, namespace) } - // Check for retryable errors: container not found, signal killed, context deadline exceeded - errStr := strings.ToLower(err.Error()) - outStr := strings.ToLower(string(out)) - isRetryable := strings.Contains(outStr, "container not found") || - strings.Contains(errStr, "signal: killed") || - strings.Contains(errStr, "context deadline exceeded") - - if isRetryable && attempt < maxRetries { - fmt.Printf("Retryable error getting IP for pod %s (attempt %d/%d): %v. Waiting 5 seconds...\n", podName, attempt, maxRetries, err) - time.Sleep(5 * time.Second) - continue + // Check if it's a container not found error + if strings.Contains(string(out), "container not found") { + if attempt < maxRetries { + fmt.Printf("Container not ready yet in pod %s (attempt %d/%d), waiting 3 seconds...\n", podName, attempt, maxRetries) + time.Sleep(3 * time.Second) + continue + } } return "", fmt.Errorf("failed to get eth1 IP for %s in namespace %s: %w\nOutput: %s", podName, namespace, err, string(out)) From 6dff2ca6171623ea823e57b46a9a3c6ae0e1bf78 Mon Sep 17 00:00:00 2001 From: sivakami Date: Mon, 8 Dec 2025 10:45:56 -0800 Subject: [PATCH 47/64] update pod.yaml --- .../swiftv2/long-running-cluster/pod.yaml | 22 +++---------------- 1 file changed, 3 insertions(+), 19 deletions(-) diff --git a/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml b/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml index 5a3d4015ab..5ee2cf6033 100644 --- a/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml +++ b/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml @@ -18,26 +18,10 @@ spec: echo "Pod Network Diagnostics started on $(hostname)" echo "Pod IP: $(hostname -i)" echo "Starting TCP listener on port 8080" - - # Start netcat listener that responds to connections - while true; do - echo "TCP Connection Success from $(hostname) at $(date)" | nc -l -p 8080 + + while true; do + nc -lk -p 8080 -e /bin/sh done - echo "Pod Network Diagnostics started on $(hostname)"; - echo "Pod IP: $(hostname -i)"; - echo "Starting TCP listener on port 8080"; - - while true; do - nc -l -p 8080 -c 'echo "TCP Connection Success from $(hostname) at $(date)"' - done & - - echo "TCP listener started on port 8080" - sleep 2 - if netstat -tuln | grep -q ':8080'; then # Verify listener is running - echo "TCP listener is active on port 8080" - else - echo "WARNING: TCP listener may not be active on port 8080" - fi ports: - containerPort: 8080 protocol: TCP From c0f3841c51496089aa94712f3f335a66f8ffae33 Mon Sep 17 00:00:00 2001 From: sivakami Date: Mon, 8 Dec 2025 15:11:07 -0800 Subject: [PATCH 48/64] Update pod.yaml --- .../long-running-pipeline-template.yaml | 170 ++++++------------ .../swiftv2/long-running-cluster/pod.yaml | 7 +- 2 files changed, 55 insertions(+), 122 deletions(-) diff --git a/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml b/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml index 69bcadf3c0..0a876ef882 100644 --- a/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml +++ b/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml @@ -67,6 +67,7 @@ stages: steps: - checkout: self - task: AzureCLI@2 + displayName: "Create AKS clusters" displayName: "Create AKS clusters" inputs: azureSubscription: ${{ parameters.serviceConnection }} @@ -365,130 +366,61 @@ stages: # ------------------------------------------------------------ # Job 4: Scale Tests with Device Plugin # ------------------------------------------------------------ - - job: ScaleTest - displayName: "Scale Test - Create and Delete 15 Pods with Device Plugin" - dependsOn: - - CreateTestResources - - ConnectivityTests - - PrivateEndpointTests - condition: succeeded() - timeoutInMinutes: 90 - pool: - vmImage: ubuntu-latest - steps: - - checkout: self + # - job: DeleteTestResources + # displayName: "Delete PodNetwork, PNI, and Pods" + # dependsOn: + # - CreateTestResources + # - ConnectivityTests + # - PrivateEndpointTests + # # Always run cleanup, even if previous jobs failed + # condition: always() + # timeoutInMinutes: 60 + # pool: + # vmImage: ubuntu-latest + # steps: + # - checkout: self - - task: GoTool@0 - displayName: "Use Go 1.22.5" - inputs: - version: "1.22.5" + # - task: GoTool@0 + # displayName: "Use Go 1.22.5" + # inputs: + # version: "1.22.5" - - task: AzureCLI@2 - displayName: "Run Scale Test (Create and Delete)" - inputs: - azureSubscription: ${{ parameters.serviceConnection }} - scriptType: bash - scriptLocation: inlineScript - inlineScript: | - echo "==> Installing Ginkgo CLI" - go install github.com/onsi/ginkgo/v2/ginkgo@latest + # - task: AzureCLI@2 + # displayName: "Delete Test Resources" + # inputs: + # azureSubscription: ${{ parameters.serviceConnection }} + # scriptType: bash + # scriptLocation: inlineScript + # inlineScript: | + # echo "==> Installing Ginkgo CLI" + # go install github.com/onsi/ginkgo/v2/ginkgo@latest - echo "==> Adding Go bin to PATH" - export PATH=$PATH:$(go env GOPATH)/bin + # echo "==> Adding Go bin to PATH" + # export PATH=$PATH:$(go env GOPATH)/bin - echo "==> Downloading Go dependencies" - go mod download + # echo "==> Downloading Go dependencies" + # go mod download - echo "==> Setting up kubeconfig for cluster aks-1" - az aks get-credentials \ - --resource-group $(rgName) \ - --name aks-1 \ - --file /tmp/aks-1.kubeconfig \ - --overwrite-existing \ - --admin - - echo "==> Setting up kubeconfig for cluster aks-2" - az aks get-credentials \ - --resource-group $(rgName) \ - --name aks-2 \ - --file /tmp/aks-2.kubeconfig \ - --overwrite-existing \ - --admin + # echo "==> Setting up kubeconfig for cluster aks-1" + # az aks get-credentials \ + # --resource-group $(rgName) \ + # --name aks-1 \ + # --file /tmp/aks-1.kubeconfig \ + # --overwrite-existing \ + # --admin - echo "==> Verifying cluster aks-1 connectivity" - kubectl --kubeconfig /tmp/aks-1.kubeconfig get nodes + # echo "==> Setting up kubeconfig for cluster aks-2" + # az aks get-credentials \ + # --resource-group $(rgName) \ + # --name aks-2 \ + # --file /tmp/aks-2.kubeconfig \ + # --overwrite-existing \ + # --admin - echo "==> Verifying cluster aks-2 connectivity" - kubectl --kubeconfig /tmp/aks-2.kubeconfig get nodes - - echo "==> Running scale test: Create 15 pods with device plugin across both clusters" - echo "NOTE: Pods are auto-scheduled by Kubernetes scheduler and device plugin" - echo " - 8 pods in aks-1 (cx_vnet_a1/s1)" - echo " - 7 pods in aks-2 (cx_vnet_a2/s1)" - echo "Pod limits per PodNetwork/PodNetworkInstance:" - echo " - Subnet IP address capacity" - echo " - Node capacity (typically 250 pods per node)" - echo " - Available device plugin resources (NICs per node)" - export RG="$(rgName)" - export BUILD_ID="$(rgName)" - export WORKLOAD_TYPE="swiftv2-linux" - cd ./test/integration/swiftv2/longRunningCluster - ginkgo -v -trace --timeout=1h --tags=scale_test - - # ------------------------------------------------------------ - # Job 5: Delete Test Resources - # ------------------------------------------------------------ - - job: DeleteTestResources - displayName: "Delete PodNetwork, PNI, and Pods" - dependsOn: - - CreateTestResources - - ConnectivityTests - - PrivateEndpointTests - - ScaleTest - # Always run cleanup, even if previous jobs failed - condition: always() - timeoutInMinutes: 60 - pool: - vmImage: ubuntu-latest - steps: - - checkout: self - - task: GoTool@0 - displayName: "Use Go 1.22.5" - inputs: - version: "1.22.5" - - task: AzureCLI@2 - displayName: "Delete Test Resources" - inputs: - azureSubscription: ${{ parameters.serviceConnection }} - scriptType: bash - scriptLocation: inlineScript - inlineScript: | - echo "==> Installing Ginkgo CLI" - go install github.com/onsi/ginkgo/v2/ginkgo@latest - echo "==> Adding Go bin to PATH" - export PATH=$PATH:$(go env GOPATH)/bin - echo "==> Downloading Go dependencies" - go mod download - - echo "==> Setting up kubeconfig for cluster aks-1" - az aks get-credentials \ - --resource-group $(rgName) \ - --name aks-1 \ - --file /tmp/aks-1.kubeconfig \ - --overwrite-existing \ - --admin - echo "==> Setting up kubeconfig for cluster aks-2" - az aks get-credentials \ - --resource-group $(rgName) \ - --name aks-2 \ - --file /tmp/aks-2.kubeconfig \ - --overwrite-existing \ - --admin - - echo "==> Deleting test resources (8 scenarios)" - export RG="$(rgName)" - export BUILD_ID="$(rgName)" - export WORKLOAD_TYPE="swiftv2-linux" - cd ./test/integration/swiftv2/longRunningCluster - ginkgo -v -trace --timeout=1h --tags=delete_test + # echo "==> Deleting test resources (8 scenarios)" + # export RG="$(rgName)" + # export BUILD_ID="$(rgName)" + # export WORKLOAD_TYPE="swiftv2-linux" + # cd ./test/integration/swiftv2/longRunningCluster + # ginkgo -v -trace --timeout=1h --tags=delete_test \ No newline at end of file diff --git a/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml b/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml index 5ee2cf6033..28b422d0d6 100644 --- a/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml +++ b/test/integration/manifests/swiftv2/long-running-cluster/pod.yaml @@ -18,9 +18,10 @@ spec: echo "Pod Network Diagnostics started on $(hostname)" echo "Pod IP: $(hostname -i)" echo "Starting TCP listener on port 8080" - - while true; do - nc -lk -p 8080 -e /bin/sh + + # Start netcat listener that responds to connections + while true; do + echo "TCP Connection Success from $(hostname) at $(date)" | nc -l -p 8080 done ports: - containerPort: 8080 From 236a874914dcdcdcabeb6f1862d5e734688b937c Mon Sep 17 00:00:00 2001 From: sivakami Date: Mon, 8 Dec 2025 15:49:13 -0800 Subject: [PATCH 49/64] Update connectivity test. --- test/integration/swiftv2/longRunningCluster/datapath.go | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go index 40c92ea379..df82380b2e 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath.go +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -533,7 +533,7 @@ func DeleteAllScenarios(testScenarios TestScenarios) error { // Phase 3: Verify no MTPNC resources are stuck fmt.Printf("\n=== Phase 3: Verifying MTPNC cleanup ===\n") clustersChecked := make(map[string]bool) - + for _, scenario := range testScenarios.Scenarios { // Check each cluster only once if clustersChecked[scenario.Cluster] { @@ -543,7 +543,7 @@ func DeleteAllScenarios(testScenarios TestScenarios) error { kubeconfig := fmt.Sprintf("/tmp/%s.kubeconfig", scenario.Cluster) fmt.Printf("Checking for pending MTPNC resources in cluster %s\n", scenario.Cluster) - + err := helpers.VerifyNoMTPNC(kubeconfig, testScenarios.BuildID) if err != nil { fmt.Printf("WARNING: Found pending MTPNC resources in cluster %s: %v\n", scenario.Cluster, err) @@ -636,8 +636,7 @@ func RunConnectivityTest(test ConnectivityTest, rg, buildId string) error { // Using -m 3 for 3 second timeout (short because netcat closes connection immediately) // Using --interface eth1 to force traffic through delegated subnet interface // Using --http0.9 to allow HTTP/0.9 responses from netcat (which sends raw text without proper HTTP headers) - // Exit code 28 (timeout) is OK if we received data, since netcat doesn't properly close the connection - curlCmd := fmt.Sprintf("curl --http0.9 --interface eth1 -m 3 http://%s:8080/", destIP) + curlCmd := fmt.Sprintf("curl --http0.9 --interface eth1 -m 10 http://%s:8080/", destIP) output, err := helpers.ExecInPod(sourceKubeconfig, test.SourceNamespace, test.SourcePod, curlCmd) if err != nil { From 6b1cb819c0e7ecc0d1120232b0e4e485277c30d4 Mon Sep 17 00:00:00 2001 From: sivakami Date: Mon, 8 Dec 2025 18:46:41 -0800 Subject: [PATCH 50/64] Update netcat curl test. --- .../integration/swiftv2/helpers/az_helpers.go | 22 +++++++++++-------- .../swiftv2/longRunningCluster/datapath.go | 7 +++++- 2 files changed, 19 insertions(+), 10 deletions(-) diff --git a/test/integration/swiftv2/helpers/az_helpers.go b/test/integration/swiftv2/helpers/az_helpers.go index e3503e3410..c40718af73 100644 --- a/test/integration/swiftv2/helpers/az_helpers.go +++ b/test/integration/swiftv2/helpers/az_helpers.go @@ -330,10 +330,10 @@ func GetPodIP(kubeconfig, namespace, podName string) (string, error) { // GetPodDelegatedIP retrieves the eth1 IP address (delegated subnet IP) of a pod // This is the IP used for cross-VNet communication and is subject to NSG rules func GetPodDelegatedIP(kubeconfig, namespace, podName string) (string, error) { - // Retry logic - pod might be Running but container not ready yet + // Retry logic - pod might be Running but container not ready yet, or network interface still initializing maxRetries := 5 for attempt := 1; attempt <= maxRetries; attempt++ { - ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) + ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) // Get eth1 IP address by running 'ip addr show eth1' in the pod cmd := exec.CommandContext(ctx, "kubectl", "--kubeconfig", kubeconfig, "exec", podName, @@ -349,13 +349,17 @@ func GetPodDelegatedIP(kubeconfig, namespace, podName string) (string, error) { return "", fmt.Errorf("pod %s in namespace %s has no eth1 IP address (delegated subnet not configured?)", podName, namespace) } - // Check if it's a container not found error - if strings.Contains(string(out), "container not found") { - if attempt < maxRetries { - fmt.Printf("Container not ready yet in pod %s (attempt %d/%d), waiting 3 seconds...\n", podName, attempt, maxRetries) - time.Sleep(3 * time.Second) - continue - } + // Check for retryable errors: container not found, signal killed, context deadline exceeded + errStr := strings.ToLower(err.Error()) + outStr := strings.ToLower(string(out)) + isRetryable := strings.Contains(outStr, "container not found") || + strings.Contains(errStr, "signal: killed") || + strings.Contains(errStr, "context deadline exceeded") + + if isRetryable && attempt < maxRetries { + fmt.Printf("Retryable error getting IP for pod %s (attempt %d/%d): %v. Waiting 5 seconds...\n", podName, attempt, maxRetries, err) + time.Sleep(5 * time.Second) + continue } return "", fmt.Errorf("failed to get eth1 IP for %s in namespace %s: %w\nOutput: %s", podName, namespace, err, string(out)) diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go index df82380b2e..69717a5bdd 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath.go +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -636,9 +636,14 @@ func RunConnectivityTest(test ConnectivityTest, rg, buildId string) error { // Using -m 3 for 3 second timeout (short because netcat closes connection immediately) // Using --interface eth1 to force traffic through delegated subnet interface // Using --http0.9 to allow HTTP/0.9 responses from netcat (which sends raw text without proper HTTP headers) - curlCmd := fmt.Sprintf("curl --http0.9 --interface eth1 -m 10 http://%s:8080/", destIP) + // Exit code 28 (timeout) is OK if we received data, since netcat doesn't properly close the connection + curlCmd := fmt.Sprintf("curl --http0.9 --interface eth1 -m 3 http://%s:8080/", destIP) output, err := helpers.ExecInPod(sourceKubeconfig, test.SourceNamespace, test.SourcePod, curlCmd) + + // Check if we received data even if curl timed out (exit code 28) + // Netcat closes the connection without proper HTTP close, causing curl to timeout + // But if we got the expected response, the connectivity test is successful if err != nil { if strings.Contains(err.Error(), "exit status 28") && strings.Contains(output, "TCP Connection Success") { // Timeout but we got the data - this is OK with netcat From 8498946786b241721cffcfe25c5eefcd2b1c2764 Mon Sep 17 00:00:00 2001 From: sivakami Date: Mon, 8 Dec 2025 19:11:33 -0800 Subject: [PATCH 51/64] Enable delete pods. --- .../long-running-pipeline-template.yaml | 100 +++++++++--------- 1 file changed, 50 insertions(+), 50 deletions(-) diff --git a/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml b/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml index 0a876ef882..bd9c9534d9 100644 --- a/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml +++ b/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml @@ -366,61 +366,61 @@ stages: # ------------------------------------------------------------ # Job 4: Scale Tests with Device Plugin # ------------------------------------------------------------ - # - job: DeleteTestResources - # displayName: "Delete PodNetwork, PNI, and Pods" - # dependsOn: - # - CreateTestResources - # - ConnectivityTests - # - PrivateEndpointTests - # # Always run cleanup, even if previous jobs failed - # condition: always() - # timeoutInMinutes: 60 - # pool: - # vmImage: ubuntu-latest - # steps: - # - checkout: self + - job: DeleteTestResources + displayName: "Delete PodNetwork, PNI, and Pods" + dependsOn: + - CreateTestResources + - ConnectivityTests + - PrivateEndpointTests + # Always run cleanup, even if previous jobs failed + condition: always() + timeoutInMinutes: 60 + pool: + vmImage: ubuntu-latest + steps: + - checkout: self - # - task: GoTool@0 - # displayName: "Use Go 1.22.5" - # inputs: - # version: "1.22.5" + - task: GoTool@0 + displayName: "Use Go 1.22.5" + inputs: + version: "1.22.5" - # - task: AzureCLI@2 - # displayName: "Delete Test Resources" - # inputs: - # azureSubscription: ${{ parameters.serviceConnection }} - # scriptType: bash - # scriptLocation: inlineScript - # inlineScript: | - # echo "==> Installing Ginkgo CLI" - # go install github.com/onsi/ginkgo/v2/ginkgo@latest + - task: AzureCLI@2 + displayName: "Delete Test Resources" + inputs: + azureSubscription: ${{ parameters.serviceConnection }} + scriptType: bash + scriptLocation: inlineScript + inlineScript: | + echo "==> Installing Ginkgo CLI" + go install github.com/onsi/ginkgo/v2/ginkgo@latest - # echo "==> Adding Go bin to PATH" - # export PATH=$PATH:$(go env GOPATH)/bin + echo "==> Adding Go bin to PATH" + export PATH=$PATH:$(go env GOPATH)/bin - # echo "==> Downloading Go dependencies" - # go mod download + echo "==> Downloading Go dependencies" + go mod download - # echo "==> Setting up kubeconfig for cluster aks-1" - # az aks get-credentials \ - # --resource-group $(rgName) \ - # --name aks-1 \ - # --file /tmp/aks-1.kubeconfig \ - # --overwrite-existing \ - # --admin + echo "==> Setting up kubeconfig for cluster aks-1" + az aks get-credentials \ + --resource-group $(rgName) \ + --name aks-1 \ + --file /tmp/aks-1.kubeconfig \ + --overwrite-existing \ + --admin - # echo "==> Setting up kubeconfig for cluster aks-2" - # az aks get-credentials \ - # --resource-group $(rgName) \ - # --name aks-2 \ - # --file /tmp/aks-2.kubeconfig \ - # --overwrite-existing \ - # --admin + echo "==> Setting up kubeconfig for cluster aks-2" + az aks get-credentials \ + --resource-group $(rgName) \ + --name aks-2 \ + --file /tmp/aks-2.kubeconfig \ + --overwrite-existing \ + --admin - # echo "==> Deleting test resources (8 scenarios)" - # export RG="$(rgName)" - # export BUILD_ID="$(rgName)" - # export WORKLOAD_TYPE="swiftv2-linux" - # cd ./test/integration/swiftv2/longRunningCluster - # ginkgo -v -trace --timeout=1h --tags=delete_test + echo "==> Deleting test resources (8 scenarios)" + export RG="$(rgName)" + export BUILD_ID="$(rgName)" + export WORKLOAD_TYPE="swiftv2-linux" + cd ./test/integration/swiftv2/longRunningCluster + ginkgo -v -trace --timeout=1h --tags=delete_test \ No newline at end of file From 2ba3f3b9645d266131df57d2011cb42afc223a94 Mon Sep 17 00:00:00 2001 From: sivakami Date: Mon, 8 Dec 2025 19:50:53 -0800 Subject: [PATCH 52/64] Remove test changes. --- test/integration/swiftv2/longRunningCluster/datapath.go | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go index 69717a5bdd..a2e167a1fd 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath.go +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -640,7 +640,7 @@ func RunConnectivityTest(test ConnectivityTest, rg, buildId string) error { curlCmd := fmt.Sprintf("curl --http0.9 --interface eth1 -m 3 http://%s:8080/", destIP) output, err := helpers.ExecInPod(sourceKubeconfig, test.SourceNamespace, test.SourcePod, curlCmd) - + // Check if we received data even if curl timed out (exit code 28) // Netcat closes the connection without proper HTTP close, causing curl to timeout // But if we got the expected response, the connectivity test is successful From 59a08263f62497b5b945642323b0f1948b1254e9 Mon Sep 17 00:00:00 2001 From: sivakami Date: Tue, 9 Dec 2025 14:52:44 -0800 Subject: [PATCH 53/64] remove test changes for storage accounts. --- .../swiftv2-long-running/scripts/create_storage.sh | 9 --------- 1 file changed, 9 deletions(-) diff --git a/.pipelines/swiftv2-long-running/scripts/create_storage.sh b/.pipelines/swiftv2-long-running/scripts/create_storage.sh index fd5f7addae..53017e7f3c 100644 --- a/.pipelines/swiftv2-long-running/scripts/create_storage.sh +++ b/.pipelines/swiftv2-long-running/scripts/create_storage.sh @@ -71,15 +71,6 @@ for SA in "$SA1" "$SA2"; do && echo "[OK] Test blob 'hello.txt' uploaded to $SA/test/" done -# # Disable public network access ONLY on SA1 (Tenant A storage with private endpoint) -# echo "==> Disabling public network access on $SA1" -# az storage account update \ -# --name "$SA1" \ -# --resource-group "$RG" \ -# --public-network-access Disabled \ -# --output none \ -# && echo "[OK] Public network access disabled on $SA1" - echo "All storage accounts created and verified successfully." # Set pipeline output variables From 4fed155168bc607dddb72cab7655d9b02a4deef6 Mon Sep 17 00:00:00 2001 From: sivakami Date: Thu, 11 Dec 2025 10:47:42 -0800 Subject: [PATCH 54/64] update go.mod --- go.mod | 6 ++---- go.sum | 8 ++++---- 2 files changed, 6 insertions(+), 8 deletions(-) diff --git a/go.mod b/go.mod index 3bafced2fa..185f92afd6 100644 --- a/go.mod +++ b/go.mod @@ -1,8 +1,6 @@ module github.com/Azure/azure-container-networking -go 1.22.5 - -toolchain go1.22.5 +go 1.24.1 require ( github.com/Azure/azure-sdk-for-go/sdk/azcore v1.19.1 @@ -121,7 +119,7 @@ require ( github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/network/armnetwork/v5 v5.2.0 github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/resources/armresources v1.2.0 github.com/cilium/cilium v1.15.16 - github.com/cilium/ebpf v0.19.0 + github.com/cilium/ebpf v0.16.0 github.com/jsternberg/zap-logfmt v1.3.0 github.com/onsi/ginkgo/v2 v2.23.4 golang.org/x/sync v0.17.0 diff --git a/go.sum b/go.sum index c1ac6b2891..286183d795 100644 --- a/go.sum +++ b/go.sum @@ -72,8 +72,8 @@ github.com/cilium/checkmate v1.0.3 h1:CQC5eOmlAZeEjPrVZY3ZwEBH64lHlx9mXYdUehEwI5 github.com/cilium/checkmate v1.0.3/go.mod h1:KiBTasf39/F2hf2yAmHw21YFl3hcEyP4Yk6filxc12A= github.com/cilium/cilium v1.15.16 h1:m27kbvRA0ynOQlm1ay+a+lNVgLCTUW5Inky9WoA3wBM= github.com/cilium/cilium v1.15.16/go.mod h1:UuiAb8fmxV/lix5cGRgiJJ7hvhRfcdF48QreqG0xTB4= -github.com/cilium/ebpf v0.19.0 h1:Ro/rE64RmFBeA9FGjcTc+KmCeY6jXmryu6FfnzPRIao= -github.com/cilium/ebpf v0.19.0/go.mod h1:fLCgMo3l8tZmAdM3B2XqdFzXBpwkcSTroaVqN08OWVY= +github.com/cilium/ebpf v0.16.0 h1:+BiEnHL6Z7lXnlGUsXQPPAE7+kenAd4ES8MQ5min0Ok= +github.com/cilium/ebpf v0.16.0/go.mod h1:L7u2Blt2jMM/vLAVgjxluxtBKlz3/GWjB0dMOEngfwE= github.com/cilium/proxy v0.0.0-20231202123106-38b645b854f3 h1:fckMszrvhMot1XdF04NUKzmGw2CBJWGc9BCpFhVPKD8= github.com/cilium/proxy v0.0.0-20231202123106-38b645b854f3/go.mod h1:cvRtoiPIT40QqsHRR77WyyMSj8prsz0/kaV0s8Q3LIA= github.com/client9/misspell v0.3.4/go.mod h1:qj6jICC3Q7zFZvVWo7KLAzC3yx5G7kyvSDkc90ppPyw= @@ -159,8 +159,8 @@ github.com/go-openapi/swag v0.23.0 h1:vsEVJDUo2hPJ2tu0/Xc+4noaxyEffXNIs3cOULZ+Gr github.com/go-openapi/swag v0.23.0/go.mod h1:esZ8ITTYEsH1V2trKHjAN8Ai7xHb8RV+YSZ577vPjgQ= github.com/go-openapi/validate v0.22.3 h1:KxG9mu5HBRYbecRb37KRCihvGGtND2aXziBAv0NNfyI= github.com/go-openapi/validate v0.22.3/go.mod h1:kVxh31KbfsxU8ZyoHaDbLBWU5CnMdqBUEtadQ2G4d5M= -github.com/go-quicktest/qt v1.101.1-0.20240301121107-c6c8733fa1e6 h1:teYtXy9B7y5lHTp8V9KPxpYRAVA7dozigQcMiBust1s= -github.com/go-quicktest/qt v1.101.1-0.20240301121107-c6c8733fa1e6/go.mod h1:p4lGIVX+8Wa6ZPNDvqcxq36XpUDLh42FLetFU7odllI= +github.com/go-quicktest/qt v1.101.0 h1:O1K29Txy5P2OK0dGo59b7b0LR6wKfIhttaAhHUyn7eI= +github.com/go-quicktest/qt v1.101.0/go.mod h1:14Bz/f7NwaXPtdYEgzsx46kqSxVwTbzVZsDC26tQJow= github.com/go-task/slim-sprig v0.0.0-20210107165309-348f09dbbbc0/go.mod h1:fyg7847qk6SyHyPtNmDHnmrv/HOrqktSC+C9fM+CJOE= github.com/go-task/slim-sprig/v3 v3.0.0 h1:sUs3vkvUymDpBKi3qH1YSqBQk9+9D/8M2mN1vB6EwHI= github.com/go-task/slim-sprig/v3 v3.0.0/go.mod h1:W848ghGpv3Qj3dhTPRyJypKRiqCdHZiAzKg9hl15HA8= From e7c99332d0365f883d1432c0be4dc641a411df0c Mon Sep 17 00:00:00 2001 From: sivakami Date: Thu, 11 Dec 2025 15:23:01 -0800 Subject: [PATCH 55/64] Make dockerfiles. --- .pipelines/build/dockerfiles/cns.Dockerfile | 4 ++-- cni/Dockerfile | 4 ++-- cns/Dockerfile | 6 +++--- 3 files changed, 7 insertions(+), 7 deletions(-) diff --git a/.pipelines/build/dockerfiles/cns.Dockerfile b/.pipelines/build/dockerfiles/cns.Dockerfile index 1fc8f9d5b1..424dd37a71 100644 --- a/.pipelines/build/dockerfiles/cns.Dockerfile +++ b/.pipelines/build/dockerfiles/cns.Dockerfile @@ -11,11 +11,11 @@ ENTRYPOINT ["azure-cns.exe"] EXPOSE 10090 # mcr.microsoft.com/azurelinux/base/core:3.0 -FROM --platform=linux/${ARCH} mcr.microsoft.com/azurelinux/base/core@sha256:833693619d523c23b1fe4d9c1f64a6c697e2a82f7a6ee26e1564897c3fe3fa02 AS build-helper +FROM --platform=linux/${ARCH} mcr.microsoft.com/azurelinux/base/core@sha256:ee7f76ce3febc06e79c1a3776178b36bea62f76da43f0d58c30d5974d0ec3dbf AS build-helper RUN tdnf install -y iptables # mcr.microsoft.com/azurelinux/distroless/minimal:3.0 -FROM --platform=linux/${ARCH} mcr.microsoft.com/azurelinux/distroless/minimal@sha256:d784c8233e87e8bce2e902ff59a91262635e4cabc25ec55ac0a718344514db3c AS linux +FROM --platform=linux/${ARCH} mcr.microsoft.com/azurelinux/distroless/minimal@sha256:810f96c73cfbe47690b54eb4f3cea57ec0467e413f1fd068a234746a95a1c27e AS linux ARG ARTIFACT_DIR . COPY --from=build-helper /usr/sbin/*tables* /usr/sbin/ diff --git a/cni/Dockerfile b/cni/Dockerfile index 5867fd09b2..979f4438ee 100644 --- a/cni/Dockerfile +++ b/cni/Dockerfile @@ -6,10 +6,10 @@ ARG OS_VERSION ARG OS # mcr.microsoft.com/oss/go/microsoft/golang:1.24-azurelinux3.0 -FROM --platform=linux/${ARCH} mcr.microsoft.com/oss/go/microsoft/golang@sha256:7bbbda682ce4a462855bd8a61c5efdc1e79ab89d9e32c2610f41e6f9502e1cf4 AS go +FROM --platform=linux/${ARCH} mcr.microsoft.com/oss/go/microsoft/golang@sha256:bc4b33940e68439f4c74b70ba0d4c5a2cd5f61fd689fe36da1708ac0b1dc3a6c AS go # mcr.microsoft.com/azurelinux/base/core:3.0 -FROM --platform=linux/${ARCH} mcr.microsoft.com/azurelinux/base/core@sha256:833693619d523c23b1fe4d9c1f64a6c697e2a82f7a6ee26e1564897c3fe3fa02 AS mariner-core +FROM --platform=linux/${ARCH} mcr.microsoft.com/azurelinux/base/core@sha256:ee7f76ce3febc06e79c1a3776178b36bea62f76da43f0d58c30d5974d0ec3dbf AS mariner-core FROM go AS azure-vnet ARG OS diff --git a/cns/Dockerfile b/cns/Dockerfile index 7908371aea..9f337d1dad 100644 --- a/cns/Dockerfile +++ b/cns/Dockerfile @@ -5,13 +5,13 @@ ARG OS_VERSION ARG OS # mcr.microsoft.com/oss/go/microsoft/golang:1.24-azurelinux3.0 -FROM --platform=linux/${ARCH} mcr.microsoft.com/oss/go/microsoft/golang@sha256:7bbbda682ce4a462855bd8a61c5efdc1e79ab89d9e32c2610f41e6f9502e1cf4 AS go +FROM --platform=linux/${ARCH} mcr.microsoft.com/oss/go/microsoft/golang@sha256:bc4b33940e68439f4c74b70ba0d4c5a2cd5f61fd689fe36da1708ac0b1dc3a6c AS go # mcr.microsoft.com/azurelinux/base/core:3.0 -FROM mcr.microsoft.com/azurelinux/base/core@sha256:833693619d523c23b1fe4d9c1f64a6c697e2a82f7a6ee26e1564897c3fe3fa02 AS mariner-core +FROM mcr.microsoft.com/azurelinux/base/core@sha256:ee7f76ce3febc06e79c1a3776178b36bea62f76da43f0d58c30d5974d0ec3dbf AS mariner-core # mcr.microsoft.com/azurelinux/distroless/minimal:3.0 -FROM mcr.microsoft.com/azurelinux/distroless/minimal@sha256:d784c8233e87e8bce2e902ff59a91262635e4cabc25ec55ac0a718344514db3c AS mariner-distroless +FROM mcr.microsoft.com/azurelinux/distroless/minimal@sha256:810f96c73cfbe47690b54eb4f3cea57ec0467e413f1fd068a234746a95a1c27e AS mariner-distroless FROM --platform=linux/${ARCH} go AS builder ARG OS From 62347dcc4e4a1055327d89765da3a1f46f2443b7 Mon Sep 17 00:00:00 2001 From: sivakami Date: Thu, 11 Dec 2025 16:02:09 -0800 Subject: [PATCH 56/64] lint fixes --- .../integration/swiftv2/helpers/az_helpers.go | 47 ++++---- .../swiftv2/longRunningCluster/datapath.go | 104 ++++++++++++------ .../datapath_connectivity_test.go | 2 +- .../datapath_private_endpoint_test.go | 4 +- 4 files changed, 102 insertions(+), 55 deletions(-) diff --git a/test/integration/swiftv2/helpers/az_helpers.go b/test/integration/swiftv2/helpers/az_helpers.go index c40718af73..54bc4cd878 100644 --- a/test/integration/swiftv2/helpers/az_helpers.go +++ b/test/integration/swiftv2/helpers/az_helpers.go @@ -2,12 +2,26 @@ package helpers import ( "context" + "errors" "fmt" "os/exec" "strings" "time" ) +var ( + // ErrPodNotRunning is returned when a pod does not reach Running state + ErrPodNotRunning = errors.New("pod did not reach Running state") + // ErrPodNoIP is returned when a pod has no IP address assigned + ErrPodNoIP = errors.New("pod has no IP address assigned") + // ErrPodNoEth1IP is returned when a pod has no eth1 IP address (delegated subnet not configured) + ErrPodNoEth1IP = errors.New("pod has no eth1 IP address (delegated subnet not configured?)") + // ErrPodContainerNotReady is returned when a pod container is not ready + ErrPodContainerNotReady = errors.New("pod container not ready") + // ErrMTPNCStuckDeletion is returned when MTPNC resources are stuck and not deleted + ErrMTPNCStuckDeletion = errors.New("MTPNC resources should have been deleted but were found") +) + func runAzCommand(cmd string, args ...string) (string, error) { out, err := exec.Command(cmd, args...).CombinedOutput() if err != nil { @@ -32,11 +46,6 @@ func GetSubnetGUID(rg, vnet, subnet string) (string, error) { return runAzCommand("az", "resource", "show", "--ids", subnetID, "--api-version", "2023-09-01", "--query", "properties.serviceAssociationLinks[0].properties.subnetId", "-o", "tsv") } -func GetSubnetToken(rg, vnet, subnet string) (string, error) { - // Optionally implement if you use subnet token override - return "", nil -} - // GetClusterNodes returns a slice of node names from a cluster using the given kubeconfig func GetClusterNodes(kubeconfig string) ([]string, error) { cmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "get", "nodes", "-o", "name") @@ -70,7 +79,7 @@ func EnsureNamespaceExists(kubeconfig, namespace string) error { cmd = exec.Command("kubectl", "--kubeconfig", kubeconfig, "create", "namespace", namespace) out, err := cmd.CombinedOutput() if err != nil { - return fmt.Errorf("failed to create namespace %s: %s\n%s", namespace, err, string(out)) + return fmt.Errorf("failed to create namespace %s: %w\nOutput: %s", namespace, err, string(out)) } return nil @@ -87,20 +96,20 @@ func DeletePod(kubeconfig, namespace, podName string) error { cmd := exec.CommandContext(ctx, "kubectl", "--kubeconfig", kubeconfig, "delete", "pod", podName, "-n", namespace, "--ignore-not-found=true") out, err := cmd.CombinedOutput() if err != nil { - if ctx.Err() == context.DeadlineExceeded { + if errors.Is(ctx.Err(), context.DeadlineExceeded) { fmt.Printf("kubectl delete pod command timed out after 90s, attempting force delete...\n") } else { - return fmt.Errorf("failed to delete pod %s in namespace %s: %s\n%s", podName, namespace, err, string(out)) + return fmt.Errorf("failed to delete pod %s in namespace %s: %w\nOutput: %s", podName, namespace, err, string(out)) } } // Wait for pod to be completely gone (critical for IP release) fmt.Printf("Waiting for pod %s to be fully removed...\n", podName) for attempt := 1; attempt <= 30; attempt++ { - ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) - checkCmd := exec.CommandContext(ctx, "kubectl", "--kubeconfig", kubeconfig, "get", "pod", podName, "-n", namespace, "--ignore-not-found=true", "-o", "name") + checkCtx, checkCancel := context.WithTimeout(context.Background(), 10*time.Second) + checkCmd := exec.CommandContext(checkCtx, "kubectl", "--kubeconfig", kubeconfig, "get", "pod", podName, "-n", namespace, "--ignore-not-found=true", "-o", "name") checkOut, _ := checkCmd.CombinedOutput() - cancel() + checkCancel() if strings.TrimSpace(string(checkOut)) == "" { fmt.Printf("Pod %s fully removed after %d seconds\n", podName, attempt*2) @@ -140,7 +149,7 @@ func DeletePodNetworkInstance(kubeconfig, namespace, pniName string) error { cmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "delete", "podnetworkinstance", pniName, "-n", namespace, "--ignore-not-found=true") out, err := cmd.CombinedOutput() if err != nil { - return fmt.Errorf("failed to delete PodNetworkInstance %s: %s\n%s", pniName, err, string(out)) + return fmt.Errorf("failed to delete PodNetworkInstance %s: %w\nOutput: %s", pniName, err, string(out)) } // Wait for PNI to be completely gone (it may take time for DNC to release reservations) @@ -191,7 +200,7 @@ func DeletePodNetwork(kubeconfig, pnName string) error { cmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "delete", "podnetwork", pnName, "--ignore-not-found=true") out, err := cmd.CombinedOutput() if err != nil { - return fmt.Errorf("failed to delete PodNetwork %s: %s\n%s", pnName, err, string(out)) + return fmt.Errorf("failed to delete PodNetwork %s: %w\nOutput: %s", pnName, err, string(out)) } // Wait for PN to be completely gone @@ -231,7 +240,7 @@ func DeleteNamespace(kubeconfig, namespace string) error { cmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "delete", "namespace", namespace, "--ignore-not-found=true") out, err := cmd.CombinedOutput() if err != nil { - return fmt.Errorf("failed to delete namespace %s: %s\n%s", namespace, err, string(out)) + return fmt.Errorf("failed to delete namespace %s: %w\nOutput: %s", namespace, err, string(out)) } // Wait for namespace to be completely gone @@ -304,7 +313,7 @@ func WaitForPodRunning(kubeconfig, namespace, podName string, maxRetries, sleepS } } - return fmt.Errorf("pod %s did not reach Running state after %d attempts", podName, maxRetries) + return fmt.Errorf("%w: pod %s after %d attempts", ErrPodNotRunning, podName, maxRetries) } // GetPodIP retrieves the IP address of a pod @@ -321,7 +330,7 @@ func GetPodIP(kubeconfig, namespace, podName string) (string, error) { ip := strings.TrimSpace(string(out)) if ip == "" { - return "", fmt.Errorf("pod %s in namespace %s has no IP address assigned", podName, namespace) + return "", fmt.Errorf("%w: pod %s in namespace %s", ErrPodNoIP, podName, namespace) } return ip, nil @@ -346,7 +355,7 @@ func GetPodDelegatedIP(kubeconfig, namespace, podName string) (string, error) { if ip != "" { return ip, nil } - return "", fmt.Errorf("pod %s in namespace %s has no eth1 IP address (delegated subnet not configured?)", podName, namespace) + return "", fmt.Errorf("%w: pod %s in namespace %s", ErrPodNoEth1IP, podName, namespace) } // Check for retryable errors: container not found, signal killed, context deadline exceeded @@ -365,7 +374,7 @@ func GetPodDelegatedIP(kubeconfig, namespace, podName string) (string, error) { return "", fmt.Errorf("failed to get eth1 IP for %s in namespace %s: %w\nOutput: %s", podName, namespace, err, string(out)) } - return "", fmt.Errorf("pod %s container not ready after %d attempts", podName, maxRetries) + return "", fmt.Errorf("%w: pod %s after %d attempts", ErrPodContainerNotReady, podName, maxRetries) } // ExecInPod executes a command in a pod and returns the output @@ -410,7 +419,7 @@ func VerifyNoMTPNC(kubeconfig, buildID string) error { } if len(mtpncNames) > 0 { - return fmt.Errorf("found %d MTPNC resources with build ID '%s' that should have been deleted. This may indicate stuck MTPNC deletion", len(mtpncNames), buildID) + return fmt.Errorf("%w: found %d MTPNC resources with build ID '%s'", ErrMTPNCStuckDeletion, len(mtpncNames), buildID) } } diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go index a2e167a1fd..431df6c493 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath.go +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -1,8 +1,9 @@ -package longRunningCluster +package longrunningcluster import ( "bytes" "context" + "errors" "fmt" "os" "os/exec" @@ -13,22 +14,47 @@ import ( "github.com/Azure/azure-container-networking/test/integration/swiftv2/helpers" ) +var ( + // ErrNoLowNICNodes is returned when no low-NIC nodes are available + ErrNoLowNICNodes = errors.New("no low-NIC nodes available") + // ErrNoHighNICNodes is returned when no high-NIC nodes are available + ErrNoHighNICNodes = errors.New("no high-NIC nodes available") + // ErrAllLowNICNodesInUse is returned when all low-NIC nodes are already in use + ErrAllLowNICNodesInUse = errors.New("all low-NIC nodes already in use") + // ErrAllHighNICNodesInUse is returned when all high-NIC nodes are already in use + ErrAllHighNICNodesInUse = errors.New("all high-NIC nodes already in use") + // ErrFailedToGenerateSASToken is returned when SAS token generation fails + ErrFailedToGenerateSASToken = errors.New("failed to generate SAS token") + // ErrSASTokenEmpty is returned when generated SAS token is empty + ErrSASTokenEmpty = errors.New("generated SAS token is empty") + // ErrSASTokenInvalid is returned when generated SAS token appears invalid + ErrSASTokenInvalid = errors.New("generated SAS token appears invalid") + // ErrPodNotRunning is returned when pod is not in running state + ErrPodNotRunning = errors.New("pod is not running") + // ErrHTTPAuthError is returned when HTTP authentication fails for private endpoint + ErrHTTPAuthError = errors.New("HTTP authentication error from private endpoint") + // ErrBlobNotFound is returned when blob is not found (404) on private endpoint + ErrBlobNotFound = errors.New("blob not found (404) on private endpoint") + // ErrUnexpectedBlobResponse is returned when blob download response is unexpected + ErrUnexpectedBlobResponse = errors.New("unexpected response from blob download (no 'Hello' or '200 OK' found)") +) + func applyTemplate(templatePath string, data interface{}, kubeconfig string) error { tmpl, err := template.ParseFiles(templatePath) if err != nil { - return err + return fmt.Errorf("failed to parse template: %w", err) } var buf bytes.Buffer - if err := tmpl.Execute(&buf, data); err != nil { - return err + if err = tmpl.Execute(&buf, data); err != nil { + return fmt.Errorf("failed to execute template: %w", err) } cmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "apply", "-f", "-") cmd.Stdin = &buf out, err := cmd.CombinedOutput() if err != nil { - return fmt.Errorf("kubectl apply failed: %s\n%s", err, string(out)) + return fmt.Errorf("kubectl apply failed: %w\nOutput: %s", err, string(out)) } return nil @@ -130,6 +156,17 @@ type VnetSubnetInfo struct { SubnetToken string } +// isValidWorkloadType validates workload type to prevent command injection +func isValidWorkloadType(workloadType string) bool { + // Only allow alphanumeric, dash, and underscore characters + for _, r := range workloadType { + if !((r >= 'a' && r <= 'z') || (r >= 'A' && r <= 'Z') || (r >= '0' && r <= '9') || r == '-' || r == '_') { + return false + } + } + return len(workloadType) > 0 && len(workloadType) <= 64 +} + // NodePoolInfo holds information about nodes in different pools type NodePoolInfo struct { LowNicNodes []string @@ -149,11 +186,17 @@ func GetNodesByNicCount(kubeconfig string) (NodePoolInfo, error) { workloadType = "swiftv2-linux" } + // Validate workloadType to prevent command injection + if !isValidWorkloadType(workloadType) { + return NodePoolInfo{}, fmt.Errorf("invalid workload type: %s", workloadType) + } + fmt.Printf("Filtering nodes by workload-type=%s\n", workloadType) // Get nodes with low-nic capacity and matching workload-type + //#nosec G204 -- workloadType is validated above cmd := exec.Command("kubectl", "--kubeconfig", kubeconfig, "get", "nodes", - "-l", fmt.Sprintf("nic-capacity=low-nic,workload-type=%s", workloadType), "-o", "name") + "-l", "nic-capacity=low-nic,workload-type="+workloadType, "-o", "name") out, err := cmd.CombinedOutput() if err != nil { return NodePoolInfo{}, fmt.Errorf("failed to get low-nic nodes: %w\nOutput: %s", err, string(out)) @@ -167,8 +210,9 @@ func GetNodesByNicCount(kubeconfig string) (NodePoolInfo, error) { } // Get nodes with high-nic capacity and matching workload-type + //#nosec G204 -- workloadType is validated above cmd = exec.Command("kubectl", "--kubeconfig", kubeconfig, "get", "nodes", - "-l", fmt.Sprintf("nic-capacity=high-nic,workload-type=%s", workloadType), "-o", "name") + "-l", "nic-capacity=high-nic,workload-type="+workloadType, "-o", "name") out, err = cmd.CombinedOutput() if err != nil { return NodePoolInfo{}, fmt.Errorf("failed to get high-nic nodes: %w\nOutput: %s", err, string(out)) @@ -289,16 +333,11 @@ func GetOrFetchVnetSubnetInfo(rg, vnetName, subnetName string, cache map[string] return VnetSubnetInfo{}, fmt.Errorf("failed to get Subnet ARM ID: %w", err) } - subnetToken, err := helpers.GetSubnetToken(rg, vnetName, subnetName) - if err != nil { - return VnetSubnetInfo{}, fmt.Errorf("failed to get Subnet Token: %w", err) - } - info := VnetSubnetInfo{ VnetGUID: vnetGUID, SubnetGUID: subnetGUID, SubnetARMID: subnetARMID, - SubnetToken: subnetToken, + SubnetToken: "", // Token can be fetched if needed } cache[key] = info @@ -372,7 +411,7 @@ func CreateScenarioResources(scenario PodScenario, testScenarios TestScenarios) if scenario.NodeSelector == "low-nic" { if len(nodeInfo.LowNicNodes) == 0 { - return fmt.Errorf("scenario %s: no low-NIC nodes available", scenario.Name) + return fmt.Errorf("%w: scenario %s", ErrNoLowNICNodes, scenario.Name) } // Find first unused node in the pool (low-NIC nodes can only handle one pod) targetNode = "" @@ -384,11 +423,11 @@ func CreateScenarioResources(scenario PodScenario, testScenarios TestScenarios) } } if targetNode == "" { - return fmt.Errorf("scenario %s: all low-NIC nodes already in use", scenario.Name) + return fmt.Errorf("%w: scenario %s", ErrAllLowNICNodesInUse, scenario.Name) } } else { // "high-nic" if len(nodeInfo.HighNicNodes) == 0 { - return fmt.Errorf("scenario %s: no high-NIC nodes available", scenario.Name) + return fmt.Errorf("%w: scenario %s", ErrNoHighNICNodes, scenario.Name) } // Find first unused node in the pool targetNode = "" @@ -400,12 +439,12 @@ func CreateScenarioResources(scenario PodScenario, testScenarios TestScenarios) } } if targetNode == "" { - return fmt.Errorf("scenario %s: all high-NIC nodes already in use", scenario.Name) + return fmt.Errorf("%w: scenario %s", ErrAllHighNICNodesInUse, scenario.Name) } } // Step 6: Create pod - podName := fmt.Sprintf("pod-%s", scenario.PodNameSuffix) + podName := "pod-" + scenario.PodNameSuffix err = CreatePodResource(resources, podName, targetNode) if err != nil { return fmt.Errorf("scenario %s: %w", scenario.Name, err) @@ -426,7 +465,7 @@ func DeleteScenarioResources(scenario PodScenario, buildID string) error { subnetNameSafe := strings.ReplaceAll(scenario.SubnetName, "_", "-") pnName := fmt.Sprintf("pn-%s-%s-%s", buildID, vnetShort, subnetNameSafe) pniName := fmt.Sprintf("pni-%s-%s-%s", buildID, vnetShort, subnetNameSafe) - podName := fmt.Sprintf("pod-%s", scenario.PodNameSuffix) + podName := "pod-" + scenario.PodNameSuffix // Delete pod err := helpers.DeletePod(kubeconfig, pnName, podName) @@ -479,7 +518,7 @@ func DeleteAllScenarios(testScenarios TestScenarios) error { vnetShort = strings.ReplaceAll(vnetShort, "_", "-") subnetNameSafe := strings.ReplaceAll(scenario.SubnetName, "_", "-") pnName := fmt.Sprintf("pn-%s-%s-%s", testScenarios.BuildID, vnetShort, subnetNameSafe) - podName := fmt.Sprintf("pod-%s", scenario.PodNameSuffix) + podName := "pod-" + scenario.PodNameSuffix fmt.Printf("Deleting pod for scenario: %s\n", scenario.Name) err := helpers.DeletePod(kubeconfig, pnName, podName) @@ -611,7 +650,7 @@ type ConnectivityTest struct { } // RunConnectivityTest tests HTTP connectivity between two pods -func RunConnectivityTest(test ConnectivityTest, rg, buildId string) error { +func RunConnectivityTest(test ConnectivityTest) error { // Get kubeconfig for the source cluster sourceKubeconfig := fmt.Sprintf("/tmp/%s.kubeconfig", test.Cluster) @@ -640,7 +679,6 @@ func RunConnectivityTest(test ConnectivityTest, rg, buildId string) error { curlCmd := fmt.Sprintf("curl --http0.9 --interface eth1 -m 3 http://%s:8080/", destIP) output, err := helpers.ExecInPod(sourceKubeconfig, test.SourceNamespace, test.SourcePod, curlCmd) - // Check if we received data even if curl timed out (exit code 28) // Netcat closes the connection without proper HTTP close, causing curl to timeout // But if we got the expected response, the connectivity test is successful @@ -720,7 +758,7 @@ func GenerateStorageSASToken(storageAccountName, containerName, blobName string) out, err = cmd.CombinedOutput() if err != nil { - return "", fmt.Errorf("failed to generate SAS token (both account key and user delegation): %s\n%s", err, string(out)) + return "", fmt.Errorf("%w (both account key and user delegation): %s\n%s", ErrFailedToGenerateSASToken, err, string(out)) } sasToken = strings.TrimSpace(string(out)) @@ -729,7 +767,7 @@ func GenerateStorageSASToken(storageAccountName, containerName, blobName string) } if sasToken == "" { - return "", fmt.Errorf("generated SAS token is empty") + return "", ErrSASTokenEmpty } // Remove any surrounding quotes that might be added by some shells @@ -737,7 +775,7 @@ func GenerateStorageSASToken(storageAccountName, containerName, blobName string) // Validate SAS token format - should start with typical SAS parameters if !strings.Contains(sasToken, "sv=") && !strings.Contains(sasToken, "sig=") { - return "", fmt.Errorf("generated SAS token appears invalid (missing sv= or sig=): %s", sasToken) + return "", fmt.Errorf("%w (missing sv= or sig=): %s", ErrSASTokenInvalid, sasToken) } // Remove any surrounding quotes that might be added by some shells @@ -752,14 +790,14 @@ func GenerateStorageSASToken(storageAccountName, containerName, blobName string) } // GetStoragePrivateEndpoint retrieves the private IP address of a storage account's private endpoint -func GetStoragePrivateEndpoint(resourceGroup, storageAccountName string) (string, error) { +func GetStoragePrivateEndpoint(storageAccountName string) (string, error) { // Return the storage account blob endpoint FQDN // This will resolve to the private IP via Private DNS Zone - return fmt.Sprintf("%s.blob.core.windows.net", storageAccountName), nil + return storageAccountName + ".blob.core.windows.net", nil } // RunPrivateEndpointTest tests connectivity from a pod to a private endpoint (storage account) -func RunPrivateEndpointTest(testScenarios TestScenarios, test ConnectivityTest) error { +func RunPrivateEndpointTest(test ConnectivityTest) error { // Get kubeconfig for the cluster kubeconfig := fmt.Sprintf("/tmp/%s.kubeconfig", test.SourceCluster) @@ -775,7 +813,7 @@ func RunPrivateEndpointTest(testScenarios TestScenarios, test ConnectivityTest) } podStatus := strings.TrimSpace(string(statusOut)) if podStatus != "Running" { - return fmt.Errorf("pod %s is not running (status: %s)", test.SourcePodName, podStatus) + return fmt.Errorf("%w: pod %s (status: %s)", ErrPodNotRunning, test.SourcePodName, podStatus) } fmt.Printf("Pod is running\n") @@ -809,10 +847,10 @@ func RunPrivateEndpointTest(testScenarios TestScenarios, test ConnectivityTest) if err != nil { // Check for HTTP errors in wget output if strings.Contains(output, "ERROR 403") || strings.Contains(output, "ERROR 401") { - return fmt.Errorf("HTTP authentication error from private endpoint\nOutput: %s", truncateString(output, 500)) + return fmt.Errorf("%w\nOutput: %s", ErrHTTPAuthError, truncateString(output, 500)) } if strings.Contains(output, "ERROR 404") { - return fmt.Errorf("blob not found (404) on private endpoint\nOutput: %s", truncateString(output, 500)) + return fmt.Errorf("%w\nOutput: %s", ErrBlobNotFound, truncateString(output, 500)) } return fmt.Errorf("private endpoint connectivity test failed: %w\nOutput: %s", err, truncateString(output, 500)) } @@ -823,7 +861,7 @@ func RunPrivateEndpointTest(testScenarios TestScenarios, test ConnectivityTest) return nil } - return fmt.Errorf("unexpected response from blob download (no 'Hello' or '200 OK' found)\nOutput: %s", truncateString(output, 500)) + return fmt.Errorf("%w\nOutput: %s", ErrUnexpectedBlobResponse, truncateString(output, 500)) } // ExecInPodWithTimeout executes a command in a pod with a custom timeout @@ -835,7 +873,7 @@ func ExecInPodWithTimeout(kubeconfig, namespace, podName, command string, timeou "-n", namespace, "--", "sh", "-c", command) out, err := cmd.CombinedOutput() if err != nil { - if ctx.Err() == context.DeadlineExceeded { + if errors.Is(ctx.Err(), context.DeadlineExceeded) { return string(out), fmt.Errorf("command timed out after %v in pod %s: %w", timeout, podName, ctx.Err()) } return string(out), fmt.Errorf("failed to exec in pod %s in namespace %s: %w", podName, namespace, err) diff --git a/test/integration/swiftv2/longRunningCluster/datapath_connectivity_test.go b/test/integration/swiftv2/longRunningCluster/datapath_connectivity_test.go index 7822d04b63..96f398bcf7 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath_connectivity_test.go +++ b/test/integration/swiftv2/longRunningCluster/datapath_connectivity_test.go @@ -134,7 +134,7 @@ var _ = ginkgo.Describe("Datapath Connectivity Tests", func() { for _, test := range connectivityTests { ginkgo.By(fmt.Sprintf("Test: %s - %s", test.Name, test.Description)) - err := RunConnectivityTest(test, rg, buildId) + err := RunConnectivityTest(test) if test.ShouldFail { // This test should fail (NSG blocked or customer isolation) diff --git a/test/integration/swiftv2/longRunningCluster/datapath_private_endpoint_test.go b/test/integration/swiftv2/longRunningCluster/datapath_private_endpoint_test.go index 6fb0131ccb..6eca79ce4d 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath_private_endpoint_test.go +++ b/test/integration/swiftv2/longRunningCluster/datapath_private_endpoint_test.go @@ -48,7 +48,7 @@ var _ = ginkgo.Describe("Private Endpoint Tests", func() { storageAccountName := storageAccount1 ginkgo.By(fmt.Sprintf("Getting private endpoint for storage account: %s", storageAccountName)) - storageEndpoint, err := GetStoragePrivateEndpoint(testScenarios.ResourceGroup, storageAccountName) + storageEndpoint, err := GetStoragePrivateEndpoint(storageAccountName) gomega.Expect(err).To(gomega.BeNil(), "Failed to get storage account private endpoint") gomega.Expect(storageEndpoint).NotTo(gomega.BeEmpty(), "Storage account private endpoint is empty") @@ -117,7 +117,7 @@ var _ = ginkgo.Describe("Private Endpoint Tests", func() { return "SUCCESS" }())) - err := RunPrivateEndpointTest(testScenarios, test) + err := RunPrivateEndpointTest(test) if test.ShouldFail { // Expected to fail (e.g., tenant isolation) From 58f35f3cbc0d35fcf9d077331782ca93296a3a03 Mon Sep 17 00:00:00 2001 From: sivakami Date: Thu, 11 Dec 2025 16:59:28 -0800 Subject: [PATCH 57/64] Lint fix. --- test/integration/swiftv2/longRunningCluster/datapath.go | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go index 431df6c493..8b511774c0 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath.go +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -37,6 +37,8 @@ var ( ErrBlobNotFound = errors.New("blob not found (404) on private endpoint") // ErrUnexpectedBlobResponse is returned when blob download response is unexpected ErrUnexpectedBlobResponse = errors.New("unexpected response from blob download (no 'Hello' or '200 OK' found)") + // ErrInvalidWorkloadType is returned when workload type is invalid + ErrInvalidWorkloadType = errors.New("invalid workload type") ) func applyTemplate(templatePath string, data interface{}, kubeconfig string) error { @@ -164,7 +166,7 @@ func isValidWorkloadType(workloadType string) bool { return false } } - return len(workloadType) > 0 && len(workloadType) <= 64 + return workloadType != "" && len(workloadType) <= 64 } // NodePoolInfo holds information about nodes in different pools @@ -188,7 +190,7 @@ func GetNodesByNicCount(kubeconfig string) (NodePoolInfo, error) { // Validate workloadType to prevent command injection if !isValidWorkloadType(workloadType) { - return NodePoolInfo{}, fmt.Errorf("invalid workload type: %s", workloadType) + return NodePoolInfo{}, fmt.Errorf("%w: %s", ErrInvalidWorkloadType, workloadType) } fmt.Printf("Filtering nodes by workload-type=%s\n", workloadType) @@ -758,7 +760,7 @@ func GenerateStorageSASToken(storageAccountName, containerName, blobName string) out, err = cmd.CombinedOutput() if err != nil { - return "", fmt.Errorf("%w (both account key and user delegation): %s\n%s", ErrFailedToGenerateSASToken, err, string(out)) + return "", fmt.Errorf("%w (both account key and user delegation): %w\n%s", ErrFailedToGenerateSASToken, err, string(out)) } sasToken = strings.TrimSpace(string(out)) From 0d161af253f3a2e1c254a1132394edac1548dac1 Mon Sep 17 00:00:00 2001 From: sivakami Date: Thu, 11 Dec 2025 19:14:05 -0800 Subject: [PATCH 58/64] make dockerfiles. --- cns/Dockerfile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/cns/Dockerfile b/cns/Dockerfile index b7b68cbd2b..9f337d1dad 100644 --- a/cns/Dockerfile +++ b/cns/Dockerfile @@ -38,4 +38,4 @@ FROM --platform=windows/${ARCH} mcr.microsoft.com/oss/kubernetes/windows-host-pr FROM hpc as windows COPY --from=builder /go/bin/azure-cns /azure-cns.exe ENTRYPOINT ["azure-cns.exe"] -EXPOSE 10090 \ No newline at end of file +EXPOSE 10090 From 86d1c373e64f27ce7512985bc37a552a07299613 Mon Sep 17 00:00:00 2001 From: sivakami Date: Thu, 11 Dec 2025 19:23:01 -0800 Subject: [PATCH 59/64] fix generateSaStoken method. --- .../swiftv2/longRunningCluster/datapath.go | 25 +------------------ 1 file changed, 1 insertion(+), 24 deletions(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go index 8b511774c0..d4000599f3 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath.go +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -705,6 +705,7 @@ func truncateString(s string, maxLen int) string { return s[:maxLen] + "..." } +// GenerateStorageSASToken generates a SAS token for a blob in a storage account // GenerateStorageSASToken generates a SAS token for a blob in a storage account func GenerateStorageSASToken(storageAccountName, containerName, blobName string) (string, error) { // Calculate expiry time: 7 days from now (Azure CLI limit) @@ -722,13 +723,6 @@ func GenerateStorageSASToken(storageAccountName, containerName, blobName string) out, err := cmd.CombinedOutput() sasToken := strings.TrimSpace(string(out)) - // Check if account key method produced valid token - accountKeyWorked := err == nil && !strings.Contains(sasToken, "WARNING") && - !strings.Contains(sasToken, "ERROR") && (strings.Contains(sasToken, "sv=") || strings.Contains(sasToken, "sig=")) - - if !accountKeyWorked { - sasToken := strings.TrimSpace(string(out)) - // Check if account key method produced valid token accountKeyWorked := err == nil && !strings.Contains(sasToken, "WARNING") && !strings.Contains(sasToken, "ERROR") && (strings.Contains(sasToken, "sv=") || strings.Contains(sasToken, "sig=")) @@ -741,12 +735,6 @@ func GenerateStorageSASToken(storageAccountName, containerName, blobName string) fmt.Printf("Account key SAS generation failed (no credentials): %s\n", sasToken) } - if err != nil { - fmt.Printf("Account key SAS generation failed (error): %s\n", string(out)) - } else { - fmt.Printf("Account key SAS generation failed (no credentials): %s\n", sasToken) - } - cmd = exec.Command("az", "storage", "blob", "generate-sas", "--account-name", storageAccountName, "--container-name", containerName, @@ -757,15 +745,12 @@ func GenerateStorageSASToken(storageAccountName, containerName, blobName string) "--as-user", "--output", "tsv") - out, err = cmd.CombinedOutput() if err != nil { return "", fmt.Errorf("%w (both account key and user delegation): %w\n%s", ErrFailedToGenerateSASToken, err, string(out)) } sasToken = strings.TrimSpace(string(out)) - - sasToken = strings.TrimSpace(string(out)) } if sasToken == "" { @@ -780,14 +765,6 @@ func GenerateStorageSASToken(storageAccountName, containerName, blobName string) return "", fmt.Errorf("%w (missing sv= or sig=): %s", ErrSASTokenInvalid, sasToken) } - // Remove any surrounding quotes that might be added by some shells - sasToken = strings.Trim(sasToken, "\"'") - - // Validate SAS token format - should start with typical SAS parameters - if !strings.Contains(sasToken, "sv=") && !strings.Contains(sasToken, "sig=") { - return "", fmt.Errorf("generated SAS token appears invalid (missing sv= or sig=): %s", sasToken) - } - return sasToken, nil } From d71032f55644139f68b89646d930c3c457c1ca24 Mon Sep 17 00:00:00 2001 From: sivakami Date: Thu, 11 Dec 2025 21:09:43 -0800 Subject: [PATCH 60/64] linter fix. --- test/integration/swiftv2/helpers/az_helpers.go | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/test/integration/swiftv2/helpers/az_helpers.go b/test/integration/swiftv2/helpers/az_helpers.go index 54bc4cd878..89a903ed63 100644 --- a/test/integration/swiftv2/helpers/az_helpers.go +++ b/test/integration/swiftv2/helpers/az_helpers.go @@ -20,6 +20,8 @@ var ( ErrPodContainerNotReady = errors.New("pod container not ready") // ErrMTPNCStuckDeletion is returned when MTPNC resources are stuck and not deleted ErrMTPNCStuckDeletion = errors.New("MTPNC resources should have been deleted but were found") + // ErrPodNotScheduled is returned when pod was not scheduled (no node assigned) + ErrPodNotScheduled = errors.New("pod was not scheduled (no node assigned)") ) func runAzCommand(cmd string, args ...string) (string, error) { @@ -292,7 +294,7 @@ func WaitForPodScheduled(kubeconfig, namespace, podName string, maxRetries, slee } } - return fmt.Errorf("pod %s was not scheduled (no node assigned) after %d attempts", podName, maxRetries) + return fmt.Errorf("%w: pod %s after %d attempts", ErrPodNotScheduled, podName, maxRetries) } // WaitForPodRunning waits for a pod to reach Running state with retries From 89c79ad84a85ea7608806937aafc5c17e4787979 Mon Sep 17 00:00:00 2001 From: sivakami Date: Thu, 11 Dec 2025 21:15:44 -0800 Subject: [PATCH 61/64] Fix display name --- .../template/long-running-pipeline-template.yaml | 1 - 1 file changed, 1 deletion(-) diff --git a/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml b/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml index bd9c9534d9..1302e65a07 100644 --- a/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml +++ b/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml @@ -67,7 +67,6 @@ stages: steps: - checkout: self - task: AzureCLI@2 - displayName: "Create AKS clusters" displayName: "Create AKS clusters" inputs: azureSubscription: ${{ parameters.serviceConnection }} From 743a3f8696f0d8f19922b72de47e7f6e12b63147 Mon Sep 17 00:00:00 2001 From: sivakami Date: Thu, 11 Dec 2025 21:31:04 -0800 Subject: [PATCH 62/64] fix package name. --- test/integration/swiftv2/longRunningCluster/datapath.go | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/test/integration/swiftv2/longRunningCluster/datapath.go b/test/integration/swiftv2/longRunningCluster/datapath.go index d4000599f3..cf18918ef4 100644 --- a/test/integration/swiftv2/longRunningCluster/datapath.go +++ b/test/integration/swiftv2/longRunningCluster/datapath.go @@ -1,4 +1,4 @@ -package longrunningcluster +package longRunningCluster import ( "bytes" From e21cbec204b03bfce1a6fe235c60a9b3d75aeeb6 Mon Sep 17 00:00:00 2001 From: sivakami Date: Thu, 11 Dec 2025 22:17:34 -0800 Subject: [PATCH 63/64] Enable scale tests. --- .../long-running-pipeline-template.yaml | 73 +++++++++++++++++++ 1 file changed, 73 insertions(+) diff --git a/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml b/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml index 1302e65a07..1a004f53d0 100644 --- a/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml +++ b/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml @@ -365,6 +365,79 @@ stages: # ------------------------------------------------------------ # Job 4: Scale Tests with Device Plugin # ------------------------------------------------------------ + - job: ScaleTest + displayName: "Scale Test - Create and Delete 15 Pods with Device Plugin" + dependsOn: + - CreateTestResources + - ConnectivityTests + - PrivateEndpointTests + condition: succeeded() + timeoutInMinutes: 90 + pool: + vmImage: ubuntu-latest + steps: + - checkout: self + + - task: GoTool@0 + displayName: "Use Go 1.22.5" + inputs: + version: "1.22.5" + + - task: AzureCLI@2 + displayName: "Run Scale Test (Create and Delete)" + inputs: + azureSubscription: ${{ parameters.serviceConnection }} + scriptType: bash + scriptLocation: inlineScript + inlineScript: | + echo "==> Installing Ginkgo CLI" + go install github.com/onsi/ginkgo/v2/ginkgo@latest + + echo "==> Adding Go bin to PATH" + export PATH=$PATH:$(go env GOPATH)/bin + + echo "==> Downloading Go dependencies" + go mod download + + echo "==> Setting up kubeconfig for cluster aks-1" + az aks get-credentials \ + --resource-group $(rgName) \ + --name aks-1 \ + --file /tmp/aks-1.kubeconfig \ + --overwrite-existing \ + --admin + + echo "==> Setting up kubeconfig for cluster aks-2" + az aks get-credentials \ + --resource-group $(rgName) \ + --name aks-2 \ + --file /tmp/aks-2.kubeconfig \ + --overwrite-existing \ + --admin + + echo "==> Verifying cluster aks-1 connectivity" + kubectl --kubeconfig /tmp/aks-1.kubeconfig get nodes + + echo "==> Verifying cluster aks-2 connectivity" + kubectl --kubeconfig /tmp/aks-2.kubeconfig get nodes + + echo "==> Running scale test: Create 15 pods with device plugin across both clusters" + echo "NOTE: Pods are auto-scheduled by Kubernetes scheduler and device plugin" + echo " - 8 pods in aks-1 (cx_vnet_a1/s1)" + echo " - 7 pods in aks-2 (cx_vnet_a2/s1)" + echo "Pod limits per PodNetwork/PodNetworkInstance:" + echo " - Subnet IP address capacity" + echo " - Node capacity (typically 250 pods per node)" + echo " - Available device plugin resources (NICs per node)" + export RG="$(rgName)" + export BUILD_ID="$(rgName)" + export WORKLOAD_TYPE="swiftv2-linux" + cd ./test/integration/swiftv2/longRunningCluster + ginkgo -v -trace --timeout=1h --tags=scale_test + + # ------------------------------------------------------------ + # Job 5: Delete Test Resources + # ------------------------------------------------------------ - job: DeleteTestResources displayName: "Delete PodNetwork, PNI, and Pods" dependsOn: From bb13603b545a8f3a046cf8e09166519253fb1ebf Mon Sep 17 00:00:00 2001 From: sivakami Date: Thu, 11 Dec 2025 22:46:42 -0800 Subject: [PATCH 64/64] set pipeline job order. --- .../template/long-running-pipeline-template.yaml | 1 + 1 file changed, 1 insertion(+) diff --git a/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml b/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml index 1a004f53d0..a635a9998b 100644 --- a/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml +++ b/.pipelines/swiftv2-long-running/template/long-running-pipeline-template.yaml @@ -444,6 +444,7 @@ stages: - CreateTestResources - ConnectivityTests - PrivateEndpointTests + - ScaleTest # Always run cleanup, even if previous jobs failed condition: always() timeoutInMinutes: 60