Updating Azure Plugin to Support Flexible VMSS #1106

benjamin-lykins · 2025-06-09T02:11:07Z

Reviewed contributor guide.
I will note, this might not be a small first PR to put in. Ended up being a little bit more noisy since I did migrate off the legacy Azure SDK.

Changes

Updated from deprecated to current Azure Compute SDK.

"github.com/Azure/azure-sdk-for-go/services/compute/mgmt/2020-06-01/compute"
"github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/compute/armcompute"

Updated status checks

Supports both flexible and uniform scale sets. Queries the VMSS to get the type and then conditionally runs the respective process status/scale in and out processes.

Flexible scale sets are a shift from uniform and were unable to query the instance view of the instance. Essentially, grab the list of instances from the VMSS, then use that against the VM (not VMSS) API to get instance status.
When getting instance states, since Flexible VMSS can have up to 2000 instances, I used a goroutine to iterate through them more effectively. Hardcoded at 5 concurrent, might be worth allowing the end user to provide a limit on this, but from my testing with 5 concurrent against 100 instances, it took ~1 second. When I was sequentially pulling instance state, it took ~30 seconds for the same count. I did test limiting to 10 concurrent, but it took it from ~1 second to a little less than a second, so not much gained.

The only existing function specifically with Uniform I updated was processInstanceView.

I can double check the why, but I believe one of the attributes was not available in the current api.

Testing

Admittedly Test_processInstanceView used copilot to refactor for the new api.

Testing

Flexible VMSS

Scale In
Scale Out

Uniform VMSS

Scale In
Scale Out

tgross

This is great work @benjamin-lykins. I've left a bunch of comments -- some on small details and some on design. For the smaller comments like contents of debug logging, try to take a pass over the whole PR with those in mind. (Edit: it looks like GitHub is unhelpfully doing the "hidden conversations" thing so make sure you expand that as that includes some of the major design items)

We've added a lot of new code here and it doesn't seem like there's a lot of test coverage. Unfortunately we don't have integration tests so that's challenging, but maybe as you break up some of those large methods I pointed out there'd be a way to extract some of the logic into testable chunks.

go.mod

plugins/builtin/target/azure-vmss/plugin/azure.go

plugins/builtin/target/azure-vmss/plugin/plugin.go

tgross

Making good progress!

plugins/test/noop-apm/go.mod

plugins/builtin/target/azure-vmss/plugin/azure.go

tgross · 2025-06-20T16:06:46Z

plugins/builtin/target/azure-vmss/plugin/azure.go

+					} else {
+						t.logger.Debug("skipping instance", "name", *vm.Name, "instance_id", *vm.InstanceID, "code", *s.Code)
+					}


This log line seems wrong... we're not skipping the instance here, we're iterating to the next InstanceView.Status and will print this log line for each status we find.

Ran through this a bit more in my head. This was carry over from what it was previously. I had tried to simplify this since it had multiple nested if statement. I ended up reverting back to what Luiz had previously.

if vm.Properties != nil && vm.Properties.InstanceView != nil && vm.Properties.InstanceView.Statuses != nil { for _, s := range vm.Properties.InstanceView.Statuses { if s.Code != nil && strings.HasPrefix(*s.Code, "PowerState/") { if *s.Code == "PowerState/running" { t.logger.Debug("found healthy instance", "name", *vm.Name, "instance_id", *vm.InstanceID) vmNames = append(vmNames, *vm.Name) break } else { t.logger.Debug("skipping instance - power state is not running", "name", *vm.Name, "instance_id", *vm.InstanceID, "code", *s.Code) } } }

I think this still has the same problem. Suppose I have the following set of InstanceViewStatus:

[]InstanceViewStatus{ { Code: pointer.Of("ProvisioningState/succeeded"), ...} { Code: pointer.Of("Whatever/foo"), ...} { Code: pointer.Of("PowerState/running"), ...} }

In this case, we'll log that we're skipping the instance twice and then not skip it, because we eventually find a "PowerState/running" code. The number of times we log ends up being dependent on the order that the statuses get returned (which I can't find anywhere in the Azure docs). So we only want to log if we never find that "PowerState/running" code. Similar to how you've got isFlexibleVMReady working.

Or we can drop this else block entirely, which I think would also be fine.

plugins/builtin/target/azure-vmss/plugin/azure.go

benjamin-lykins · 2025-06-23T19:08:48Z

@tgross Replied back to your latest reviews and put in a series of commits to address the recommendations.

I'm working to clean up the functions which get both status and IDs for Flexible VMSS. I may try to consolidate those two functions, or generalize some of it that I can.

plugins/builtin/target/azure-vmss/plugin/azure.go

plugins/test/noop-apm/go.mod

plugins/builtin/target/azure-vmss/plugin/azure.go

plugins/builtin/target/azure-vmss/plugin/plugin.go

benjamin-lykins · 2025-07-25T13:59:30Z

Azure API rate limiting no longer an issue, now bumping into Nomad rate limiting issue. The other plugins have a retry function which I'll look to incorporate. Probably not in this PR, but in a future I will add better handling if the node drain occurs on some nodes, but not all to re-enable them for allocations.

tgross · 2025-08-06T18:18:01Z

now bumping into Nomad rate limiting issue.

Once #1134 has been merged, this should reduce the impact of this and make it easier to wrap this up.

benjamin-lykins · 2025-08-07T18:48:10Z

now bumping into Nomad rate limiting issue.

Once #1134 has been merged, this should reduce the impact of this and make it easier to wrap this up.

Ahh great! I'll hold off, since I had been trying to work at what was available to handle that. Thanks for the heads up @tgross.

- Updated Azure SDK imports to use the new `azidentity` and `armcompute` packages. - Replaced deprecated methods for creating Azure clients with new client secret credential methods. - Modified `setupAzureClient` to handle client creation and error handling more effectively. - Enhanced `scaleOut` and `scaleIn` methods to support both Uniform and Flexible orchestration modes. - Implemented concurrent processing for VM instance views in Flexible mode to improve performance. - Updated tests to reflect changes in the Azure SDK and added new test cases for instance view processing. - Simplified instance view processing logic to accommodate new SDK structure.

Removing Co-authored-by: Tim Gross <[email protected]>

removing comments and cleanup Co-authored-by: Tim Gross <[email protected]>

checking for running and provisioned vms.

reuse

both uniform & flexible

… and name formats

tgross

@benjamin-lykins this is really getting there. I've left a few small comments but nothing that's a major design issue at this point. We should also add a changelog entry to the Unreleased section, so that when we release this we have it noted.

plugins/builtin/target/azure-vmss/plugin/plugin.go

plugins/builtin/target/azure-vmss/plugin/azure.go

tgross · 2025-10-30T18:43:18Z

plugins/builtin/target/azure-vmss/plugin/azure.go

+					} else {
+						t.logger.Debug("skipping instance", "name", *vm.Name, "instance_id", *vm.InstanceID, "code", *s.Code)
+					}


I think this still has the same problem. Suppose I have the following set of InstanceViewStatus:

[]InstanceViewStatus{ { Code: pointer.Of("ProvisioningState/succeeded"), ...} { Code: pointer.Of("Whatever/foo"), ...} { Code: pointer.Of("PowerState/running"), ...} }

In this case, we'll log that we're skipping the instance twice and then not skip it, because we eventually find a "PowerState/running" code. The number of times we log ends up being dependent on the order that the statuses get returned (which I can't find anywhere in the Azure docs). So we only want to log if we never find that "PowerState/running" code. Similar to how you've got isFlexibleVMReady working.

Or we can drop this else block entirely, which I think would also be fine.

plugins/builtin/target/azure-vmss/plugin/plugin.go

…, better after update in hashicorp#1134

benjamin-lykins · 2025-11-04T03:09:38Z

@tgross made a series of changes off of your last set of reviews and ready for rereview.

benjamin-lykins · 2025-11-04T14:06:49Z

plugins/builtin/target/azure-vmss/plugin/plugin.go


-	capacity := *currVMSS.Sku.Capacity
+	capacity := *currVMSS.SKU.Capacity



@tgross I'm going to rework this section as well. It has always pulled what the total number of instances can be in either an orchestrated or flexible vmss. I think it would be better to grab the actual count.

Updated and tested with the following successfully:

// Recommended removing this since it only gets the requested capacity and not the actual count. // capacity := *currVMSS.SKU.Capacity vms, err := t.getVMSSVMs(ctx, resourceGroup, orchestrationMode, vmScaleSet) if err != nil { return fmt.Errorf("failed to get VMSS VMs: %w", err) } capacity := int64(len(vms))

I'll find some time to run this against a uniform set and confirm scaling actions perform without issue as well for all the changes. Last time I did, there were no issues, but wanting to double check since it has been a bit.

…of requested capacity

tgross

One last item and I think that'll do it, @benjamin-lykins!

plugins/builtin/target/azure-vmss/plugin/azure.go

…unction

tgross

LGTM! Nice work!

tgross requested changes Jun 9, 2025

View reviewed changes

benjamin-lykins marked this pull request as ready for review June 19, 2025 20:32

benjamin-lykins requested review from a team as code owners June 19, 2025 20:32

tgross mentioned this pull request Jun 19, 2025

The azure-vmss target plugin fails when used with Azure scale sets in flexible orchestration mode #1096

Closed

tgross requested changes Jun 20, 2025

View reviewed changes

tgross reviewed Jun 27, 2025

View reviewed changes

tgross added this to Nomad - Community Issues Triage Jul 25, 2025

github-project-automation bot moved this to Needs Triage in Nomad - Community Issues Triage Jul 25, 2025

tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Jul 25, 2025

tgross self-assigned this Oct 27, 2025

benjamin-lykins and others added 9 commits October 29, 2025 13:59

cleaning up old comments and addiing additional comments

0b461cc

Update plugins/builtin/target/azure-vmss/plugin/azure.go

974c48b

Removing Co-authored-by: Tim Gross <[email protected]>

Apply suggestions from code review

7bab444

removing comments and cleanup Co-authored-by: Tim Gross <[email protected]>

Adding constants for orchestration mode

b61edb9

adding note

280f80a

cleaning up processInstanceView

3f10e4e

adding context check at the beginning of the loop

1f8ffb9

removing comments and redudant conditionals

e1a8b0c

tgross force-pushed the feat/azure-vmss-support-flexible branch from 39b0307 to f544494 Compare October 29, 2025 18:22

benjamin-lykins added 5 commits October 29, 2025 14:27

consolidating scale in function and generalized

7d5ed29

fix uniform logic. tested and scaling in successfully

7f08aac

adding nil check for vm.InstanceID

0c02da8

reverting to luiz's logic when checking if powerState = running

ac65f6e

fix getFlexibleReadyRemoteIDs, which could inaccurately check status.

c87411e

checking for running and provisioned vms.

benjamin-lykins and others added 12 commits October 29, 2025 14:27

rework flexible vmss instance status checks. add helper function for

72dcac4

reuse

update plugins/builtin/target/azure-vmss/plugin/azure.go

5714fbb

update from debug to error message

e372904

updated with suggested helper function. validated scale in/out working

7f66910

both uniform & flexible

cleaning up helper function for more descriptive errors

92086fb

updated error output

9ea19e0

removing goroutines due to rate limit issue.

3f35205

removing config output

e3a64dc

storing readyFlexibleInstances to ease azure api hits

6adb452

tweaks after testing for logging

cb63004

flexible vm ready test

cd434c0

add tests for idFromRemoteID function to validate orchestration modes…

01cf3bb

… and name formats

tgross force-pushed the feat/azure-vmss-support-flexible branch from f544494 to 01cf3bb Compare October 29, 2025 18:28

tgross requested changes Oct 30, 2025

View reviewed changes

benjamin-lykins added 8 commits October 31, 2025 16:43

remove debug log for context cancellation in processInstanceViewFlexible

e8237d8

updated to use errors.Is for context cancellation

f783468

remove unnecessary error handling for rate limits in scaleIn function…

318c9cb

…, better after update in hashicorp#1134

additional verification for flexible vmss vms.

4d45911

removing unneeded healthy debug log

8ae3ec2

removing redundant logging

908fd8f

removed else block when checking uniform instance set vm properties.

71c6c00

mutex changes in processInstanceViewFlexible and scaleIn

8755c3e

benjamin-lykins commented Nov 4, 2025

View reviewed changes

refactor: update capacity calculation to use actual VM count instead …

bac4e60

…of requested capacity

tgross requested changes Nov 4, 2025

View reviewed changes

plugins/builtin/target/azure-vmss/plugin/azure.go Outdated Show resolved Hide resolved

fix: ensure thread-safe access to readyFlexibleInstances in scaleIn f…

9b777de

…unction

tgross approved these changes Nov 4, 2025

View reviewed changes

tgross merged commit 823b78f into hashicorp:main Nov 4, 2025
19 checks passed

github-project-automation bot moved this from In Progress to Done in Nomad - Community Issues Triage Nov 4, 2025


		capacity := *currVMSS.Sku.Capacity
		capacity := *currVMSS.SKU.Capacity

Updating Azure Plugin to Support Flexible VMSS #1106

Updating Azure Plugin to Support Flexible VMSS #1106

Conversation

benjamin-lykins commented Jun 9, 2025 • edited by tgross Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Testing

Uh oh!

tgross left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tgross left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tgross Jun 20, 2025

Choose a reason for hiding this comment

Uh oh!

benjamin-lykins Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tgross Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

benjamin-lykins commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

benjamin-lykins commented Jul 25, 2025

Uh oh!

tgross commented Aug 6, 2025

Uh oh!

benjamin-lykins commented Aug 7, 2025

Uh oh!

tgross left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tgross Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

benjamin-lykins commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benjamin-lykins Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

benjamin-lykins Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

tgross left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tgross left a comment

benjamin-lykins commented Jun 9, 2025 •

edited by tgross

Loading

tgross left a comment •

edited

Loading

benjamin-lykins Jun 20, 2025 •

edited

Loading

benjamin-lykins commented Jun 23, 2025 •

edited

Loading

benjamin-lykins commented Nov 4, 2025 •

edited

Loading