Skip to content

Conversation

@benjamin-lykins
Copy link
Contributor

@benjamin-lykins benjamin-lykins commented Jun 9, 2025

Fixes: #1096

  • Reviewed contributor guide.
    I will note, this might not be a small first PR to put in. Ended up being a little bit more noisy since I did migrate off the legacy Azure SDK.

Changes

  1. Updated from deprecated to current Azure Compute SDK.

"github.com/Azure/azure-sdk-for-go/services/compute/mgmt/2020-06-01/compute"
"github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/compute/armcompute"

  1. Updated status checks

Supports both flexible and uniform scale sets. Queries the VMSS to get the type and then conditionally runs the respective process status/scale in and out processes.

  • Flexible scale sets are a shift from uniform and were unable to query the instance view of the instance. Essentially, grab the list of instances from the VMSS, then use that against the VM (not VMSS) API to get instance status.

  • When getting instance states, since Flexible VMSS can have up to 2000 instances, I used a goroutine to iterate through them more effectively. Hardcoded at 5 concurrent, might be worth allowing the end user to provide a limit on this, but from my testing with 5 concurrent against 100 instances, it took ~1 second. When I was sequentially pulling instance state, it took ~30 seconds for the same count. I did test limiting to 10 concurrent, but it took it from ~1 second to a little less than a second, so not much gained.

  1. The only existing function specifically with Uniform I updated was processInstanceView.

I can double check the why, but I believe one of the attributes was not available in the current api.

  1. Testing

Admittedly Test_processInstanceView used copilot to refactor for the new api.

Testing

Flexible VMSS

  • Scale In
  • Scale Out

Uniform VMSS

  • Scale In
  • Scale Out

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great work @benjamin-lykins. I've left a bunch of comments -- some on small details and some on design. For the smaller comments like contents of debug logging, try to take a pass over the whole PR with those in mind. (Edit: it looks like GitHub is unhelpfully doing the "hidden conversations" thing so make sure you expand that as that includes some of the major design items)

We've added a lot of new code here and it doesn't seem like there's a lot of test coverage. Unfortunately we don't have integration tests so that's challenging, but maybe as you break up some of those large methods I pointed out there'd be a way to extract some of the logic into testable chunks.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making good progress!

Comment on lines 283 to 265
} else {
t.logger.Debug("skipping instance", "name", *vm.Name, "instance_id", *vm.InstanceID, "code", *s.Code)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log line seems wrong... we're not skipping the instance here, we're iterating to the next InstanceView.Status and will print this log line for each status we find.

Copy link
Contributor Author

@benjamin-lykins benjamin-lykins Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran through this a bit more in my head. This was carry over from what it was previously. I had tried to simplify this since it had multiple nested if statement. I ended up reverting back to what Luiz had previously.

if vm.Properties != nil && vm.Properties.InstanceView != nil && vm.Properties.InstanceView.Statuses != nil {
		for _, s := range vm.Properties.InstanceView.Statuses {
			if s.Code != nil && strings.HasPrefix(*s.Code, "PowerState/") {
				if *s.Code == "PowerState/running" {
					t.logger.Debug("found healthy instance", "name", *vm.Name, "instance_id", *vm.InstanceID)
					vmNames = append(vmNames, *vm.Name)
					break
				} else {
					t.logger.Debug("skipping instance - power state is not running", "name", *vm.Name, "instance_id", *vm.InstanceID, "code", *s.Code)
				}
			}
		}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this still has the same problem. Suppose I have the following set of InstanceViewStatus:

 []InstanceViewStatus{
	{ Code: pointer.Of("ProvisioningState/succeeded"), ...}
	{ Code: pointer.Of("Whatever/foo"), ...}
	{ Code: pointer.Of("PowerState/running"), ...}
}

In this case, we'll log that we're skipping the instance twice and then not skip it, because we eventually find a "PowerState/running" code. The number of times we log ends up being dependent on the order that the statuses get returned (which I can't find anywhere in the Azure docs). So we only want to log if we never find that "PowerState/running" code. Similar to how you've got isFlexibleVMReady working.

Or we can drop this else block entirely, which I think would also be fine.

@benjamin-lykins
Copy link
Contributor Author

benjamin-lykins commented Jun 23, 2025

@tgross Replied back to your latest reviews and put in a series of commits to address the recommendations.

I'm working to clean up the functions which get both status and IDs for Flexible VMSS. I may try to consolidate those two functions, or generalize some of it that I can.

@benjamin-lykins
Copy link
Contributor Author

Azure API rate limiting no longer an issue, now bumping into Nomad rate limiting issue. The other plugins have a retry function which I'll look to incorporate. Probably not in this PR, but in a future I will add better handling if the node drain occurs on some nodes, but not all to re-enable them for allocations.

@tgross
Copy link
Member

tgross commented Aug 6, 2025

now bumping into Nomad rate limiting issue.

Once #1134 has been merged, this should reduce the impact of this and make it easier to wrap this up.

@benjamin-lykins
Copy link
Contributor Author

now bumping into Nomad rate limiting issue.

Once #1134 has been merged, this should reduce the impact of this and make it easier to wrap this up.

Ahh great! I'll hold off, since I had been trying to work at what was available to handle that. Thanks for the heads up @tgross.

@tgross tgross self-assigned this Oct 27, 2025
benjamin-lykins and others added 9 commits October 29, 2025 13:59
- Updated Azure SDK imports to use the new `azidentity` and `armcompute` packages.
- Replaced deprecated methods for creating Azure clients with new client secret credential methods.
- Modified `setupAzureClient` to handle client creation and error handling more effectively.
- Enhanced `scaleOut` and `scaleIn` methods to support both Uniform and Flexible orchestration modes.
- Implemented concurrent processing for VM instance views in Flexible mode to improve performance.
- Updated tests to reflect changes in the Azure SDK and added new test cases for instance view processing.
- Simplified instance view processing logic to accommodate new SDK structure.
removing comments and cleanup

Co-authored-by: Tim Gross <[email protected]>
@tgross tgross force-pushed the feat/azure-vmss-support-flexible branch from 39b0307 to f544494 Compare October 29, 2025 18:22
@tgross tgross force-pushed the feat/azure-vmss-support-flexible branch from f544494 to 01cf3bb Compare October 29, 2025 18:28
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benjamin-lykins this is really getting there. I've left a few small comments but nothing that's a major design issue at this point. We should also add a changelog entry to the Unreleased section, so that when we release this we have it noted.

Comment on lines 283 to 265
} else {
t.logger.Debug("skipping instance", "name", *vm.Name, "instance_id", *vm.InstanceID, "code", *s.Code)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this still has the same problem. Suppose I have the following set of InstanceViewStatus:

 []InstanceViewStatus{
	{ Code: pointer.Of("ProvisioningState/succeeded"), ...}
	{ Code: pointer.Of("Whatever/foo"), ...}
	{ Code: pointer.Of("PowerState/running"), ...}
}

In this case, we'll log that we're skipping the instance twice and then not skip it, because we eventually find a "PowerState/running" code. The number of times we log ends up being dependent on the order that the statuses get returned (which I can't find anywhere in the Azure docs). So we only want to log if we never find that "PowerState/running" code. Similar to how you've got isFlexibleVMReady working.

Or we can drop this else block entirely, which I think would also be fine.

@benjamin-lykins
Copy link
Contributor Author

benjamin-lykins commented Nov 4, 2025

@tgross made a series of changes off of your last set of reviews and ready for rereview.

Comment on lines 132 to 134

capacity := *currVMSS.Sku.Capacity
capacity := *currVMSS.SKU.Capacity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tgross I'm going to rework this section as well. It has always pulled what the total number of instances can be in either an orchestrated or flexible vmss. I think it would be better to grab the actual count.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated and tested with the following successfully:

	// Recommended removing this since it only gets the requested capacity and not the actual count.
	// capacity := *currVMSS.SKU.Capacity

	vms, err := t.getVMSSVMs(ctx, resourceGroup, orchestrationMode, vmScaleSet)

	if err != nil {
		return fmt.Errorf("failed to get VMSS VMs: %w", err)
	}

	capacity := int64(len(vms))

I'll find some time to run this against a uniform set and confirm scaling actions perform without issue as well for all the changes. Last time I did, there were no issues, but wanting to double check since it has been a bit.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last item and I think that'll do it, @benjamin-lykins!

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice work!

@tgross tgross merged commit 823b78f into hashicorp:main Nov 4, 2025
19 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Nomad - Community Issues Triage Nov 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

The azure-vmss target plugin fails when used with Azure scale sets in flexible orchestration mode

2 participants