-
Notifications
You must be signed in to change notification settings - Fork 95
Updating Azure Plugin to Support Flexible VMSS #1106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating Azure Plugin to Support Flexible VMSS #1106
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great work @benjamin-lykins. I've left a bunch of comments -- some on small details and some on design. For the smaller comments like contents of debug logging, try to take a pass over the whole PR with those in mind. (Edit: it looks like GitHub is unhelpfully doing the "hidden conversations" thing so make sure you expand that as that includes some of the major design items)
We've added a lot of new code here and it doesn't seem like there's a lot of test coverage. Unfortunately we don't have integration tests so that's challenging, but maybe as you break up some of those large methods I pointed out there'd be a way to extract some of the logic into testable chunks.
tgross
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making good progress!
| } else { | ||
| t.logger.Debug("skipping instance", "name", *vm.Name, "instance_id", *vm.InstanceID, "code", *s.Code) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log line seems wrong... we're not skipping the instance here, we're iterating to the next InstanceView.Status and will print this log line for each status we find.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran through this a bit more in my head. This was carry over from what it was previously. I had tried to simplify this since it had multiple nested if statement. I ended up reverting back to what Luiz had previously.
if vm.Properties != nil && vm.Properties.InstanceView != nil && vm.Properties.InstanceView.Statuses != nil {
for _, s := range vm.Properties.InstanceView.Statuses {
if s.Code != nil && strings.HasPrefix(*s.Code, "PowerState/") {
if *s.Code == "PowerState/running" {
t.logger.Debug("found healthy instance", "name", *vm.Name, "instance_id", *vm.InstanceID)
vmNames = append(vmNames, *vm.Name)
break
} else {
t.logger.Debug("skipping instance - power state is not running", "name", *vm.Name, "instance_id", *vm.InstanceID, "code", *s.Code)
}
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this still has the same problem. Suppose I have the following set of InstanceViewStatus:
[]InstanceViewStatus{
{ Code: pointer.Of("ProvisioningState/succeeded"), ...}
{ Code: pointer.Of("Whatever/foo"), ...}
{ Code: pointer.Of("PowerState/running"), ...}
}In this case, we'll log that we're skipping the instance twice and then not skip it, because we eventually find a "PowerState/running" code. The number of times we log ends up being dependent on the order that the statuses get returned (which I can't find anywhere in the Azure docs). So we only want to log if we never find that "PowerState/running" code. Similar to how you've got isFlexibleVMReady working.
Or we can drop this else block entirely, which I think would also be fine.
|
@tgross Replied back to your latest reviews and put in a series of commits to address the recommendations. I'm working to clean up the functions which get both status and IDs for Flexible VMSS. I may try to consolidate those two functions, or generalize some of it that I can. |
|
Azure API rate limiting no longer an issue, now bumping into Nomad rate limiting issue. The other plugins have a retry function which I'll look to incorporate. Probably not in this PR, but in a future I will add better handling if the node drain occurs on some nodes, but not all to re-enable them for allocations. |
Once #1134 has been merged, this should reduce the impact of this and make it easier to wrap this up. |
- Updated Azure SDK imports to use the new `azidentity` and `armcompute` packages. - Replaced deprecated methods for creating Azure clients with new client secret credential methods. - Modified `setupAzureClient` to handle client creation and error handling more effectively. - Enhanced `scaleOut` and `scaleIn` methods to support both Uniform and Flexible orchestration modes. - Implemented concurrent processing for VM instance views in Flexible mode to improve performance. - Updated tests to reflect changes in the Azure SDK and added new test cases for instance view processing. - Simplified instance view processing logic to accommodate new SDK structure.
Removing Co-authored-by: Tim Gross <[email protected]>
removing comments and cleanup Co-authored-by: Tim Gross <[email protected]>
39b0307 to
f544494
Compare
checking for running and provisioned vms.
both uniform & flexible
… and name formats
f544494 to
01cf3bb
Compare
tgross
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benjamin-lykins this is really getting there. I've left a few small comments but nothing that's a major design issue at this point. We should also add a changelog entry to the Unreleased section, so that when we release this we have it noted.
| } else { | ||
| t.logger.Debug("skipping instance", "name", *vm.Name, "instance_id", *vm.InstanceID, "code", *s.Code) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this still has the same problem. Suppose I have the following set of InstanceViewStatus:
[]InstanceViewStatus{
{ Code: pointer.Of("ProvisioningState/succeeded"), ...}
{ Code: pointer.Of("Whatever/foo"), ...}
{ Code: pointer.Of("PowerState/running"), ...}
}In this case, we'll log that we're skipping the instance twice and then not skip it, because we eventually find a "PowerState/running" code. The number of times we log ends up being dependent on the order that the statuses get returned (which I can't find anywhere in the Azure docs). So we only want to log if we never find that "PowerState/running" code. Similar to how you've got isFlexibleVMReady working.
Or we can drop this else block entirely, which I think would also be fine.
…, better after update in hashicorp#1134
|
@tgross made a series of changes off of your last set of reviews and ready for rereview. |
|
|
||
| capacity := *currVMSS.Sku.Capacity | ||
| capacity := *currVMSS.SKU.Capacity | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tgross I'm going to rework this section as well. It has always pulled what the total number of instances can be in either an orchestrated or flexible vmss. I think it would be better to grab the actual count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated and tested with the following successfully:
// Recommended removing this since it only gets the requested capacity and not the actual count.
// capacity := *currVMSS.SKU.Capacity
vms, err := t.getVMSSVMs(ctx, resourceGroup, orchestrationMode, vmScaleSet)
if err != nil {
return fmt.Errorf("failed to get VMSS VMs: %w", err)
}
capacity := int64(len(vms))I'll find some time to run this against a uniform set and confirm scaling actions perform without issue as well for all the changes. Last time I did, there were no issues, but wanting to double check since it has been a bit.
…of requested capacity
tgross
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last item and I think that'll do it, @benjamin-lykins!
tgross
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Nice work!
Fixes: #1096
I will note, this might not be a small first PR to put in. Ended up being a little bit more noisy since I did migrate off the legacy Azure SDK.
Changes
"github.com/Azure/azure-sdk-for-go/services/compute/mgmt/2020-06-01/compute"
"github.com/Azure/azure-sdk-for-go/sdk/resourcemanager/compute/armcompute"
Supports both flexible and uniform scale sets. Queries the VMSS to get the
typeand then conditionally runs the respective process status/scale in and out processes.Flexible scale sets are a shift from uniform and were unable to query the instance view of the instance. Essentially, grab the list of instances from the VMSS, then use that against the VM (not VMSS) API to get instance status.
When getting instance states, since Flexible VMSS can have up to 2000 instances, I used a goroutine to iterate through them more effectively. Hardcoded at 5 concurrent, might be worth allowing the end user to provide a limit on this, but from my testing with 5 concurrent against 100 instances, it took ~1 second. When I was sequentially pulling instance state, it took ~30 seconds for the same count. I did test limiting to 10 concurrent, but it took it from ~1 second to a little less than a second, so not much gained.
processInstanceView.I can double check the why, but I believe one of the attributes was not available in the current api.
Admittedly
Test_processInstanceViewused copilot to refactor for the new api.Testing
Flexible VMSS
Uniform VMSS