We are doing some preparation for moving from Intel-based architecture to ARM-based CPUs. As part of this testing we’ve set up a parallel stack on a different queue to run our CI on ARM. The stack spins up instances and runs tests for the right queue as expected. Unfortunately we’re seeing an issue where the stack doesn’t seem to scale instances down.
At first I thought this might have been because I chose the arm option for LambdaArchitecture. However, switching it back to x86 didn’t seem to help. (Which makes sense given the documentation here: Experimental Lambda-based Scaler 🦑) As a result, we see that our stack’s ASG spins up instances to max capacity and then we never see them turn off when they are idle. I looked over the parameters in the two stacks and I can’t find anything different about them other than the architecture and queue names. The desired capacity never seems to go down even when jobs haven’t been run for hours.
Any ideas as to what we can do here to resolve this?
Hey @geoffharcourt,
Thanks for getting in touch – it sounds like you have disable_scale_in set to true in this case. If that’s correct, we’d recommend setting this to false to ensure that the scale is able to reduce the desired capacity.
Would you be able to share the parameters of your Stack for us to take a look at your configuration to see if we can identify any issue? We have information on this here.
We’d recommend emailing this over to our support team to ensure that there’s no sensitive information available publicly, but if you prefer to share here, that’s fine – just ensure that there’s no identifiable or sensitive information exposed. 
Hi Joe,
Thanks for responding! I forwarded our stack parameters to support@buildkite.com. I can set disable_scale_in to false, but our Intel stack has it as true and seems to scale down so I’m confused as to what’s making the idle termination behavior different on the two stacks.
I’m also wondering if this behavior Max describes is the same thing I’ve been seeing: Experimental Lambda-based Scaler 🦑 - #14 by max
Hey @geoffharcourt,
We’ve observed that the disable_scale_in setting (which is exposed by NewInstancesProtectedFromScaleIn as a parameter) was preventing instances from being scaled-in by our Agent Scaler, which would result in Agents hanging. The ASG itself would be prevented from terminating these instances, too.
When this setting is disabled, this ensures that instances can be terminated by the Agent Scaler. You can test this parameter works for yourself before committing to a change by going to your ASG, navigating to the “Instance Management” tab and selecting a few instances, pressing on “Actions” and then selecting “Remove scale-in protection”.
Based off of the symptoms you’ve described, this seems to be the same cause that I set out to fix on an issue I myself was running into and subsequently exposed this setting for. I do believe if you follow the steps above to manually verify, you’ll see that the Agent count begins to increased, with the instances with Scale-in protection disabled being the ones that drop-off.
We figured out what was going on. We were re-using the instance role from two stacks and the instance role’s InstancePolicy inline policy didn’t have auto-scaling group instance termination privileges and was therefore unable to scale in on the stack. We’ve fixed that permission on our instance role and the scaling in and out happens as expected. Thanks Joe!
1 Like