Autoscaling disconnects active agents

Hi team,
Currently running into an issue where auto-scaling is bringing down active agents occasionally when there are idle agents. Are there any settings I can change to prioritize bringing down idle agents?

Hi Sam! :wave:

You can use the ScaleInIdlePeriod property that corresponds to the disconnect-after-idle-timeout from the agent configuration : elastic-ci-stack-for-aws/packer/linux/conf/bin/bk-install-elastic-stack.sh at master · buildkite/elastic-ci-stack-for-aws · GitHub (number of seconds the agent must be idle before terminating). ScaleInIdlePeriod isn’t an ASG parameter; it terminates an agent, but the instance can still be up.

Hope this helps!

Best,

I have the same issue I think. I can’t find docs about agent status, but I would assume that if there’s a running job, then it’s not idle, and therefore shouldn’t be shut down? But alas, this is what I saw. Here’s the timeline from the relevant job:

and the corresponding activity history from the ASG:

Perhaps the scaling logic only tells the ASG to reduce size, then the ASG just picks randomly one of the instances? But it would be good if that took into consideration whether or not there was a job in progress.

Hey @jkburges!

Thanks for the message :+1:

For your stack in question above (thanks for the pictures also) - what was the ScaleInIdlePeriod? As @paula mentioned above, the parameter isn’t one set on the ASGs themselves, but at an agent instance level (UserData specifically). The scaler (by default) doesn’t scale in: its more based on the termination script added which then sets the ASG’s desired count.

Cheers!

600 which I guess is the default, pretty sure we’re not setting it.

Thanks for confirming @jkburges - 600 seconds/10 minutes being the default.

Happy for this to be sent to a dedicated item to support@buildkite.com but it would be worth seeing what the actual activity on the job for the last 10 minutes was for the agent to consider itself being idle/shutting down after the allotted time.

Cheers :+1:

It is deploying to AWS Elastic Beanstalk, using the eb deploy CLI tool. The EB side takes a while (over 10 minutes) so it’s basically sitting there doing nothing, waiting for the status to change. I should say that I haven’t noticed this problem before, and we do a lot of deploys.

Cheers - could very well be a contributing factor there - though of course, it depends on how long the eb deploy CLI takes to wrap up. It might be worth considering bumping the timeout period to accomodate - though I concur with the actual shutdown/termination scripts to account for active agents.

Another thing - has this CI stack been configured with Spot instances with InstanceType? Not ruling it our per-se, but reclamation might be at play if so

Yes, we use spot instances. But I think the ASG message above in the screenshot “taken out of service in response to a user request” suggests that it’s not this? Could be wrong :slight_smile:

Suspected the case :slightly_smiling_face:

You are right that its more a User (script) level request from first glance, though noting that from the logs , the Deployment succeeded entry indicates the finishing of the EB deployment, then for the last ~7 minutes, the agent could then be sitting idle. Depending on how long the deployment normally takes to run before that command in itself wraps up, the idle period then would be whats triggering its shutdown.

Also worth potentially trialling this with a on demand instance - primarily for the case of reclaims (but also taken into consideration the cost differential).

Cheers!

That screenshot was a bit misleading due to the clipping, the full log is:

2023-07-31 19:14:59 INFO Deployment succeeded. Terminating old instances and temporary Auto Scaling group.

i.e. the eb deploy command is still running as the deploy hasn’t finished due to still having some cleanup to do.

the agent could then be sitting idle

This bit I don’t understand: if eb deploy is still running, within the context of a job (also running), then the agent should not be considered idle, right? Or is idleness based on something else?

Ah - turns out truncation occurred: no worries, so the actual command then was still running.

The agent processes have a fixed lifetime based on the parameters as above, though they should be extended if they are running a job - so crossing off all options with the Spot instance ask before. Its all part of the lifecycle of the agent (handled via lifecycled)

I believe what is happening is the eb deploy command is still running, but under the hood the lifecycle is potentially considering the agent to be idle, even though its actively running the deploy command.

It would be worth for a deeper investigation and potential feedback on our side too: happy for you to raise this to support@buildkite.com so we can then action this individually :slightly_smiling_face:

1 Like