Autoscaling disconnects active agents

samuel · June 27, 2023, 9:13pm

Hi team,
Currently running into an issue where auto-scaling is bringing down active agents occasionally when there are idle agents. Are there any settings I can change to prioritize bringing down idle agents?

paula · June 27, 2023, 9:45pm

Hi Sam!

You can use the ScaleInIdlePeriod property that corresponds to the disconnect-after-idle-timeout from the agent configuration : elastic-ci-stack-for-aws/packer/linux/conf/bin/bk-install-elastic-stack.sh at master · buildkite/elastic-ci-stack-for-aws · GitHub (number of seconds the agent must be idle before terminating). ScaleInIdlePeriod isn’t an ASG parameter; it terminates an agent, but the instance can still be up.

Hope this helps!

Best,

jkburges · July 31, 2023, 11:07pm

I have the same issue I think. I can’t find docs about agent status, but I would assume that if there’s a running job, then it’s not idle, and therefore shouldn’t be shut down? But alas, this is what I saw. Here’s the timeline from the relevant job:

and the corresponding activity history from the ASG:

Perhaps the scaling logic only tells the ASG to reduce size, then the ASG just picks randomly one of the instances? But it would be good if that took into consideration whether or not there was a job in progress.

james.s · July 31, 2023, 11:36pm

Hey @jkburges!

Thanks for the message

For your stack in question above (thanks for the pictures also) - what was the ScaleInIdlePeriod? As @paula mentioned above, the parameter isn’t one set on the ASGs themselves, but at an agent instance level (UserData specifically). The scaler (by default) doesn’t scale in: its more based on the termination script added which then sets the ASG’s desired count.

Cheers!

jkburges · July 31, 2023, 11:41pm

600 which I guess is the default, pretty sure we’re not setting it.

james.s · August 1, 2023, 12:19am

Thanks for confirming @jkburges - 600 seconds/10 minutes being the default.

Happy for this to be sent to a dedicated item to support@buildkite.com but it would be worth seeing what the actual activity on the job for the last 10 minutes was for the agent to consider itself being idle/shutting down after the allotted time.

Cheers

jkburges · August 1, 2023, 12:25am

It is deploying to AWS Elastic Beanstalk, using the eb deploy CLI tool. The EB side takes a while (over 10 minutes) so it’s basically sitting there doing nothing, waiting for the status to change. I should say that I haven’t noticed this problem before, and we do a lot of deploys.

james.s · August 1, 2023, 12:41am

Cheers - could very well be a contributing factor there - though of course, it depends on how long the eb deploy CLI takes to wrap up. It might be worth considering bumping the timeout period to accomodate - though I concur with the actual shutdown/termination scripts to account for active agents.

Another thing - has this CI stack been configured with Spot instances with InstanceType? Not ruling it our per-se, but reclamation might be at play if so

jkburges · August 1, 2023, 1:11am

Yes, we use spot instances. But I think the ASG message above in the screenshot “taken out of service in response to a user request” suggests that it’s not this? Could be wrong

james.s · August 1, 2023, 1:40am

Suspected the case

You are right that its more a User (script) level request from first glance, though noting that from the logs , the Deployment succeeded entry indicates the finishing of the EB deployment, then for the last ~7 minutes, the agent could then be sitting idle. Depending on how long the deployment normally takes to run before that command in itself wraps up, the idle period then would be whats triggering its shutdown.

Also worth potentially trialling this with a on demand instance - primarily for the case of reclaims (but also taken into consideration the cost differential).

Cheers!

jkburges · August 1, 2023, 2:18am

That screenshot was a bit misleading due to the clipping, the full log is:

2023-07-31 19:14:59 INFO Deployment succeeded. Terminating old instances and temporary Auto Scaling group.

i.e. the eb deploy command is still running as the deploy hasn’t finished due to still having some cleanup to do.

the agent could then be sitting idle

This bit I don’t understand: if eb deploy is still running, within the context of a job (also running), then the agent should not be considered idle, right? Or is idleness based on something else?

james.s · August 1, 2023, 2:45am

Ah - turns out truncation occurred: no worries, so the actual command then was still running.

The agent processes have a fixed lifetime based on the parameters as above, though they should be extended if they are running a job - so crossing off all options with the Spot instance ask before. Its all part of the lifecycle of the agent (handled via lifecycled)

I believe what is happening is the eb deploy command is still running, but under the hood the lifecycle is potentially considering the agent to be idle, even though its actively running the deploy command.

It would be worth for a deeper investigation and potential feedback on our side too: happy for you to raise this to support@buildkite.com so we can then action this individually

Topic		Replies	Views
Idle Buildkite Agent is trying to terminate instance Elastic CI Stack for AWS	2	362	September 7, 2023
Experimental Lambda-based Scaler 🦑 Elastic CI Stack for AWS	14	4719	September 25, 2024
Autoscaling not scaling down Elastic CI Stack for AWS	2	1058	February 1, 2021
Multiple Agents on a single instance don't timeout together General	2	345	September 2, 2023
Hybrid Agent Setup Elastic CI Stack for AWS	5	70	August 27, 2024

Autoscaling disconnects active agents

Related topics