Hi team,
Currently running into an issue where auto-scaling is bringing down active agents occasionally when there are idle agents. Are there any settings I can change to prioritize bringing down idle agents?
Hi Sam!
You can use the ScaleInIdlePeriod
property that corresponds to the disconnect-after-idle-timeout
from the agent configuration : elastic-ci-stack-for-aws/packer/linux/conf/bin/bk-install-elastic-stack.sh at master · buildkite/elastic-ci-stack-for-aws · GitHub (number of seconds the agent must be idle before terminating). ScaleInIdlePeriod
isn’t an ASG parameter; it terminates an agent, but the instance can still be up.
Hope this helps!
Best,
I have the same issue I think. I can’t find docs about agent status, but I would assume that if there’s a running job, then it’s not idle, and therefore shouldn’t be shut down? But alas, this is what I saw. Here’s the timeline from the relevant job:
and the corresponding activity history from the ASG:
Perhaps the scaling logic only tells the ASG to reduce size, then the ASG just picks randomly one of the instances? But it would be good if that took into consideration whether or not there was a job in progress.
Hey @jkburges!
Thanks for the message
For your stack in question above (thanks for the pictures also) - what was the ScaleInIdlePeriod
? As @paula mentioned above, the parameter isn’t one set on the ASGs themselves, but at an agent instance level (UserData specifically). The scaler (by default) doesn’t scale in: its more based on the termination script added which then sets the ASG’s desired count.
Cheers!
600 which I guess is the default, pretty sure we’re not setting it.
Thanks for confirming @jkburges - 600 seconds/10 minutes being the default.
Happy for this to be sent to a dedicated item to support@buildkite.com
but it would be worth seeing what the actual activity on the job for the last 10 minutes was for the agent to consider itself being idle/shutting down after the allotted time.
Cheers
It is deploying to AWS Elastic Beanstalk, using the eb deploy
CLI tool. The EB side takes a while (over 10 minutes) so it’s basically sitting there doing nothing, waiting for the status to change. I should say that I haven’t noticed this problem before, and we do a lot of deploys.
Cheers - could very well be a contributing factor there - though of course, it depends on how long the eb deploy
CLI takes to wrap up. It might be worth considering bumping the timeout period to accomodate - though I concur with the actual shutdown/termination scripts to account for active
agents.
Another thing - has this CI stack been configured with Spot instances with InstanceType
? Not ruling it our per-se, but reclamation might be at play if so
Yes, we use spot instances. But I think the ASG message above in the screenshot “taken out of service in response to a user request” suggests that it’s not this? Could be wrong
Suspected the case
You are right that its more a User (script) level request from first glance, though noting that from the logs , the Deployment succeeded
entry indicates the finishing of the EB deployment, then for the last ~7 minutes, the agent could then be sitting idle. Depending on how long the deployment normally takes to run before that command in itself wraps up, the idle period then would be whats triggering its shutdown.
Also worth potentially trialling this with a on demand instance - primarily for the case of reclaims (but also taken into consideration the cost differential).
Cheers!
That screenshot was a bit misleading due to the clipping, the full log is:
2023-07-31 19:14:59 INFO Deployment succeeded. Terminating old instances and temporary Auto Scaling group.
i.e. the eb deploy
command is still running as the deploy hasn’t finished due to still having some cleanup to do.
the agent could then be sitting idle
This bit I don’t understand: if eb deploy
is still running, within the context of a job (also running), then the agent should not be considered idle, right? Or is idleness based on something else?
Ah - turns out truncation occurred: no worries, so the actual command then was still running.
The agent processes have a fixed lifetime based on the parameters as above, though they should be extended if they are running a job - so crossing off all options with the Spot instance ask before. Its all part of the lifecycle of the agent (handled via lifecycled)
I believe what is happening is the eb deploy
command is still running, but under the hood the lifecycle is potentially considering the agent to be idle, even though its actively running the deploy command.
It would be worth for a deeper investigation and potential feedback on our side too: happy for you to raise this to support@buildkite.com
so we can then action this individually