Idle Buildkite Agent is trying to terminate instance

I’ve an elastic stack deployed configured to have 1 to 8 instances. But I started noticing the Agent, when idle for 10 minutes, tries to terminate the instance anyway. Since the ASG min is set to 1 it fails and restarts the agent, but this takes several minutes, and if a new job gets triggered during that time it has to wait like 13 minutes to start.

In other words, when the ASG is at its minimum, and that node is idle for 10 minutes, it tries to terminate the instance, it fails (the instance gets never terminated), but the agent service goes offline for like 15 minutes.

These are the logs from the agent:

ep 06 15:07:37 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:07:37 INFO   Starting 5 Agent(s)
Sep 06 15:07:37 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:07:37 INFO   You can press Ctrl-C to stop the agents
Sep 06 15:07:37 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:07:37 INFO   allurion-build-agents-default-i-0000d2e5aa0c3e7a5-1 Connecting to Buildkite...
Sep 06 15:07:37 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:07:37 INFO   allurion-build-agents-default-i-0000d2e5aa0c3e7a5-3 Connecting to Buildkite...
Sep 06 15:07:37 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:07:37 INFO   allurion-build-agents-default-i-0000d2e5aa0c3e7a5-2 Connecting to Buildkite...
Sep 06 15:07:37 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:07:37 INFO   allurion-build-agents-default-i-0000d2e5aa0c3e7a5-5 Connecting to Buildkite...
Sep 06 15:07:37 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:07:37 INFO   allurion-build-agents-default-i-0000d2e5aa0c3e7a5-4 Connecting to Buildkite...
Sep 06 15:07:37 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:07:37 INFO   allurion-build-agents-default-i-0000d2e5aa0c3e7a5-4 Waiting for work...
Sep 06 15:07:37 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:07:37 INFO   allurion-build-agents-default-i-0000d2e5aa0c3e7a5-5 Waiting for work...
Sep 06 15:07:37 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:07:37 INFO   allurion-build-agents-default-i-0000d2e5aa0c3e7a5-1 Waiting for work...
Sep 06 15:07:37 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:07:37 INFO   allurion-build-agents-default-i-0000d2e5aa0c3e7a5-2 Waiting for work...
Sep 06 15:07:37 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:07:37 INFO   allurion-build-agents-default-i-0000d2e5aa0c3e7a5-3 Waiting for work...
Sep 06 15:17:37 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:37 INFO   allurion-build-agents-default-i-0000d2e5aa0c3e7a5-5 All agents have been idle for 600 seconds. Disconnecting...
Sep 06 15:17:37 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:37 INFO   allurion-build-agents-default-i-0000d2e5aa0c3e7a5-5 Disconnecting...
Sep 06 15:17:38 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:38 INFO   allurion-build-agents-default-i-0000d2e5aa0c3e7a5-5 Disconnected
Sep 06 15:17:42 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:42 INFO   allurion-build-agents-default-i-0000d2e5aa0c3e7a5-1 All agents have been idle for 600 seconds. Disconnecting...
Sep 06 15:17:42 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:42 INFO   allurion-build-agents-default-i-0000d2e5aa0c3e7a5-1 Disconnecting...
Sep 06 15:17:42 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:42 INFO   allurion-build-agents-default-i-0000d2e5aa0c3e7a5-4 All agents have been idle for 600 seconds. Disconnecting...
Sep 06 15:17:42 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:42 INFO   allurion-build-agents-default-i-0000d2e5aa0c3e7a5-4 Disconnecting...
Sep 06 15:17:42 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:42 INFO   allurion-build-agents-default-i-0000d2e5aa0c3e7a5-3 All agents have been idle for 600 seconds. Disconnecting...
Sep 06 15:17:42 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:42 INFO   allurion-build-agents-default-i-0000d2e5aa0c3e7a5-3 Disconnecting...
Sep 06 15:17:42 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:42 INFO   allurion-build-agents-default-i-0000d2e5aa0c3e7a5-2 All agents have been idle for 600 seconds. Disconnecting...
Sep 06 15:17:42 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:42 INFO   allurion-build-agents-default-i-0000d2e5aa0c3e7a5-2 Disconnecting...
Sep 06 15:17:43 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:43 INFO   allurion-build-agents-default-i-0000d2e5aa0c3e7a5-1 Disconnected
Sep 06 15:17:43 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:43 INFO   allurion-build-agents-default-i-0000d2e5aa0c3e7a5-3 Disconnected
Sep 06 15:17:43 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:43 INFO   allurion-build-agents-default-i-0000d2e5aa0c3e7a5-2 Disconnected
Sep 06 15:17:43 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:43 INFO   allurion-build-agents-default-i-0000d2e5aa0c3e7a5-4 Disconnected
Sep 06 15:17:43 ip-172-16-1-226 terminate-instance[35285]: sleeping for 10 seconds before terminating instance to allow agent logs to drain to cloudwatch...
Sep 06 15:17:53 ip-172-16-1-226 terminate-instance[35285]: requesting instance termination...
Sep 06 15:17:53 ip-172-16-1-226 terminate-instance[35290]: An error occurred (ValidationError) when calling the TerminateInstanceInAutoScalingGroup operation: Currently, desiredSize equals minSize (1). Terminating in>
Sep 06 15:17:59 ip-172-16-1-226 buildkite-agent[35349]: 

And this is a job that waited for 13 minutes when the instance was running but the agent service down:

SHouldn’t the agent check if its already at its minimum size before disconnecting?

Hey @smoreno-allurion ,

Thanks for the information

This could happen because the Template Auto scaling configuration has this setting

ScaleInIdlePeriod: Number of seconds an agent must be idle before terminating
Default Value: 600

And that is why the logs show the error message, because your agents have been indeed idle for that period

"All agents have been idle for 600 seconds. Disconnecting"

Hi @stephanie.atte , thanks for your answer.

yes I was aware of that setting, and I understand that 600 is the default value. My questions wasn’t referring to why the node was trying to terminate when idle, but why is trying to terminate and scale in to 0 when I set the minimum to 1.

When I have just 1 node, and that node is idle for 10 minutes, the agent tries to terminate the instance even though the min is set to 1. And this is causing 13 minutes wait on my jobs since there are no available agents.