I’ve an elastic stack deployed configured to have 1 to 8 instances. But I started noticing the Agent, when idle for 10 minutes, tries to terminate the instance anyway. Since the ASG min is set to 1 it fails and restarts the agent, but this takes several minutes, and if a new job gets triggered during that time it has to wait like 13 minutes to start.
In other words, when the ASG is at its minimum, and that node is idle for 10 minutes, it tries to terminate the instance, it fails (the instance gets never terminated), but the agent service goes offline for like 15 minutes.
These are the logs from the agent:
ep 06 15:07:37 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:07:37 INFO Starting 5 Agent(s)
Sep 06 15:07:37 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:07:37 INFO You can press Ctrl-C to stop the agents
Sep 06 15:07:37 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:07:37 INFO allurion-build-agents-default-i-0000d2e5aa0c3e7a5-1 Connecting to Buildkite...
Sep 06 15:07:37 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:07:37 INFO allurion-build-agents-default-i-0000d2e5aa0c3e7a5-3 Connecting to Buildkite...
Sep 06 15:07:37 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:07:37 INFO allurion-build-agents-default-i-0000d2e5aa0c3e7a5-2 Connecting to Buildkite...
Sep 06 15:07:37 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:07:37 INFO allurion-build-agents-default-i-0000d2e5aa0c3e7a5-5 Connecting to Buildkite...
Sep 06 15:07:37 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:07:37 INFO allurion-build-agents-default-i-0000d2e5aa0c3e7a5-4 Connecting to Buildkite...
Sep 06 15:07:37 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:07:37 INFO allurion-build-agents-default-i-0000d2e5aa0c3e7a5-4 Waiting for work...
Sep 06 15:07:37 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:07:37 INFO allurion-build-agents-default-i-0000d2e5aa0c3e7a5-5 Waiting for work...
Sep 06 15:07:37 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:07:37 INFO allurion-build-agents-default-i-0000d2e5aa0c3e7a5-1 Waiting for work...
Sep 06 15:07:37 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:07:37 INFO allurion-build-agents-default-i-0000d2e5aa0c3e7a5-2 Waiting for work...
Sep 06 15:07:37 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:07:37 INFO allurion-build-agents-default-i-0000d2e5aa0c3e7a5-3 Waiting for work...
Sep 06 15:17:37 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:37 INFO allurion-build-agents-default-i-0000d2e5aa0c3e7a5-5 All agents have been idle for 600 seconds. Disconnecting...
Sep 06 15:17:37 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:37 INFO allurion-build-agents-default-i-0000d2e5aa0c3e7a5-5 Disconnecting...
Sep 06 15:17:38 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:38 INFO allurion-build-agents-default-i-0000d2e5aa0c3e7a5-5 Disconnected
Sep 06 15:17:42 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:42 INFO allurion-build-agents-default-i-0000d2e5aa0c3e7a5-1 All agents have been idle for 600 seconds. Disconnecting...
Sep 06 15:17:42 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:42 INFO allurion-build-agents-default-i-0000d2e5aa0c3e7a5-1 Disconnecting...
Sep 06 15:17:42 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:42 INFO allurion-build-agents-default-i-0000d2e5aa0c3e7a5-4 All agents have been idle for 600 seconds. Disconnecting...
Sep 06 15:17:42 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:42 INFO allurion-build-agents-default-i-0000d2e5aa0c3e7a5-4 Disconnecting...
Sep 06 15:17:42 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:42 INFO allurion-build-agents-default-i-0000d2e5aa0c3e7a5-3 All agents have been idle for 600 seconds. Disconnecting...
Sep 06 15:17:42 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:42 INFO allurion-build-agents-default-i-0000d2e5aa0c3e7a5-3 Disconnecting...
Sep 06 15:17:42 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:42 INFO allurion-build-agents-default-i-0000d2e5aa0c3e7a5-2 All agents have been idle for 600 seconds. Disconnecting...
Sep 06 15:17:42 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:42 INFO allurion-build-agents-default-i-0000d2e5aa0c3e7a5-2 Disconnecting...
Sep 06 15:17:43 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:43 INFO allurion-build-agents-default-i-0000d2e5aa0c3e7a5-1 Disconnected
Sep 06 15:17:43 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:43 INFO allurion-build-agents-default-i-0000d2e5aa0c3e7a5-3 Disconnected
Sep 06 15:17:43 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:43 INFO allurion-build-agents-default-i-0000d2e5aa0c3e7a5-2 Disconnected
Sep 06 15:17:43 ip-172-16-1-226 buildkite-agent[34979]: 2023-09-06 15:17:43 INFO allurion-build-agents-default-i-0000d2e5aa0c3e7a5-4 Disconnected
Sep 06 15:17:43 ip-172-16-1-226 terminate-instance[35285]: sleeping for 10 seconds before terminating instance to allow agent logs to drain to cloudwatch...
Sep 06 15:17:53 ip-172-16-1-226 terminate-instance[35285]: requesting instance termination...
Sep 06 15:17:53 ip-172-16-1-226 terminate-instance[35290]: An error occurred (ValidationError) when calling the TerminateInstanceInAutoScalingGroup operation: Currently, desiredSize equals minSize (1). Terminating in>
Sep 06 15:17:59 ip-172-16-1-226 buildkite-agent[35349]:
And this is a job that waited for 13 minutes when the instance was running but the agent service down:
SHouldn’t the agent check if its already at its minimum size before disconnecting?