Multiple Agents on a single instance don't timeout together

Mic · August 30, 2023, 10:39pm

Hi Buildkite,

Firstly, thanks for a great product. We love it.

I came across some strange behaviour in one of our queues today… We have a queue(called small) where we run 20 agents on a single EC2 instance. We do this as the small queue primarily has jobs that wait for AWS api calls to complete(ie, upgrading an ECS service, invalidating cloudfront, etc…)

This morning our pipeline stalled as 19/20 of the agents had terminated due to idle timeout… and one was still working. In this case it seems better that all the agents stay alive, or they all terminate(and the instance is removed from the ASG)… as it was, the instance remained, and the scaling lambda didn’t add another instance.

It seems to be that the agents timeout after 15 minutes idle, but they maintain their own ‘countdown’ instead we want the agents on a shared instance to run/terminate together as a set. No point having an EC2 instance that can run 20 fine, just running 1.

What can we do to avoid this happening again? Is there any configuration to have the agents avoid terminating if they are part of a set?

(sceenshots of logs for additional context)

Seems like the log line: 'All agents have been idle for 1200 seconds. Disconnecting…" give the wrong impression. It would be more correct: “This agent has been idle for ### seconds. Disconnecting.”

james.s · August 30, 2023, 11:24pm

Hello @Mic!

Hope you are well and welcome to the Buildkite community!

You are correct in the behaviour that you are describing as below - and thank you for the screenshots too (which also assists with confirming cases). As you’ve stated, each agent “worker” indeed maintains its own countdown poller (or in this case - the ping loop) which determines if the singular agent sharing the host in its “pool” is stopping, running a job (which extends its last action time to keep it alive), or in the latter case: determines if the idle period has been exhausted to then shut each down only if the rest of the instance in the shared pool are also idle. - which in inbuilt behaviour.

Going from the pictures you’ve shared - it looks like while the agents were disconnecting - a job assignment occurred during the worker’s Disconnection process. What versions of the CI stack and (corresponding) agents are you running in this setup?

Cheers!

system · September 2, 2023, 10:39pm

This topic was automatically closed after 3 days. New replies are no longer allowed.

Topic		Replies	Views
Idle Buildkite Agent is trying to terminate instance Elastic CI Stack for AWS	2	359	September 7, 2023
Timeout waiting for agent Features Requests	15	4229	December 13, 2023
Agents are hangs when pipeline is canceled by timeout General	1	1428	July 29, 2019
Autoscaling disconnects active agents Elastic CI Stack for AWS	11	387	August 1, 2023
Waiting for a buildkite agent to become available General	1	830	March 22, 2023

Multiple Agents on a single instance don't timeout together

Related topics