Multiple Agents on a single instance don't timeout together

Hi Buildkite,

Firstly, thanks for a great product. We love it.

I came across some strange behaviour in one of our queues today… We have a queue(called small) where we run 20 agents on a single EC2 instance. We do this as the small queue primarily has jobs that wait for AWS api calls to complete(ie, upgrading an ECS service, invalidating cloudfront, etc…)

This morning our pipeline stalled as 19/20 of the agents had terminated due to idle timeout… and one was still working. In this case it seems better that all the agents stay alive, or they all terminate(and the instance is removed from the ASG)… as it was, the instance remained, and the scaling lambda didn’t add another instance.

It seems to be that the agents timeout after 15 minutes idle, but they maintain their own ‘countdown’ instead we want the agents on a shared instance to run/terminate together as a set. No point having an EC2 instance that can run 20 fine, just running 1.

What can we do to avoid this happening again? Is there any configuration to have the agents avoid terminating if they are part of a set?

(sceenshots of logs for additional context)

Seems like the log line: 'All agents have been idle for 1200 seconds. Disconnecting…" give the wrong impression. It would be more correct: “This agent has been idle for ### seconds. Disconnecting.”

Hello @Mic!

Hope you are well and welcome to the Buildkite community! :wave:

You are correct in the behaviour that you are describing as below - and thank you for the screenshots too (which also assists with confirming cases). As you’ve stated, each agent “worker” indeed maintains its own countdown poller (or in this case - the ping loop) which determines if the singular agent sharing the host in its “pool” is stopping, running a job (which extends its last action time to keep it alive), or in the latter case: determines if the idle period has been exhausted to then shut each down only if the rest of the instance in the shared pool are also idle. - which in inbuilt behaviour.

Going from the pictures you’ve shared - it looks like while the agents were disconnecting - a job assignment occurred during the worker’s Disconnection process. What versions of the CI stack and (corresponding) agents are you running in this setup?


This topic was automatically closed after 3 days. New replies are no longer allowed.