Hi Buildkite,
Firstly, thanks for a great product. We love it.
I came across some strange behaviour in one of our queues today… We have a queue(called small) where we run 20 agents on a single EC2 instance. We do this as the small queue primarily has jobs that wait for AWS api calls to complete(ie, upgrading an ECS service, invalidating cloudfront, etc…)
This morning our pipeline stalled as 19/20 of the agents had terminated due to idle timeout… and one was still working. In this case it seems better that all the agents stay alive, or they all terminate(and the instance is removed from the ASG)… as it was, the instance remained, and the scaling lambda didn’t add another instance.
It seems to be that the agents timeout after 15 minutes idle, but they maintain their own ‘countdown’ instead we want the agents on a shared instance to run/terminate together as a set. No point having an EC2 instance that can run 20 fine, just running 1.
What can we do to avoid this happening again? Is there any configuration to have the agents avoid terminating if they are part of a set?
(sceenshots of logs for additional context)
Seems like the log line: 'All agents have been idle for 1200 seconds. Disconnecting…" give the wrong impression. It would be more correct: “This agent has been idle for ### seconds. Disconnecting.”