I have a cluster of agents on physical machines and VMs which are started on demand, I use the disconnect-after-idle-timeout option to stop the agent process when there is no work and OS-specific methods of varying hackyness to shut the machine down after that.
It would be nice if the agent process termination reason (e.g. signal, timeout, something else…) was provided to the agent-shutdown hook, then I could simply have a hook which schedules a shutdown of the host system when the agent exits due to the idle timeout expiring, and manually stopping the agent (e.g. for maintenance) would skip the system shutdown.
Could you share how you’re currently running your agents—are they part of a self-managed fleet, or are you using our Elastic CI Stack on AWS?
At the moment, the shutdown reason isn’t available to pass into the agent-shutdown hook. However, the agent logs do emit the message "All agents have been idle for N seconds. Disconnecting" before shutting down. Would it be possible for you to monitor this log line and trigger a scheduled shutdown of the host accordingly?
Just to clarify, the disconnect-after-idle-timeout setting only takes effect once all agents on a host are idle and ready to disconnect.
The agents are part of a self-hosted cluster, I could write something to monitor the log for that specific message, but that falls under the “varying hackyness” bucket I’ve already got