From the host running the agent, can it be determined if the agent is executing a job?

petergoldsmith-rea · February 4, 2021, 10:21am

Hi there,
I’m wondering if there is some way to inspect a running agent to determine if it is currently executing a job? My use case is that we run a fleet of macs, and to perform rolling updates I offline the agents until the updates are complete.
The mechanism I currently use to offline the agents is to launchctl unload <launch-agent-plist> which in turn sends a SIGTERM to the agent, but that is non-blocking, and so I still don’t know when the agent has finished up gracefully.

I might be missing something obvious.
Cheers, Pete

sj26 · February 4, 2021, 11:08am

Hi @petergoldsmith-rea,

Hmm, you could send a TERM via launchctl and then loop waiting for pgrep buildkite-agent to be empty perhaps? Or for launchctl list homebrew.mxcl.buildkite-agent to stop returning a PID (... | grep -q PID)?

It would be lovely if Apple provided something like systemctl stop [--no-block] buildkite-agent, but I’m not aware of anything.

Cheers,
Sam

petergoldsmith-rea · February 4, 2021, 11:16pm

How do I prevent SIGTERM killing a job mid-flight? Just by having an excessively large cancel-grace-period? The jobs we run can sometimes exceed the hour mark.

sj26 · February 4, 2021, 11:44pm

Yeah, a SIGTERM will tell the agent to gracefully allow a job to finish without a timeout (docs), but launchd timeouts would also need to be considered. I don’t have enough experience to suggest how to do that, sorry! But it sounds like you’re on the right track.

petergoldsmith-rea · February 5, 2021, 12:04am

Thank you!!! I would have been pulling my hair out had you not pointed me in the direction of launchd’s own behaviour for SIGTERM → SIGKILL

For anyone else that comes across this trying to do the same, you’ll want a 0 or very large value for ExitTimeOut in your launch agent plist

<key>ExitTimeOut</key>
<integer>0</integer>

From launchd man page:

 ExitTimeOut <integer>
     The amount of time launchd waits before sending a SIGKILL signal. The
     default value is 20 seconds. The value zero is interpreted as infinity.

Topic		Replies	Views
Buildkite-agent command to signal it should stop after this job Features Requests	2	821	February 18, 2022
Newer versions of the Buildkite agent don't respect the cancel grace period General	4	587	May 1, 2024
Job being killed on MacOS General	2	146	June 19, 2024
How can we retry jobs only when the agent is gracefully terminated? Pipelines	10	2610	August 9, 2022
Is buildkite-agent intended to be used on preemptible instances? General	7	1620	December 25, 2020

From the host running the agent, can it be determined if the agent is executing a job?

Related topics