I’d like to retry jobs automatically when the agent is gracefully terminated (exit_status: 255). However, timed-out jobs also have exit_status: 255, and they will be retried automatically when we have a step like this:
Graceful termination of agents can happen exceptionally (like maintenance of infrastructure) and it’s a kind of accidental failure. On the other hand, timeouts are caused by the inappropriate value of timeout_in_minutes in most cases and it’s a reproducible failure. I think they should be treated differently.
Is there any good way to exclude timed-out jobs from the retry condition?
Or we might need a new feature for example:
Use a special exit status (e.g. -2) for timed-out jobs.
I just checked a job that has timed out and its exit status was 143 and not 255. For us to check further, Could you please provide the Build URL of a timed-out job where you see the exit code 255?
Sorry, I removed the link to your build for security reasons You can contact support directly when you need to share private info
Now, regarding your question. Cancellation (including timeout) causes the running process to receive a SIGINT. If the signal is unhandled, the bootstrap reports the exit status as -1 (which can sometimes wrap to 255, though I thought we’d reduced that chance a while ago)… if the signal is handled, then it reports whatever the process’s real exit status is.
Hmm , we received a SIGTERM when the agent is stopped. Maybe you could try changing the cancel-signal on the agent configuration to send other than the SIGTERM.
I’m not sure why but I couldn’t change the signal by setting cancel-signal to the agent… But anyway, I found a feature request which will solve my problem. I’ll wait for that.
Not sure if you are referring to the same issue of the post, but answering your question you have the following hooks available: Buildkite Agent Hooks v3 | Buildkite Documentation and particularly, you have the pre-command to run before the build command.
Hey @Joe, you might be able to make use of BUILDKITE_RETRY_COUNT to achieve this. There isn’t a direct way to run some code before/after a retry - a new job is just executed. But that env var will be incremented each time so you could have some logic in your script to check for the value and perform any necessary tasks