How can we retry jobs only when the agent is gracefully terminated?

tfuji · December 17, 2021, 6:08am

Hi.

I’d like to retry jobs automatically when the agent is gracefully terminated (exit_status: 255). However, timed-out jobs also have exit_status: 255, and they will be retried automatically when we have a step like this:

steps:
  - command: ...
    retry:
      automatic:
        - exit_status: 255
          limit: 1

Graceful termination of agents can happen exceptionally (like maintenance of infrastructure) and it’s a kind of accidental failure. On the other hand, timeouts are caused by the inappropriate value of timeout_in_minutes in most cases and it’s a reproducible failure. I think they should be treated differently.

Is there any good way to exclude timed-out jobs from the retry condition?

Or we might need a new feature for example:

Use a special exit status (e.g. -2) for timed-out jobs.
Add a new attribute like state (the one in the REST API response) to the automatic retry attributes.

Thank you.

anon4496704 · December 17, 2021, 7:35am

Hi tfuji! Welcome to our community

I just checked a job that has timed out and its exit status was 143 and not 255. For us to check further, Could you please provide the Build URL of a timed-out job where you see the exit code 255?

Best,
Nithya

tfuji · December 17, 2021, 7:56am

Hi, Nithya. Thank you for the quick reply.

This is the Build URL (private).

paula · December 17, 2021, 8:13pm

Hi @tfuji!

Sorry, I removed the link to your build for security reasons You can contact support directly when you need to share private info

Now, regarding your question. Cancellation (including timeout) causes the running process to receive a SIGINT. If the signal is unhandled, the bootstrap reports the exit status as -1 (which can sometimes wrap to 255, though I thought we’d reduced that chance a while ago)… if the signal is handled, then it reports whatever the process’s real exit status is.

Best!

tfuji · December 20, 2021, 7:21am

Hi @paula .

Sorry, I removed the link to your build for security reasons

Thank you!

As you mentioned, the exit status was changed to the real exit status by handling signals like this:

steps:
    commands:
      - trap "echo 'Received SIGINT'" SIGINT
      - trap "echo 'Received SIGTERM'" SIGTERM
      - trap "echo 'Received SIGKILL'" SIGKILL
      - sleep 600
    timeout_in_minutes: 1

I tried this step in the following situations:

stop by timeout
stop the agent before the timeout

In both cases, the build process received SIGTERM and the exit status was 143. The log was completely the same in both cases:

# Received cancellation signal, interrupting
Terminated
Received SIGTERM
🚨 Error: The command exited with status 143

And these are the output of the timeline.

Case 1: stop by timeout

Case 2: stop the agent before the timeout

Can we distinguish these 2 situations and retry the job only when the agent was stopped?

paula · December 20, 2021, 3:02pm

Hey!

Hmm , we received a SIGTERM when the agent is stopped. Maybe you could try changing the cancel-signal on the agent configuration to send other than the SIGTERM.

Best,

tfuji · December 22, 2021, 12:05pm

I’m not sure why but I couldn’t change the signal by setting cancel-signal to the agent… But anyway, I found a feature request which will solve my problem. I’ll wait for that.

Thank you so much for helping me!

Joe · August 8, 2022, 1:32am

Can you run a pre script before the retry?

paula · August 8, 2022, 10:52am

Hey!

Not sure if you are referring to the same issue of the post, but answering your question you have the following hooks available: Buildkite Agent Hooks v3 | Buildkite Documentation and particularly, you have the pre-command to run before the build command.

Best,

Joe · August 9, 2022, 12:36am

So like my scenario is if step Exited with status -1 (agent lost) delete previous package build then retry

paula · August 9, 2022, 6:12am

Hey @Joe, you might be able to make use of BUILDKITE_RETRY_COUNT to achieve this. There isn’t a direct way to run some code before/after a retry - a new job is just executed. But that env var will be incremented each time so you could have some logic in your script to check for the value and perform any necessary tasks

Topic		Replies	Views
Automatically retry failed steps on AGENT_STOP Features Requests	1	770	February 5, 2021
Reschedule builds on other agents rather than Fail builds when agents time out or are killed (machine shut down or put to sleep) Features Requests	5	1762	December 19, 2020
Auto-restart job on terminated spot instance? Elastic CI Stack for AWS	6	1687	June 29, 2021
Buildkite-agent command to signal it should stop after this job Features Requests	2	821	February 18, 2022
Set a fail status to a timed out job Pipelines	1	430	May 12, 2023

How can we retry jobs only when the agent is gracefully terminated?

Related topics