How can we retry jobs only when the agent is gracefully terminated?

Hi.

I’d like to retry jobs automatically when the agent is gracefully terminated (exit_status: 255). However, timed-out jobs also have exit_status: 255, and they will be retried automatically when we have a step like this:

steps:
  - command: ...
    retry:
      automatic:
        - exit_status: 255
          limit: 1

Graceful termination of agents can happen exceptionally (like maintenance of infrastructure) and it’s a kind of accidental failure. On the other hand, timeouts are caused by the inappropriate value of timeout_in_minutes in most cases and it’s a reproducible failure. I think they should be treated differently.

Is there any good way to exclude timed-out jobs from the retry condition?

Or we might need a new feature for example:

Thank you.

Hi tfuji! Welcome to our community :slightly_smiling_face:

I just checked a job that has timed out and its exit status was 143 and not 255. For us to check further, Could you please provide the Build URL of a timed-out job where you see the exit code 255?

Best,
Nithya

Hi, Nithya. Thank you for the quick reply.

This is the Build URL (private).

Hi @tfuji!

Sorry, I removed the link to your build for security reasons :slight_smile: You can contact support directly when you need to share private info :blush:

Now, regarding your question. Cancellation (including timeout) causes the running process to receive a SIGINT. If the signal is unhandled, the bootstrap reports the exit status as -1 (which can sometimes wrap to 255, though I thought we’d reduced that chance a while ago)… if the signal is handled, then it reports whatever the process’s real exit status is.

Best!

Hi @paula .

Sorry, I removed the link to your build for security reasons :slight_smile:

Thank you!

As you mentioned, the exit status was changed to the real exit status by handling signals like this:

steps:
    commands:
      - trap "echo 'Received SIGINT'" SIGINT
      - trap "echo 'Received SIGTERM'" SIGTERM
      - trap "echo 'Received SIGKILL'" SIGKILL
      - sleep 600
    timeout_in_minutes: 1

I tried this step in the following situations:

  1. stop by timeout
  2. stop the agent before the timeout

In both cases, the build process received SIGTERM and the exit status was 143. The log was completely the same in both cases:

# Received cancellation signal, interrupting
Terminated
Received SIGTERM
🚨 Error: The command exited with status 143

And these are the output of the timeline.

Case 1: stop by timeout

Case 2: stop the agent before the timeout

Can we distinguish these 2 situations and retry the job only when the agent was stopped?

Hey!

Hmm :thinking:, we received a SIGTERM when the agent is stopped. Maybe you could try changing the cancel-signal on the agent configuration to send other than the SIGTERM.

Best,

I’m not sure why but I couldn’t change the signal by setting cancel-signal to the agent… But anyway, I found a feature request which will solve my problem. I’ll wait for that.

Thank you so much for helping me!

1 Like

Can you run a pre script before the retry?

Hey! :wave:

Not sure if you are referring to the same issue of the post, but answering your question you have the following hooks available: Buildkite Agent Hooks v3 | Buildkite Documentation and particularly, you have the pre-command to run before the build command.

Best,

So like my scenario is if step Exited with status -1 (agent lost) delete previous package build then retry

Hey @Joe, you might be able to make use of BUILDKITE_RETRY_COUNT to achieve this. There isn’t a direct way to run some code before/after a retry - a new job is just executed. But that env var will be incremented each time so you could have some logic in your script to check for the value and perform any necessary tasks