Reschedule builds on other agents rather than Fail builds when agents time out or are killed (machine shut down or put to sleep)

Please can you implement build retries on other agents to handle when the BuildKite agent goes away due to the agent or machine being shut down or machine put to sleep.

In distributed processing systems it is common to retry a task 4 times on 4 different hosts before declaring a task as actually failed, because a lot of the time those failures are due to temporary issues or machines dying or in my case being shut down or put to sleep because I run BuildKite agents on my laptop.

This will also make BuildKite more suitable for use on Kubernetes where pods get evicted or on Cloud where preemptible instances can be killed with short notice, not enough time to wait for builds to finish cleanly and which will also result in false negative build failures and red failed badges on projects that shouldn’t happen but currently does (hence how I found out to raise this ticket).

Isn’t this already achieved by https://buildkite.com/docs/pipelines/command-step#automatic-retry-attributes?

1 Like

@moensch suggestion is the way that we would recommend you handle these situations.

As in the example provided, you can utilise the exit statuses to check for particular failures:

A job will fail with an exit status of -1 if communication with the agent has been lost (e.g. the agent has been forcefully terminated, or the agent machine was shut down without allowing the agent to disconnect). See the section on Exit Codes for information on other exit codes.

@moensch / @Jason ah yes you are correct, thanks I didn’t see that.

I guess I’d change this request to make retries automatic when it is a buildkite/agent issue since I want my CI pipelines to reflect the state of the code being tested.

For now, is it possible to set this automatic exit status retry at the global or pipeline level rather than having to repeat this for every step in each pipeline, ballooning out my pipeline.yml? (I have 4 such steps in each pipeline so this is a lot of redundancy)

- command: make
  retry:
    automatic:
      - exit_status: -1  # Agent was lost
        limit: 2
      - exit_status: 255 # Forced agent shutdown
        limit: 2

Currently, it’s not possible to set it at a global level at the moment, sorry.

But as you could use YAML Anchors here:

anchors:
  std_retries: &std_retries
    retry:
      automatic:
        - exit_status: -1  # Agent was lost
          limit: 2
        - exit_status: 255 # Forced agent shutdown
          limit: 2
      
steps:
  - command: exit -1
    <<: [*std_retries]
  - command: exit 255
    <<: [*std_retries]

Thanks, I’ll try that!