Reschedule builds on other agents rather than Fail builds when agents time out or are killed (machine shut down or put to sleep)

harisekhon · December 15, 2020, 6:31pm

Please can you implement build retries on other agents to handle when the BuildKite agent goes away due to the agent or machine being shut down or machine put to sleep.

In distributed processing systems it is common to retry a task 4 times on 4 different hosts before declaring a task as actually failed, because a lot of the time those failures are due to temporary issues or machines dying or in my case being shut down or put to sleep because I run BuildKite agents on my laptop.

This will also make BuildKite more suitable for use on Kubernetes where pods get evicted or on Cloud where preemptible instances can be killed with short notice, not enough time to wait for builds to finish cleanly and which will also result in false negative build failures and red failed badges on projects that shouldn’t happen but currently does (hence how I found out to raise this ticket).

moensch · December 16, 2020, 7:01pm

Isn’t this already achieved by https://buildkite.com/docs/pipelines/command-step#automatic-retry-attributes?

Jason · December 17, 2020, 12:41am

@moensch suggestion is the way that we would recommend you handle these situations.

As in the example provided, you can utilise the exit statuses to check for particular failures:

A job will fail with an exit status of -1 if communication with the agent has been lost (e.g. the agent has been forcefully terminated, or the agent machine was shut down without allowing the agent to disconnect). See the section on Exit Codes for information on other exit codes.

harisekhon · December 17, 2020, 1:34pm

@moensch / @Jason ah yes you are correct, thanks I didn’t see that.

I guess I’d change this request to make retries automatic when it is a buildkite/agent issue since I want my CI pipelines to reflect the state of the code being tested.

For now, is it possible to set this automatic exit status retry at the global or pipeline level rather than having to repeat this for every step in each pipeline, ballooning out my pipeline.yml? (I have 4 such steps in each pipeline so this is a lot of redundancy)

- command: make
  retry:
    automatic:
      - exit_status: -1  # Agent was lost
        limit: 2
      - exit_status: 255 # Forced agent shutdown
        limit: 2

Jason · December 18, 2020, 1:55am

Currently, it’s not possible to set it at a global level at the moment, sorry.

But as you could use YAML Anchors here:

anchors:
  std_retries: &std_retries
    retry:
      automatic:
        - exit_status: -1  # Agent was lost
          limit: 2
        - exit_status: 255 # Forced agent shutdown
          limit: 2
      
steps:
  - command: exit -1
    <<: [*std_retries]
  - command: exit 255
    <<: [*std_retries]

harisekhon · December 19, 2020, 3:21pm

Thanks, I’ll try that!

Topic		Replies	Views
Automatically retry failed steps on AGENT_STOP Features Requests	1	758	February 5, 2021
Is buildkite-agent intended to be used on preemptible instances? General	7	1612	December 25, 2020
Restart step in a different agent Features Requests	9	1632	March 21, 2025
Buildkite-agent lock: what happens on failure General	4	327	November 20, 2023
Permissions on retry Features Requests	2	444	April 19, 2021

Reschedule builds on other agents rather than Fail builds when agents time out or are killed (machine shut down or put to sleep)

Related topics