Auto-restart job on terminated spot instance?

ianwremmel · June 23, 2021, 4:59pm

A few times a week lately, I’ve been seeing jobs fail on Exited with status -1 (process killed or agent lost; see the timeline tab for more information). I’m pretty sure it’s just AWS terminating the spot instance. Is there a way for the stack to either detect specifically when agents are lost for spot termination or to assume that error, when sent from a spot instance, means it was a spot termination and therefor retry the job?

geoffharcourt · June 23, 2021, 5:57pm

We have gotten around this by wrapping all of our CI tasks in a script such that job failures of exit status 1 get returned as a different exit status, then we can assume that any process that exits with 1 was due to spot termination or other infrastructure stuff and can/should be safely retried.

ianwremmel · June 23, 2021, 7:39pm

I’m not sure that would work here. The agent is disappearing and the whole job dies so there’s no opportunity to run further logic.

Note that the Buildkite UI shows -1 as the agent’s exit code, not 1. There’s no exit code in the logs; the logs just stop.

paula · June 24, 2021, 3:34am

Hi @ianwremmel!

Global retry is not a functionality we have at the moment, but it’s already in our backlog. But as a short term,you could use something like this in the step:

steps:
  - label: "Tests"
    command: "tests.sh"
    retry:
      automatic:
        - exit_status: -1  # Agent was lost
          limit: 2
        - exit_status: 255 # Forced agent shutdown
          limit: 2

Another alternative could be by using YAML-named anchors. In here you can see the proposed solution from another user.
It’s not an ideal solution, but something to work on in the meantime.

Cheers!

ianwremmel · June 24, 2021, 3:54am

Oh nice, that would work, thanks!

Might this be something to build into the default behavior of Elastic CI stack? It seems like this is an expected behavior for an environment backed by spot instances

paula · June 24, 2021, 3:57am

Nice suggestion @ianwremmel! I’ll let our PM know your comments
Thanks!

scotth · June 29, 2021, 6:57am

Our solution for this was to setup the aws eventbridge integration, we then filter on the exit codes and trigger a lambda which hits the buildkite api and triggers a retry for the step.

We also setup a graceful shutdown of the agents with the ec2 lifecycle hooks, if they shutdown in the 2 min spot termination you are good, if not the first solution above will retry it

Topic		Replies	Views
Automatically retry failed steps on AGENT_STOP Features Requests	1	767	February 5, 2021
Reschedule builds on other agents rather than Fail builds when agents time out or are killed (machine shut down or put to sleep) Features Requests	5	1760	December 19, 2020
How can we retry jobs only when the agent is gracefully terminated? Pipelines	10	2592	August 9, 2022
Exited with status -1 (agent lost) Pipelines	3	420	August 21, 2024
Timeout waiting for agent Features Requests	15	4218	December 13, 2023

Auto-restart job on terminated spot instance?

Related topics