Auto-restart job on terminated spot instance?

A few times a week lately, I’ve been seeing jobs fail on Exited with status -1 (process killed or agent lost; see the timeline tab for more information). I’m pretty sure it’s just AWS terminating the spot instance. Is there a way for the stack to either detect specifically when agents are lost for spot termination or to assume that error, when sent from a spot instance, means it was a spot termination and therefor retry the job?

1 Like

We have gotten around this by wrapping all of our CI tasks in a script such that job failures of exit status 1 get returned as a different exit status, then we can assume that any process that exits with 1 was due to spot termination or other infrastructure stuff and can/should be safely retried.

1 Like

I’m not sure that would work here. The agent is disappearing and the whole job dies so there’s no opportunity to run further logic.

Note that the Buildkite UI shows -1 as the agent’s exit code, not 1. There’s no exit code in the logs; the logs just stop.

Hi @ianwremmel! :wave:

Global retry is not a functionality we have at the moment, but it’s already in our backlog. But as a short term,you could use something like this in the step:

steps:
  - label: "Tests"
    command: "tests.sh"
    retry:
      automatic:
        - exit_status: -1  # Agent was lost
          limit: 2
        - exit_status: 255 # Forced agent shutdown
          limit: 2

Another alternative could be by using YAML-named anchors. In here you can see the proposed solution from another user.
It’s not an ideal solution, but something to work on in the meantime.

Cheers!

Oh nice, that would work, thanks!

Might this be something to build into the default behavior of Elastic CI stack? It seems like this is an expected behavior for an environment backed by spot instances

Nice suggestion @ianwremmel! I’ll let our PM know your comments :slightly_smiling_face:
Thanks!

1 Like

Our solution for this was to setup the aws eventbridge integration, we then filter on the exit codes and trigger a lambda which hits the buildkite api and triggers a retry for the step.

We also setup a graceful shutdown of the agents with the ec2 lifecycle hooks, if they shutdown in the 2 min spot termination you are good, if not the first solution above will retry it