Is buildkite-agent intended to be used on preemptible instances?

onsails · October 22, 2020, 10:42pm

I am running buildkite-agent inside a google kubernetes cluster on preemptible node pool. When node goes down (or when I simulate this situation by just killing pod) agent stays in buildkite agents lists for a very long time. I know that there is a health-check-addr option but it’s not an option for agents running inside private clusters.
Is there any way to set some kind of timeout which sould be tracked on buildkite API side?

onsails · October 23, 2020, 11:24am

I suppose perfect solution would be an option like eviction-id which tells API that if there is a new agent spawned, the old one with the same eviction-id should be stopped by buildkite API.

juanitofatas · October 27, 2020, 2:09am

Hi Andrey,

I am not familiar with this but I think it is ok to use on Preemptible instances. There is similar work that adds scripting around the life cycle of Agent. Example of such scripting work from our Elastic CI Stack for AWS:
https://github.com/buildkite/elastic-ci-stack-for-aws/pull/737

I found Google has an article on terminating with grace
https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-terminating-with-grace

But I cannot see a way to send double SIGTERM to the agent, but maybe could implement a preStop hook that sends the first SIGTERM then sleeps for 20 seconds. Once that is done, Kubernetes will send the second SIGTERM which will force the job to stop and the agent will be able to deregister.

I hope this helps!

Cheers,
Juanito

megan · October 30, 2020, 8:31am

Is this a suggestion or this option exists to fix this issue ?

juanitofatas · November 2, 2020, 3:25am

Sorry would need to implement scripting similar to https://github.com/buildkite/elastic-ci-stack-for-aws/pull/737 for now on GCP.

harisekhon · December 15, 2020, 6:31pm

Buildkite is not well suited to preemptible instances because even with a spot termination script to shut down the agent, you usually only have 30 seconds to do that and since most builds take longer than 30 seconds you won’t be able to wait and will incur false negative build failures.

See this ticket I’ve just raised which would make BuildKite more suitable for use with pre-emptible instances:

harisekhon · December 17, 2020, 1:36pm

Update: you will need to automatic retries for when the agent is lost or killed, in which case it should work, see the thread I linked which has the details.

onsails · December 25, 2020, 5:15pm

I’ve found an easy way for agents running on kubernetes not to hang on redeployment. Perhaps it will work for preemption event too, didn’t have a chance to test yet.

Create pre-stop.sh and add it to a container:

#!/usr/bin/env bash
set -euo pipefail

# https://buildkite.com/docs/agent/v3#signal-handling`

echo "Sending SIGTERM to buildkite-agent" > /dev/termination-log
pkill -TERM buildkite-agent
   
# TODO: send SIGQUIT after sleep 25?

Add preStop hook to deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  ...
  spec:
  ...
  template:
    spec:
      containers:
        - name: buildkite-agent
           image: ...
           lifecycle:
             preStop:
               exec:
                 command: ["/usr/bin/env", "bash", "/pre-stop.sh"]

On the next deployment or a pod deletion buildkite-agent will notify api that it’s stopped.

Waring: for some reason it doesn’t wait for an ongoing step to finish, step fails with Terminating bootstrap after cancellation with terminated.

Topic		Replies	Views
GCP CloudRun Self-Hosted Agent Example Features Requests	3	24	April 28, 2025
Timeout waiting for agent Features Requests	15	4229	December 13, 2023
Buildkite-agent command to signal it should stop after this job Features Requests	2	821	February 18, 2022
Reschedule builds on other agents rather than Fail builds when agents time out or are killed (machine shut down or put to sleep) Features Requests	5	1762	December 19, 2020
On-demand - aka "just-in-time" agents General	2	483	August 8, 2023

Is buildkite-agent intended to be used on preemptible instances?

Related topics