I am running buildkite-agent inside a google kubernetes cluster on preemptible node pool. When node goes down (or when I simulate this situation by just killing pod) agent stays in buildkite agents lists for a very long time. I know that there is a health-check-addr
option but it’s not an option for agents running inside private clusters.
Is there any way to set some kind of timeout which sould be tracked on buildkite API side?
I suppose perfect solution would be an option like eviction-id
which tells API that if there is a new agent spawned, the old one with the same eviction-id
should be stopped by buildkite API.
Hi Andrey,
I am not familiar with this but I think it is ok to use on Preemptible instances. There is similar work that adds scripting around the life cycle of Agent. Example of such scripting work from our Elastic CI Stack for AWS:
https://github.com/buildkite/elastic-ci-stack-for-aws/pull/737
I found Google has an article on terminating with grace
https://cloud.google.com/blog/products/gcp/kubernetes-best-practices-terminating-with-grace
But I cannot see a way to send double SIGTERM to the agent, but maybe could implement a preStop hook that sends the first SIGTERM then sleeps for 20 seconds. Once that is done, Kubernetes will send the second SIGTERM which will force the job to stop and the agent will be able to deregister.
I hope this helps!
Cheers,
Juanito
Is this a suggestion or this option exists to fix this issue ?
Sorry would need to implement scripting similar to https://github.com/buildkite/elastic-ci-stack-for-aws/pull/737 for now on GCP.
Buildkite is not well suited to preemptible instances because even with a spot termination script to shut down the agent, you usually only have 30 seconds to do that and since most builds take longer than 30 seconds you won’t be able to wait and will incur false negative build failures.
See this ticket I’ve just raised which would make BuildKite more suitable for use with pre-emptible instances:
Update: you will need to automatic retries for when the agent is lost or killed, in which case it should work, see the thread I linked which has the details.
I’ve found an easy way for agents running on kubernetes not to hang on redeployment. Perhaps it will work for preemption event too, didn’t have a chance to test yet.
- Create pre-stop.sh and add it to a container:
#!/usr/bin/env bash
set -euo pipefail
# https://buildkite.com/docs/agent/v3#signal-handling`
echo "Sending SIGTERM to buildkite-agent" > /dev/termination-log
pkill -TERM buildkite-agent
# TODO: send SIGQUIT after sleep 25?
- Add preStop hook to deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
...
spec:
...
template:
spec:
containers:
- name: buildkite-agent
image: ...
lifecycle:
preStop:
exec:
command: ["/usr/bin/env", "bash", "/pre-stop.sh"]
On the next deployment or a pod deletion buildkite-agent will notify api that it’s stopped.
Waring: for some reason it doesn’t wait for an ongoing step to finish, step fails with Terminating bootstrap after cancellation with terminated
.