We run Buildkite agents in our Kubernetes cluster to perform deployments.
After upgrading from version3.115.4to version 3.116.0we’ve started to intermittently have Helm deployments interrupted by Exited with status -1 (agent lost)messages.
I collected the logs for the most recent interruption before reverting our agent back to version 3.115.4 again:
2026-02-02 03:42:05 INFO buildkite-fcc848567-gz8nb-1 Assigned job 019c1c71-0021-464f-a6d0-e73b0cdb2eb9. Accepting...
2026-02-02 03:42:05 INFO buildkite-fcc848567-gz8nb-1 Starting job 019c1c71-0021-464f-a6d0-e73b0cdb2eb9 for build at https://buildkite.com/healthengineau/megatron-deploy/builds/23645
2026-02-02 03:42:06 INFO buildkite-fcc848567-gz8nb-1 [Process] Process is running with PID: 1702
2026-02-02 03:44:13 INFO Received CTRL-C, send again to forcefully kill the agent(s)
2026-02-02 03:44:13 INFO buildkite-fcc848567-gz8nb-1 Gracefully stopping agent. Waiting for current job to finish before disconnecting...
2026-02-02 03:47:11 ERROR buildkite-fcc848567-gz8nb-1 Invalid access token, cancelling job 019c1c71-0021-464f-a6d0-e73b0cdb2eb9
2026-02-02 03:47:11 INFO buildkite-fcc848567-gz8nb-1 Canceling job 019c1c71-0021-464f-a6d0-e73b0cdb2eb9 with a signal grace period of 9s (access token is invalid)
2026-02-02 03:47:11 INFO buildkite-fcc848567-gz8nb-1 Process with PID: 1702 finished with Exit Status: -1, Signal: SIGTERM
2026-02-02 03:47:11 WARN buildkite-fcc848567-gz8nb-1 Buildkite rejected the chunk upload (POST https://agent.buildkite.com/v3/jobs/019c1c71-0021-464f-a6d0-e73b0cdb2eb9/chunks?sequence=29&offset=381783&size=300: 401 Unauthorized: Invalid access token)
2026-02-02 03:47:11 ERROR buildkite-fcc848567-gz8nb-1 Giving up on uploading chunk 29, this will result in only a partial build log on Buildkite
2026-02-02 03:47:11 WARN buildkite-fcc848567-gz8nb-1 1 chunks failed to upload for this job
2026-02-02 03:47:12 WARN buildkite-fcc848567-gz8nb-1 Buildkite rejected the call to finish the job (PUT https://agent.buildkite.com/v3/jobs/019c1c71-0021-464f-a6d0-e73b0cdb2eb9/finish: 401 Unauthorized: Invalid access token)
2026-02-02 03:47:12 ERROR buildkite-fcc848567-gz8nb-1 Couldn't mark job as finished: PUT https://agent.buildkite.com/v3/jobs/019c1c71-0021-464f-a6d0-e73b0cdb2eb9/finish: 401 Unauthorized: Invalid access token
2026-02-02 03:47:12 INFO buildkite-fcc848567-gz8nb-1 Finished job 019c1c71-0021-464f-a6d0-e73b0cdb2eb9 for build at https://buildkite.com/healthengineau/megatron-deploy/builds/23645
I guess Kubernetes has sent a SIGTERM to the agent (probably sent due to cluster-autoscaler rebalancing pods or something like that).
We have the terminationGracePeriodSeconds on our pods set to 1800 seconds so jobs have 30 mins to complete before Kubernetes kills the pod in this case.
The agent wants to do the right thing and wait for the current job to complete before killing it, since it has logged Gracefully stopping agent. Waiting for current job to finish before disconnecting... however this is immediately followed by these lines:
Invalid access token, cancelling job 019c1c71-0021-464f-a6d0-e73b0cdb2eb9
Canceling job 019c1c71-0021-464f-a6d0-e73b0cdb2eb9 with a signal grace period of 9s (access token is invalid)
So it only gave 9 seconds before killing the job.
I have no idea why the access token would become invalid.
I’ve gone back over our logs for the past 90 days and this Invalid access token, cancelling job log was only produced since the upgrade to the new agent.
Any ideas what is happening?