Started losing agents with version 3.116.0

We run Buildkite agents in our Kubernetes cluster to perform deployments.
After upgrading from version3.115.4to version 3.116.0we’ve started to intermittently have Helm deployments interrupted by Exited with status -1 (agent lost)messages.

I collected the logs for the most recent interruption before reverting our agent back to version 3.115.4 again:

2026-02-02 03:42:05 INFO   buildkite-fcc848567-gz8nb-1 Assigned job 019c1c71-0021-464f-a6d0-e73b0cdb2eb9. Accepting... 
2026-02-02 03:42:05 INFO   buildkite-fcc848567-gz8nb-1 Starting job 019c1c71-0021-464f-a6d0-e73b0cdb2eb9 for build at https://buildkite.com/healthengineau/megatron-deploy/builds/23645 
2026-02-02 03:42:06 INFO   buildkite-fcc848567-gz8nb-1 [Process] Process is running with PID: 1702 
2026-02-02 03:44:13 INFO   Received CTRL-C, send again to forcefully kill the agent(s)
2026-02-02 03:44:13 INFO   buildkite-fcc848567-gz8nb-1 Gracefully stopping agent. Waiting for current job to finish before disconnecting... 
2026-02-02 03:47:11 ERROR  buildkite-fcc848567-gz8nb-1 Invalid access token, cancelling job 019c1c71-0021-464f-a6d0-e73b0cdb2eb9 
2026-02-02 03:47:11 INFO   buildkite-fcc848567-gz8nb-1 Canceling job 019c1c71-0021-464f-a6d0-e73b0cdb2eb9 with a signal grace period of 9s (access token is invalid) 
2026-02-02 03:47:11 INFO   buildkite-fcc848567-gz8nb-1 Process with PID: 1702 finished with Exit Status: -1, Signal: SIGTERM 
2026-02-02 03:47:11 WARN   buildkite-fcc848567-gz8nb-1 Buildkite rejected the chunk upload (POST https://agent.buildkite.com/v3/jobs/019c1c71-0021-464f-a6d0-e73b0cdb2eb9/chunks?sequence=29&offset=381783&size=300: 401 Unauthorized: Invalid access token) 
2026-02-02 03:47:11 ERROR  buildkite-fcc848567-gz8nb-1 Giving up on uploading chunk 29, this will result in only a partial build log on Buildkite 
2026-02-02 03:47:11 WARN   buildkite-fcc848567-gz8nb-1 1 chunks failed to upload for this job 
2026-02-02 03:47:12 WARN   buildkite-fcc848567-gz8nb-1 Buildkite rejected the call to finish the job (PUT https://agent.buildkite.com/v3/jobs/019c1c71-0021-464f-a6d0-e73b0cdb2eb9/finish: 401 Unauthorized: Invalid access token) 
2026-02-02 03:47:12 ERROR  buildkite-fcc848567-gz8nb-1 Couldn't mark job as finished: PUT https://agent.buildkite.com/v3/jobs/019c1c71-0021-464f-a6d0-e73b0cdb2eb9/finish: 401 Unauthorized: Invalid access token 
2026-02-02 03:47:12 INFO   buildkite-fcc848567-gz8nb-1 Finished job 019c1c71-0021-464f-a6d0-e73b0cdb2eb9 for build at https://buildkite.com/healthengineau/megatron-deploy/builds/23645

I guess Kubernetes has sent a SIGTERM to the agent (probably sent due to cluster-autoscaler rebalancing pods or something like that).
We have the terminationGracePeriodSeconds on our pods set to 1800 seconds so jobs have 30 mins to complete before Kubernetes kills the pod in this case.

The agent wants to do the right thing and wait for the current job to complete before killing it, since it has logged Gracefully stopping agent. Waiting for current job to finish before disconnecting... however this is immediately followed by these lines:

Invalid access token, cancelling job 019c1c71-0021-464f-a6d0-e73b0cdb2eb9
Canceling job 019c1c71-0021-464f-a6d0-e73b0cdb2eb9 with a signal grace period of 9s (access token is invalid)

So it only gave 9 seconds before killing the job.

I have no idea why the access token would become invalid.
I’ve gone back over our logs for the past 90 days and this Invalid access token, cancelling job log was only produced since the upgrade to the new agent.

Any ideas what is happening?

Hi Jim, thanks for contacting us about this! Just so I’m clear, it sounds like you are not using our agent stack for Kubernetes (a custom controller for managing Buildkite agents in k8s) are you?

No. We have our own deployment that we’ve been using for years.

Thanks for confirming that. I think you might have uncovered a bug here. This PR in the agent changed how we handle cases where the agent ping or heartbeat get into an unrecoverable state. However, as an unintended side-effect, it seems to cause the heartbeat loop to exit immediately when StopGracefully() is called, even if a job is still running. I think that explains the behavior you’re seeing, and why it went away after you stepped back one version (good on you for keeping your agents up to date!). I’ll raise this with our engineers for verification.

1 Like

Confirmed that this was a bug, so thanks for reporting it! This PR has a fix, if you want to watch for that to be merged and released.

1 Like