Jobs cancelled when node nears expiry

Hi all,

Our org has seen some CI jobs get cancelled under a specific condition when running Buildkite agents on Kubernetes, and I’m wondering if this is expected behaviour or if there’s a configuration we’re missing.

What happens:

  • If an agent pod is scheduled on a node that is close to its termination time, the pod receives a SIGTERM.

  • The agent starts to gracefully shut down, cancels the job, and doesn’t continue it.

  • After the configured grace period, the pod then receives SIGKILL and the node is cleaned up.

Question:

  • Maybe is it possible to change buildkite to ignore SIGTERM and continue the job during the grace period?

Would appreciate any guidance or confirmation — and if this is something addressed in newer versions of the Kubernetes agent stack.

Thanks!

Hi @byao021031 ,

Welcome to the Buildkite Support Community! :waving_hand:

With your question, I don’t think there is a way for the agent to ignore a SIGTERM. However, if an agent was cancelled gracefully, it will continue to run based on the cancellation grace period configured - terminationGracePeriodSeconds . Maybe you can leverage on this config?

Cheers,

Lizette

Hi @Lizette, thanks for clarifying! :folded_hands:

Just to make sure I understand correctly — from what we’ve observed, when the pod receives a SIGTERM because the node is close to termination, the Buildkite agent cancels the job immediately instead of letting it continue. Even though our terminationGracePeriodSeconds is set to 45m, the job doesn’t actually make use of that grace period, it just exits right away.
• Is the expectation that the agent should still run during the grace period if it was cancelled gracefully?
• Or does SIGTERM always cause the agent to cancel the job, regardless of the configured grace period?

If it’s the latter, would there be any recommended approaches to allow jobs to continue running up until SIGKILL? For example, for longer-running pipelines where the node lifecycle is shorter than the job.

Thanks again for your help!

Hi @byao021031 This is Amna jumping for Lizette!

Your understanding is correct.

When the pod gets SIGTERM, the agent begins a graceful shutdown right away, which cancels the running job. The various “grace periods” only govern how long the agent waits before force-killing the job process or how long Kubernetes waits before SIGKILL; they don’t make the agent keep running the job during that window. This behavior is separate from Kubernetes’ terminationGracePeriodSeconds

Is the expectation that the agent should still run during the grace period if it was cancelled gracefully? No. The agent’s grace periods are only for cleanup and orderly shutdown, not continued execution.⁠⁠⁠⁠​

Does SIGTERM always cause the agent to cancel the job regardless of configured grace period? Correct. SIGTERM triggers immediate cancellation; grace settings only control how long shutdown waits before escalation, not whether the job keeps running.⁠

There isn’t a supported way to make the agent keep running jobs after it has received SIGTERM. What you can do is increase the agent’s cancel-grace-period and ensure Kubernetes’ terminationGracePeriodSeconds is longer, so the agent has time to finish cleanup (like uploading artifacts) before Kubernetes sends SIGKILL. This avoids a hard kill mid-cleanup, but the job itself will still be cancelled.

Hope this helped.