Newer versions of the Buildkite agent don't respect the cancel grace period

Posting here as Github issues don’t seem to be monitored…

Somewhere in the last few releases, the Buildkite agent stopped respecting BUILDKITE_CANCEL_GRACE_PERIOD, which we currently set to 600. Instead it kills the job after the default 10 second grace period.

I rolled back our agent to 3.65 and that version seems to respect it properly.

Also reported here: Newer versions don't respect `BUILDKITE_CANCEL_GRACE_PERIOD` · Issue #2748 · buildkite/agent · GitHub

Hello Evan.

Welcome to the Buildkite community! We appreciate you bringing this to our attention. Starting from agent version 3.66.0, we have expanded graceful cancellation to all job phases. Essentially, this means that if a cancellation signal is sent while an agent is running a job, the agent will complete the current job and then shut down gracefully by default. This feature essentially replicates the functionality of BUILDKITE_CANCEL_GRACE_PERIOD, which ensures that the agent will shut down once the current job is finished or once the time specified in the BUILDKITE_CANCEL_GRACE_PERIOD environment variable is reached, whichever comes first. I conducted tests using a code snippet that traps “SIGTERM” and executes a small piece of code to guarantee a graceful termination. After multiple tests, I found that the job only ended within a ~10 second timeframe in the following scenarios:

  • When I forced the agent to cancel by issuing another cancel signal after the initial SIGTERM.
  • When I cancelled the job through the UI.

However, during these scenarios,I noticed that the BUILDKITE_CANCEL_GRACE_PERIOD was not being honored, which is expected when we forcefully stop the agent. I’m curious to know if you experienced the same behavior or if it was different. Did you observe that cancelling a job while using the later versions of the agent (3.66.0 or above) still immediately cancels the job after the initial cancel signal, without waiting for the current running job to complete.?

Cheers! :rocket:
Athreya.

Yes my specific repro steps were:

  • Run buildkite agent with BUILDKITE_CANCEL_GRACE_PERIOD=600 in the environment
  • Start a job
  • Cancel the job via Buildkite UI
  • Job handles the SIGTERM and starts a graceful shutdown
  • 10 seconds later, the job is forcefully killed

From reading the pull requests, I thought the new signal-grace-period setting was intended to apply an additional wait period in addition to the normal cancel-grace-period. If that’s the case, then I think this is a bug.

On the other hand if the default 9 second signal-grace-period is meant to override the cancel-grace-period, then I would have appreciated some documentation making it clear that this is a breaking change!

Hello @etoddstrongdm,

Ivanna here stepping in for Athreya due to the timezones. Thank for providing more details on this issue. We have tested using Buildkite agent version v3.66.0 and v3.71.0, and confirmed that neither version respects the BUILDKITE_CANCEL_GRACE_PERIOD. I will create the escalation for our internal pipeline team. Your initiative in creating a GitHub issue will certainly help in tracking and resolving this matter effectively. I will include the reference to the escalation ticket, and we will keep you updated!

Thank you once again for your contribution to the Buildkite community!

Hello Evan,

Thank you so much for raising this. My name is Sarah and I am engineer on the Pipelines team. I am posting here to acknowledge that our escalation team have received this issue. We will post again here when we begin our investigation or have other updates :slight_smile:

We really appreciate you providing the extra details on the issue. We hope to get back to you soon.

Cheers,
Sarah

Hi Evan,

Thanks for raising this. It should be fixed in the latest version of the agent (v3.73.1).