Post-command fails without much information

I’ve noticed some of the newer agents are failing post-commands when cancelled because of step timeout. I always turn on set -x for the post-command script and it does print every line when the post-command gets run in most builds. But whenever a pipeline gets cancelled because of step timeout, I only see what is shown in the screenshot, which appears to me that the post-command step never gets run?

My post-command script looks like this

#!/bin/bash

export PS4='= '
set -exo pipefail
...

I’ve noticed this issue with agents versioned v3.66.0 and v3.74.1

Hey @rryan :wave:

Welcome to the community! :slight_smile:

The way that timeouts work on the agent, is that when a job is being cancelled due to timeout, the agent will cancel any hooks it is currently executing, and move to fire the pre-exit hook (you can see this logic here)

This means that the expected behaviour here is that the post-command hook won’t run if it hasn’t run yet and the job is timing out.

If you have some logic that needs to be executed even on a timeout, the suggestion would be to have that be run from your pre-exit hook - since this will be run even on timeout.

Something to be aware of as well, is that the agent will try to run the pre-exit hook within the cancel grace period, which is configurable on the agent and has a default of 10 seconds.

Out of curiosity, were you seeing this issue on older buildkite agent versions, or is this a new behaviour?

Hi @jeremy , thanks for suggestion! I will try out pre-exit. I think this is a newer behavior I’ve noticed and with older agents, they were able to execute post-command even when a step times out. But I’m not 100% certain though. Either way, sounds like pre-exit is the way to go.

@rryan Happy to help!

Yeah - pre-exit should help with what you’re wanting to achieve there. It’s possible the post-command hook was executing during a timeout event, for example:

if the command hook completed, and the post-command hook was running, that hook may have completed before things moved on to the job tear down. Because it’s all dependant on timing and execution order, without seeing the individual job logs it’s difficult to say.

That said, the pre-exit hook will be more reliable in a timeout scenario as the job will always attempt to run this script during a timeout!

Hi @jeremy , I’ve switched to pre-exit and it does get run even when a job/step times out. I did run into the grace period exceeded issue, and somehow it seems that I failed to change the grace period using the env var. I am setting the grace period in the pipeline yaml:

env:
  BUILDKITE_CANCEL_GRACE_PERIOD: 60

But when a job gets cancelled, somehow the pre-exit did not get to finish. And from the signal (SIGKILL), it seems that the grace period did not get extended and pre-exit job get killed because it’s near the 10s default limit

I also double checked on the env vars and the BUILDKITE_CANCEL_GRACE_PERIOD does exist in the environment

Do you know what I might be doing wrong?

Hey @rryan :wave:

The value needs to be set when the agent starts up, rather than during job runtime, otherwise the default value will be used. You can do this by either ensuring that BUILDKITE_CANCEL_GRACE_PERIOD is exported before starting the agent on the host, or by setting the value of cancel-grace-period in your buildkite-agent.cfg file.

1 Like