Agent 3.67.0 has incorrect pipeline headings

Hi.

We upgraded from Elasticstack version v6.17.0 to v6.18.0.
This in turn upgraded the agents from v3.66.0 to v3.67.0

Now our Buildkite pipelines have bogus headings taken from completely different branches from the one that was merged into the master/main branches.

I don’t know how to diagnose this…
For example this build:
https://buildkite.com/healthengineau/megatron/builds/187326
has the title: NOOT-640 Is this finally the end of scope creep?
But the branch/PR has the title: PK-1071 Update user data request list page

I’m just guessing, but looking at the changes for 3.67.0 it seems like this patch might be the likely cause?

Hey Jim,

Welcome to the Buildkite community!

Based on the information you’ve provided and your current pipeline step setup using the skip-checkout plugin, it seems the root of the issue lies in how metadata, specifically the buildkite:git:commit key, is being handled.

As you’ve noted, version 3.67.0 of the Buildkite agent introduces functionality where the agent sets a special metadata key buildkite:git:commit during the checkout process. This key includes vital information about the commit such as the commit message, author, and hash (details found in the respective pull request). The use of the skip-checkout plugin inherently skips the standard checkout process and as I can see from the build you have shared, this value is being set as ( buildkite-agent meta-data set buildkite:git:commit < /dev/stdin) , and the value is being read from STDIN.

This likely seems to be the cause for your issue. I do understand that you would like to perform a minimal and focused checkout of only necessary files , and then use these files to update or configure the pipeline dynamically. This is a quite interesting approach as the use of a temporary directory ensures that these operations are performed in isolation and do not interfere with the rest of the workspace or persist beyond the lifecycle of the build. However I would suggest to temporarily disable the skip-checkout plugin in a test environment to confirm the behaviour. This will help establish whether the plugin’s omission of the checkout process is the definitive cause of the issue.

Thanks,
Athreya.

I can confirm that not skipping checkout puts the correct heading as per this build
https://buildkite.com/healthengineau/megatron/builds/187407

We are skipping the default checkout and using a sparse-checkout of just the .buildkite/ directory and its contents because the repository is > 400 MB in size but for the pipeline upload step it only needs .buildkite/pipeline.yml that is only a few kilobytes to perform that step.

Hey Jim,

Thanks for sharing the details.

With your previous build using the skip-checkout, the agent is using a previously cloned/checked out version of the repository. This will be the commit with message “NOOT…”. So the agent won’t do any standard checkout steps like updating the repo to BUILDKITE_COMMIT so it is left as is. The agent then looks into the current HEAD and sets that into the special “buildkite:git:commit” meta-data - which our server uses to set the build message. This is why the message is wrong.

So, to fix this, you’ll need to implement your own agent level checkout hook which will run before the agent sets the git commit. You are currently doing the sparse-checkout as part of your command but it is too late at that point. You will need to run sparse-checkout logic as part of your agent checkout hook and add buildkite-agent meta-data set ‘buildkite:git:commit’ “$$BUILDKITE_COMMIT” to the hook.

Also, note, setting custom logic at agent hook level will impact all pipelines so make sure to use condition for your changes.

Hope this helps, let us know if any questions.

Cheers,
Priya

The thing is, this was all working properly before, and was broken with the change above…

Hey @Jim its been a while! :wave:

You’ve had us scratching our heads at this one for a while hence it taking while to respond. But I think we see the full problem now.

The agent PR you linked originally looks to be the culprit here. But at the same time, I think the PR has introduced better logic that has actually highlighted a problem with the design of your sparse-checkout anyway - that it could lead to rebuilds targeting different commits over time. Its not likely to cause an issue when builds are triggered from GitHub webhooks (I think), but moreso for manual builds.

Regardless, I think the way forward here is to not use the “skip checkout” plugin and to instead create a “sparse-checkout” plugin.

I’ve got the team looking if we can help you with a plugin for this as I think it makes sense for us to provide a spare-checkout for use as a step upload, but I’ve also found this one that exists already: GitHub - pragmaplatform/sparse-checkout-buildkite-plugin.
Can you give that a go with the .buildkite directory and let us know if it works?

If not, we will look to create our own one using our plugin template GitHub - buildkite-plugins/template-buildkite-plugin: 📝 Template repository for Buildkite Plugins

Hey Jarryd. Yeah it’s probably time for beers again :slight_smile:

Okay I’ll investigate the plugin in the near future. Early next week at the latest.

Oh also regarding this:

that it could lead to rebuilds targeting different commits over time

Would that be the case because we’ve used mktemp -d and move into that directory before performing the checkout? Seems like it should never clash.

I guess I should point out that because we’ve had out own checkout for the first step that prepares the pipeline upload, the heading the build gets seems to be from the 2nd step were we don’t do that.
And as athreya pointed out above it’s using buildkite-agent meta-data set buildkite:git:commit < /dev/stdin at that point.
The second step in the build I linked in the first post has the following output which is where it is picking up the warning about the branch being left behind and using that for the heading.

$ git checkout -f 5d12580388cbcbe8370f66f5cbd25b3833f1cee2
Warning: you are leaving 1 commit behind, not connected to
any of your branches:
  e78bcc55cb0 NOOT-640 Is this finally the end of scope creep?
If you want to keep it by creating a new branch, this may be a good time
to do so with:
 git branch <new-branch-name> e78bcc55cb0
HEAD is now at 5d12580388c PK-1071 Update user data request list page (#13078)

When instead it should be picking up the HEAD is now at 5d12580388c PK-1071 Update user data request list page (#13078) message instead.

I don’t think this would apply to your use-case as your builds are coming from GitHub where the specific commit SHA to checkout is known. So if you hit the rebuild button from such a build, it would still use that commit SHA.
The issue is more around manual builds that target HEAD. As HEAD moves from commit to commit, if you previously rebuilt a build targeting HEAD, it would end up running on the latest HEAD from a pull. Whereas with this change to the agent which sends the actual commit metadata to BK, we’ll know for a rebuild of a HEAD to target the same commit SHA


I’ll share how this whole process works in the default flow because I think things are getting mixed up.

  1. A build starts and the very first step to run will pull the repo to the given commit then extract the commit metadata
  2. The agent then checks if the buildkite:git:commit meta-data key is set on the server
    • if it is, it doesn’t overwrite it
    • if its not, it will send it up
  3. The server parses that special meta-data and will set the build message to the commit message from that meta-data key

This happens immediately after the checkout hook. So the state of the repository needs to be correct at this point. The issue with your current approach is that the checkout hook is being skipped and the state of the repo is being set by the command hook.

So when an agent goes to run this upload step (with a sparse checkout) after previously running a “standard” step, it doesn’t update the repository during the checkout hook because its skipped. So the repository state is still pointing at the previous step the agent run (the one with the NOOT… message). And so now, when the agent uploads the meta-data, it reads that from the stale repo state.

Leads us to how to fix this: the state of the repository needs to be set from the checkout hook as this is the last point before the agent will set meta-data/build message. So basically, you will need to write a plugin with the custom sparse-checkout logic you wish to achieve.

The good news is that I’ve made a start on that for you and finished the boilerplate. Its available at GitHub - buildkite-plugins/sparse-checkout-buildkite-plugin: Create a sparse checkout of the repository. It roughly works but might need some changes - I don’t know exactly what your sparse-checkout.sh is doing. And I’d appreciate any contributions you can make to improve, especially for your use-case.

Give it a try and let us know how it goes. And whether this explanation of the logic makes sense

Okay but isn’t the bug introduced by your change is that it is picking up the wrong message from this output?

The logic in our script for the sparse checkout follows

# Set some defaults, some may come from a Buildkite pipeline.
BRANCH="${BUILDKITE_BRANCH:-master}"
DIR="."
REPO="$BUILDKITE_REPO"

These defaults can be overridden via command line options…
The PATHS array is built up from command line parameters.
Then there’s stuff running the ssh-keyscan

Then finally the sparse checkout itself.

if [[ ! -d "$DIR/.git" ]]
then
	echo "- Performing empty clone of the $BRANCH branch of $REPO"
	# If the Buildkite git-mirrors feature is being used, clone from that.
	git clone \
		--branch="$BRANCH" \
		--depth 1 \
		--filter=blob:none \
		--no-checkout \
		${BUILDKITE_REPO_MIRROR:+--reference "$BUILDKITE_REPO_MIRROR"} \
		"$REPO" \
		"$DIR"
elif ! git -C "$DIR" config --get remote.origin.url | grep -Eiq "^$REPO\$"
then
	echo >&2 "Error: A different repository is already checked out to $DIR - Exiting..."
	exit 1
else
	echo "- Repo already checked out. Fetching updates"
	git -C "$DIR" fetch --depth 1 origin "$BRANCH"
fi

echo "- Configuring git sparse checkout"
git -C "$DIR" sparse-checkout set --no-cone "${PATHS[@]}"

echo "- Checking out selected files in the repository"
git -C "$DIR" checkout

The --filter=blob:none is needed otherwise the sparse checkout still ends up pulling the repo over the network even though it checks no files out.

I don’t believe so, but do you have a link to that job where you got this log from?
This is the code that runs on the agent to pull out the commit info: agent/internal/job/checkout.go at 81a3b1402613c527c1671e7ad6c15b2b96f03c2c · buildkite/agent · GitHub

Otherwise, with this example https://buildkite.com/healthengineau/megatron/builds/187407#018ee559-bc72-4374-814f-9e445bffe136/100-131. You can see a similar log output to above, but on subsequent lines (after the repository has been checked out to the correct commit) it sets the correct message.

I think the presence of the < /dev/stdin you see in the logs might be throwing off here. But its stdin from the git log process run by the agent.


Ah yes, the current plugin hook doesn’t do a keyscan so it might hang there waiting for human input to accept the keys. That will need updating.
It also doesn’t handle mirror repositories. I think the best thing would be for you to fork and apply these changes to the hook so that it works for you, then we can work on how that would best fit into a community plugin

I don’t believe so, but do you have a link to that job where you got this log from?

It’s in the build I first linked.
https://buildkite.com/healthengineau/megatron/builds/187326

In the 2nd step (the one after the pipeline upload one) that is labelled: Upload mjml html to artifacts and is in the Preparing working directory section of that.

Oh thanks. In that case you can see this line https://buildkite.com/healthengineau/megatron/builds/187326#018ee09e-a653-4662-aa32-f3bcdb4ccbbb/113-147 that it hit point 2 in the flow where it checked for the meta-data but didn’t overwrite it because the first step in the pipeline already set it (and set it incorrectly)

The sparse-checkout of your current plugin downloads a lot more than is needed.
It ends up basically pulling all the files of the repository down still.

I forked it and tried my own version which really does just check out the files needed.

That works fine, BUT because it isn’t in its own separate directory below /var/lib/buildkite-agent/builds/$BUILDKITE_AGENT_NAME/$BUILDKITE_ORGANIZATION_SLUG/$BUILDKITE_PIPELINE_SLUG the sparse checkout changes it makes affects future git clones…

So when the next step comes along that performs the default git clone steps, it still only pulls the files configured by the git sparse-checkout command.

If I add a git sparse-checkout disable at the end of the plugin, then that also defeats the purpose because at that point all files are downloaded.

At the moment the hooks/checkout command in my branch looks as follows

#!/bin/bash

set -euo pipefail

DIR="$(cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd)"

# shellcheck source=lib/plugin.bash
. "$DIR/../lib/plugin.bash"

if plugin_read_list_into_result PATHS; then
    CHECKOUT_PATHS=("${result[@]}")
else
    echo "Missing 'paths' option in the plugin"
    exit 1
fi

echo "Scanning SSH keys for remote git repository"
[[ -d ~/.ssh ]] || mkdir -p ~/.ssh
ssh-keyscan "${BUILDKITE_REPO_SSH_HOST}" >> ~/.ssh/known_hosts

echo "Creating sparse-checkout with paths: ${CHECKOUT_PATHS[*]}"

# clone the repo without checking out files if it does not exist already
if [[ ! -d .git ]]; then
    git clone \
        --depth 1 \
        --filter=blob:none \
        --no-checkout \
        ${BUILDKITE_REPO_MIRROR:+--reference "$BUILDKITE_REPO_MIRROR"} \
        -v \
        "${BUILDKITE_REPO}" .
fi

git clean -ffxdq
# fetch branch if commit is HEAD
if [[ ${BUILDKITE_COMMIT} = "HEAD" ]]; then
    git fetch --depth 1 origin "${BUILDKITE_BRANCH}"
else
    git fetch --depth 1 origin "${BUILDKITE_COMMIT}"
fi

git sparse-checkout set --no-cone "${CHECKOUT_PATHS[@]}"
git checkout

I could potentially add steps to use mktemp and then have the sparse checkout work with the content in there and it would probably solve the issue…
I’m not sure then if this plugin will be something you want or not?

Oh but anyway it still has the wrong heading…
https://buildkite.com/healthengineau/sre/builds/262

It has a heading of Merge pull request #79 from HealthengineAU/address-warnings
But my branch’s log looks like so:

commit 7a221381da8e2b54fb395fc836806a2f5f6df6e0 (HEAD -> hack, origin/hack)
Author: Jim Barber <jim.barber@healthengine.com.au>
Date:   Mon Apr 22 13:15:12 2024 +0800

    Change to test a pipeline

So it should say Change to test a pipeline instead…
So this sparse-checkout hasn’t fixed the issue anyway.

Hey Jim,

Thanks for forking the repo and adding additional changes. It looks good.

Could you create a PR for your changes? so we will review and continue our conversation in the Git for this ticket.

Also in terms of headings not changing, I wonder if BUILDKITE_BUILD_CHECKOUT_PATH env variable is not aligning with your sparse checkout changes.

We will take a look and shall get back to you. Look forward for your PR.
Thanks,
Priya.

Raise a PR even though it is broken?
As it is now, future checkouts don’t work because they still use the sparse settings and so they don’t checkout everything that they need to.

Yeah I think create a PR from the current state of your fork. It will be much easier to work together on this to create a working solution if we do it from GitHub

I think I have a solution for the subsequent checkouts problem but will leave it to collaborate on the PR

Hey Jim,

As you collaborate with Jarryd on this matter, I’ve also raised this with our pipelines team parallelly to see if they can advice us of some alternatives.

Cheers!
Athreya.

Hey Jim,

Jordan an engineer from the Pipelines team here. We’ll catch up on this thread and then see if we can figure something out to resolve this. In the meantime continue to work with @jarryd as it sounds like he has a way forward.