Improving visibility of build/step failures?

carl · June 4, 2020, 10:13am

We have some pipelines where:

Other pipelines are triggered
There are many jobs
There are jobs which have a large amount of output

In these cases, it can be time consuming to dig into a failed step to figure out why a build failed

e.g: Failed Build → Failed Triggered Build → Jump to failed step → Expand failed step → Explore log to find the failure

Some approaches we have taken:

Use the echo "^^^ +++" feature to expand the current log span - doesn’t scale as we have thousands of different jobs to add this to
Write a buildkite plugin which overrides command step to pipe stdout/stderr to a file and annotate this if the exitcode != 0.
Awkward to test, introduces bugs that the ‘real’ buildkite command step has already fixed. Doesn’t handle all the potential errors that can happen in a step (e.g. pre-exit hook failure)

We really like the idea of adding annotations to the build, so that the process of diagnosing failure is shortened to:
Failed Build → Failed Triggered Build → Expand failure annotation for the fails step(s)

I was wondering, what solutions might you suggest to improve the visibility of why a given pipeline/step failed?

anon57234190 · June 5, 2020, 1:41am

Great question. It sounds like using build annotations would suit your use case here–have you given that a go?

carl · June 8, 2020, 11:26am

Yeah as mentioned in the OP we tried this but did run into the following caveats when writing a plugin which annotates the last few lines of log output from failed commands:

Overriding the buildkite command step was a bit awkward since we lose the built-in functionality for running commands - we aren’t sure how to get the log output to annotate with otherwise(since only one command hook can run) but we cant afford to lose the built-in command step behaviour.
Linking back to the logs from the annotation - reconciling which line of log output you are in for the overall output of the build didn’t seem to be possible from the command step. Not a big deal but would be very neat to jump straight to the full error context - imagine being able to annote the build with a link to the line of log output in the step where it was decided that the build failed.

The key thing we are looking for here is the modular solution to making step failures fast to investigate - its not viable for us to go into individual steps and modify them to add annotations on failures as we have thousands of workloads across hundreds of pipelines.

Topic		Replies	Views
Display the failed step when many succeed but pipeline fails Features Requests	1	362	December 17, 2021
Allow annotations to be attached to a particular job Features Requests	10	752	March 4, 2025
Ignoring a failing step? Features Requests	11	4900	October 31, 2023
Build status override by human Pipelines	2	275	April 4, 2024
Support buildkite annotation scoped to a step Pipelines	5	14	April 15, 2025

Improving visibility of build/step failures?

Related topics