We have some pipelines where:
- Other pipelines are triggered
- There are many jobs
- There are jobs which have a large amount of output
In these cases, it can be time consuming to dig into a failed step to figure out why a build failed
e.g: Failed Build -> Failed Triggered Build -> Jump to failed step -> Expand failed step -> Explore log to find the failure
Some approaches we have taken:
echo "^^^ +++"feature to expand the current log span - doesn’t scale as we have thousands of different jobs to add this to
Write a buildkite plugin which overrides
commandstep to pipe stdout/stderr to a file and annotate this if the exitcode != 0.
Awkward to test, introduces bugs that the ‘real’ buildkite command step has already fixed. Doesn’t handle all the potential errors that can happen in a step (e.g. pre-exit hook failure)
We really like the idea of adding annotations to the build, so that the process of diagnosing failure is shortened to:
Failed Build -> Failed Triggered Build -> Expand failure annotation for the fails step(s)
I was wondering, what solutions might you suggest to improve the visibility of why a given pipeline/step failed?