There are jobs which have a large amount of output
In these cases, it can be time consuming to dig into a failed step to figure out why a build failed
e.g: Failed Build → Failed Triggered Build → Jump to failed step → Expand failed step → Explore log to find the failure
Some approaches we have taken:
Use the echo "^^^ +++" feature to expand the current log span - doesn’t scale as we have thousands of different jobs to add this to
Write a buildkite plugin which overrides command step to pipe stdout/stderr to a file and annotate this if the exitcode != 0.
Awkward to test, introduces bugs that the ‘real’ buildkite command step has already fixed. Doesn’t handle all the potential errors that can happen in a step (e.g. pre-exit hook failure)
We really like the idea of adding annotations to the build, so that the process of diagnosing failure is shortened to:
Failed Build → Failed Triggered Build → Expand failure annotation for the fails step(s)
I was wondering, what solutions might you suggest to improve the visibility of why a given pipeline/step failed?
Yeah as mentioned in the OP we tried this but did run into the following caveats when writing a plugin which annotates the last few lines of log output from failed commands:
Overriding the buildkite command step was a bit awkward since we lose the built-in functionality for running commands - we aren’t sure how to get the log output to annotate with otherwise(since only one command hook can run) but we cant afford to lose the built-in command step behaviour.
Linking back to the logs from the annotation - reconciling which line of log output you are in for the overall output of the build didn’t seem to be possible from the command step. Not a big deal but would be very neat to jump straight to the full error context - imagine being able to annote the build with a link to the line of log output in the step where it was decided that the build failed.
The key thing we are looking for here is the modular solution to making step failures fast to investigate - its not viable for us to go into individual steps and modify them to add annotations on failures as we have thousands of workloads across hundreds of pipelines.