Add option to `bail_early` for steps that fails all sibling steps that are still running

Currently, if a build has multiple parallel steps and any of those steps will cause the build to fail, there’s no way to tell steps to kill the whole build. If a build has more than one step and one of the steps fails, that means that potentially long-running steps could be unnecessarily causing resource contention by waiting to finish, even if the build will fail after the last step is done.

Ideally, I would be able to setup my pipeline so that any (or all) steps, based on a configuration option, would notify the BuildKite step manager that the step failed and will cause the whole build to fail and the step manager should kill all sibling steps.

I know that there’s an option to "wait" between critical steps as an alternative method, but if I have to fine-tune my "wait" sequences to be able to do this (for example, currently our first stage before a "wait" could fail in as early as 3 minutes, but the longest running step (a build used in steps in the next stage) could take anywhere between 10-15 minutes, so the feedback loop that would ideally be 3 minutes if the shortest step failed won’t signal the user until the longest (10-15 min) fails.

If there was a bail_early to signal the job manager to stop as “neutral” (or fail) all sibling steps, that would help us a lot in faster feedback and in reducing resource contention by releasing unnecessarily acquired nodes.

I really dig this idea @chaseadamsio! Let me brew on it and we’ll see what we can do.

I’d also like to see a ‘bail early’ feature…but allow setting the build result to fail or pass. Granted, bailing with a pass makes less sense in context with canceling sibling steps, but it would be very handy for a singleton step to be able to terminate early but still report success. Kind of like running make when the targets are up to date.

I was working on an automated VERSION file update step, which works great…except that when BK does the git push, that triggers another build…which updates the VERSION file again, lather, rinse repeat. One way to get around that would be to have the first job push the update to the version file and then bail out with success on the rest of the steps, and let the second job run to completion.

:+1: if there’s any other insights I can give (or if you need to kick around the thought with a second set of ears), please let me know.

This feature would be really valuable for my company as well. We generate our Buildkite config dynamically at runtime, based on the Nix dependency graph, so we have a lot of jobs running in parallel. I think auto-cancelling jobs using something like bail_early could save a lot of wasted CI time.