Cancel_on_build_failing with fail counter

Would be nice to configure this cancel_on_build_failing to cancel after a number of failures. 1 is a bit too strict given a single flaky job can halt the rest of the build. Allowing 3 or so failures before cancelling would give us more confidence the build is cancelled because of a broken change rather than a few flakes.

Hi @jl-applied,

Have you tried adding a retry attribute to the job to ensure that it is not a single flaky job before all the rest of the jobs in the builds are canceled?

Please let us know if this works for you.

Cheers,
Lizette

Ya, I think this is fundamentally different than that. We use automatic retries already on the most flaky tests + want to use automatic retries sparingly in general + without something like Restart step in a different agent, there is nothing from stopping a single, rogue agent/job from cancelling a build.

I think there is much more signal from N steps failing at least once than from a single step failing N times.

Hi @jl-applied,

Would you be able to provide more details/context or maybe a sample scenario on how you propose the cancel_on_build_failing attribute should not cancel the build?

Thanks!

Our pipeline is more or less,

  • short test, reliable
  • wait
  • medium-long, mostly reliable test
  • medium-long, mostly reliable test
  • medium-long, mostly reliable test
  • long, mostly unreliable

A minimal implementation I think would look like,

  • short test, reliable
  • wait
  • medium-long, mostly reliable test
  • medium-long, mostly reliable test
  • medium-long, mostly reliable test
  • long, mostly unreliable
    cancel_on_build_failing:
    limit: 3

So basically all three medium-long jobs would have to fail before we cancel the long running job. Effectively, no single-medium job is enough signal to imply the build is broken and should be halted, but together we can assume the build is broken since they’re mostly reliable.

Going even further, something analogous to the retry’s exit_status limit would be very nice,

  • short test, reliable
  • wait
  • medium-long, mostly reliable test
  • medium-long, mostly reliable test
  • medium-long, mostly reliable test
  • long, mostly unreliable
    cancel_on_build_failing:
    • exit_status: *
      limit: 3
    • exit_status: 1, 255
      limit: 1

So the two cases the long job is cancelled is if 1. any medium job fails with well-defined exit statuses or 2, all medium jobs fail with any status (similar to above).

Hello, @jl-applied! Thank you for your suggestion! At the moment, it is not possible to allow both failing and successful steps in a build and not get it to be marked as “failed” in the end.

One thing that I can suggest is to check out the continue_on_failure attribute of the wait step. It can help you ensure that your build does make it to the final step, even if the other steps before it fail.

Cheers!

Gotchya.

And to be clear, I wouldn’t want this to affect the overall build status; only the cancel behavior of the surrounding jobs.