We have a very large generated pipeline, most of which is set up to fast fail using cancel_on_build_failing. If you attempt to manually retry the job which failed the build, it immediately gets cancelled due to the fast fail settings. Is it possible to bypass that so you can manually retry a job even when fast fail is on?
Ultimately, I want to, say, take a video of a failing test if you manually retry that specific job. (To avoid the overhead normally, but make it super easy to access otherwise.) But I still want fast fail enabled normally.
Hi @noahtallen, there is a Retry failed jobs button (besides rebuild) that triggers the failed jobs for you. This should help to rerun those failed jobs without them being cancelled immediately.
Unfortunately, that’s where I’m experiencing the problem! Let’s say there are 10 jobs, and one of them fails. This causes all other jobs to get cancelled via fast failure.
When I then go to Retry failed jobs (or retry that individual job), that new retry is instantly cancelled. I’m guessing this is due to the cancel_on_build_failing flag, since that setting applies to every job in the build, and I guess the job settings don’t change when you retry it.
I’m wondering if there’s a work around so that the job can normally be cancelled by fast failure, but still allow manually retries.
The cancel_on_build_failing flag just doesn’t let failed jobs be retried—it cancels everything immediately, even on retries. Unfortunately, it looks like there’s no real workaround for this. If you want manual retries to work, you’d probably have to disable fast-fail entirely.
Hey @noahtallen for sure we can raise a feedback for this issue. To understand a little better, we would like to clarify a few things. As this helps to dissect the issue and how it affects you currently
Can you describe the problem you’re trying to solve?
How is this issue impacting your workflow or team’s productivity?
How are you currently working around this issue? Is there anything that partially meets your needs?
Hi @noahtallen , just an update on this. We have raised a feedback to the product folks about this issue and informed them that the attribute can be limiting when you only want to retry one job from the build.
Thank you! Fast fail and job retry are both useful features. Fast failure helps avoid running extra workflows when they aren’t strictly needed, and job retry lets you see if a specific job might pass on a re-run due to a flaky test. Currently, you can’t get both of these benefits together.
The specific workflow I’ve been looking at is a little different: I want to allow devs to retry an individual job to get some extra functionality. For example, let’s say I have a massive CI pipeline with multiple jobs which each run a few e2e tests. When a test fails, it could be useful to auto-cancel all the other jobs to help decrease CI resource utilization.
At the same time, I want a developer to be able to click “retry job,” which then enables video recording for the e2e test framework. (Video recording in, e.g. Cypress, has performance overhead which isn’t needed almost all of the time… unless a test fails and a developer needs to investigate it!) This manually retry job → enable video recording in that specific job is totally possible, but that new job gets cancelled instantly if fast failure is enabled.
Doing this via “retry job” instead of “retry build” is useful, because the entire build has a large number of pipeline steps – running the individual job (the smallest unit of that build as possible) is much more efficient, since a new build isn’t needed to enable video recording.
IMO, fast failure shouldn’t impact manually retried jobs in CI – the main point of fast failure is to cancel automatically started jobs which don’t necessarily need to continue. But when a developer manually retries a job, that’s a strong signal they don’t want it to be auto-cancelled.