Rerun the failed step on the same agent

It would be helpful for debugging purposes if we could rerun the failed step on the same agent. This would help determine if we have flaky build steps.

Sometimes the issue for a failed step might lie in the infra provisioning, and retrying the step will just run it in another container/machine and pass. This masks the issue and won’t get addressed. Then next time the same step runs on that agent, it will fail again and be dismissed again as flaky tests, but won’t help calling attention to the infra situation.

This is something super easy to implement and would be quite helpful for building more robust pipelines.

Thanks @zenogueira for your suggestion.

Often, retrying the job places it back on the same agent (that has the same targeting rules), even if there are other agents available. This is partly because we give preference to agents that have most recently run a job, because they’re more likely to have warm caches etc.
If a step failed, we considered that something was off with that agent, and It’s possible that an agent with a warmer cache takes preference.

Some discussions are happening around how we can allow more granular rules around jobs that get assigned to particular agents, and these situations are part of this discussion.
We haven’t scheduled the work yet, but we appreciate your feedback.