With large pipelines (100+ parallel steps), it’s common for us to have more than 1 failed step due to e.g. network or test flakiness. Rather than restarting the entire build, it would be much faster (and more economical) to retry all failed individual steps, especially because blocking steps in the pipeline have already passed.
A simple button (“retry failed steps”) on top of the build would solve this problem. We have been asked about this many times by our developers.
We have the same issue here and our approach is to auto retry these flaky tests automatically three times and forbid manual retry for these tests. If somebody bumped into the same stone three times in a roll, he/she should be the one who clean up the road and remove that stone.