Ya, I think this is fundamentally different than that. We use automatic retries already on the most flaky tests + want to use automatic retries sparingly in general + without something like Restart step in a different agent, there is nothing from stopping a single, rogue agent/job from cancelling a build.
I think there is much more signal from N steps failing at least once than from a single step failing N times.