As a software developer, checking the CI results of the branch I’m working, while not necessarily having access to more advanced/admin configurations, I would like to be able to restart a job for a step in a pipeline in a different agent than the one it just ran.
This is because, sometimes, the failure is related to the current state of the machine in which the tests ran, so it’s specific to that machine’s agent. An example could be having a full disk, not enough memory, etc.
When I restart a job, it always restarts in the same agent that it just ran.
The other option I have is to restart the full build, for the full pipeline. But that would take a lot of time, as some of the steps take several minutes, and the specific step that I currently care about could be run on the same agent/machine that is having problems.
We have a similar need, but for different reasons. In our case, we have some jobs that deploy resources to machines in China.
The Great Firewall, however, sometimes decides that the dynamic IP address assigned to our cloud build agent isn’t allowed to talk to China. Thus, retries on this machine will never work, but running the same job from a different machine would.
Right now, the best answer we have is to shut down the cloud agent that’s blacklisted and start up a new instance that gets a new IP, and just hope that one isn’t also blacklisted. This usually works, but is obviously far from ideal.
For us, just having a signal of a subsequent failure on a different agent would give us a high-quality signal in distinguishing code-specific failures from host-specific failures. While local caches might be a useful counterpoint, we tend to clear all local caches for each build for reproducibility.
Are there any updates here? We have the same use case as OP, where agents can run out of disk space and we need a way to target a different agent on retries.
Hi again @mycarrysun, and welcome to the community!
Thank you for bumping these threads, I’ve reached out to the team to see where things are at with this request, and we’ll update the thread once we have some more information.
We raised this with the team and it’s on the road map, but there is no timeline around it as yet. We will keep y’all updated in this thread when there is more movement on it.
+1. @Jason , do you have a timeline on this? We’ve just discussed building this feature client-side after some issues where a dodgy agent failed many jobs despite retry. I’d love to know if this is going to happen in the next few months, in which case I’ll tell my team to hold off.
I’ve taken a look at this in the backlog and it is being investigated however it’s not something that will be released within the next few months while we work on other features and additions to Buildkite.
I’ll mark your +1 on the request though as it allows us to build a metric on how often requests come up and how they should be prioritised.