I run buildkite in a monorepo environment. We use a lot of abstractions that cause us to trigger other steps or pipelines to keep our CI code well organized and maintainable.
It’s not uncommon for us to have 10+ step on a build who’s sole job is to trigger other steps or pipelines.
This leads me to my question, how can I decrease the wait time of these steps.
I see that the wait time is actually a few different lifecycle steps rolled up. I’d like to know which ones I have control over decreasing and how to do so.
I run plenty of agents, I am rarely waiting on scaling activity.
Any help would be greatly appreciated.
Hi @OwenCR the agent using a polling architecture that is dependent on the number of agents you have registered in your organisation. And it’s not configurable.
To allow us to manage the load, as the number of agents registered increases, the poll interval increases up to a maximum of 10 seconds.
So unfortunately, the only way to change this on your end would be to run less agents, but I suspect that would not be desirable because then pipelines would be stuck waiting in a different place.
Hope that helps!
Thanks for the context. Can you share more with me about the polling architecture and algorithm works?
I’ve also observed long wait times that add up quickly for any given build that has many steps, especially builds with step dependencies. I’ve created a pull request to tentatively improve wait times, hopefully without any impact on aggregate server load.
@kgillette thanks for opening up that PR someone from the team will review it and provide any feedback directly in the PR.
@OwenCR I’m not sure what else I could share and it could even change in the future depending on load distribution, etc.
But at the moment, it’s about 2 seconds for <5 agents, 5 seconds for <10 and 10 seconds above that.
My understanding is that the intent was that as you had more agents, there would be more pinging for work and that the average wait time would decrease, whilst also managing load for the API.
However, this is not how it works. In reality, the median time for agents to accept a job is approximately the same as the ping interval. The dispatcher as I remember sorts by agents that have recently performed a similar job, or failing that those that have pinged recently. It then assigns the work to that agent and waits for it to ping to pick up the work. Because it’s picking agents that have pinged most recently, it will ALWAYS take about a ping interval for the job to be accepted.
Unless I’m mistaken (which is totes possible), I reckon the best approach would be for the dispatcher to exclude agents that haven’t checked in for a long time but otherwise assign at random if there isn’t an agent available that has previously done the some work. That should result in the gaussian distribution of accept times that I think was the original intent.
The dispatcher as I remember sorts by agents that have recently performed a similar job, or failing that those that have pinged recently
Yes, that is correct, and it continues to be the current behavior
Your suggestion makes sense, I like that approach
Thanks for all the extra details @lachlandonald! this is gold