I run buildkite in a monorepo environment. We use a lot of abstractions that cause us to trigger other steps or pipelines to keep our CI code well organized and maintainable.
It’s not uncommon for us to have 10+ step on a build who’s sole job is to trigger other steps or pipelines.
This leads me to my question, how can I decrease the wait time of these steps.
I see that the wait time is actually a few different lifecycle steps rolled up. I’d like to know which ones I have control over decreasing and how to do so.
I run plenty of agents, I am rarely waiting on scaling activity.
Hi @OwenCR the agent using a polling architecture that is dependent on the number of agents you have registered in your organisation. And it’s not configurable.
To allow us to manage the load, as the number of agents registered increases, the poll interval increases up to a maximum of 10 seconds.
So unfortunately, the only way to change this on your end would be to run less agents, but I suspect that would not be desirable because then pipelines would be stuck waiting in a different place.
I’ve also observed long wait times that add up quickly for any given build that has many steps, especially builds with step dependencies. I’ve created a pull request to tentatively improve wait times, hopefully without any impact on aggregate server load.
My understanding is that the intent was that as you had more agents, there would be more pinging for work and that the average wait time would decrease, whilst also managing load for the API.
However, this is not how it works. In reality, the median time for agents to accept a job is approximately the same as the ping interval. The dispatcher as I remember sorts by agents that have recently performed a similar job, or failing that those that have pinged recently. It then assigns the work to that agent and waits for it to ping to pick up the work. Because it’s picking agents that have pinged most recently, it will ALWAYS take about a ping interval for the job to be accepted.
Unless I’m mistaken (which is totes possible), I reckon the best approach would be for the dispatcher to exclude agents that haven’t checked in for a long time but otherwise assign at random if there isn’t an agent available that has previously done the some work. That should result in the gaussian distribution of accept times that I think was the original intent.
How does the dispatcher determine what is a similar job? Same pipeline or same repo? Agent selection filter? Some other criteria?
I could imagine that, if caching is well utilized, it could be worth waiting 10 seconds in order to save more than 10 seconds in redundant work. For pipelines that do not opportunities for work sharing, then random agent would certainly be more effective.
However, it does not appear, even with that PR, that “waiting to accept” times have decreased, and I’m not sure why (perhaps the agent is delaying or waiting somewhere in the code that I did not see, or perhaps the dispatcher is assigning work to agents that are not immediately ready, perhaps causing a perpetual ~10 second delay due to the work allocation algorithm).
In above conversations using “similar” may not have been the best use of a words. Dispatcher will assign job which can run on that agent and will pick the agent which had recently ran a job.
It would really help us if you could point us to a pipeline or build url of yours where you observed wait times where you did not expect that to happen so we can check what happened and respond. Please send those details to support@buildkite.com so we can check what is causing additional delay you are observing.
I came here because I am trying to crack the same problem – several very short jobs need to run in order to upload all the pipelines before the real work begins, and the wait times add up to be problematic.
A deeper change (with implications for the backend that I cannot judge from here) would be to let the agents long-poll. Aside from possible timeouts, this would be a server-side change. The API could delay responding to a ping for some time (say, 10s) before saying “nothing to do for you right now”. Then, the dispatcher could prefer those agents that are currently hanging in this state, immediately handing them work. This would drive the expected wait time towards zero: when an agent is available, it can be handed new work immediately at any point in time.
As an added benefit, you get some control of the server load on the server side – if you hold on to an in-flight ping for 30s, the API requests per second would decrease. However I do recognize that holding onto these long-poll requests also takes some resources, and it can make managing a backend more challenging. Plus you’d need some way to actually notify the process that is holding on to the long-poll that it should now respond.
Hello, @mr_chronosphere! Welcome to the community and thank you for your feedback!
Could you please share the link to the pipeline(s) where you’re experiencing extended wait time to support@buildkite.com for us to see if anything could be done to optimize the process?
I have a few optimizations to work through already, thank you! There’s no single extended wait time, the seconds add up across steps. I came across this thread in my journey so just wanted to chime in. The polling adds a lower bound on the total time per step, this constrains how we can use pipelines-of-pipelines and pipelines-in-code. Bringing it close to zero would remove a bunch of constraints on that.
Hi! Agree that bringing the time closer to zero would be an ideal scenario; unfortunately, it’s not something we can do at the moment because this would increase the load on our systems. Reducing the time the agent polls us increases our load to support your agents.
But we are making improvements to our system to approach this issue better. There’s still a lot to do, so I don’t have a timeframe to share, but this is in our roadmap.
Running in to the same problem. Almost all jobs have 10s dispatch time. It can add 2 min overhead on our highly optimized build process. Total run time be as low as 5 min…
Fixing this server side without hitting resource problems is close to impossible IMHO. Are there any work being done to have a local (closer to agents) dispatcher?
Idea from the top of my head:
Agents register to the local dispatcher
local dispatcher have a lower latency response from Buildkite central server because it off lodes the load from all the agents it is responsible for.
Agents communicate to local dispatcher can have a much higher resource incentive solution as it is distributed to local resource.
The size of the instance and the resources allocated to the agents are crucial factors. A common issue is agents taking too long to get assigned to jobs due to running out of disk space.
As for the current inquiry, to my knowledge, we do not have that at the moment. If you can provide the agent logs, this will help us determine if there is anything unusual. For privacy concerns, please send them to support@buildkite.com.