Decrease Wait time

I run buildkite in a monorepo environment. We use a lot of abstractions that cause us to trigger other steps or pipelines to keep our CI code well organized and maintainable.

It’s not uncommon for us to have 10+ step on a build who’s sole job is to trigger other steps or pipelines.

This leads me to my question, how can I decrease the wait time of these steps.

I see that the wait time is actually a few different lifecycle steps rolled up. I’d like to know which ones I have control over decreasing and how to do so.

I run plenty of agents, I am rarely waiting on scaling activity.

Any help would be greatly appreciated.

Hi @OwenCR the agent using a polling architecture that is dependent on the number of agents you have registered in your organisation. And it’s not configurable.

To allow us to manage the load, as the number of agents registered increases, the poll interval increases up to a maximum of 10 seconds.

So unfortunately, the only way to change this on your end would be to run less agents, but I suspect that would not be desirable because then pipelines would be stuck waiting in a different place.

Hope that helps!

Thanks for the context. Can you share more with me about the polling architecture and algorithm works?

I’ve also observed long wait times that add up quickly for any given build that has many steps, especially builds with step dependencies. I’ve created a pull request to tentatively improve wait times, hopefully without any impact on aggregate server load.


@kgillette thanks for opening up that PR :slight_smile: someone from the team will review it and provide any feedback directly in the PR.

@OwenCR I’m not sure what else I could share and it could even change in the future depending on load distribution, etc.

But at the moment, it’s about 2 seconds for <5 agents, 5 seconds for <10 and 10 seconds above that.

1 Like

My understanding is that the intent was that as you had more agents, there would be more pinging for work and that the average wait time would decrease, whilst also managing load for the API.

However, this is not how it works. In reality, the median time for agents to accept a job is approximately the same as the ping interval. The dispatcher as I remember sorts by agents that have recently performed a similar job, or failing that those that have pinged recently. It then assigns the work to that agent and waits for it to ping to pick up the work. Because it’s picking agents that have pinged most recently, it will ALWAYS take about a ping interval for the job to be accepted.

Unless I’m mistaken (which is totes possible), I reckon the best approach would be for the dispatcher to exclude agents that haven’t checked in for a long time but otherwise assign at random if there isn’t an agent available that has previously done the some work. That should result in the gaussian distribution of accept times that I think was the original intent.

1 Like

The dispatcher as I remember sorts by agents that have recently performed a similar job, or failing that those that have pinged recently

Yes, that is correct, and it continues to be the current behavior :slight_smile:
Your suggestion makes sense, I like that approach :ok_hand:

Thanks for all the extra details @lachlandonald! this is gold :blush:

1 Like

How does the dispatcher determine what is a similar job? Same pipeline or same repo? Agent selection filter? Some other criteria?

I could imagine that, if caching is well utilized, it could be worth waiting 10 seconds in order to save more than 10 seconds in redundant work. For pipelines that do not opportunities for work sharing, then random agent would certainly be more effective.

FWIW, the PR I linked to earlier (Ping immediately after completing a job by extemporalgenome · Pull Request #1567 · buildkite/agent · GitHub) was intended to reduce this wait time to zero in the case the pipeline has more steps dependent upon or unblocked by a freshly completed step, and if it dispatcher happens to assign work to an agent that is immediately available (i.e. reporting that it completed the previous job).

However, it does not appear, even with that PR, that “waiting to accept” times have decreased, and I’m not sure why (perhaps the agent is delaying or waiting somewhere in the code that I did not see, or perhaps the dispatcher is assigning work to agents that are not immediately ready, perhaps causing a perpetual ~10 second delay due to the work allocation algorithm).

Hi @kgillette

In above conversations using “similar” may not have been the best use of a words. Dispatcher will assign job which can run on that agent and will pick the agent which had recently ran a job.

It would really help us if you could point us to a pipeline or build url of yours where you observed wait times where you did not expect that to happen so we can check what happened and respond. Please send those details to so we can check what is causing additional delay you are observing.