Decrease Wait time

I run buildkite in a monorepo environment. We use a lot of abstractions that cause us to trigger other steps or pipelines to keep our CI code well organized and maintainable.

It’s not uncommon for us to have 10+ step on a build who’s sole job is to trigger other steps or pipelines.

This leads me to my question, how can I decrease the wait time of these steps.


I see that the wait time is actually a few different lifecycle steps rolled up. I’d like to know which ones I have control over decreasing and how to do so.

I run plenty of agents, I am rarely waiting on scaling activity.

Any help would be greatly appreciated.

Hi @OwenCR the agent using a polling architecture that is dependent on the number of agents you have registered in your organisation. And it’s not configurable.

To allow us to manage the load, as the number of agents registered increases, the poll interval increases up to a maximum of 10 seconds.

So unfortunately, the only way to change this on your end would be to run less agents, but I suspect that would not be desirable because then pipelines would be stuck waiting in a different place.

Hope that helps!

Thanks for the context. Can you share more with me about the polling architecture and algorithm works?

I’ve also observed long wait times that add up quickly for any given build that has many steps, especially builds with step dependencies. I’ve created a pull request to tentatively improve wait times, hopefully without any impact on aggregate server load.

2 Likes

@kgillette thanks for opening up that PR :slight_smile: someone from the team will review it and provide any feedback directly in the PR.

@OwenCR I’m not sure what else I could share and it could even change in the future depending on load distribution, etc.

But at the moment, it’s about 2 seconds for <5 agents, 5 seconds for <10 and 10 seconds above that.

1 Like

My understanding is that the intent was that as you had more agents, there would be more pinging for work and that the average wait time would decrease, whilst also managing load for the API.

However, this is not how it works. In reality, the median time for agents to accept a job is approximately the same as the ping interval. The dispatcher as I remember sorts by agents that have recently performed a similar job, or failing that those that have pinged recently. It then assigns the work to that agent and waits for it to ping to pick up the work. Because it’s picking agents that have pinged most recently, it will ALWAYS take about a ping interval for the job to be accepted.

Unless I’m mistaken (which is totes possible), I reckon the best approach would be for the dispatcher to exclude agents that haven’t checked in for a long time but otherwise assign at random if there isn’t an agent available that has previously done the some work. That should result in the gaussian distribution of accept times that I think was the original intent.

1 Like

The dispatcher as I remember sorts by agents that have recently performed a similar job, or failing that those that have pinged recently

Yes, that is correct, and it continues to be the current behavior :slight_smile:
Your suggestion makes sense, I like that approach :ok_hand:

Thanks for all the extra details @lachlandonald! this is gold :blush:

1 Like

How does the dispatcher determine what is a similar job? Same pipeline or same repo? Agent selection filter? Some other criteria?

I could imagine that, if caching is well utilized, it could be worth waiting 10 seconds in order to save more than 10 seconds in redundant work. For pipelines that do not opportunities for work sharing, then random agent would certainly be more effective.

FWIW, the PR I linked to earlier (Ping immediately after completing a job by extemporalgenome · Pull Request #1567 · buildkite/agent · GitHub) was intended to reduce this wait time to zero in the case the pipeline has more steps dependent upon or unblocked by a freshly completed step, and if it dispatcher happens to assign work to an agent that is immediately available (i.e. reporting that it completed the previous job).

However, it does not appear, even with that PR, that “waiting to accept” times have decreased, and I’m not sure why (perhaps the agent is delaying or waiting somewhere in the code that I did not see, or perhaps the dispatcher is assigning work to agents that are not immediately ready, perhaps causing a perpetual ~10 second delay due to the work allocation algorithm).

Hi @kgillette

In above conversations using “similar” may not have been the best use of a words. Dispatcher will assign job which can run on that agent and will pick the agent which had recently ran a job.

It would really help us if you could point us to a pipeline or build url of yours where you observed wait times where you did not expect that to happen so we can check what happened and respond. Please send those details to support@buildkite.com so we can check what is causing additional delay you are observing.

I came here because I am trying to crack the same problem – several very short jobs need to run in order to upload all the pipelines before the real work begins, and the wait times add up to be problematic.

A deeper change (with implications for the backend that I cannot judge from here) would be to let the agents long-poll. Aside from possible timeouts, this would be a server-side change. The API could delay responding to a ping for some time (say, 10s) before saying “nothing to do for you right now”. Then, the dispatcher could prefer those agents that are currently hanging in this state, immediately handing them work. This would drive the expected wait time towards zero: when an agent is available, it can be handed new work immediately at any point in time.

As an added benefit, you get some control of the server load on the server side – if you hold on to an in-flight ping for 30s, the API requests per second would decrease. However I do recognize that holding onto these long-poll requests also takes some resources, and it can make managing a backend more challenging. Plus you’d need some way to actually notify the process that is holding on to the long-poll that it should now respond.

Hello, @mr_chronosphere! Welcome to the community and thank you for your feedback!
Could you please share the link to the pipeline(s) where you’re experiencing extended wait time to support@buildkite.com for us to see if anything could be done to optimize the process?

Many thanks!
Cheers
Karen

I have a few optimizations to work through already, thank you! There’s no single extended wait time, the seconds add up across steps. I came across this thread in my journey so just wanted to chime in. The polling adds a lower bound on the total time per step, this constrains how we can use pipelines-of-pipelines and pipelines-in-code. Bringing it close to zero would remove a bunch of constraints on that.

1 Like

Hi! Agree that bringing the time closer to zero would be an ideal scenario; unfortunately, it’s not something we can do at the moment because this would increase the load on our systems. Reducing the time the agent polls us increases our load to support your agents.

But we are making improvements to our system to approach this issue better. There’s still a lot to do, so I don’t have a timeframe to share, but this is in our roadmap.

Thanks!

2 Likes