Decrease Wait time

OwenCR · February 24, 2022, 1:18am

I run buildkite in a monorepo environment. We use a lot of abstractions that cause us to trigger other steps or pipelines to keep our CI code well organized and maintainable.

It’s not uncommon for us to have 10+ step on a build who’s sole job is to trigger other steps or pipelines.

This leads me to my question, how can I decrease the wait time of these steps.

I see that the wait time is actually a few different lifecycle steps rolled up. I’d like to know which ones I have control over decreasing and how to do so.

I run plenty of agents, I am rarely waiting on scaling activity.

Any help would be greatly appreciated.

paula · February 24, 2022, 2:14am

Hi @OwenCR the agent using a polling architecture that is dependent on the number of agents you have registered in your organisation. And it’s not configurable.

To allow us to manage the load, as the number of agents registered increases, the poll interval increases up to a maximum of 10 seconds.

So unfortunately, the only way to change this on your end would be to run less agents, but I suspect that would not be desirable because then pipelines would be stuck waiting in a different place.

Hope that helps!

OwenCR · February 25, 2022, 2:40pm

Thanks for the context. Can you share more with me about the polling architecture and algorithm works?

kgillette · February 25, 2022, 7:46pm

I’ve also observed long wait times that add up quickly for any given build that has many steps, especially builds with step dependencies. I’ve created a pull request to tentatively improve wait times, hopefully without any impact on aggregate server load.

jeremy · February 25, 2022, 9:46pm

@kgillette thanks for opening up that PR someone from the team will review it and provide any feedback directly in the PR.

paula · February 28, 2022, 1:10am

@OwenCR I’m not sure what else I could share and it could even change in the future depending on load distribution, etc.

But at the moment, it’s about 2 seconds for <5 agents, 5 seconds for <10 and 10 seconds above that.

lachlandonald · March 16, 2022, 6:06am

My understanding is that the intent was that as you had more agents, there would be more pinging for work and that the average wait time would decrease, whilst also managing load for the API.

However, this is not how it works. In reality, the median time for agents to accept a job is approximately the same as the ping interval. The dispatcher as I remember sorts by agents that have recently performed a similar job, or failing that those that have pinged recently. It then assigns the work to that agent and waits for it to ping to pick up the work. Because it’s picking agents that have pinged most recently, it will ALWAYS take about a ping interval for the job to be accepted.

Unless I’m mistaken (which is totes possible), I reckon the best approach would be for the dispatcher to exclude agents that haven’t checked in for a long time but otherwise assign at random if there isn’t an agent available that has previously done the some work. That should result in the gaussian distribution of accept times that I think was the original intent.

paula · March 16, 2022, 10:20pm

The dispatcher as I remember sorts by agents that have recently performed a similar job, or failing that those that have pinged recently

Yes, that is correct, and it continues to be the current behavior
Your suggestion makes sense, I like that approach

Thanks for all the extra details @lachlandonald! this is gold

kgillette · September 6, 2022, 8:44pm

How does the dispatcher determine what is a similar job? Same pipeline or same repo? Agent selection filter? Some other criteria?

I could imagine that, if caching is well utilized, it could be worth waiting 10 seconds in order to save more than 10 seconds in redundant work. For pipelines that do not opportunities for work sharing, then random agent would certainly be more effective.

FWIW, the PR I linked to earlier (Ping immediately after completing a job by extemporalgenome · Pull Request #1567 · buildkite/agent · GitHub) was intended to reduce this wait time to zero in the case the pipeline has more steps dependent upon or unblocked by a freshly completed step, and if it dispatcher happens to assign work to an agent that is immediately available (i.e. reporting that it completed the previous job).

However, it does not appear, even with that PR, that “waiting to accept” times have decreased, and I’m not sure why (perhaps the agent is delaying or waiting somewhere in the code that I did not see, or perhaps the dispatcher is assigning work to agents that are not immediately ready, perhaps causing a perpetual ~10 second delay due to the work allocation algorithm).

suma · September 7, 2022, 12:39am

Hi @kgillette

In above conversations using “similar” may not have been the best use of a words. Dispatcher will assign job which can run on that agent and will pick the agent which had recently ran a job.

It would really help us if you could point us to a pipeline or build url of yours where you observed wait times where you did not expect that to happen so we can check what happened and respond. Please send those details to support@buildkite.com so we can check what is causing additional delay you are observing.

mr_chronosphere · May 4, 2023, 12:32pm

I came here because I am trying to crack the same problem – several very short jobs need to run in order to upload all the pipelines before the real work begins, and the wait times add up to be problematic.

A deeper change (with implications for the backend that I cannot judge from here) would be to let the agents long-poll. Aside from possible timeouts, this would be a server-side change. The API could delay responding to a ping for some time (say, 10s) before saying “nothing to do for you right now”. Then, the dispatcher could prefer those agents that are currently hanging in this state, immediately handing them work. This would drive the expected wait time towards zero: when an agent is available, it can be handed new work immediately at any point in time.

As an added benefit, you get some control of the server load on the server side – if you hold on to an in-flight ping for 30s, the API requests per second would decrease. However I do recognize that holding onto these long-poll requests also takes some resources, and it can make managing a backend more challenging. Plus you’d need some way to actually notify the process that is holding on to the long-poll that it should now respond.

karen.sawrey · May 4, 2023, 2:42pm

Hello, @mr_chronosphere! Welcome to the community and thank you for your feedback!
Could you please share the link to the pipeline(s) where you’re experiencing extended wait time to support@buildkite.com for us to see if anything could be done to optimize the process?

Many thanks!
Cheers
Karen

mr_chronosphere · May 5, 2023, 8:02am

I have a few optimizations to work through already, thank you! There’s no single extended wait time, the seconds add up across steps. I came across this thread in my journey so just wanted to chime in. The polling adds a lower bound on the total time per step, this constrains how we can use pipelines-of-pipelines and pipelines-in-code. Bringing it close to zero would remove a bunch of constraints on that.

paula · May 5, 2023, 8:50am

Hi! Agree that bringing the time closer to zero would be an ideal scenario; unfortunately, it’s not something we can do at the moment because this would increase the load on our systems. Reducing the time the agent polls us increases our load to support your agents.

But we are making improvements to our system to approach this issue better. There’s still a lot to do, so I don’t have a timeframe to share, but this is in our roadmap.

Thanks!

Asmund · August 1, 2024, 9:16am

hi

Running in to the same problem. Almost all jobs have 10s dispatch time. It can add 2 min overhead on our highly optimized build process. Total run time be as low as 5 min…

Fixing this server side without hitting resource problems is close to impossible IMHO. Are there any work being done to have a local (closer to agents) dispatcher?

Idea from the top of my head:

Agents register to the local dispatcher
local dispatcher have a lower latency response from Buildkite central server because it off lodes the load from all the agents it is responsible for.
Agents communicate to local dispatcher can have a much higher resource incentive solution as it is distributed to local resource.

UI lagging behind is totally fine for us.

BR,
Åsmund

stephanie.atte · August 1, 2024, 8:01pm

Hey @Asmund

The size of the instance and the resources allocated to the agents are crucial factors. A common issue is agents taking too long to get assigned to jobs due to running out of disk space.

As for the current inquiry, to my knowledge, we do not have that at the moment. If you can provide the agent logs, this will help us determine if there is anything unusual. For privacy concerns, please send them to support@buildkite.com.

Asmund · August 2, 2024, 1:46pm

Thanks @stephanie.atte have started a thread with support.

zenogueira · May 27, 2025, 3:59pm

This polling interval seems to be growing with time. Currently, we have a lot of jobs with wait times between 10-20s. Add a few of those in dependent steps and the wait time starts to weigh in a lot in the overall build time.

Are there any advancements in this front? As in having a local dispatcher or something that can coordinate resources closer to the agents?

Thank you

pete · May 27, 2025, 5:31pm

Hi @zenogueira

Many improvements have been made to the performance of the Agent API in recent months. If you are still seeing wait times of 10-20 seconds for many of your jobs we would ask that you reach out to support@buildkite.com to provide us with links to these jobs so that we can investigate further. Thanks!

Topic		Replies	Views
Waiting for a buildkite agent to become available General	1	803	March 22, 2023
Build run times Pipelines	3	65	December 13, 2024
Caching among steps Pipelines	3	254	June 4, 2024
Delay steps after wait Features Requests	1	1523	October 29, 2020
Buildkite pipeline priority Features Requests	5	589	March 18, 2022

Decrease Wait time

Related topics