All builds "Waiting on concurrency group"

Hi, we’ve seen something very weird this morning with all our BuildKite builds.

Generally we trigger an end-to-end build across our microservices on any successful development build. Today, no e2e build is starting, and instead all builds are sitting in “Waiting on concurrency group” when they hit the trigger e2e step.

This appears to have started late on Sunday or early Monday morning. We have scheduled builds that run at the start of each day and they are currently all hanging.

We have restarted all agents and they are all active and connected.

Anyone else seeing this? It doesn’t seem like it’s our issue as we made no changes this weekend. Is this a BuildKite problem? Or if it’s us, how did we cause it?

Thanks in advance for any help you can offer!

Neil Brennan

Hi @Nello ,

Thanks for reaching out to us on this one. We are currently looking into it if it is related to on-going incident and get back to you when we have more info.

Cheers!

Hi, a little more info. It looks like the change that has impacted us was actually late Friday or early Saturday.

Right now all of our unclustered agents don’t seem to be picking up jobs in any concurrency group. We’ve raised the limit in our e2e group and it hasn’t helped.

We can’t see any indication of an ongoing incident on your status page, btw!

Thanks,
Neil

Hi @lizette. Thanks for responding earlier.

It looks like our “stalled” builds have all just started passing. Did you guys do something? :blush:

Hi @Nello ,

Jobs states are quite complicated and somehow they can get “stuck” in a particular state due to many reasons. And if these jobs are stuck on a concurrency group, it will also cause other jobs to be stuck in limited (waiting for concurrency gate to open) state. Therefore we have allowed the customers an access to reprocess them from the UI https://buildkite.com/organizations/{organization-slug}/concurrency-groups.

I just did the reprocessing on the stuck job from the trigger-e2e-build concurrency group for you.

Cheers!

Thanks @lizette!

I did try raising the limit on that concurrency group … which didn’t help at all.

First time we’ve seen this one in 8 years of BuildKite use. I hope recent changes haven’t made this less stable :(

Again, thank you for your help.

Neil

Hi @Nello,

When was the new concurrency group limit applied ? Was it while there was a job stuck on the wait state? It is possible that due to the circumstance, the concurrency group’s limit was not refreshed. I did try to reproduce locally, but the jobs in my tests were not stuck on a waiting state, thus new limits always gets applied as expected.

However, if you have observed this again, do let us know and we can investigate further.

Cheers!