Queue wait times metric

The build-agent-metrics application monitors queue metrics, like RunningJobCount, ScheduleJobsCount, UnfinishedJobsCount, etc. These metrics could indicate an issue with the system, but they usually suggest that the system is under load but running normally. The team I’m working on uses these metrics to alert us when the system is having an issue, but these alerts are frequently false alarms.

Long wait times always indicates a problem for our users. I believe a useful feature would be to track queue wait times, similar to build-agent-metrics.

Hey @mat! Sorry for such a delay getting back to you on this one. Did you come up with any workarounds in the meantime?

I just wanted to get a bit more info on your idea if that’s alright :grinning_face_with_smiling_eyes: By ‘queue wait times’ do you mean the time a job sits in a queue before the agent picks it up? And where/how would you want to consume this info? Would you need it from an API, or would something in the UI suit? Would you be pulling it into your own dashboards?

+1 to this.

On a related note, how is WaitJobCount defined?

Hey @clambertops, hrmm not sure about that value, where about’s are you seeing WaitJobCount?

I see it in CloudWatch after upgrading to 5.1.0 from 4.5.0:

Skimming the release notes for the stack and the agent, I don’t see any mention of this. Nor do I see any mention of it here:

Additionally, the Total and Idle count metrics are not working. You can see all of this here:

Shared with CloudApp

I’m quite at a loss. Hoping someone here can point me in some helpful direction.

Thanks,
Chris

Ahh, I see it! You are correct - it was added into the V5 release.

It is in the V5 changelog buildkite-agent-metrics/CHANGELOG.md at c8145a178990ff59994eb45e4e1cd4c91fc411e1 · buildkite/buildkite-agent-metrics · GitHub and It refers to jobs that are waiting behind a wait step and is used for pre-emptive scaling.

You can see it in the code here buildkite-agent-metrics/collector.go at c8145a178990ff59994eb45e4e1cd4c91fc411e1 · buildkite/buildkite-agent-metrics · GitHub

@clambertops If you’re having any difficulty with your stack, we can take a deeper look for you if you send us in a message at support@buildkite.com.

Hello,

Apologies for digging out an old thread, but I am trying to get something similar to OP.

I want to measure how long a job waits in queue before an agent is assigned to it, consume it through an API and ingest it through a Grafana dashboard.

Is there a way to do this currently?

Hello, @RRosa and welcome to the Buildkite Community Forum!

To find out the time elapsed between scheduling a job on a pipeline and the time it started, you can enable and use clusters for your Buildkite organization - this way you’ll have the access to queue metrics that display the information you are looking for. Keep it in mind that enabling clusters is irreversible!

Another way to get to this information is by using Amazon EventBridge and a Lambda example to get the metrics pushed to CloudWatch. Example event payloads in our documentation cover Job Scheduled and Job Started. You can then add an EventBridge dashboard to Grafana.

To outline all the possible options, I should mention that you could also use Buildkite’s GraphQL API and run the following query:

query {
  pipeline(slug: "your-org-slug/build-slug/build-number") {
    jobs(first:1) {
      edges {
        node {
          ... on JobTypeCommand {
            scheduledAt
            startedAt
          }
        }
      }
    }
  }
}

Best!

Karen

1 Like

Hello @karen.sawrey,

Thank you for your answer, that’s exactly what I wanted!
I’ll test out the GraphQL API to get the metric.

All the best,
Rafal

1 Like

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.