The build-agent-metrics application monitors queue metrics, like RunningJobCount, ScheduleJobsCount, UnfinishedJobsCount, etc. These metrics could indicate an issue with the system, but they usually suggest that the system is under load but running normally. The team I’m working on uses these metrics to alert us when the system is having an issue, but these alerts are frequently false alarms.
Long wait times always indicates a problem for our users. I believe a useful feature would be to track queue wait times, similar to build-agent-metrics.
Hey @mat! Sorry for such a delay getting back to you on this one. Did you come up with any workarounds in the meantime?
I just wanted to get a bit more info on your idea if that’s alright By ‘queue wait times’ do you mean the time a job sits in a queue before the agent picks it up? And where/how would you want to consume this info? Would you need it from an API, or would something in the UI suit? Would you be pulling it into your own dashboards?
Apologies for digging out an old thread, but I am trying to get something similar to OP.
I want to measure how long a job waits in queue before an agent is assigned to it, consume it through an API and ingest it through a Grafana dashboard.
Hello, @RRosa and welcome to the Buildkite Community Forum!
To find out the time elapsed between scheduling a job on a pipeline and the time it started, you can enable and use clusters for your Buildkite organization - this way you’ll have the access to queue metrics that display the information you are looking for. Keep it in mind that enabling clusters is irreversible!