On-demand - aka "just-in-time" agents

geekflyer · July 7, 2023, 10:43am

Hi,

I’m in charge to overhaul our CI platform (currently Github Actions) to improve upon cost, reliability and scalability of our CI system and BuildKite is one of the options I’m looking at. We currently run the majority of our builds on a GCP/GKE-cluster via actions-runner-controller. Scale is at peak times ~10k vCPU and ~30TB RAM.
A lot of our jobs do compile a large Rust monorepo and require pretty beefy machines. Some do require local SSDs, some are fine with small machines, but overall there is quite some diversity in the requirements for each job.

Due to cost efficiency and flexibility I’d like for our devs to have the ability to precisely specify the vCPU, memory and disk requirements for each job and then have the CI system create workers/agents on-demand as GCP VMs for each job. Creation of these VMs should only happen just-in-time for each job. Note that GCP VMs typically can be provisioned within 1-2mins, which is an acceptable “wait” / cold start time for us.
I’d really really like to avoid the notion of “queues”/ “pools” / “autoscaling groups” as that would require me to upfront define certain machine specs and would require a notion of some kind of autoscaler that scales these up and down. This would introduce an imo unnecessary level of indirection and scaling down these pools in time - right after a job has finished seems like a major pain point. So basically I want our CI system to just create a GCP VM for each job and shut it down immediately once the job has finished (successful or not).

If you want some more concrete idea how this could work, please have a look at Computing Services - Cirrus CI which more-or-less does exactly that.

That being said I’m not blind to the popularity and flexibility that buildkite offers, in particular dynamic pipelines etc.

Currently when reading through the buildkite agent docs everything seems to be written in mind mostly with the notion of persistent workers or “autoscaling pools” of workers, however I have seen very little concrete examples on how one could implement the notion of on-demand/just-in-time workers.

I imagine buildkite has enough flexibility to achieve that, so I’d love to hear your thoughts on it. Are there any examples of someone having implemented something like that in the community?

The way how I myself envision this is that probably I would have to implement some kind of task scheduler/adapter component which periodically (every 5 seconds or so) queries the buildkite API, gets the list of pending jobs (commands?) and spawns a GCP machine that is bootstrapped with a buildkite agent and tags that ensure that agent will only run the specific job it was created for. Users would probably submit their resource requirements via a set of tags (cpu, memory, disk size, machine family) in the “agent” field of their command.
The “scheduler” probably needs to keep some ephemeral state in Redis or so to keep track of the VMs it has spawned in the past and shut down them or recreate them if something bad happens on the infra level and update the job status if after a certain number attempts it couldn’t spawn successfully a GCP VM with the agent running.
It basically acts like an intermediary between the actual buildkite agent and the buildkite cloud.
Some bonus points would be if it supports different scheduling backend, e.g. for example using Cloud Run Jobs for small size containers that require a faster bootstrap time compared to VMs.

Do you think this is feasible / has something like this done before? Any pointers on which APIs to query for pending command steps?

Other questions are:

how can I monitor the resource utilization of each job or VM and surface this to devs the most immediate (without manually clicking through 10 links to get to a grafana dashboard or whatever)?
How can I give devs feedback about the scheduling status/progress of VMs (e.g. VM created but agent on it hasn’t connected yet) and surface them scheduling errors, such as GCP quotas being exceeded?

Looking forward to your thoughts!

karen.sawrey · July 7, 2023, 12:00pm

Hello, @geekflyer and welcome to the Buildkite community!
Thank you for explaining your use cases and needs in such great detail. Let us look into providing you with the answers you are after ASAP!

Best!
Karen

mikemorgan · August 8, 2023, 8:10pm

Hello! Mike from the Buildkite team here. I’m a bit late updating this thread, but thought it would be helpful to provide some extra detail for any future travellers. I met with geekflyer a few weeks ago to discuss how Buildkite could be used in these kinds of scenarios.

Buildkite’s default scheduling behaviour is quite generally quite straightforward. There’s some subtle differences in the logic used for job retries but, for the most part, jobs will be assigned to an available agent that matches the specified agent tags for that job (i.e. the queue tag, and any other arbitrary tags the build author has requested). That said, if your job scheduling requirements differ from this (for example, those specified in geekflyer’s OP), then it’s absolutely possible to implement your own job scheduling logic using Buildkite’s GraphQL API and the Buildkite Agent’s ‘–aquire-job’ feature.

An implementation of the above mentioned custom scheduler might consist of a service that continuously polls GraphQL for jobs waiting in a job queue and, based on the job metadata (e.g. agent tags), would spawn an agent using the ‘–acquire-job’ setting to have it request that particular job. This would give you the ability to totally customize the job scheduling behaviour - pluck jobs from the middle of the job queue for processing, reprioritize jobs on the fly, or dynamically assign jobs to particular hardware configs without having to use traditional autoscaled queues!

This is not an ‘out of the box’ capability of Buildkite, but it is certainly possible, and is something that some customers opt to do with great success.

To one of the final points in geekflyer’s post - Buildkite does not currently monitor system resource usage, or resource consumption of the jobs being performed. This is certainly an area of interest, but we don’t have anything to share here quite yet

Topic		Replies	Views
GCP CloudRun Self-Hosted Agent Example Features Requests	3	19	April 28, 2025
Waiting for a buildkite agent to become available General	1	811	March 22, 2023
Build priorities Features Requests	7	1765	August 23, 2021
Buildkite Elastic CI Stack for AWS v5.0.0 released Announcements	1	1564	November 9, 2020
Spin-up ESXi VMs with Buildkite Features Requests	5	932	January 22, 2019

On-demand - aka "just-in-time" agents

Related topics