Hi,
I’m in charge to overhaul our CI platform (currently Github Actions) to improve upon cost, reliability and scalability of our CI system and BuildKite is one of the options I’m looking at. We currently run the majority of our builds on a GCP/GKE-cluster via actions-runner-controller. Scale is at peak times ~10k vCPU and ~30TB RAM.
A lot of our jobs do compile a large Rust monorepo and require pretty beefy machines. Some do require local SSDs, some are fine with small machines, but overall there is quite some diversity in the requirements for each job.
Due to cost efficiency and flexibility I’d like for our devs to have the ability to precisely specify the vCPU, memory and disk requirements for each job and then have the CI system create workers/agents on-demand as GCP VMs for each job. Creation of these VMs should only happen just-in-time for each job. Note that GCP VMs typically can be provisioned within 1-2mins, which is an acceptable “wait” / cold start time for us.
I’d really really like to avoid the notion of “queues”/ “pools” / “autoscaling groups” as that would require me to upfront define certain machine specs and would require a notion of some kind of autoscaler that scales these up and down. This would introduce an imo unnecessary level of indirection and scaling down these pools in time - right after a job has finished seems like a major pain point. So basically I want our CI system to just create a GCP VM for each job and shut it down immediately once the job has finished (successful or not).
If you want some more concrete idea how this could work, please have a look at Computing Services - Cirrus CI which more-or-less does exactly that.
That being said I’m not blind to the popularity and flexibility that buildkite offers, in particular dynamic pipelines etc.
Currently when reading through the buildkite agent docs everything seems to be written in mind mostly with the notion of persistent workers or “autoscaling pools” of workers, however I have seen very little concrete examples on how one could implement the notion of on-demand/just-in-time workers.
I imagine buildkite has enough flexibility to achieve that, so I’d love to hear your thoughts on it. Are there any examples of someone having implemented something like that in the community?
The way how I myself envision this is that probably I would have to implement some kind of task scheduler/adapter component which periodically (every 5 seconds or so) queries the buildkite API, gets the list of pending jobs (commands?) and spawns a GCP machine that is bootstrapped with a buildkite agent and tags that ensure that agent will only run the specific job it was created for. Users would probably submit their resource requirements via a set of tags (cpu, memory, disk size, machine family) in the “agent” field of their command.
The “scheduler” probably needs to keep some ephemeral state in Redis or so to keep track of the VMs it has spawned in the past and shut down them or recreate them if something bad happens on the infra level and update the job status if after a certain number attempts it couldn’t spawn successfully a GCP VM with the agent running.
It basically acts like an intermediary between the actual buildkite agent and the buildkite cloud.
Some bonus points would be if it supports different scheduling backend, e.g. for example using Cloud Run Jobs for small size containers that require a faster bootstrap time compared to VMs.
Do you think this is feasible / has something like this done before? Any pointers on which APIs to query for pending command steps?
Other questions are:
- how can I monitor the resource utilization of each job or VM and surface this to devs the most immediate (without manually clicking through 10 links to get to a grafana dashboard or whatever)?
- How can I give devs feedback about the scheduling status/progress of VMs (e.g. VM created but agent on it hasn’t connected yet) and surface them scheduling errors, such as GCP quotas being exceeded?
Looking forward to your thoughts!