I’m maintaining an open-source Buildkite-based CI cluster (for the Julia project) that supports running CI on untrusted, third-party PRs. To that end, we run buildkite-agent in a sandboxed environment, running a single job + --disconnect-after-job so that we can reset the container state after each job. This is a bit coarse, but works fine.
I now want to change the way the container is set up based on the type of job (i.e., trusted or not). This is tricky, because buildkite-agent already runs in the container when it acquires a job. My current workaround is to have the host select a job through the REST API, set-up the container appropriately, and use --acquire to to have the agent execute that job. That’s not great: it can be racy, requires a separate API key, and requires us to reimplement quite some things from the agent in my own scheduler (polling, rate limiting, job selection logic, etc).
I wonder if things could be improved, e.g., by only needing the agents API, and ideally re-using the existing agent. For example:
- first, re-use
buildkite-agent to acquire a job, but not run it yet and simply return the job metadata (as returned by the `Acquire` API call)
- then, use
buildkite-agent with a hypothetical --accept (instead of --acquire, which AFAIU does Acquire + Accept) to take the previously-acquired job and run it in the container
This is a high-level idea; I’m not familiar enough with the agent or API to know if this is feasible. But it would greatly simplify our setup, and more generally it would seem like a useful feature to support running the agent in a appropriately-configured container without having to split the agent pool.
Hey @maleadt 
It sounds like what you’re building here is a stack, which we now provide a proper API for 
The Stacks API would be exactly what you’re looking for here. It’ll allow you to drop the REST polling and the second token. In the Stacks API, you poll a queue for scheduled jobs, reserve the ones you’ll run, then provision and let the agent acquire them. Reserve is what kills your race, a reserved job drops out of the feed for every other stack until you run it or the reservation lapses. It’s all on a cluster token, so the separate REST key goes away too.
On the container side, nothing will need to change. You’ll still run buildkite-agent start --acquire-job, so your reset after each job will remain as is.
We do also have a Go Client here, which allows for a StackTypeCustom. You could also use the agent-stack-k8s as a reference (or a base) as the Stacks API is used here.
Let us know if you have any questions! 
Looks perfect! I’ll take a look and let you know how it works.
Are there any ambitions to make the stacksapi example into more of a capable CLI client (e.g. supporting job selection based on queue/tags), or is that out of scope?
Honestly, the Stacks API library is designed to be specifically a client library opposed to a CLI, so I wouldn’t build upon this expecting it to follow suit.
However, what you’re looking to achieve here possible as Queue is already a poll parameter, you register and list against a queue. Tags come back on each job as agent_query_rules, so filtering by tag is a few lines on top, and the agent-stack-k8s already does this will a small agenttags helper that could be ripped here. The link to agent-stack-k8s will take you straight to the code I’m referencing here.
However, it’s certainly something that could be considered and baked into our Buildkite CLI. I’d suggest getting an issue raised here as a feature request, and the team can take a look into this for you. But, it’s a nice idea, for sure! 