Sorry for the slow reply! It’s a complex topic to get right, as it’s quite dependent on your AWS setup and how you’ve designed your account relationships.
Correct me if I’ve mis-interpreted your question, but it seems like you are trying to restrict the AWS resources that different builds have access to? I’ll talk through some of the options we’ve seen.
Basic Policy-Per-Queue Design
The simplest way to control access to AWS resources is by mapping IAM Roles to specific queues. For the sake of an example, imagine we have some queues like:
- A “default” queue that has very few AWS permissions other than perhaps some basic ECR access that is used for low security general build
- A “deployment” queue that has much more access, including to secrets in SSM
Generally each queue would be backed by an Elastic Stack, which would have a Role with the permissions that would be inherited by builds that run on it.
The trick here is to have an Agent Hook on each agent (we recommend the environment
hook) that checks that the builds that are running on these agents are allowed to run there, especially on the deployment queue, as otherwise it would be trivial for anyone to run a build that targeted the “deployment” queue with arbitrary code.
Our docs on Securing your Agent go into more detail, but basically you want a script that checks the following things about each build before running it:
-
BUILDKITE_PIPELINE_SLUG: The name of the pipeline that triggered the build. For example, if your pipeline was https://buildkite.com/acme/my-pipeline
the pipeline slug is my-pipeline
.
-
BUILDKITE_REPO: The URL of the source code repository, e.g. git@github.com:acme/my-pipeline.git
.
-
BUILDKITE_BRANCH: A list of permitted branches for production deployment, e.g. master
, preprod
-
BUILDKITE_BUILD_CREATOR_TEAMS: The teams that the build creator is in.
Whilst you can maintain the above in lists in the shell script, often it’s more scalable to use something like AWS SSM ParameterStore to track the params:
REPOSITORY=$(aws ssm get-parameter \
--name "/buildkite/pipelines/${BUILDKITE_PIPELINE_SLUG}/repository" \
--region us-east-1 \
--output text \
--query Parameter.Value 2>&1)
if [[ "$REPOSITORY" != "$BUILDKITE_REPO" ]] ; then
echo "🙅🏼‍♂️"
exit 1
fi
So that gets you a long way, but as you say over time you end up with more and more access added to the per-queue roles and it’s hard to audit and keep track of.
Assume Roles Per-Pipeline
I’ve not used this approach personally, but have worked with several teams that use it at scale. I’ll ask folks to chip in in-case I’ve glossed over details
So rather than adding additional permissions to the EC2 Role, you have very sparse permissions except for the ability to assume roles.
You’d add an additional action into your environment
hook to look up what roles a specific pipeline was supposed to assume (from SSM or similar) and then assume the role.
Another option here is to use a plugin in your pipeline YAML to specify which roles are assumed: https://github.com/cultureamp/aws-assume-role-buildkite-plugin.
Restrict IAM Access by Firewalling off Metadata API
So the tricky bit about the above approach is that technically you don’t have any way to prevent code in a trusted pipeline / repository from making role assumptions they aren’t supposed to, as technically they all have access to the underlying IAM Instance Role.
In some cases this might be ok, for instance if the primary problem you are trying to solve is just partitioning your access into smaller, more easily auditable policies / roles.
You can however take a leaf out of ECS’s book and use a technique where you run the rest of a build in a docker container and firewall off access to the EC2 host’s Instance Metadata API.
Lyft has an awesome tool for doing this: https://github.com/lyft/metadataproxy
A critical part of this technique is running the entire Buildkite build in a container, which is done with a bootstrap-script
which you can read more about at https://github.com/buildkite/docker-bootstrap-example.
Phew! Hope that helped!