Cross Account Deployments in AWS using Buildkite agents

Hello,

we are in the process of building a CI/CD pipeline using the https://buildkite.com platform and AWS.

we would like to adhere by the principle of least privilege.

we currently have separate AWS accounts Management, DEV,TEST etc.

we have hosted our buildkite agents on the Management account on EC2 instances and in order to be able to deploy to other accounts we need to assume a role.

the problem here is that when EC2 is provisioned it is bound to the role it is initially provisioned with so if we need to add a new role we also need to re-provision the agent EC2 instances.

we can create a new service role and assign it appropriate permissions when needed but over time this service role would gain too many policies and can do too many things.

we could create multiple policies and roles which again becomes a management nightmare over time or we can bind a build queue to a certain policy.

another option is to use SAML and federated users but then again someone has to manage the users and take care of governance.

My question is apart from the above mentioned approaches are there any other viable options and also what would be the best middle ground, what would enable us to lock down and secure the pipeline policies so that they cant be abused with minimal management and governance?

1 Like

:wave: Sorry for the slow reply! It’s a complex topic to get right, as it’s quite dependent on your AWS setup and how you’ve designed your account relationships.

Correct me if I’ve mis-interpreted your question, but it seems like you are trying to restrict the AWS resources that different builds have access to? I’ll talk through some of the options we’ve seen.

Basic Policy-Per-Queue Design

The simplest way to control access to AWS resources is by mapping IAM Roles to specific queues. For the sake of an example, imagine we have some queues like:

  • A “default” queue that has very few AWS permissions other than perhaps some basic ECR access that is used for low security general build
  • A “deployment” queue that has much more access, including to secrets in SSM

Generally each queue would be backed by an Elastic Stack, which would have a Role with the permissions that would be inherited by builds that run on it.

The trick here is to have an Agent Hook on each agent (we recommend the environment hook) that checks that the builds that are running on these agents are allowed to run there, especially on the deployment queue, as otherwise it would be trivial for anyone to run a build that targeted the “deployment” queue with arbitrary code.

Our docs on Securing your Agent go into more detail, but basically you want a script that checks the following things about each build before running it:

  • BUILDKITE_PIPELINE_SLUG: The name of the pipeline that triggered the build. For example, if your pipeline was https://buildkite.com/acme/my-pipeline the pipeline slug is my-pipeline.

  • BUILDKITE_REPO: The URL of the source code repository, e.g. git@github.com:acme/my-pipeline.git.

  • BUILDKITE_BRANCH: A list of permitted branches for production deployment, e.g. master, preprod

  • BUILDKITE_BUILD_CREATOR_TEAMS: The teams that the build creator is in.

Whilst you can maintain the above in lists in the shell script, often it’s more scalable to use something like AWS SSM ParameterStore to track the params:

REPOSITORY=$(aws ssm get-parameter \
  --name "/buildkite/pipelines/${BUILDKITE_PIPELINE_SLUG}/repository" \
  --region us-east-1 \
  --output text \
  --query Parameter.Value 2>&1) 

if [[ "$REPOSITORY" != "$BUILDKITE_REPO" ]] ; then
  echo "🙅🏼‍♂️"
  exit 1
fi

So that gets you a long way, but as you say over time you end up with more and more access added to the per-queue roles and it’s hard to audit and keep track of.

Assume Roles Per-Pipeline

I’ve not used this approach personally, but have worked with several teams that use it at scale. I’ll ask folks to chip in in-case I’ve glossed over details

So rather than adding additional permissions to the EC2 Role, you have very sparse permissions except for the ability to assume roles.

You’d add an additional action into your environment hook to look up what roles a specific pipeline was supposed to assume (from SSM or similar) and then assume the role.

Another option here is to use a plugin in your pipeline YAML to specify which roles are assumed: https://github.com/cultureamp/aws-assume-role-buildkite-plugin.

Restrict IAM Access by Firewalling off Metadata API

So the tricky bit about the above approach is that technically you don’t have any way to prevent code in a trusted pipeline / repository from making role assumptions they aren’t supposed to, as technically they all have access to the underlying IAM Instance Role.

In some cases this might be ok, for instance if the primary problem you are trying to solve is just partitioning your access into smaller, more easily auditable policies / roles.

You can however take a leaf out of ECS’s book and use a technique where you run the rest of a build in a docker container and firewall off access to the EC2 host’s Instance Metadata API.

Lyft has an awesome tool for doing this: https://github.com/lyft/metadataproxy

A critical part of this technique is running the entire Buildkite build in a container, which is done with a bootstrap-script which you can read more about at https://github.com/buildkite/docker-bootstrap-example.

Phew! Hope that helped!

1 Like