Elastic CI Stack for AWS ECS/Spotfleet


#1

We’re working on a version of the Elastic Stack that runs on ECS and Spot Fleet.

We’re hoping to end up with something that will allow compute resources to be pooled across lots of teams, but still have strong IAM role separation between pipelines/queues.

The current README and project is in a minimally working setup (we’re going to use it to power our open-source builds).

We’d love feedback!


#2

This is very cool, but in optimising for per-container security, I worry that you’re leaving things like shared caches (eg, docker layers, yarn cache) on the floor, and seriously complicating the stack to boot.

We’ve got a similar problem internally, and we’re looking at instead spinning up dedicated smaller stacks for sensitive and/or public pipelines. There’s some efficiency loss, but I wonder how the cost/complexity tradeoff will play out, especially if a weakness is found in the way you’re isolating containers.

In our case, though, 95% of the build jobs can run with the same set of non-elevated permissions, and are from trusted rather than public sources.


#3

This is very cool, but in optimising for per-container security, I worry that you’re leaving things like shared caches (eg, docker layers, yarn cache) on the floor, and seriously complicating the stack to boot.

That’s a very valid thing to be concerned about, and is definitely something I think a lot about. In terms of complexity, in a lot of ways I view this stack as simpler. It’s more modular, relies more on AWS primitives and stock ECS AMI’s, so it’s easier to fork and modify. Customization happens at a docker image level, which means folks can easily provide their own agent docker images vs having a custom packer setup. This stack will have less magic baked in, less bash and the lambdas are in golang so they can be tested better.

In terms of shared caches, I think there are a ton of ways to build these on top of strong per-container isolation, where as trying to build shared caches securely without that isolation is really hard. We will always prioritise fast, parallel builds, so will absolutely re-visit if we think that this stack makes that harder. The flip side is that this stack more cost-effectively lets you use bigger instance sizes and scales much more responsively, so I think there are some big performance wins in there too.

There’s some efficiency loss, but I wonder how the cost/complexity tradeoff will play out, especially if a weakness is found in the way you’re isolating containers.

I see the isolation that ECS provides as defense-in-depth vs the primary defence mechanism. I still think isolating stacks and a variety of other security initiatives we have coming up around Buildkite-level isolation of queues/teams/pipelines will be key here.

In our case, though, 95% of the build jobs can run with the same set of non-elevated permissions, and are from trusted rather than public sources.

I assure you that this is a primary use case for us too and one we’ll be careful to keep solving for!


#4

Thanks for that response - makes the motivation / tradeoffs much clearer, and very keen to see where this ends up!


#5

No problems, really appreciate the feedback!