This is very cool, but in optimising for per-container security, I worry that you’re leaving things like shared caches (eg, docker layers, yarn cache) on the floor, and seriously complicating the stack to boot.
That’s a very valid thing to be concerned about, and is definitely something I think a lot about. In terms of complexity, in a lot of ways I view this stack as simpler. It’s more modular, relies more on AWS primitives and stock ECS AMI’s, so it’s easier to fork and modify. Customization happens at a docker image level, which means folks can easily provide their own agent docker images vs having a custom packer setup. This stack will have less magic baked in, less bash and the lambdas are in golang so they can be tested better.
In terms of shared caches, I think there are a ton of ways to build these on top of strong per-container isolation, where as trying to build shared caches securely without that isolation is really hard. We will always prioritise fast, parallel builds, so will absolutely re-visit if we think that this stack makes that harder. The flip side is that this stack more cost-effectively lets you use bigger instance sizes and scales much more responsively, so I think there are some big performance wins in there too.
There’s some efficiency loss, but I wonder how the cost/complexity tradeoff will play out, especially if a weakness is found in the way you’re isolating containers.
I see the isolation that ECS provides as defense-in-depth vs the primary defence mechanism. I still think isolating stacks and a variety of other security initiatives we have coming up around Buildkite-level isolation of queues/teams/pipelines will be key here.
In our case, though, 95% of the build jobs can run with the same set of non-elevated permissions, and are from trusted rather than public sources.
I assure you that this is a primary use case for us too and one we’ll be careful to keep solving for!