I’m working on migrating from the cf stack to agent-stack-k8s on a local cluster. I’m still interesting in using ECR because I’m not sure I’ll ever fully retire the CF stack. My builds rely on the docker-compose plugin: the first step builds a container specific that branch/sha and each subsequent step pulls that container and runs steps within it.
Unfortunately, because I’m no longer running on ECR, this significantly increases my network delay. I think I most of that delay would disappear (and in fact improve upon the CF stack) if I could reuse pulled images across agents.
Is there any way to make that work without updating my in-repo pipeline definitions (i.e., can i reconfigure either my cluster or the agent-stack-k8s to have a shared, persistent image and layer cache)?
I hear your concerns to stick with ECR to go with both CF stack and K8s cluster. Could you clarify if you are currently running your k8s cluster in EKS? What is the image registry currently been used with K8s?
If you’re sticking to any non ECR registry, I’m wondering if you should use ECR pull through caches to reduce the latency and also looking at establishing node level container cache? Do you have any imagePullPolicy currently defined as part of your pod.
We can dig deeper once we understand a bit more about your current setup.
I’m not running the cluster in ~k8s~ EKS, I’m running it onprem. both my k8s stack and my cloudformation station push to and pull from a private ECR registry. If I’m reading your link correctly, ECR pull-through caches are for using ECR as a cache, which is valuable when your builders are on AWS infrastructure, but not particularly valuable when your builders and runners are elsewhere.
I’ve been trying to set up a pull-through cache in the cluster that caches images coming from ECR, but so far, the credentials needed for a private registry have gotten in my way. I also tried to set up node-level caching, but since I’ll be running multiple agents on the same work and potentially multiple agents will be doing builds, all indications are that this will cause issues with the docker layer cache.
Hi @ianwremmel Wanted to check if you had a chance to review this document as it could give some insights into the approach you could try here and please let me know if this helps. Just another note, the image cache is handed by Kubernetes with imagePullPolicy in the Pod spec - https://kubernetes.io/docs/concepts/containers/images/#image-pull-policy and for build layer caching, I believe you can create a Persistent Volume Claim and mount this at /var/lib/docker in container-0 using config.pod-spec-patch in the controller’s configuration to achieve your use case to not update the in-repo pipeline YAML files
yea, i’m currently using the dind approach. Everything I was reading suggested that mounting /var/lib/docker was discouraged by the docker folks because the cache wasn’t intended for multiple writes and corruption is likely.
does the imagePullPolicy apply to the image used by agents or does it also apply to docker containers started by those agents? (I’m admittedly rather fuzzy on where the boundaries between docker and kubernetes are across the various configuration i’m juggling )
Hi @ianwremmel looking at your scenario and after reviewing it a bit further for this use case, I see your point as there are multiple daemons pointing to the same data root which could cause corruption. Regarding imagePullPolicy , I believe it only applies to the agent pod images themselves (the Buildkite agent container, DinD sidecar, etc.) and not to images pulled by docker-compose inside the DinD container. Since you’re using DinD, there’s a separate Docker daemon running inside your pod, and that daemon pulls images completely independently of Kubernetes. So imagePullPolicy won’t help with caching the images your docker-compose plugin pulls from ECR. If you stick with your current DinD approach then you could add an in-cluster registry that acts as a caching proxy for ECR. After this, you can then point your DinD daemons at it so they pull through the cache. The main thing you’d need to handle is keeping the ECR credentials refreshed
(You could use a cronjob)so the proxy can continue authenticating to your registry.
i’ve been trying to set up a caching proxy for ECR for the last few days and it keeps either failing to authenticate or make things slower.
From what I’ve read, if I switch to buildkx, i’ll get the shared build cache i’m looking for, but I think that still leaves the docker-compose plugin without a cache. is docker-compose still recommended for k8s or do y’all have a different approach entirely at this point?
Hi @ianwremmel There are other approaches asides dind. I believe you can explore using Kaniko as it also supports ECR and could work for your use case or any other other approaches described here for building container images. For the ECR authentication you can review this documentation for guidance.