Found that the environment hook runs the following:
echo "Cleaning up docker resources older than ${DOCKER_PRUNE_UNTIL:-4h}"
docker image prune --all --force --filter "until=${DOCKER_PRUNE_UNTIL:-4h}"
It only gets executed when the disk available storage is low, so I don’t know if this is what’s causing the issue, but I’m going to try defining the DOCKER_PRUNE_UNTIL on the agent env.
Using docker image prune --all --force --filter “until=${DOCKER_PRUNE_UNTIL:-4h}” will
remove all unused images, not just dangling ones and also only removes images created before given timestamp. Also the script in your environment hook will always run, is the prune command in a conditional expression?
Where you able to try run your use case without the docker image prune to see the difference in the behaviour.
Please can you send a related build URL so we can take a look
Cheers!
It did not, but I found something else interesting while checking /var/log/elastic-stack.log
h-5.2$ sudo cat /var/log/elastic-stack.log
Disk space free: 5.3G
Inodes free: 4.0M
Total reclaimed space: 0B
Disk space free: 27G
Inodes free: 20M
Total reclaimed space: 0B
Disk space free: 23G
Inodes free: 20M
+ [[ false != \t\r\u\e ]]
+ echo 'Skipping mounting instance storage'
Skipping mounting instance storage
+ exit 0
...
There it shows how the prune command gets executed at the very beginning cause the /usr/local/bin/bk-check-disk-space.sh is resolving that there are not enough space:
No, still getting the exact same log about the disk space. I can’t find where is that getting executed and why it returns those values, the root volume on the AMI has 45 Gb with more than 10 free, and the Elastic Stack mounts an even bigger volume.
The situation you are describing is exactly what the userns-remap documentation describes (Isolate containers with a user namespace | Docker Docs). When you configure userns-remap, it creates a specific folder in /var/lib/docker and the user owns the namespaced storage directories under that directory.
You’ll need to check that if the userns-remap is ON when you do the image pull and use the buildkite-agent user to do the pulling. If it was not ON, then you need to ensure that it is not turned on when creating the stack or this will happen.
It doesn’t really matter cause the first thing it does is remove all the images, and then changing the daemon.json settings. So the images get deleted any way.
buildkite-agent@ip-172-16-0-201 ~]$ cat /var/log/elastic-stack.log
Disk space free: 5.3G
Inodes free: 4.0M
Total reclaimed space: 0B
Disk space free: 28G
Inodes free: 23M
...
source /usr/local/lib/bk-configure-docker.sh
++ QEMU_BINFMT_TAG=qemu-v7.0.0-28@sha256:66e11bea77a5ea9d6f0fe79b57cd2b189b5d15b93a2bdb925be22949232e4e55
+ [[ true == \t\r\u\e ]]
+ cat
++ jq '."userns-remap"="buildkite-agent"' /etc/docker/daemon.json
+ cat
++ id -u buildkite-agent
+ cat
++ getent group docker
...
I’ve tried using a ridiculously high time window to prevent images from getting deleted, but not even that worked.
DOCKER_PRUNE_UNTIL=26280h
Any other ideas? If I can pre-pull the images it doesn’t really make sense to have a custom AMI in this case.
However, the first log you shared shows the images are with the user ec2-user but in the last one, it is with buildkite-agent.
To rule out possible reasons, can you confirm that the images are being deleted and not that you cannot see them?