Missing docker images from custom AMI

Hi,
I’m currently using the Elastic Stack to provision the build nodes, and I’ve created a custom AMI based on the latest used by the stack.

During the AMI build process (using packer), some docker images are pulled to save some time later during provision and build.

If I create an instance from the resulting AMI I can see the docker images listed:

[ec2-user@ip-172-31-21-233 ~]$ docker images
REPOSITORY                     TAG              IMAGE ID       CREATED         SIZE
python                         3.9-alpine       3b2db3bf7ff4   5 weeks ago     47.8MB
python                         3.8-bullseye     21103390e13a   5 weeks ago     907MB
python                         3.7-bullseye     264922b4f7c6   5 weeks ago     904MB
licensefinder/license_finder   latest           5a9a2902c78c   9 months ago    9.9GB
node                           16.17-bullseye   613f3b69142f   10 months ago   941MB
tonistiigi/binfmt              <none>           354472a37893   13 months ago   60.2MB
softartdev/android-fastlane    30               ecdc26cb94f8   2 years ago     2.75GB

But when the instance gets created from the ASG from the Elastic Stack, the images aren’t there.

I’ve checked the AMI id is the right one, so I don’t know why the images aren’t present.

Any ideas what could I check next?

I read on another topic that there is some sort of prune command that gets run during the bootstrap, if that’s true can it be disabled some how?

Found that the environment hook runs the following:

  echo "Cleaning up docker resources older than ${DOCKER_PRUNE_UNTIL:-4h}"
  docker image prune --all --force --filter "until=${DOCKER_PRUNE_UNTIL:-4h}"

It only gets executed when the disk available storage is low, so I don’t know if this is what’s causing the issue, but I’m going to try defining the DOCKER_PRUNE_UNTIL on the agent env.

Hey @smoreno-allurion,

Thanks for the information

I have gone through

Using docker image prune --all --force --filter “until=${DOCKER_PRUNE_UNTIL:-4h}” will
remove all unused images, not just dangling ones and also only removes images created before given timestamp. Also the script in your environment hook will always run, is the prune command in a conditional expression?

Where you able to try run your use case without the docker image prune to see the difference in the behaviour.

Please can you send a related build URL so we can take a look
Cheers!

It did not, but I found something else interesting while checking /var/log/elastic-stack.log

h-5.2$ sudo cat /var/log/elastic-stack.log
Disk space free: 5.3G
Inodes free: 4.0M
Total reclaimed space: 0B
Disk space free:  27G
Inodes free:  20M
Total reclaimed space: 0B
Disk space free:  23G
Inodes free:  20M
+ [[ false != \t\r\u\e ]]
+ echo 'Skipping mounting instance storage'
Skipping mounting instance storage
+ exit 0
...

There it shows how the prune command gets executed at the very beginning cause the /usr/local/bin/bk-check-disk-space.sh is resolving that there are not enough space:

DISK_MIN_AVAILABLE=${DISK_MIN_AVAILABLE:-5242880} # 5GB
DISK_MIN_INODES=${DISK_MIN_INODES:-250000} # docker needs lots

I’m building a new ami with more space on root folder now to test this out.

No, still getting the exact same log about the disk space. I can’t find where is that getting executed and why it returns those values, the root volume on the AMI has 45 Gb with more than 10 free, and the Elastic Stack mounts an even bigger volume.

Any help would be much appreciated.

Hey Santi :wave:t2:

We have seen this in the past when namespace remapping being enabled in your AWS Elastic Stack (Template parameters in the Elastic CI Stack for AWS | Buildkite Documentation).

The situation you are describing is exactly what the userns-remap documentation describes (Isolate containers with a user namespace | Docker Docs). When you configure userns-remap, it creates a specific folder in /var/lib/docker and the user owns the namespaced storage directories under that directory.

You’ll need to check that if the userns-remap is ON when you do the image pull and use the buildkite-agent user to do the pulling. If it was not ON, then you need to ensure that it is not turned on when creating the stack or this will happen.

It doesn’t really matter cause the first thing it does is remove all the images, and then changing the daemon.json settings. So the images get deleted any way.

buildkite-agent@ip-172-16-0-201 ~]$ cat /var/log/elastic-stack.log
Disk space free: 5.3G
Inodes free: 4.0M
Total reclaimed space: 0B
Disk space free:  28G
Inodes free:  23M
...
 source /usr/local/lib/bk-configure-docker.sh
++ QEMU_BINFMT_TAG=qemu-v7.0.0-28@sha256:66e11bea77a5ea9d6f0fe79b57cd2b189b5d15b93a2bdb925be22949232e4e55
+ [[ true == \t\r\u\e ]]
+ cat
++ jq '."userns-remap"="buildkite-agent"' /etc/docker/daemon.json
+ cat
++ id -u buildkite-agent
+ cat
++ getent group docker
... 

I’ve tried using a ridiculously high time window to prevent images from getting deleted, but not even that worked.

DOCKER_PRUNE_UNTIL=26280h

Any other ideas? If I can pre-pull the images it doesn’t really make sense to have a custom AMI in this case.

This one is rare :thinking: I’m not entirely sure this issue has something to do with the low disk space script.

The log you shared shows that you have more than enough disk space and inodes, and we cannot see the log lines that it’s reclaiming space (https://github.com/buildkite/elastic-ci-stack-for-aws/blob/6665d04faf6eadbd3d0711bb1c6056cf71ea13a7/packer/linux/conf/docker/scripts/docker-low-disk-gc#L37), so that’s probably not the issue (you should be able to see a line saying “Cleaning up docker resources…”)

However, the first log you shared shows the images are with the user ec2-user but in the last one, it is with buildkite-agent.
To rule out possible reasons, can you confirm that the images are being deleted and not that you cannot see them? :pray:t2:

Thanks!

Thanks @paula, I got it working in the end, I had to combine both solutions.

I enabled the Buildkite-agent namespace during the AMI build, and increased theDOCKER_PRUNE_UNTIL to prevent the images from being deleted on startup.

This is how my shell provisioning script looks like now:

#!/usr/bin/env bash
set -o errexit
set -o nounset

sudo sh -c "echo 'export DOCKER_PRUNE_UNTIL=26280h' > /etc/profile.d/script.sh"
sudo chmod +x /etc/profile.d/script.sh

sudo cp /etc/docker/daemon.json /etc/docker/daemon.json.bak

sudo systemctl stop docker
DOCKER_CONFIG_TMP=$(mktemp)
sudo cp /etc/docker/daemon.json /etc/docker/daemon.json.bak
jq '. += {"userns-remap": "buildkite-agent"}' < /etc/docker/daemon.json > "${DOCKER_CONFIG_TMP}"
sudo mv "${DOCKER_CONFIG_TMP}" /etc/docker/daemon.json

cat <<EOF > /tmp/subuid
buildkite-agent:$(id -u buildkite-agent):1
buildkite-agent:100000:65536
EOF
sudo mv /tmp/subuid /etc/subuid

cat <<EOF > /tmp/subgid
buildkite-agent:$(getent group docker | awk -F: '{print $3}'):1
buildkite-agent:100000:65536
EOF
sudo mv /tmp/subgid /etc/subgid

sudo systemctl start docker

(
    sudo su - buildkite-agent
    docker pull softartdev/android-fastlane:30
)

sudo mv /etc/docker/daemon.json.bak /etc/docker/daemon.json

Thanks for all the help!

That’s great! :tada: thanks for sharing your solution :slight_smile:

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.