Disk space cleanup on agents. The environment hook vs docker-low-disk-gc

Jim · October 25, 2023, 9:04am

We have had a problem with disk filling up on our agents rather quickly and needing to manually stop the agents so that they will terminate.

When it fails the hook tries to clean things up and can’t find enough to do.

Checking disk space
Disk space free: 3.3G
Not enough disk space free, cutoff is 4194304 🚨
Cleaning up docker resources older than 30m
Deleted Images:
untagged: 759931498410.dkr.ecr.ap-southeast-2.amazonaws.com/megatron@sha256:43340be10b6950c279c787fbc2418f5346414922d6c800675d03c962c3f121d1
deleted: sha256:d3746c326ed977c2c9ce19a5cb4f89f55d4be609523400e56acd566a6c804326
deleted: sha256:bfa34b8ada2e363fbab06030d6e26c3d4966320e8e8a11cf7381b0c730341ed9
deleted: sha256:9d084627c5e6f4a4914910868426128396cb709f9b5953ed05e148ddf1d22093
deleted: sha256:21b2d5114e248e81d6d353fe2484c9c50d73c36452dbb425e13fb1fc640ed4c2
deleted: sha256:682ada4d0212df04fc654aac8e372f20f75cf38cddbf58bacbfd05d76b9f3258
deleted: sha256:e956b6b217161984fa9251e6eb995ac2fa146012010dd47179422934989ecfe9
deleted: sha256:05e3b9608f9c8da5dca3485fbb7af29bb76ca34ec75ae684726b02ca77722313
deleted: sha256:eb10cf77d1846b827982f02c9e48bde1b95b4410e014f3d78e2d64b2c9a5381d
deleted: sha256:42f16b308990ee2b237a57ae79bc569374e1434f77b7bdf310bbc3a5699e3e97
Total reclaimed space: 391.3MB
Checking disk space again
Disk space free: 3.7G
Not enough disk space free, cutoff is 4194304 🚨
Disk health checks failed

The disks in question are 64 GiB disks.

In the Elastic Stack we set the BootstrapScriptUrl variable that does a bit of setup on the agents when they start including this to try and make the clean up more aggressive:

# Set DISK_MIN_AVAILABLE to 4GiB from the default of 5GiB.                                                               
# Set DOCKER_PRUNE_UNTIL to 30 from the default of 4h (actually 1h due to the above which I'll remove if this works.     
cat <<-EOT >> /var/lib/buildkite-agent/cfn-env                                                                           
        export DISK_MIN_AVAILABLE=4194304                                                                                
        export DOCKER_PRUNE_UNTIL=30m                                                                                    
EOT

I noticed that there are two files in your repository for handling this.
The environment hook:

github.com

buildkite/elastic-ci-stack-for-aws/blob/1d76f99e94001dc0de4373b70d2361a61a8dc9e6/packer/linux/conf/buildkite-agent/hooks/environment#L43


      
          echo "Checking docker"
          if ! docker ps; then
            echo "^^^ +++"
            echo ":alert: Docker isn't running!"
            set -x
            pgrep -lf docker || tail -n 50 /var/log/docker
            exit 1
          fi
          
          echo "Checking disk space"
          if ! /usr/local/bin/bk-check-disk-space.sh; then
            echo "Cleaning up docker resources older than ${DOCKER_PRUNE_UNTIL:-4h}"
            docker image prune --all --force --filter "until=${DOCKER_PRUNE_UNTIL:-4h}"
          
            echo "Checking disk space again"
            if ! /usr/local/bin/bk-check-disk-space.sh; then
              echo "Disk health checks failed" >&2
              exit 1
            fi
          fi

And the docker-low-disk-gc script that is run via a systemd timer on an hourly basis:

github.com

buildkite/elastic-ci-stack-for-aws/blob/1d76f99e94001dc0de4373b70d2361a61a8dc9e6/packer/linux/conf/docker/scripts/docker-low-disk-gc#L36


      
              --region "${region}" \
              --health-status Unhealthy
          }
          
          trap mark_instance_unhealthy ERR
          
          ## -----------------------------------------------------------------
          ## Check disk, we only want to prune images/containers/build caches
          ## if we really need to
          
          if ! /usr/local/bin/bk-check-disk-space.sh; then
            echo "Cleaning up docker resources older than ${DOCKER_PRUNE_UNTIL}"
            docker image prune --all --force --filter "until=${DOCKER_PRUNE_UNTIL}"
            docker builder prune --all --force --filter "until=${DOCKER_PRUNE_UNTIL}"
          
            if ! /usr/local/bin/bk-check-disk-space.sh; then
              echo "Disk health checks failed" >&2 && false
              exit 1
            fi
          fi

In addition to the docker image prune --all --force --filter "until=${DOCKER_PRUNE_UNTIL}" command, the systemd timer runs docker builder prune --all --force --filter "until=${DOCKER_PRUNE_UNTIL}"

I connected to an agent after it failed and first ran this:

root@ip-10-128-2-237:/home/ec2-user# docker image prune --all --force --filter until=30m
Total reclaimed space: 0B

Then I tried this:

root@ip-10-128-2-237:/home/ec2-user# docker builder prune --all --force --filter until=30m

The above command cleaned up 11.62GB from the disk.

Is there any reason why the above command isn’t run from the environment script as well?
It potentially may solve our issue?

Jim · October 25, 2023, 9:17am

In addition to the above I see that the docker-low-disk-gc systemd timer job will mark the host as unhealthy if it couldn’t free up disk space like so:

mark_instance_unhealthy() {
  # cancel any running buildkite builds
  killall -QUIT buildkite-agent || true

  # mark the instance for termination
  echo "Marking instance as unhealthy"

  # shellcheck disable=SC2155
  local token=$(curl -X PUT -H "X-aws-ec2-metadata-token-ttl-seconds: 60" --fail --silent --show-error --location "http://169.254.169.254/latest/api/token")
  # shellcheck disable=SC2155
  local instance_id=$(curl -H "X-aws-ec2-metadata-token: $token" --fail --silent --show-error --location "http://169.254.169.254/latest/meta-data/instance-id")
  # shellcheck disable=SC2155
  local region=$(curl -H "X-aws-ec2-metadata-token: $token" --fail --silent --show-error --location "http://169.254.169.254/latest/meta-data/placement/region")

  aws autoscaling set-instance-health \
    --instance-id "${instance_id}" \
    --region "${region}" \
    --health-status Unhealthy
}

trap mark_instance_unhealthy ERR

Can the environment hook also do that but send use killall -TERM buildkite-agent || true instead of -QUIT so that existing jobs on the instance can at least run to completion (or run to failure) before then marking the host as unhealthy to be terminated?
This would save me having to do it manually to stop new jobs from being scheduled to run on the full instance and failing.

stephanie.atte · October 25, 2023, 6:23pm

Hey @Jim

Thanks for the detailed message.

The Environment hook using the stack should automatically run these prune tasks, in the chance that it doesn’t happen you can perform cleanup tasks

Using a pre-exit hook (this runs on every step before the job finishes. ) to delete temporary files and remove containers agent lifecycle hooks for those tasks instead.
BootstrapScriptUrl which is run at instance boot time, and append custom values for the check disk space script configurationto the cfn-env file (the absolute path to that file is /var/lib/buildkite-agent/cfn-env ).
In your script, you’ll have something like echo 'set always DOCKER_PRUNE_UNTIL 1h' >> /var/lib/buildkite-agent/cfn-env

I was able to find some more context in your job logs there is

Cleaning up docker resources older than 30m
Error response from daemon: a prune operation is already running
:alert: Elastic CI Stack environment hook failed

From the logs, you can see the BootrapScriptUrl is running as well as the Environment hook check at the same time. The prune tasks are being run actually from what I see, it’s causing the Environment hook to fail because a prune is already running.

There is a lock and the prune command is not running anymore, which could be due to a dead container or an unresponsive container. Can try restart docker if it helps

Also can you send an example build link to support@buildkite.com if you still need some more assistance

Cheers!

Jim · October 26, 2023, 12:01am

Hi @stephanie.atte

I wasn’t saying that the environment hook isn’t running; it is.
However it only runs a docker image prune which doesn’t find enough to clean up to alleviate my problem.

But I also noticed that there is a systemd timer that hourly runs a docker builder prune command as well and I was wondering why the environment hook also doesn’t try to run that command, because when I manually ran that, it found plenty of files to clean up and so our build wouldn’t have failed.

I was also wondering why the environment hook doesn’t have some logic to mark the instance as unhealthy once it is unable to clean up enough disk space.

Regarding the lock problem that you’ve pointed out, I wasn’t aware that was happening.
I’d imagine this is because sometimes the systemd timer job running docker-low-disk-gc clashes with the environment hook.
Or maybe even because there are multiple agents per instance (5 in our case) running build steps in parallel the environment hooks potentially clash with each other as well?
Perhaps the scripts need logic to see if a docker prune is being run already and instead of running one themselves, just wait until the prune is done before checking the disk space again to decide if they should bail.

In summary:

Why doesn’t the environment hook try running a docker builder prune command?
Why doesn’t the environment hook try to mark the instance as unhealthy after it hasn’t managed to free up enough disk space?
And a new issue… It seems like the docker * prune commands being run from the parallel runs of the environment hook (due to many agents on an instance) can clash with each other, and also could clash with the docker-low-disk-gc script that the systemd timer is running.
It seems like the code in both environment and docker-low-disk-gc should handle these locking problems.

benmc · October 26, 2023, 2:42am

Hey @Jim!

Ben here! Some great questions and ones I’d recommend raising an issue for on GitHub, or maybe even a PR if you’d like.

The current set up is done in a way that’ll be most useful to the most users. Depending on your instance type and size, cleaning docker caches with the builder prune command might not be required, for example. When I run the builder command I barely recover any disk space at all (KBs), whereas the image command cleared up significantly more.

Cheers!

Jim · October 26, 2023, 4:42am

Hi Ben.

I have raised the following pull request that addresses the first point.

github.com/buildkite/elastic-ci-stack-for-aws

Add docker builder prune to environment hook

buildkite:main ← HealthengineAU:docker-prune

opened 04:40AM - 26 Oct 23 UTC

jim-barber-he

+1 -0

When the environment hook detects that there is not enough disk space on the age…nt it invokes `docker image prune`. On our agents with 64 GiB disks we are finding that our agents are filling up too quickly and even when we set `DOCKER_PRUNE_UNTIL=30m` not enough us cleaned up. The `packer/linux/conf/docker/scripts/docker-low-disk-gc` script also has similar logic, except in addition to `docker image prune` it runs `docker builder prune`. On one of our agents that had started failing builds due to full disk and where the environment hook's disk clean up was not freeing up enough disk space, I manually ran the `docker builder prune` command with appropriate command line arguments and it freed up 20 GB of disk for us. So it seems that it would be beneficial to run this in the environment hook as well.

Jim · October 27, 2023, 12:06am

Hi @benmc

The PR I’ve raised is still waiting for some checks to complete, so I guess something manual needs to happen your side.

Also is there something that I need to do so that it gets reviewers assigned to it?

benmc · October 27, 2023, 12:24am

Thanks for letting me know @Jim! I’ll let the owners of the Elastic Stack know and give them a bit of a poke for the review.

Topic		Replies	Views
Missing docker images from custom AMI Elastic CI Stack for AWS	11	636	September 2, 2023
Docker on EC2 builds sliently failing General	2	1173	February 22, 2020
Docker images pulled during bootstrap missing Elastic CI Stack for AWS	5	477	May 24, 2022
Autoscaling group kept creating instances in loop Elastic CI Stack for AWS	12	644	August 29, 2023
Is buildkite-agent intended to be used on preemptible instances? General	7	1622	December 25, 2020

Disk space cleanup on agents. The environment hook vs docker-low-disk-gc

Related topics