Disk space cleanup on agents. The environment hook vs docker-low-disk-gc

We have had a problem with disk filling up on our agents rather quickly and needing to manually stop the agents so that they will terminate.

When it fails the hook tries to clean things up and can’t find enough to do.

Checking disk space
Disk space free: 3.3G
Not enough disk space free, cutoff is 4194304 🚨
Cleaning up docker resources older than 30m
Deleted Images:
untagged: 759931498410.dkr.ecr.ap-southeast-2.amazonaws.com/megatron@sha256:43340be10b6950c279c787fbc2418f5346414922d6c800675d03c962c3f121d1
deleted: sha256:d3746c326ed977c2c9ce19a5cb4f89f55d4be609523400e56acd566a6c804326
deleted: sha256:bfa34b8ada2e363fbab06030d6e26c3d4966320e8e8a11cf7381b0c730341ed9
deleted: sha256:9d084627c5e6f4a4914910868426128396cb709f9b5953ed05e148ddf1d22093
deleted: sha256:21b2d5114e248e81d6d353fe2484c9c50d73c36452dbb425e13fb1fc640ed4c2
deleted: sha256:682ada4d0212df04fc654aac8e372f20f75cf38cddbf58bacbfd05d76b9f3258
deleted: sha256:e956b6b217161984fa9251e6eb995ac2fa146012010dd47179422934989ecfe9
deleted: sha256:05e3b9608f9c8da5dca3485fbb7af29bb76ca34ec75ae684726b02ca77722313
deleted: sha256:eb10cf77d1846b827982f02c9e48bde1b95b4410e014f3d78e2d64b2c9a5381d
deleted: sha256:42f16b308990ee2b237a57ae79bc569374e1434f77b7bdf310bbc3a5699e3e97
Total reclaimed space: 391.3MB
Checking disk space again
Disk space free: 3.7G
Not enough disk space free, cutoff is 4194304 🚨
Disk health checks failed

The disks in question are 64 GiB disks.

In the Elastic Stack we set the BootstrapScriptUrl variable that does a bit of setup on the agents when they start including this to try and make the clean up more aggressive:

# Set DISK_MIN_AVAILABLE to 4GiB from the default of 5GiB.                                                               
# Set DOCKER_PRUNE_UNTIL to 30 from the default of 4h (actually 1h due to the above which I'll remove if this works.     
cat <<-EOT >> /var/lib/buildkite-agent/cfn-env                                                                           
        export DISK_MIN_AVAILABLE=4194304                                                                                
        export DOCKER_PRUNE_UNTIL=30m                                                                                    
EOT                                                                                                                      

I noticed that there are two files in your repository for handling this.
The environment hook:

And the docker-low-disk-gc script that is run via a systemd timer on an hourly basis:

In addition to the docker image prune --all --force --filter "until=${DOCKER_PRUNE_UNTIL}" command, the systemd timer runs docker builder prune --all --force --filter "until=${DOCKER_PRUNE_UNTIL}"

I connected to an agent after it failed and first ran this:

root@ip-10-128-2-237:/home/ec2-user# docker image prune --all --force --filter until=30m
Total reclaimed space: 0B

Then I tried this:

root@ip-10-128-2-237:/home/ec2-user# docker builder prune --all --force --filter until=30m

The above command cleaned up 11.62GB from the disk.

Is there any reason why the above command isn’t run from the environment script as well?
It potentially may solve our issue?

In addition to the above I see that the docker-low-disk-gc systemd timer job will mark the host as unhealthy if it couldn’t free up disk space like so:

mark_instance_unhealthy() {
  # cancel any running buildkite builds
  killall -QUIT buildkite-agent || true

  # mark the instance for termination
  echo "Marking instance as unhealthy"

  # shellcheck disable=SC2155
  local token=$(curl -X PUT -H "X-aws-ec2-metadata-token-ttl-seconds: 60" --fail --silent --show-error --location "http://169.254.169.254/latest/api/token")
  # shellcheck disable=SC2155
  local instance_id=$(curl -H "X-aws-ec2-metadata-token: $token" --fail --silent --show-error --location "http://169.254.169.254/latest/meta-data/instance-id")
  # shellcheck disable=SC2155
  local region=$(curl -H "X-aws-ec2-metadata-token: $token" --fail --silent --show-error --location "http://169.254.169.254/latest/meta-data/placement/region")

  aws autoscaling set-instance-health \
    --instance-id "${instance_id}" \
    --region "${region}" \
    --health-status Unhealthy
}

trap mark_instance_unhealthy ERR

Can the environment hook also do that but send use killall -TERM buildkite-agent || true instead of -QUIT so that existing jobs on the instance can at least run to completion (or run to failure) before then marking the host as unhealthy to be terminated?
This would save me having to do it manually to stop new jobs from being scheduled to run on the full instance and failing.

Hey @Jim

Thanks for the detailed message.

The Environment hook using the stack should automatically run these prune tasks, in the chance that it doesn’t happen you can perform cleanup tasks

  • Using a pre-exit hook (this runs on every step before the job finishes. ) to delete temporary files and remove containers agent lifecycle hooks for those tasks instead.

  • BootstrapScriptUrl which is run at instance boot time, and append custom values for the check disk space script configurationto the cfn-env file (the absolute path to that file is /var/lib/buildkite-agent/cfn-env ).
    In your script, you’ll have something like echo 'set always DOCKER_PRUNE_UNTIL 1h' >> /var/lib/buildkite-agent/cfn-env

I was able to find some more context in your job logs there is

Cleaning up docker resources older than 30m
Error response from daemon: a prune operation is already running
:alert: Elastic CI Stack environment hook failed

From the logs, you can see the BootrapScriptUrl is running as well as the Environment hook check at the same time. The prune tasks are being run actually from what I see, it’s causing the Environment hook to fail because a prune is already running.

There is a lock and the prune command is not running anymore, which could be due to a dead container or an unresponsive container. Can try restart docker if it helps

Also can you send an example build link to support@buildkite.com if you still need some more assistance

Cheers!

Hi @stephanie.atte

I wasn’t saying that the environment hook isn’t running; it is.
However it only runs a docker image prune which doesn’t find enough to clean up to alleviate my problem.

But I also noticed that there is a systemd timer that hourly runs a docker builder prune command as well and I was wondering why the environment hook also doesn’t try to run that command, because when I manually ran that, it found plenty of files to clean up and so our build wouldn’t have failed.

I was also wondering why the environment hook doesn’t have some logic to mark the instance as unhealthy once it is unable to clean up enough disk space.

Regarding the lock problem that you’ve pointed out, I wasn’t aware that was happening.
I’d imagine this is because sometimes the systemd timer job running docker-low-disk-gc clashes with the environment hook.
Or maybe even because there are multiple agents per instance (5 in our case) running build steps in parallel the environment hooks potentially clash with each other as well?
Perhaps the scripts need logic to see if a docker prune is being run already and instead of running one themselves, just wait until the prune is done before checking the disk space again to decide if they should bail.

In summary:

  • Why doesn’t the environment hook try running a docker builder prune command?

  • Why doesn’t the environment hook try to mark the instance as unhealthy after it hasn’t managed to free up enough disk space?

  • And a new issue… It seems like the docker * prune commands being run from the parallel runs of the environment hook (due to many agents on an instance) can clash with each other, and also could clash with the docker-low-disk-gc script that the systemd timer is running.
    It seems like the code in both environment and docker-low-disk-gc should handle these locking problems.

Hey @Jim!

Ben here! Some great questions and ones I’d recommend raising an issue for on GitHub, or maybe even a PR if you’d like.

The current set up is done in a way that’ll be most useful to the most users. Depending on your instance type and size, cleaning docker caches with the builder prune command might not be required, for example. When I run the builder command I barely recover any disk space at all (KBs), whereas the image command cleared up significantly more.

Cheers!

Hi Ben.

I have raised the following pull request that addresses the first point.

1 Like

Hi @benmc

The PR I’ve raised is still waiting for some checks to complete, so I guess something manual needs to happen your side.
image

Also is there something that I need to do so that it gets reviewers assigned to it?

Thanks for letting me know @Jim! I’ll let the owners of the Elastic Stack know and give them a bit of a poke for the review.

1 Like