Handling Docker Hub timeouts on docker-login

geoffharcourt · June 24, 2020, 8:40pm

We’ve been using Buildkite for a few weeks, and have gotten a lot of benefit out of it for our CI pipeline. We have a remaining source of build flakiness that we want to figure out how to better handle.

We have kind of of timeouts or response failure that seems to happen when we’re interacting with Docker Hub through the docker-login plugin. Authentication sometimes appears to timeout when an agent is logging in with the docker-login plugin. Is there a way to have that plugin step and only that step retry if it fails. I don’t want to enable global job retries, but this has been happening a non-trivial amount in our build pipeline.

We’re using the Elastic CI stack with no customizations that I know of that should affect this. Is there anything we can do?

anon8263991 · June 25, 2020, 6:13am

Hi Geoff,

It’d be nice to fix the underlying problem and reduce the number of timeouts from docker hub. Feel free to contact us (support@buildkite.com) if you’d like us to help debug a specific build.

Is there a way to have that plugin step and only that step retry if it fails

There is! You can find an example of this here Command step | Buildkite Documentation and here Command step | Buildkite Documentation.

The simplest case is:

steps:
  - label: "Tests"
    command: "tests.sh"
    retry:
      automatic: true

geoffharcourt · June 25, 2020, 11:36pm

Hi @anon8263991 thanks for your advice!

I think I didn’t explain the issue very clearly. From what I understand retry: automatic: true will have the job retry if it fails for any reason (or any particular exit status). I only want to retry if the docker login fails (or some other build infrastructure issue that’s not related to my test suite code), which is unfortunately exit code 1, so the same as when my tests fail.

anon8263991 · June 26, 2020, 7:17am

Ah, I see.

I suspect it will be hard to address the timeouts in the pipeline.yml. It sounds like we might want to add some retry logic to the docker-login plugin.

We’ll try and find time to add that soon, but if you’re keen a pull request would be super useful! The docker-compose plugin has some retry technology that could probably be ported across. While waiting for us to merge and release it, you also have the option to temporarily load the plugin from your fork.

geoffharcourt · June 27, 2020, 6:07pm

Sounds good!

Until then we get the PR going, we’re getting around this by wrapping all our CI scripts so that they return 3 instead of 1 if they fail, and then retrying automatically for an exit status of 1.

Topic		Replies	Views
Docker on EC2 builds sliently failing General	2	1169	February 22, 2020
Reschedule builds on other agents rather than Fail builds when agents time out or are killed (machine shut down or put to sleep) Features Requests	5	1761	December 19, 2020
Expose retry status through env var or buildkite-agent Features Requests	1	311	September 16, 2021
Allow manual Retry only on latest pipeline for a branch? Pipelines	3	221	April 2, 2024
Allow the rerunning of successful pipeline steps Features Requests	2	1307	March 2, 2022

Handling Docker Hub timeouts on docker-login

Related topics