Handling Docker Hub timeouts on docker-login

We’ve been using Buildkite for a few weeks, and have gotten a lot of benefit out of it for our CI pipeline. We have a remaining source of build flakiness that we want to figure out how to better handle.

We have kind of of timeouts or response failure that seems to happen when we’re interacting with Docker Hub through the docker-login plugin. Authentication sometimes appears to timeout when an agent is logging in with the docker-login plugin. Is there a way to have that plugin step and only that step retry if it fails. I don’t want to enable global job retries, but this has been happening a non-trivial amount in our build pipeline.

We’re using the Elastic CI stack with no customizations that I know of that should affect this. Is there anything we can do?

Hi Geoff,

It’d be nice to fix the underlying problem and reduce the number of timeouts from docker hub. Feel free to contact us (support@buildkite.com) if you’d like us to help debug a specific build.

Is there a way to have that plugin step and only that step retry if it fails

There is! You can find an example of this here Command step | Buildkite Documentation and here Command step | Buildkite Documentation.

The simplest case is:

steps:
  - label: "Tests"
    command: "tests.sh"
    retry:
      automatic: true

Hi @jhealy thanks for your advice!

I think I didn’t explain the issue very clearly. From what I understand retry: automatic: true will have the job retry if it fails for any reason (or any particular exit status). I only want to retry if the docker login fails (or some other build infrastructure issue that’s not related to my test suite code), which is unfortunately exit code 1, so the same as when my tests fail.

Ah, I see.

I suspect it will be hard to address the timeouts in the pipeline.yml. It sounds like we might want to add some retry logic to the docker-login plugin.

We’ll try and find time to add that soon, but if you’re keen a pull request would be super useful! The docker-compose plugin has some retry technology that could probably be ported across. While waiting for us to merge and release it, you also have the option to temporarily load the plugin from your fork.

Sounds good!

Until then we get the PR going, we’re getting around this by wrapping all our CI scripts so that they return 3 instead of 1 if they fail, and then retrying automatically for an exit status of 1.