Build stays in 'Running' forever for `depends_on: ["<unknown>"]`

The more precise dependencies of the DAG is awesome. We’ve been excited to try it. However we’ve found what seems to be a bug.

Minimal Example

Given

dag: true

steps:
  - label: "test it"
    command: "true"
    depends_on:
      - "not-known"

A build using that will sit in running forever, even though there’s no running steps at all, and so no chance of any further progress being made:

image

Background

We have a dynamic trigger step, where the configuration used to trigger it is dynamic, and so we’re having to do a pipeline-upload of the trigger step. We still want to wait for the trigger step to be successful, meaning, with the DAG, we need to depend on it (we set the same ID for the step we upload dynamically, no matter what). However, the upload step itself is also dependent on earlier testing, so it may not run and so may not create the step upon which the later step depends.

Simplified significantly, it looks something like:

dag: true

steps:
  - id: "run-tests"
    command: "true"

  - id: "upload-step"
    command: >-
      echo '{"steps":[{"id":"dynamic-step","command":"true"}]}' | buildkite-agent pipeline upload
    depends_on:
      - "run-tests"

  - id: "after-trigger"
    command: "true"
    depends_on:
      - "dynamic-step"

If run-tests fails (e.g. command: "false"), upload-trigger never runs, and so after-trigger is depending on an ID that doesn’t exist, and the build sits in Running with run-tests failed, upload-trigger not run, dynamic-step not existing and after-trigger waiting, forever.

Attempted workaround

A workaround I tried was to have the after-trigger step also depend on an additional earlier step, e.g.

  - id: "after-trigger"
    command: "true"
    depends_on:
      - "run-tests"
      - "dynamic-step"

or

  - id: "after-trigger"
    command: "true"
    depends_on:
      - "upload-step"
      - "dynamic-step"

Or both.

The thinking was that if some of its dependencies fail (or don’t otherwise don’t run), then the pipeline will register that after-trigger can never run and so fail the whole build, but it seems the unknown dependency still “wins”.

For instance, for the second adjustment above, the build still sticks at running:

image

@huon :wave: thanks for the bug report!

The fact that missing dependencies essentially block forever is kinda by design. The thinking was that you may depend on “foo”, and a later build step may pipeline upload that “foo” step, at which point the dependency exists and can do a thing.

I didn’t anticipate this particular scenario though…

I wonder what we should do. I’d still like to maintain the “wait for a dependency” behaviour to appear, but having a build like that stuck forever running doesn’t seem desirable either.

Maybe we as we go to finish off a build, we look at scheduled steps and say “this is waiting on a dependency, but there’s no more jobs to schedule, so let’s just finish the build because this isn’t going to run”.

What do you think @matthewd?

Yeah, I think we can keep the dependency “open” as long as there is some job currently scheduled/running (or an otherwise-unimpeded block step waiting for input), but fail it if the build has run out of things to do. If there are no jobs left that are capable of running, then there’s nothing left that could supply the missing dependency.

@huon for now, I think the best work-around I could suggest would be to only upload after-trigger at the same time you upload dynamic-step. Sorry that’s not a very good solution, but it’s the only one I can think of to allow your builds to properly fail while we’re working on getting this bug fixed.

There is a general feature request: Can pipelines get an overall end-to-end timeout parameter? That way steps that never schedule are at least reaped after some time.

It’s a great piece of the design, we trying to use it, the only problem is this bug. :smile:

Thanks for a work-around that actually works. That unblocks us!