Visualising/debugging build time with DAGs is difficult

We’ve got some builds with many steps (20+), and half of those are precisely linked together via key/depends_on fine-grained dependencies to maximise our parallelism, rather than having to do the steps in 3 separate stages with waits between them.

However, with a maximum depth of 3, and many different stages linked together, it’s now much harder for us to understand the critical path of our build time. With coarse-grained dependencies via wait, one can just look at the longest-running step within each “wait stage”, but, with fine-grained dependencies via depends_on, we’re forced to click through the timelines of many steps to find the one that finished last, trying to remember the dependency graph to focus on the leafs/sinks of the graph (or, consulting the <build url>/dag endpoint).

We would find it extremely helpful to improve our build times with improved visualisation of the build steps in relation to each other, rather than just stating the time they took. For instance, a Gantt chart, or something like the /dag graph view with more timing info displayed, or even just whether each step is in the critical path of the build/some sort of highlighting/noting of that path.

Is there any way to get this sort of info at the moment without manually pulling it out of the API and processing it?

Hi @huon! There’s also a /waterfall view you can add onto your build page, similar to /dag. Have you seen that? We’d like to get that integrated into the build page directly, but for the moment we’ve got that undocumented URL you can use.

I had not; that’s almost perfect! Thanks.

Two question about reading it: the length bars seems to be approximately, but not exactly, proportional to the time for a step, e.g. the 6:09 one is much shorter than 2/3rds of the 9:10 one, and the 7:15 one is shorter than the 6:09 one:

Additionally, it seems like the positions are approximately “real time” but not quite, e.g. the first two jobs started 5 seconds apart, but their difference seems to be more than that, and, similarly the last job starts right after (5 seconds) the first one:

image

Am I misinterpreting what the lengths and positions are representing?

@keithpitt might have some more insight into that for you @huon!