Identify signal in Error: The command was interrupted by a signal

We are having a job failing with “Error: The command was interrupted by a signal”: is there a way to identify what signal is causing this failure?

The docs state that in this case the exit code should be 128 + signal number, but the timeline states the command exit code is 255 (success).

The job is here:

Hi @simonbyrne!

Sorry, it doesn’t look like the agent reported which signal interrupted the command there. Sometimes it’s not possible to observe the signal, but the agent will do its best.

An exit code of 255 can mean that the agent or some intermediate part of the bootstrap and plugins didn’t receive an exit code and so reported -1 (which wraps around to 255 as an unsigned integer). It shouldn’t be reported as a success, sorry! Where are you seeing that? I can only see:

These docs:

state that 255 means “The agent was gracefully terminated”

255 The agent was gracefully terminated

I’ll double check this, but I think this might mean “the agent received a signal to gracefully terminate, and so it stopped the running job(s)”.

I can see that agent didn’t run any further jobs - is it possible the instance shutdown for some reason?

Unfortunately 255 can also happen in the scenario @sj26 highlighted. This is probably a bug in the agent (it should be possible to detect when it happens and not report 255 to the Buildkite API), however we haven’t managed to find and squash the root cause.

We spawn agents on demand, and run them with --acquire-job, so they will disconnect once completed.

In this case the cause was apparently an OOM killed process.

Oh, that’s interesting.

Given that, it does seem like there’s a gap in the way the agent detects and reports signals in some cases. We’ll dig into it and see what we can find. Thanks!