I’m seeing an issue when my build exceeds the capabilities of the EC2 instance I’m running my containerized builds on (I’m assuming the instances are running out of memory?). The build output in the buildkite UI simply halts and then the build fails after about 5 minutes with a “Exited with status -1 (process killed or agent lost)”. When I look though the EC2 logs I don’t see any errors or indication of what went wrong or why. Is there a way to detect that the agent crashed? Is there any other diagnostic I can look at to gain insight into the problem? I currently have to fix this just by experimenting. Previously I upped the size of my build instance to t2.micro but now (as the project progresses and there’s more code to build) I have to remove build parallelism to work around the problem (i.e. remove -j4
from my make
invocation).
I’m okay with running beefier hosts but I first want to know what the problem is so I don’t waste more time trying to figure out why builds start failing every 3 months or so.