Elastic-ci stack ec2 instance restarting and terminating

Hi Team

I am facing an issue from yesterday.
After creating a new stack of buildkite for elastic builders, the creation of stack happens correctly, but when I trigger a job and instance is starting to create and initialising, the instance terminates automatically without start the job and keeps creating itself.
I am not able to find anything specific in the logs.

I have used the stack version: “https://s3.amazonaws.com/buildkite-aws-stack/v5.21.0/aws-stack.yml
and also tried with the latest. In both cases, I am getting the same issue.

The error logs i could find when running V5.21.0 is below

2023-10-13T15:52:28.000+11:00

Copy
Oct 13 04:52:28 ip-172-31-2-39 cloud-init: Oct 13 04:52:28 cloud-init[2586]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-003 [1]
Oct 13 04:52:28 ip-172-31-2-39 cloud-init: Oct 13 04:52:28 cloud-init[2586]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-003 [1]

I have used many params, but two of these are as below.

    {
        "ParameterKey": "MaxSize",
        "ParameterValue": "1"
    },
    {
        "ParameterKey": "MinSize",
        "ParameterValue": "0"
    },

The same version with same above parameter was working fine till now. I deleted the stack and tried to recreate and started getting this issue.

Any ideas ?

Thanks
Regards
Suraj

@surajthakur :wave:

Firstly, I’d advise removing any screenshots that contain any value named token, just in case.

Are you using any custom variables or is this an out-of-the-box install? Are there any errors in the actual Cloud Foamation stack output, rather than the ec2 logs?

Cheers!

Hi @benmc

I am not using any custom variables.
I have list of parameters which are the same which I was using before.
I know there are a set of parameter names that were changes in version v6.x
But I used the same parameter file which I used before for the version V5.21.0

The issues are very strange happening with the latest version as well. They were working two days back but yesterday when tried recreating, I got this instance recreating again and again.

I use some custom variables in the /environment script. The variables are passed on build job is triggered.
They were working before as expected. If they are unset, the job should fail and expected behaviour.

I dont use any custom AMI. The stack uses the AMI mentioned in stack config file.

Are you enabling any experimental features?

Not that I am aware of. They should be disabled by default I will assume.

Thanks @surajthakur!

If you could email any logs you have to support@buildkite.com we’ll be able to dive deeper in to the issue and see what the cause is. It’s not clear from the error snippet but looks like maybe a file is missing.

Cheers!

Thanks, I have sent the logs of /buildkite/elastic-ci/{instance-id} over the email.

With regards to the latest version stack errors, I have fixed. There was a file missing due to which it was happening.

But i came across an interesting case which I could not understand.
I have two stacks
Stack1:
queue:frontend
tags:build=true

Stack2:
queue:frontend
tags:deploy=true

Now when a job with tags
queue:frontend
tags:build=true
is triggered, stack1 obviously had another issue due to which instance was restarting, but a instance of stack2 was also created by elastic-ci but the job was not place on stack2’s instance (which i understand to be obvious)

Is queue name being same for two stack can cause confusion ?

Still looking for issue in V5.21.0 stack i have been facing.

Thanks

The original issue is solved.
It seems there was an issue with the bootstrap script. The logs didnt clearly say what error, but some non zero exit code causing instance unhealthy and causing the termination of instances. Removing block of code from bootstrap script worked for me. Might need to investigating the cause but thats a separate issue.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.