Autoscaling group kept creating instances in loop

Hi Team,

While testing a bootstrap script, I found something weird. I had 3 steps in my pipeline, so elastic-stack would create 3 instances using the autoscaling group. I am not sure there could be issue in the bootstrap script but since the instance were unhealthy or could not pass health-check, autoscaling group kept creating new 3 new instances every minutes terminating the unhealthy instances.

Here is the bootstrap script I added using the parameter BootstrapScriptUrl. It was a S3 url. I used the same bucket where I kept the secrets, and pushed the bootstrap script encrypting using --sse aws:kms

#!/bin/bash

set -euo pipefail

NODE_VERSION=v18.16.0
NVM_VERSION=v0.39.0


curl -sfL https://raw.githubusercontent.com/aquasecurity/trivy/main/contrib/install.sh | sh -s -- -b /usr/local/bin

curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/${NVM_VERSION}/install.sh | bash

export NVM_DIR="$HOME/.nvm" && \
    [ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh" && \
    nvm install $NODE_VERSION && \
    nvm alias default $NODE_VERSION && \
    nvm use default

Attaching here messages from the autoscaling group events.


Hello @surajthakur !

Thanks for the question and all the info. We’re taking a look at this for you and will get back to you soon!

Have a great day!
Michelle

Hello again @surajthakur!

I’ve been doing a pretty intense dive on this one and was able to reproduce the problem you are having here.

I took a look at the logs (you can find your own by going to the location pointed to at the bottom of this page) given from the elastic stack for the instance being used in my test setup and found the following lines:

2023-05-18T13:02:55.389-04:00	export NVM_DIR="$HOME/.nvm"
	2023-05-18T13:02:55.389-04:00	[ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh" # This loads nvm
	2023-05-18T13:02:55.389-04:00	bash: line 13: HOME: unbound variable

So it appears the script is failing due to the $HOME variable missing while the script is running. Since this script is running as root, you could likely just replace $HOME with /root/. But that’s where your problems are likely since that’s where it’s having issues with my test setup.

Hi @Michelle

Thanks for your immediate response and finding the solution. I will give a try to it today.

I have a question outside this, ie, how can I set the environment variables for the agent that are relevant from buildkite configuration file. Environment variables | Buildkite Documentation

Thanks
Regards
Suraj

Also should this be considered as a bug? Like it went on into a loop and kept creating the instances and terminating them. Within 15 mins I got 60 instances created and terminated.
I was worried if AWS might not raise an exception for frequent instance creation.
I see this as a good feature of autoscaling group, that if the instance in unhealthy, it should create new but at the same time should there be some conditions?

Hello @surajthakur!

James here, jumping in for Paula with TZ differences :wave:

For the first part - you can use the AgentEnvFileUrl parameter which would point to a HTTPS/S3 URL of the config file to load in (i.e could be version controlled in a Git repo). Alternatively you could technically create a standalone AMI to use with your stack which you’d have full control over its versions/env.

On the second part of the question, I’d point you to what Michelle was saying that the $HOME environment variable not being bound - it would be the case of what happens in the script given in the BootstrapScriptUrl that defined the behaviour you saw. The stock parameters for scaling could also be worth checking (i.e ScaleOutFactor and ScaleInIdlePeriod) to control the amount of instances you are seeing.

Hope that also assists :+1:

James

Hi @james.s

Thanks for the response.
I tried this AgentEnvFileUrl and set that to an S3 file (used the bucket for managed secrets.)

This is the file contents

export BUILDKITE_GIT_CLEAN_FLAGS="-fdq"

But this didn’t reflect. When I ran the build. May be because it says AgentEnvFileUrl do not pass to the builds
So, I tried with the agent hooks in the s3 file /env with the file contents and the build picked the git clean flags.

I have two questions.

  1. What is different between hooks /env and /environment ? I understand /environment is used for secrets.
  2. What does ScaleOutFactor means? I could understand from it description.

Thanks in advance.

Hi @surajthakur!

You are correct. The AgentEnvFileUrl is meant for Agent environment variables but not for job environment variables.

Sound like you got it working via the /env file in your S3 bucket though, which is a great solution to your problem.

  1. There really isn’t a difference between /env and /environment. It’s all down to personal preference.
  2. ScaleOutFactor basically controls how fast the agents are scaled out. 1.0 is the default, so setting that higher means your agents will be provisioned more quickly, and setting it lower will cause your agents to be scaled out slower according to how many agents are required.

Saying this about the ScaleOutFactor, I’m not sure how applicable to your situation this option is since you are having instances being terminated since they are not responding properly.

​​To answer your original question: “I was worried if AWS might not raise an exception for frequent instance creation.”: This is what the Autoscaling Group, which is an AWS service that our template configures, does. It wants to make sure there is an operable instance, if there isn’t it will try to provision a new one. The new instance has some time to respond to a simple query, usually a HTTP request. If it doesn’t it’s considered in a fault condition and restarts. This is really useful if say the provisioning depends on a third party, and that third party is having issues themselves so your instances wouldn’t be responsive because they’re missing that dependency. They will rotate and reboot until they are operational.

This is normal operating behaviour for AWS.​​ Some ASG’s (Auto Scaling Groups) could have upwards of 100 or even higher of these instances in some circumstances that may be doing this. AWS is used to this behaviour and shouldn’t take any adverse action because of it. If you are concerned, you can always talk to AWS support about this to clarify.

Have a great day! :slight_smile:
Michelle

Hi @Michelle ,

Thanks for your detailed response.

I was confusing with the git flags earlier being in buildkite.cfg and so I tried to set that with AgentEnvFileUrl but since git clean flags need to be passed to job, so it cannot be setup with AgentEnvFileUrl .

From logs of the instance also got confused with the log line

+ /usr/local/bin/bk-fetch.sh s3://<buildkite-managedsecretsbucket-name>/agent-env-file /var/lib/buildkite-agent/env  

which initially I though it is replace the use case of s3://bucketname/env, but on testing found s3://bucketname/env is pulled and set on job level.

Thanks for clarifying the AWS Auto scaling groups functioning, got some learning here.

For copying the hooks, should i be using the bootstrap script? or if there is some other functionality that I didn’t explore.

Thanks
Regards
Suraj

Hello again @surajthakur!

Thanks for the question and also hope what @Michelle discussed was equally as helpful.

If by what you mean to create the agent level hooks at instance boot - the BootstrapScriptUrl would suffice if that workflow suits for you (i.e creatong/moving the agent hooks into their relevant location from the defined script - Ubuntu assumed).

I’d also suggest potentially looking at making a custom AMI with already set hooks (instead of having to bootstrap them upon launch as above).

Hope that also assists!

Cheers

Thanks @james.s
I did realise creating custom AMI should be the best way here.

No worries @surajthakur!

Agreed also - both should meet what you are after but an AMI dedicated with what you want specifically require ala agent hook setup would be more efficient.

Please let us know if anything else comes to mind - thanks too :slightly_smiling_face:

Cheers