AWS Stack upgrade from v5.21.0 to v6.4.0

Agents are getting terminated immediately after getting assigned to a job. Please refer to UI and ASG snapshots.

Also looked inot lambda autoscaler logs and it seems lile its using 1.5.0 dev version
2023/08/31 00:29:58 buildkite-agent-scaler version 1.5.0 dev.

I have scalein period set at 2700s and max size 2 and min size 0.


Hi @farhan ,

Welcome to the Buildkite Support Community! :wave:

We’ve had a similar issue reported before with upgrading v5.21.0 stack to v6.0. However, creating a new stack with v6.4.0 should work fine. Please let us know if this works for you.

Cheers!

@lizette
Thanks for the reply I didn’t create a change set only deployed a new stack v6.4.0 with same parameter settings as we are using for 5.21.0 (ofcourse with updated names).
please find below parameter settings

[
  {
    "ParameterKey": "AgentsPerInstance",
    "ParameterValue": "1"
  },
  {
    "ParameterKey": "ArtifactsBucket",
    "ParameterValue": "artifacts-bucket"
  },
  {
    "ParameterKey": "AssociatePublicIpAddress",
    "ParameterValue": "true"
  },
  {
    "ParameterKey": "BuildkiteAgentRelease",
    "ParameterValue": "stable"
  },
  {
    "ParameterKey": "BuildkiteAgentTags",
    "ParameterValue": "autoscale=true"
  },
  {
    "ParameterKey": "BuildkiteAgentTimestampLines",
    "ParameterValue": "false"
  },
  {
    "ParameterKey": "BuildkiteAgentTokenParameterStorePath",
    "ParameterValue": "token_path"
  },
  {
    "ParameterKey": "BuildkiteQueue",
    "ParameterValue": "test-v6"
  },
  {
    "ParameterKey": "BuildkiteTerminateInstanceAfterJob",
    "ParameterValue": "false"
  },
  {
    "ParameterKey": "BuildkiteAgentEnableGitMirrors",
    "ParameterValue": "false"
  },
  {
    "ParameterKey": "CostAllocationTagName",
    "ParameterValue": "CostCenter"
  },
  {
    "ParameterKey": "CostAllocationTagValue",
    "ParameterValue": "AutoscaleCI"
  },
  {
    "ParameterKey": "ECRAccessPolicy",
    "ParameterValue": "poweruser"
  },
  {
    "ParameterKey": "EnableCostAllocationTags",
    "ParameterValue": "true"
  },
  {
    "ParameterKey": "EnableDockerExperimental",
    "ParameterValue": "false"
  },
  {
    "ParameterKey": "EnableDockerLoginPlugin",
    "ParameterValue": "true"
  },
  {
    "ParameterKey": "EnableDockerUserNamespaceRemap",
    "ParameterValue": "false"
  },
  {
    "ParameterKey": "EnableECRPlugin",
    "ParameterValue": "true"
  },
  {
    "ParameterKey": "EnableSecretsPlugin",
    "ParameterValue": "true"
  },
  {
    "ParameterKey": "InstanceCreationTimeout",
    "ParameterValue": "PT5M"
  },
  {
    "ParameterKey": "InstanceTypes",
    "ParameterValue": "m6i.large"
  },
  {
    "ParameterKey": "OnDemandPercentage",
    "ParameterValue": "100"
  },
  {
    "ParameterKey": "MaxSize",
    "ParameterValue": "2"
  },
  {
    "ParameterKey": "MinSize",
    "ParameterValue": "0"
  },
  {
    "ParameterKey": "ScaleOutFactor",
    "ParameterValue": "1.0"
  },
  {
    "ParameterKey": "ScaleInIdlePeriod",
    "ParameterValue": "2700"
  },
  {
    "ParameterKey": "ScaleOutForWaitingJobs",
    "ParameterValue": "true"
  },
  {
    "ParameterKey": "RootVolumeName",
    "ParameterValue": "/dev/xvda"
  },
  {
    "ParameterKey": "RootVolumeSize",
    "ParameterValue": "75"
  },
  {
    "ParameterKey": "RootVolumeType",
    "ParameterValue": "gp3"
  },
  {
    "ParameterKey": "RootVolumeEncrypted",
    "ParameterValue": "true"
  },
  {
    "ParameterKey": "SecretsBucket",
    "ParameterValue": "our-bucket"
  },
  {
    "ParameterKey": "AvailabilityZones",
    "ParameterValue": "us-east-1a,us-east-1d"
  },
  {
    "ParameterKey": "ManagedPolicyARNs",
    "ParameterValue": "our policies"
  },
  {
    "ParameterKey": "EnableInstanceStorage",
    "ParameterValue": "true"
  }
]

Hi @farhan ,

I do not see any issues with the above parameters. As you have mentioned that after the agents were assigned jobs, they have been terminated. Are you able to provide us some agent logs? You can read about how to get the agent logs here. You can send these through to support@buildkite.com and we’ll have a closer look at the errors you are facing.

Hi @farhan, after some further experimentation, we have found an issue with your parameters.

You have EnableInstanceStorage=true, but the instance type, m6i.large does not have instance storage. We recommend you use an instance type like m6id.large instead if you want to use instance storage, or disable instance storage with by setting EnableInstanceStorage=false.

We’ve tried the combination of parameters:

EnableInstanceStorage InstanceTypes
true m6i.large
true m6id.large
false m6i.large

and we could only reproduce the issue you’ve encountered in the first one.

We think the experience could be improved, however, so we are going to adjust the stack’s code and documentation to warn the user when they’ve made such a misconfiguration, rather than error and cause the instances to terminate early.

@triarius Ofcourse however just for you information that even though EnableInstanceStorage=true stack v5.2.10 for m6i.large instances was working fine for us. Curious to know what changed in v6.4.0 that this parameter is not ignorable even if set to true when instance doesn’t have nvme.

Ok, thanks for letting us know. We have fixed this, and it will be out in the next release.

What changed between v5 and v6 is that we started to exit the start-up scripts when they errored. Previously, they were erroring silently when the stack was misconfigured like this. After the change in #1206, when you misconfigure a stack with instance types that don’t have instance storage but set EnableInstanceStorage=true there will CloudWatch logs that inform you of the misconfiguration, but otherwise the stack should operate as if EnableInstanceStorage=false.

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.