The latest version of the stack, v4.3.1, introduces a
EnableExperimentalLambdaBasedAutoscaling parameter, which when set to
true will disable the default Amazon AutoScaling powered scaling behaviour in favour of a Lambda that handles the scale-out. Due to a combination of avoiding the intrinsic wait-times for native autoscaling and a much faster polling rate, the stack scales much, much faster up from zero to whatever capacity you need. We are seeing wait time reduction on builds with cold stacks of up to 50%.
Scale down is handled with the new
--disconnect-after-idle-timeout flag that was added to the agent in v3.10.0. After the agent has been idle for a while (configured with
ScaleDownPeriod), it disconnects and then terminates the instance and decrements the autoscaling group desired count atomically.
The result is a much, much faster scale out and a much simpler scale-in process that no longer requires lifecycled.
We’d love to hear how it works for your stacks and what issues you encounter. The plan is to make this the default for v5.0.0.
Why not handle scale-in with the lambda too?
We tried! For some unknown reason, ASG’s don’t fire lifecycle hooks when scaling in via directly setting
DesiredCount. They do however fire them when terminating an instance with
TerminateInstanceInAutoScalingGroup. We were perplexed too.
What about the other scaling configuration options?
With the new autoscaler enabled, the following options are respected:
Conversely, the following options are completely ignored:
We might consider implementing
ScaleUpAdjustment if there is interest, it could provide a min-bound for scale up.
How about all the metrics the stack used to publish?
With the new scaling enabled, we disable the old buildkite-agent-metrics lambda in favour of a smaller, nimbler all-in-one lambda that collects metrics and does the scaling. Whilst we do also publish
RunningJobsCount, that’s all. You can still run buildkite-agent-metrics directly, it works nicely without a queue so you can have one lambda power a whole organization of metrics.
Let us know if there are any metrics you really miss and we’ll consider adding them back.
Is anything broken?
I think I might have broken
Help! My instances occasionally are very slow!
If you are using an instance type that has burstable CPU credits, you might be running into this: https://serverfault.com/questions/740498/why-do-ec2-t2-instances-sometimes-start-with-zero-cpu-credits
You can avoid cycling instances so often by using a much longer
Why won’t my ASG’s provision more than 10 instances at a time?
Turns out there is a hidden setting on ASG’s that limit increases to batches of 10. If you email AWS support they will change this for you.