Experimental Lambda-based Scaler 🦑

anon18197598 · April 10, 2019, 4:15am

The latest version of the stack, v4.3.1, introduces a EnableExperimentalLambdaBasedAutoscaling parameter, which when set to true will disable the default Amazon AutoScaling powered scaling behaviour in favour of a Lambda that handles the scale-out. Due to a combination of avoiding the intrinsic wait-times for native autoscaling and a much faster polling rate, the stack scales much, much faster up from zero to whatever capacity you need. We are seeing wait time reduction on builds with cold stacks of up to 50%.

Scale down is handled with the new --disconnect-after-idle-timeout flag that was added to the agent in v3.10.0. After the agent has been idle for a while (configured with ScaleDownPeriod), it disconnects and then terminates the instance and decrements the autoscaling group desired count atomically.

The result is a much, much faster scale out and a much simpler scale-in process that no longer requires lifecycled.

We’d love to hear how it works for your stacks and what issues you encounter. The plan is to make this the default for v5.0.0.

FAQ

Why not handle scale-in with the lambda too?

We tried! For some unknown reason, ASG’s don’t fire lifecycle hooks when scaling in via directly setting DesiredCount. They do however fire them when terminating an instance with TerminateInstanceInAutoScalingGroup. We were perplexed too.

What about the other scaling configuration options?

With the new autoscaler enabled, the following options are respected:

MinSize
MaxSize
ScaleDownPeriod
InstanceCreationTimeout

Conversely, the following options are completely ignored:

ScaleUpAdjustment
ScaleDownAdjustment
ScaleCooldownPeriod

We might consider implementing ScaleUpAdjustment if there is interest, it could provide a min-bound for scale up.

How about all the metrics the stack used to publish?

With the new scaling enabled, we disable the old buildkite-agent-metrics lambda in favour of a smaller, nimbler all-in-one lambda that collects metrics and does the scaling. Whilst we do also publish ScheduledJobsCount and RunningJobsCount, that’s all. You can still run buildkite-agent-metrics directly, it works nicely without a queue so you can have one lambda power a whole organization of metrics.

Let us know if there are any metrics you really miss and we’ll consider adding them back.

Is anything broken?

I think I might have broken BuildkiteTerminateInstanceAfterJob

Help! My instances occasionally are very slow!

If you are using an instance type that has burstable CPU credits, you might be running into this: https://serverfault.com/questions/740498/why-do-ec2-t2-instances-sometimes-start-with-zero-cpu-credits

You can avoid cycling instances so often by using a much longer ScaleDownPeriod.

Why won’t my ASG’s provision more than 10 instances at a time?

Turns out there is a hidden setting on ASG’s that limit increases to batches of 10. If you email AWS support they will change this for you.

pagameba · April 26, 2019, 2:00pm

I’ve been using this scaler for a couple of weeks now and I really like the decrease in allocation time for build instances, we’ve seen a decrease from about 3.5 mins to 1.5 mins (qualitative assessment, I didn’t actually go back and measure too carefully). However, our build times are not decreasing because it doesn’t respect the ScaleUpAdjustment. Our builds generally follow a fan out/in approach where we build a docker image then run multiple tests on it simultaneously. Previously we’d wait 3.5+ minutes for the first set of images to become available but then tests would immediately have an agent available for running as soon as the build finished. Now, only a single agent is scheduled for the build phase but we also have to wait on agents in the test phase as well, so the net effect is about the same overall. It would be really useful for us if you implemented ScaleUpAdjustment.

Thanks for all the great work!

anon18197598 · April 28, 2019, 6:35am

Good idea, I’ve got a PR up at https://github.com/buildkite/buildkite-agent-scaler/pull/12. Feedback welcome.

joffotron · June 20, 2019, 6:54am

We’ve seen some behaviour where MinSize doesn’t look like it’s being respected. Had it set to ‘8’ in the stack, yet all our build agent instances were terminated. Suggestions as to how to debug?

anon18197598 · June 20, 2019, 7:09am

Which stack version or autoscaler version?

joffotron · June 20, 2019, 7:19am

Hah, sorry!
I’ve just had a conversation with our contractor-ops person and turns out they were trying something out and had terminated all the instances manually!

No bugs to see here!

Thanks :-)

anon18197598 · June 20, 2019, 7:30am

No problems! Glad it’s all working!

pme · January 28, 2020, 3:43pm

FYI: it is possible to invoke the scaler lambda more frequently than once per minute which is the fastest you can do with a CloudWatch trigger, or by keeping the lambda running, by utilizing a step function.

flowchart

I’ve implemented this for our scaler and it works like a charm:

anon18197598 · February 2, 2020, 9:23pm

Yeah, I opted for running the function for longer in the new autoscaler. It seemed simpler than step functions.

phan.le · June 18, 2020, 5:50am

Have you seen any behavior where:

A MinSize is set (i.e. 1)
An idling agent exits and its host is removed from ASG causing the desired size (0) < min size (1)
A new machine is created to fill the gap.

So machines keep being shut down (because of idle agents) and spinning up.

sj26 · June 19, 2020, 12:15am

@phan.le sorry, we haven’t — the agent shouldn’t exit by itself when idle, only if you ask it to (or it gets scaled in by AWS). Want to send us a message through to support@buildkite.com with the specifics of your case?

Edit: unless you’re using --disconnect-after-idle-timeout? In which case the idle timeout and the min size might be fighting!

anon18197598 · June 19, 2020, 1:09am

Yeah, apologies @phan.le, this is unfortunately something that is very hard to do with the lambda scaler. Maintaining a minimum set of agents is really hard as they terminate when idle and can’t co-ordinate with each other to keep at least N running.

I’d recommend that you use a much longer disconnect after idle timeout for now!

phan.le · June 19, 2020, 1:48am

Got it, thanks for responding!

The reason we want to keep a small minimum of agents is to reduce latency when a commit is pushed.

max · September 25, 2024, 4:45pm

In case others run into anything similar in the future:

For some reason, setting disconnect-after-idle-timeout did not fully work for me (at least, as I interpreted its functionality from this forum post). It would stop the buildkite-agent service on the underlying server/instance, but it would not ‘terminate the instance and decrement the autoscaling group desired count atomically’ as stated above.

I’m not sure why this is the case, but regardless, a workaround is as follows:
In your buildkite agent’s cloud-init (or similar) script after you start the buildkite agent, run a background script that does something like:

#!/bin/bash
sleep 300
while service buildkite-agent status | grep -q 'Running'; do
  sleep 60
done
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
AWS_REGION=$(curl -s http://169.254.169.254/latest/meta-data/placement/region)
aws ec2 terminate-instances --instance-ids "$INSTANCE_ID" --region "$AWS_REGION"

Depending on your ASG config, you may need additional commands before the terminate to decrement the ASG desired-counts accordingly. Feels pretty janky but it works.

ivanna · September 25, 2024, 11:06pm

Hey @max!

Thank you for sharing your workaround with the Buildkite community!

Can you confirm that you are using the Elastic Stack to manage your Buildkite agents? If so, what version? As the --disconnect-after-idle-timeout flag is designed to only stop the Buildkite agent when it’s idle, but it doesn’t handle terminating the instance or adjusting the autoscaling group’s desired count on its own. These actions are automatically managed when you’re using the Elastic Stack. To dig into can you share your custom configuration including the bootstrap and hook to help identify underlying issues.

You can sent those details to support@buildkite.com.

Thanks,

Topic		Replies	Views
Autoscaling disconnects active agents Elastic CI Stack for AWS	11	383	August 1, 2023
AWS Stack upgrade from v5.21.0 to v6.4.0 Elastic CI Stack for AWS	5	426	September 12, 2023
Elastic CI Stack for AWS v5.0.0 released Announcements	1	952	March 22, 2023
Hybrid Agent Setup Elastic CI Stack for AWS	5	65	August 27, 2024
Buildkite Elastic CI Stack for AWS v5.0.0 released Announcements	1	1564	November 9, 2020