Agents not attaching

I’ve got two queues provisioned via the elastic stack, both set to scale up from zero agents on demand. The default queue consistently works with no problem, but the other queue often boots agents that fail to connect to register with Buildkite. One time that I was able to connect, I found logs indicating the agent couldn’t reach its metadata service. Is it possible there’s a race condition in the AMI that causes it to connect to the metadata service before the metadata service starts? The only difference I see between the queues is that the one that works consistently is on c5.large and the one that does not is on t2.small.

I see this with the Systems Manager Agent attempting to start

2019-11-04 23:01:49 INFO Entering SSM Agent hibernate - EC2RoleRequestError: no EC2 instance role found
caused by: EC2MetadataError: failed to make EC2Metadata request
caused by: <?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
                 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title>404 - Not Found</title>
 </head>
 <body>
  <h1>404 - Not Found</h1>
 </body>
</html>

2019-11-04 23:02:30 INFO Got signal:terminated value:0xbb5670
2019-11-04 23:02:30 INFO Stopping agent
2020-04-21 01:17:44 INFO Entering SSM Agent hibernate - RequestError: send request failed
caused by: Post https://ssm.us-east-1.amazonaws.com/: dial tcp 52.46.157.70:443: i/o timeout

and I see this in the agent logs:

Apr 21 01:15:54 ip-10-0-22-223 buildkite-agent: _           _ _     _ _    _ _                                _
Apr 21 01:15:54 ip-10-0-22-223 buildkite-agent: | |         (_) |   | | |  (_) |                              | |
Apr 21 01:15:54 ip-10-0-22-223 buildkite-agent: | |__  _   _ _| | __| | | ___| |_ ___    __ _  __ _  ___ _ __ | |_
Apr 21 01:15:54 ip-10-0-22-223 buildkite-agent: | '_ \| | | | | |/ _` | |/ / | __/ _ \  / _` |/ _` |/ _ \ '_ \| __|
Apr 21 01:15:54 ip-10-0-22-223 buildkite-agent: | |_) | |_| | | | (_| |   <| | ||  __/ | (_| | (_| |  __/ | | | |_
Apr 21 01:15:54 ip-10-0-22-223 buildkite-agent: |_.__/ \__,_|_|_|\__,_|_|\_\_|\__\___|  \__,_|\__, |\___|_| |_|\__|
Apr 21 01:15:54 ip-10-0-22-223 buildkite-agent: __/ |
Apr 21 01:15:54 ip-10-0-22-223 buildkite-agent: http://buildkite.com/agent                    |___/
Apr 21 01:15:54 ip-10-0-22-223 buildkite-agent: 2020-04-21 01:15:54 NOTICE Starting buildkite-agent v3.16.0 with PID: 5010
Apr 21 01:15:54 ip-10-0-22-223 buildkite-agent: 2020-04-21 01:15:54 NOTICE The agent source code can be found here: https://github.com/buildkite/agent
Apr 21 01:15:54 ip-10-0-22-223 buildkite-agent: 2020-04-21 01:15:54 NOTICE For questions and support, email us at: hello@buildkite.com
Apr 21 01:15:54 ip-10-0-22-223 buildkite-agent: 2020-04-21 01:15:54 INFO   Configuration loaded path=/etc/buildkite-agent/buildkite-agent.cfg
Apr 21 01:15:54 ip-10-0-22-223 buildkite-agent: 2020-04-21 01:15:54 INFO   Agents will disconnect after 900 seconds of inactivity
Apr 21 01:15:54 ip-10-0-22-223 buildkite-agent: 2020-04-21 01:15:54 INFO   Fetching EC2 meta-data...
Apr 21 01:15:54 ip-10-0-22-223 buildkite-agent: 2020-04-21 01:15:54 INFO   Successfully fetched EC2 meta-data
Apr 21 01:15:54 ip-10-0-22-223 buildkite-agent: 2020-04-21 01:15:54 INFO   Registering agent with Buildkite...
Apr 21 01:16:24 ip-10-0-22-223 buildkite-agent: 2020-04-21 01:16:24 WARN   Post https://agent.buildkite.com/v3/register: dial tcp 54.165.113.253:443: i/o timeout (Attempt 1/30 Retrying in 10s)
Apr 21 01:17:04 ip-10-0-22-223 buildkite-agent: 2020-04-21 01:17:04 WARN   Post https://agent.buildkite.com/v3/register: dial tcp 3.224.111.118:443: i/o timeout (Attempt 2/30 Retrying in 10s)
Apr 21 01:17:44 ip-10-0-22-223 buildkite-agent: 2020-04-21 01:17:44 WARN   Post https://agent.buildkite.com/v3/register: dial tcp 3.224.111.118:443: i/o timeout (Attempt 3/30 Retrying in 10s)
Apr 21 01:18:24 ip-10-0-22-223 buildkite-agent: 2020-04-21 01:18:24 WARN   Post https://agent.buildkite.com/v3/register: dial tcp 3.224.111.118:443: i/o timeout (Attempt 4/30 Retrying in 10s)
Apr 21 01:19:04 ip-10-0-22-223 buildkite-agent: 2020-04-21 01:19:04 WARN   Post https://agent.buildkite.com/v3/register: dial tcp 54.165.113.253:443: i/o timeout (Attempt 5/30 Retrying in 10s)
Apr 21 01:19:44 ip-10-0-22-223 buildkite-agent: 2020-04-21 01:19:44 WARN   Post https://agent.buildkite.com/v3/register: dial tcp 54.165.113.253:443: i/o timeout (Attempt 6/30 Retrying in 10s)
Apr 21 01:20:24 ip-10-0-22-223 buildkite-agent: 2020-04-21 01:20:24 WARN   Post https://agent.buildkite.com/v3/register: dial tcp 3.224.111.118:443: i/o timeout (Attempt 7/30 Retrying in 10s)
Apr 21 01:21:04 ip-10-0-22-223 buildkite-agent: 2020-04-21 01:21:04 WARN   Post https://agent.buildkite.com/v3/register: dial tcp 3.224.111.118:443: i/o timeout (Attempt 8/30 Retrying in 10s)
Apr 21 01:21:44 ip-10-0-22-223 buildkite-agent: 2020-04-21 01:21:44 WARN   Post https://agent.buildkite.com/v3/register: dial tcp 54.165.113.253:443: i/o timeout (Attempt 9/30 Retrying in 10s)
Apr 21 01:22:24 ip-10-0-22-223 buildkite-agent: 2020-04-21 01:22:24 WARN   Post https://agent.buildkite.com/v3/register: dial tcp 54.165.113.253:443: i/o timeout (Attempt 10/30 Retrying in 10s)

I increased the instance size from t2.small to c5.large and I don’t see this issue nearly as often. Luckily, this queue doesn’t spin up much, so the extra cost isn’t that big a deal, but it’s disappointing to have to run such large instances just because of a boot-time race condition.

Hmmm, that’s pretty weird behaviour! Would you be able to send this through to support@buildkite.com and I can get a few extra eyes on this for you?

Cheers.
Jason