Slow to 'Preparing working directory' step with git mirrors and EFS

Hi Buildkite(and other community members),

I’m looking for advice on how to increase the speed of our ‘Prepare working directory’ step.

At best its 2-5 seconds, but sometimes it slows to >30 seconds.

A bit about our configuration/setup:

Biggest repo is ~120 MB.

We are using git-mirrors feature and running agents in ASG’s in EC2. We have several queues: small(20 agents per instance, to run pipeline expansion, and wait on aws cli calls), medium(1 agent/instance) to run docker builds, large(1 agent/instance) to run tests.

Our agents, on EC2, come up as there are jobs available and terminate after a couple of minutes idle. It is common for an agent to be ‘fresh’, although we do get some agent re-use when several PRs are being built at once.

We use an EFS mount to share the ‘git-mirrors’ folder across agents. We wanted to reduce ‘internet-traffic’(downloads from github) as this incurs high nat-bytes cost. Once an agent has updated the mirror(so EFS now contains the desired commits), its available to the other agents. I believe there is a filesystem lock, so the other agents block until the lock is released.

Our usual pattern is that the first few pipeline steps expand a set of jobs(dynamic pipeline). Each of these jobs runs simultaneously, and does its own checkout on a new(depending) agent/instance. These parallel ‘checkouts’ seem to be the ones exhibiting the majority of the slowdown.

Is the issue to do with lots of little files being transferred from the EFS to local disk, when cloning from the mirror into the ‘job’ directory?

I’ll include screenshot on our EFS mounts, and spec.



Happy to answer any questions. Interested to hear any advice.

Thanks,
Michael

I connected to an ec2 instance hosting the agents and ran a parallel git clone command.

It did demonstrate a slow-down… I wonder if I need to scale EFS throughput.

seq 21 41 | xargs -P 20 -I % sh -c "time git clone -q --reference /var/lib/buildkite-agent/git-mirrors/git-github-com-repo -- git@github.com:org/repo.git test%"

Hey @Mic!

Welcome back to the Buildkite Support Community! :wave:

I think you’re right and EFS is struggling with many small file reads and filesystem locks when cloning repos in parallel.

There aren’t any obvious signs from the metrics you’ve attached, but I would need to investigate further to be sure. If there’s significant latency when accessing the git-mirrors folder this should be visible from CloudWatch.

What I can tell is that your current EFS share has its Performance Mode set to General Performance which may not be the best option for how you are using it. I would suggest changing this to the Max I/O option as it should allow you greater throughput for these parallel clones.

You can read more about the Performance Modes here; Amazon EFS performance - Amazon Elastic File System

Do you want to give that a go and see if you notice any improvement?

Thanks!