Any experience setting up a shared remote build cache using Bazel?

Horizontally scaling with Buildkite is what the Elastic-CI stack was designed for. With Bazel, it requires setting up a persistent remote cache that all your agents can access. When we wrote the Bazel docs article we published a first pass with the intent to later follow up with details on how to set up a remote build cache, which horizontally scaled agents can access. We could use your help with that. Do you know about a reference implementation we can use as a starting point?

1 Like

We used a small bazel-remote instance; it handles keeping the cache size down, etc. A single small instance has been easily enough for our team so far.

1 Like

We did ours in stages.

  1. Turn on the local filesystem disk cache (which bazel actually treats as a remote cache).

Do this via a checked in .bazelrc and now every computer, CI or workstation, gets a disk cache.

Effort: about the same amount of time it takes to read this and the paragraph of docs.

  1. make a GCS bucket, since bazel has support for GCS out-of-the-box. To auth, you need either a GCS keypair file or you if you’re running inside GCE, make your instance service account have permissions into the GCS bucket. storage.objectViewer + storage.objectCreator, or storage.objectAdmin.

Make sure that you choose a cache-silo that matches to something unique about your build-environment. For example, if building inside a docker container, calculate a stable hash of the image’s inputs. The plugin has a good setup for doing that. If building without a container, perhaps the unique ID of your machine-image.

You absolutely don’t want to pollute your build cache by allowing differently set up bazel hosts to write to it, if you do, you lose your hermeticity, which is a lot of the point of using bazel in the first place.

If you want to use AWS S3, https, or Azure, there are wrappers for those, see the docs. GCS is trivial.

Make sure that workstations don’t get to write to the cache; same reason as above, except workstations are even more likely to pollute it since developers install all sorts of things (I certainly do).

This gets you to a remote cache. Took us a day or so. Builds at the time (circa mid 2018?) went from 2-10min to flat-line 2min running within GCE.

Effort: this took us a day or two. We did it with an in-repo .bazelrc, and then an in-repo tools/bazel script to deal with the cache-silo-key calculation to supply to the flag that is the bucket address.

  1. Tune and optimise both. The above sets you up so that on each build, if your local disk cache lacks objects, those will be fetched regardless of need. This is expensive and wasteful.

I personally know less about this, a teammate did this.

Effort: a few days off and on last month. The flags interact in tricky ways, plus we want to not use the remote cache when building on master, since post-merge we want to do a from-clean build to prove to ourselves we definitely still can.

Our builds (which in the meantime have crept up to ~25m in some cases) are now ~5min.

  1. After this, venture into remote build execution (please tell us how you go, we haven’t so far).
1 Like

Hey there - I’m the teammate that @petemounce referenced above. :slight_smile:

We’ve done a whole lot of work here, and now we have around 8-10 different teams sharing a single GCS-bucket-based Bazel cache. Pete’s notes above are good places to start, and I would add a few notes and specific shortcuts that might help in particular:

  1. If possible, run your workloads inside highly-similar Docker containers, as this makes it much simpler to make guarantees about system configuration that might cause cache poisoning.

    • Ideally, you would capture anything your build depends on that isn’t listed as an explicit Bazel dependency in your cache key.
    • Even though higher-level langs like Go, JS/TS/Java, etc builds in Bazel are themselves hermetic and self-hosted, C/C++ builds will frequently pull in build-host system libraries for linking.
    • There are frequently C/C++ targets implicitly included in even higher level language builds - eg for building host-system tools like protoc or rules_docker helpers, or if you use CGO - and Bazel can’t guarantee that these builds are 100% hermetic. Ensuring that the system’s glibc, etc are the same is the only way to guarantee that these will always be compatible.
  2. The fewer different cache shards you need to maintain, the higher your cache hit rate, and the bigger your performance gains will be, so try to minimize drift in important dependencies that require

  3. Whether you’re in Docker or not, having a safe cache key like Pete mentioned is really critical - here’s how the Kubernetes project generates their unique cache keys for Docker images, and we’ve used a similar approach as well for our Docker environments.

  4. Especially if using an HTTP-based cache, you want your cache server to be as close to co-located with your build agents as possible, since the agent<->cache communication can be very chatty, especially for large projects (you’ll have at minimum one cache request per action in your build graph, plus in many cases a blob fetch if the output is already cached).

    • We run our Bazel CI workloads in GCP, so GCS is very low-latency and high-bandwidth for us., but if you’re running agents in AWS, you’ll want to have your cache live in AWS as well
    • There are some more advanced features you can get by using a gRPC-based cache instead of an HTTP cache (batch requests, metrics, monitoring, statistics, Bazel-level authentication, etc), but the setup and maintenance can be more complex. We went with an HTTP cache because we could just set up a cloud bucket and point at that, with no further server-side configuration necessary, but depending on the use case, YMMV.
  5. You’ve got multiple choices of remote cache hosts, with different tradeoffs.

    • Bazel-remote is an unofficial HTTP/gRPC remote server built by the tech lead of Bazel - it’s well-maintained, and built by people who fully understand Bazel from the inside, so it’s a pretty good default choice unless you want something specific
    • Greenhouse is the HTTP Bazel cache service built for Kubernetes, and as such is designed from the ground up to run as a Kubernetes service.
    • BuildBuddy is a new commercial startup that’s offering a more full-featured, supported service for both remote caching and remote execution that includes some nice side benefits like built-in analysis of Bazel log and performance data. I’ve only briefly looked at it, and it’s brand new, but it looks like it could be promising.
    • And of course, as mentioned, anything that acts as an HTTPS server (including various cloud buckets like S3, GCS, etc) can serve as an HTTP-based cache server.
  6. Lastly, one neat trick that we’re using to minimize the runtime configuration of Bazel for using these caches is to add a system-level .bazelrc to any Docker containers or CI agents to configure them to always use the cache without ever even needing to modify any actual build processes or scripts. Ours includes minimal configuration, like this:

build:ci --google_default_credentials
build:ci --remote_http_cache=https://path/to/bazel/remote/cache/with/cache/key
# If this config file is present, then we're definitely in a CI context
build --config=ci

Hope some of that helps!

1 Like

This is awesome useful info folks! Thanks. I’ll see if I can get the docs updated.

Remote execution sounds like it’s the fun part :blush:. Do you have any ideas around how you would attack that or is that too far in the future? I just watched BazelCon 2018 Day 1: Faster Builds With Remote Execution and Caching

Our first stop would be the Google Cloud Remote Build Execution alpha program, ahead of trying to set up our own infrastructure for it. We’re a fan of … not running our own infra.

1 Like

@petemounce @SeanR How to apply silo key properly? I found only info about it here:

I used very basic setup:

build --remote_http_cache=
build --google_default_credentials
build --remote_default_platform_properties="properties:{name:\"cache-silo-key\" value:\"1\"}"

When I remove --remote_default_platform_properties option or replace it with --remote_default_exec_properties=cache-silo-key=1, same artifacts are used.

If you’re using an HTTP cache, it’s pretty easy in this case - you can use a subdirectory of your http cache per silo key.

EG your example would become:

build --remote_http_cache=
build --google_default_credentials

This is how we run ours, and it works great. This does not translate to gRPC-based remote caches, but for a GCS/HTTP cache, it’s the simplest way to go by far.

Yeah, I ended up with this after reading this thread:
Thanks anyway!

1 Like