Best practices for handling failed "apply" when Terraform runs in Buildkite?

Hi Everyone,

I currently have a Buildkite pipeline, that every time we commit to a branch, terraform validate and terraform plan gets automatically triggered, once the branch gets merged into master a Buildkite pipeline gets triggered again and then the user needs to manually unlock it and it gets deployed. Sometimes we had problems where the terraform plan works just fine but when we merge it into master the terraform apply fails. How do other people deal with this, are there any best practices to follow? Also is there a quick and easy way to deploy back the last successful pipeline?

Hi @egibert!

I have some experience with running Terraform in Buildkite at my own company. What specific failure are you seeing when running apply after merge?

Regards,
Kevin Gillette

Hi @kgllette,

The one that comes into mind is an EC2 instance has API termination disabled and the reviewers didn’t notice, so Terraform can’t destroy it. Or even trying to terminate a service that depends on another service. Is there a way to do a rollback of the last pipeline? How do you avoid other people from deploying after you when the pipeline is broken? Thanks

Hi @egibert

Welcome to Buildkite community.

Regarding question “We had problems where the terraform plan works just fine but when we merge it into master the terraform apply fails. How do other people deal with this, are there any best practices to follow?”

One option I can think about is running terraform plan again after merge into master branch to make sure plan is successful before triggering terraform apply. Is that you do not really see any issues when running terraform plan but during terraform apply it causes failures due to some underlying API at EC2 level or service which we cannot validate during terraform plan ?

Regarding questions “Is there a way to do a rollback of the last pipeline? How do you avoid other people from deploying after you when the pipeline is broken?”

It depends on what you mean by “rollback” of a pipeline. A pipeline is executed in response to changes in a repository, if you want to rollback changes to a pipeline, you should rollback the changes that were made to the repo and have the pipeline do its thing anyway. But when a pipeline executed if it is triggering terraform changes which are terminating instances or other infrastructure services I can see that reverting the changes might trigger recreation of the instances or services but might need some actions on recovering the data of those instances.

So this looks more of a question on terraform side where we want to know how to recover the resources which had a change when terraform apply fails half way through its execution.

Regarding how to avoid other people from deploying after you when the pipeline is broken ? One option for this is to archive the pipeline when it encounters a failure so that no further builds can be triggered on it until the fixes are performed and pipeline is unarchived.

Please let us know your thoughts on this and once again welcome to the Buildkite community