Terraform Drift Detection and Remediation

Spacelift team - Jun 17 - - Dev Community

Organizations using Terraform to manage their infrastructure as code (IaC) need a reliable solution to ensure their infrastructure's actual state aligns with its intended state.

Terraform stores information about the infrastructure it manages in state files, and any change to the infrastructure Terraform manages that Terraform has not triggered is called "drift."

In this post, we will explore the reasons why drift happens, its associated risks, and the options available to remediate it.

What is Terraform drift?

Infrastructure drift refers to changes made outside of your Infrastructure as Code (IaC) tool for resources that are managed by it. To this extent, Terraform drift refers to resources that are managed by Terraform and changed outside the Terraform workflow.

This can happen for several reasons, such as:

  • Manual changes - As a DevOps engineer, when you have severity one issues, you may make manual changes just to get the systems up and running, but this also means that you have to make these changes in the code afterward. Sometimes, you forget that you've made these changes, and your configuration will drift.
  • External processes - You may have automated processes outside Terraform's control, such as autoscaling actions triggered by cloud providers or external scripts that make changes to your infrastructure.
  • Resource eviction - Due to cost-saving measures and policy violations, resources can be evicted or deleted, which can cause drift.

Drift is a significant concern and can lead to inconsistencies that complicate infrastructure management.

Common sources of infrastructure drift

Consistency is a key goal when managing infrastructure using Terraform. With IaC, you can keep multiple environments consistent, irrespective of how many times they are recreated.

Infrastructure drift undermines that consistency. Here are some of its common sources:

Manual changes

Manual changes are a primary cause of infrastructure drift. These can be made either deliberately or unintentionally.

If a deployed system's configuration needs to be changed to address a critical production incident, doing it manually can be the fastest way to fix it. Similarly, certain network configurations are tweaked for testing purposes to address a certain network security vulnerability. These are examples of intentional manual changes to the infrastructure.

However, sometimes users are not even aware they have made a manual change to the infrastructure. Identifying the components managed by Terraform is not always intuitive. When users log into the web console, they may perform specific tasks on resources without the knowledge of Terraform's state file. Executing scripts that make API calls to the cloud platform is another possible source of unintentional change.

Irrespective of whether the change is deliberate or unintentional if the changes are not ported back into the Terraform configurations, this results in drift.

Automation tools

Organizations and large teams implement multiple automation tools to streamline operations. These tools all have specific workflows and lifecycle management capabilities, and responsibilities can overlap if boundaries of influence for these tools are unclear or wrongly implemented.

For example, using Terraform for infrastructure management alongside a configuration management tool like Ansible creates a high possibility of infrastructure drift. Although Ansible is responsible for managing the application layer of a business service, it also has infrastructure provisioning capabilities.

Ironically, the more automation tools you implement, the more manual effort is required to reconcile the changes they create in Terraform state files.

User scripts

Cloud platforms facilitate the triggering of certain event-driven user-defined scripts. These scripts allow users to perform actions on a resource or execute API calls to modify another resource.

For example, when creating Linux-based EC2 instances in AWS, it is possible to execute bash/shell scripts when the instance boots. These scripts are provided in the user_data field when creating an instance from the web console. Similarly, Terraform provides a way to supply the same using IaC.

Although providing user_data is not mandatory, it enables various automation capabilities to manage virtual machines. User_data scripts are used to run upgrades, install security patches, install dependencies, invoke various system processes, etc., as soon as the system boots.

Bash and shell scripts are powerful because they can change any network configuration of the system and execute API calls to modify other resources. This has the potential to introduce drift in infrastructure.

đź’ˇ You might also like:

Risks and impact of infrastructure drift

Terraform IaC manages the infrastructure's end-to-end lifecycle. It is responsible for creating and recreating cloud resources and consistently introducing changes. To do this successfully, up-to-date information is saved in the state files.

Essentially, infrastructure drift is untracked changes. These untracked changes pose risks of varying severity and could have a drastic impact on the system. Similarly, some changes may be beneficial for the system, improving attributes like reliability, security, and performance.

Given the nature of infrastructure drift in the context of Terraform IaC, if it's not addressed, it can create blind spots while managing the infrastructure in scope. Infrastructure changes that fall out of the scope of Terraform management go unnoticed.

Security vulnerabilities

Infrastructure drift exposes the system's security vulnerabilities to attackers. This has the potential to cause serious damage not just to the system but to businesses in general. For example, when security group rules are manually modified to test a certain use case for public access, this can have multiple impacts, ranging from data breaches to the entire system being compromised.

Compliance violations

Automated policy execution or manual configuration changes can lead to breaches of regulatory requirements --- for example, drift that results in the personal data of users being exposed to the public or actions that enable unauthorized access to data and resources.

Performance and operational difficulties

Infrastructure drift can impair system performance because of latency or reduced network throughput, underprovisioning of resources, disabling of auto-scaling configurations, etc. Drift also makes it challenging to identify, analyze, and investigate the root cause of the issues. Unknown and untracked changes introduce challenges that increase downtime and also impact the mean time to resolution.

Higher costs

Changes caused by infrastructure drift can have wide-ranging financial implications. Provisioning of unutilized cloud resources generates unnecessary cloud platform costs, and, because the changes are not tracked, the cost of remediation and maintenance also increases.

You can learn more about drift in this article: Infrastructure Drift Detection and How to Fix It With IaC Tools.

Terraform drift example

A simple example would be that you have a terraform configuration that creates three EC2 instances. After these changes are deployed, someone goes into the AWS console and deletes one of them manually.

This has caused a drift because your current state of infrastructure doesn't reflect your Terraform state, or your Terraform configuration.

To solve the drift, you have two options:

  • Reapply the terraform code to recreate the missing instance.
  • Change the Terraform configuration to reflect the current state of your infrastructure.

How to detect Terraform drifts?

When infrastructure drift occurs, the first challenge is to identify it. As we have seen, drift has multiple sources, so it is not possible to track where and when the drift happens without a monitoring mechanism.

You can identify the existence of drift by running a couple of Terraform commands. The terraform refresh command helps refresh the state file, and the terraform plan command provides a plan of action by analyzing the state file and current configuration. The output provided by the plan command helps us identify drift.

Without changing the Terraform config, if the execution of a plan command suggests either modifying or recreating a certain resource, this indicates that something else has modified the infrastructure. But this depends on when exactly the commands are run. It often happens when we prepare and check the status before implementing other intended changes.

Monitoring drift with Spacelift

Periodic monitoring of IaC-managed infrastructure to proactively check for drift is challenging. Drift detection provided by Spacelift helps to identify and highlight infrastructure drift promptly. Configuring a drift monitor is as simple as configuring a cron job.

To start, select the stack you wish to configure drift detection for and navigate to Settings > Scheduling. A couple of notable control options are provided here:

  1. Reconcile: When this is enabled, Spacelift automatically remediates the drift. When infrastructure drift is identified, Spacelift triggers the "terraform apply" workflow to restore the original state of infrastructure as per the Terraform configuration.
  2. Schedule: This is a simple cron job notation that determines the scanning frequency and compares the state of deployment. In the example below, the drift detection happens every 15 minutes.

scheduling drift detection

When drift is detected, it is represented in a very intuitive way, making it easy to interpret its impact.

The screenshot below shows that one of the network components has drifted.

drift overview

Clicking on the drifted block quickly reveals details of the drift.

details of the drift

Read more in the documentation.

Understanding and reconciling drift with Spacelift

Whenever drift is detected, it is important to identify the factors that caused it. As discussed previously, the changes introduced out of the scope of Terraform could be either desirable or unwanted. Here are some examples:

  1. Changes caused by the scopes of automation tools overlapping are usually desirable. However, the responsibility is not clearly defined.
  2. Changes introduced manually to troubleshoot a related issue elsewhere but overlooked and not reverted are unwanted because they expose the system to various vulnerabilities and could have cost implications.
  3. Hotfixes implemented to address issues in critical services can be either permanent fixes or temporary workarounds, which makes it difficult to classify them as either desirable or unwanted. Further investigation is needed to decide.

As seen from the examples above, understanding drift needs some analysis. The course of remediation action usually boils down to the following:

  1. If the changes are desirable, import the configuration under Terraform management scope.
  2. If the changes are not desired, reinstate the original state by running "terraform apply".
  3. If a resource is not supposed to be managed by Terraform, disassociate it from Terraform state and configuration.

When drift detection is enabled, Spacelift highlights the drift in the very next run. It depends on how frequently the drift detection runs are configured. When we enable the "Reconcile" option in the drift detection schedule, Spacelift automatically triggers Terraform runs to reinstate the original configuration. This is appropriate when the scope is clearly defined, and resource management policies are in place.

However, if the boundaries are not clearly defined, you should turn off the "Reconcile" option. This is because there may be a need to either import drift or disassociate infrastructure from the current Terraform configuration. The drift detection schedule again plays an important role in confirming mitigation actions post-import/disassociation.

Terraform drift detection tools

Several tools can help you identify drift and some of them can even remediate the drift for you. Below is a list of these tools:

  • Terraform drift detection documentation
  • Brainboard
  • Terratest
  • Driftctl
  • TestInfra
  • Kitchen-Terraform

Terraform drift detection documentation

The Terraform drift detection documentation offers a comprehensive guide to identifying and managing drift within your Terraform configurations. It outlines how to use some of Terraform's native features, such as plan and apply, to detect changes not reflected in your Terraform configuration.

Brainboard

Brainboard is a cloud architecture design tool that offers several features to manage and visualize your cloud infrastructure effectively. It can help you identify discrepancies between your deployed resources and your Terraform configurations, thus making it easier for engineers to address drift and enforce compliance in their IaC definitions.

Terratest

While Terratest is a go library for testing infrastructure code, it can be used to automate testing of infrastructure states, indirectly helping to identify drift by validating the resources with the Terraform configuration.

Driftctl

Driftcl is a dedicated Terraform toll for detecting drift that scans your infrastructure state and compares it with the actual state of your resources. This approach helps to quickly identify and address drift, ensuring your infrastructure aligns with the IaC definition.

TestInfra

TestInfra is another testing framework for your infrastructure and even though it is not dedicated to Terraform, it can be used to test the state of the infrastructure managed by Terraform. It helps in identifying configuration drifts by asserting the actual state of your infrastructure against expected configurations.

Kitchen-Terraform

Kitchen-Terraform integrates the test kitchen automation tool with Terraform, allowing you to define tests for your Terraform configurations. Similar to Terratest and TestInfra, it can verify your configurations against the actual state of your infrastructure, thus detecting drift.

Key points

Managing infrastructure drift is challenging because it may originate from any source. It is difficult to get absolute certainty of who changed what and when. The risk potential of such drift can range from low to critical, and the impact can affect the system's security, cost, and reliability.

Terraform IaC and state files are the only reliable and predictable sources of information about the managed infrastructure. Spacelift's drift detection encompasses monitoring and an intuitive UI to highlight the drift (and optionally automate the reconciliation). This makes it easy to identify what has changed and how to proceed with investigating it.

Written by Sumeet Ninawe and Flavius Dinu

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .