Optimizing AWS Infrastructure Deployment: Terraform, Sentinel, and CI/CD Best Practices

Nikolai Main - Oct 6 - - Dev Community

This project follows on from a previous post where I built AWS infrastructure solely in the AWS console. In it i cover the following topics:

  • Centralized Terraform state management
  • Terraform code validation with Sentinel
  • CI/CD Pipeline deployment
  • AWS Infrastructure

Less focus is placed on actual application design but may be covered in a later post.

Project Overview

In my initial project, I spent about an hour building the infrastructure and quickly realized how easy it is to make even minor mistakes that can lead to system failures. This often resulted in spending an additional 10 minutes here and there, sifting through each component to identify the error.

Recognizing this challenge in a relatively small project made me acutely aware of the potential headaches that could arise when managing larger systems.

To address this issue, I turned to Terraform. I dedicated a similar amount of time — approximately 1-2 hours — to define my infrastructure. However, the benefits were substantial: instead of spending 1-2 hours each time I needed to deploy, I can now get my entire infrastructure up and running in about 10 minutes, with a comparable teardown time.

This improvement effectively reduced my deployment time by approximately 50 minutes. Additionally, I can confidently assert that my application and infrastructure are secure, thanks to the comprehensive scans conducted prior to deployment:

  • Infrastructure Validation: My infrastructure is validated and checked with Sentinel in my cloud workspace. If any misconfigurations—such as poor naming and tagging conventions, overly permissive IAM policies, or insecure VPC designs—are present in my Infrastructure as Code (IaC), the run will fail, and I will be notified of the necessary changes.
  • Application Security Scans: For my application, I utilize GitLab's built-in suite of security tools to scan for code and dependency vulnerabilities, as well as exposed secrets. If GitLab isn’t an option, there are several other security scanning tools available, such as CodeQL, SonarQube, and Trivy. Once the application image is built, it undergoes an additional scan with Trivy to ensure its security.

Infrastructure Overview

Blank diagram (1).jpeg

Frontend Infrastructure (Repo 1)

  • ECR (Elastic Container Registry)
  • ECS (Elastic Container Service)
  • Application Load Balancer

Backend Infrastructure (Repo 2)

  • VPC (Virtual Private Cloud)
  • RDS (Relational Database Service)
  • API Gateway
  • AWS Lambda
  • Secrets Manager

Security Checks and Scans

Pipeline Scans

  1. Secret Detection
  2. SAST (Static Application Security Testing) Scanning
  3. Dependency Scanning
  4. SCA (Software Composition Analysis) Scanning

Sentinel Scans

  1. Appropriate IAM Permissions
  2. General Configuration Checks
  3. VPC Traffic Flows

Deployment Workflow Overview

Blank diagram (1).jpeg

Workflow 1 (Backend Configuration)

  1. Backend code is pushed to GitLab.
  2. Terraform run triggered in cloud workspace.
  3. Sentinel policies check code for misconfigurations
    • VPC: Naming conventions and private subnet config
    • Security Groups: Only allowing traffic over necessary ports
    • Lambda: IAM permissions and general config
    • RDS: Check for encryption, Public Accessibility and Default credentials.
    • Secrets Manager: Checks for secret rotation and read replicas
  4. Upon validation infrastructure can be applied. Note relevant outputs.
    • RDS Endpoint + Secret Name are needed for Lamdba to work in this project. (I later came back to this and retrieved those outputs dynamically from within the Lambda function)

Example Sentinel Policy - VPC Checks

import "tfplan/v2" as tfplan
import "tfrun" as run
import "strings"

// Define variables

messages = \[\]
resource = "VPC"

// Define main function
checks = func() {
  if run.is\_destroy == true {
    return true
  }

  // Retrieve resource info
  vpc = filter tfplan.resource\_changes as \_, rc {
    rc.mode is "managed" and
    rc.type is "aws\_vpc"
  }
  subnet = filter tfplan.resource\_changes as \_, rc {
    rc.mode is "managed" and
    rc.type is "aws\_subnet"
  }

  // Checking if resource exists.
  if length(vpc) == 0 {
    append(messages, "No vpc found.")
  }
  if length(subnet) == 0 {
    append(messages, "No subnets found.")
  }

  // Iterate over subnets
  for subnet as address, subnet {
    // Check number of available addresses
    if int(strings.split(subnet.change.after.cidr\_block, "/")\[1\]) < 24{
      append(messages, (subnet.address + " CIDR prefix too large. Must be at least 24."))
    }
    if(strings.has\_prefix(subnet.address, "aws\_subnet.private")){

      // Check subnet CIDR block
      if subnet.change.after.cidr\_block == "0.0.0.0/0"{
        append(messages, "Subnet not private. Edit CIDR block")
      }

      // Check if subnet has a public IP enabled.
      if subnet.change.after.map\_public\_ip\_on\_launch == true{
       append(messages, "Subnet not private. Public IP enabled")
      }
    }
  }

  // Run VPC checks
  for vpc as address, vpc {

    // Check if requires\_compatibilities is set and includes "FARGATE"
    requires\_name = vpc.change.after.tags else \[\]

    // Check VPC name/tags
    if length(requires\_name) == 0 or requires\_name.Name == "main-vpc"{
      append(messages, "VPC must follow proper naming conventions. Current name: " + requires\_name.Name)
    }
  }

  // Checking if any error messages have been produced
  // If messages is empty, the policy returns True and passes.
  if length(messages) != 0 {
    print(resource + " misconfigurations:")
    counter = 1
   for messages as message{
     print(string(counter) + ". " + message)
      counter += 1
    }
    return false
  }
  return true
}

// Main rule
main = rule {
   checks()
}
Enter fullscreen mode Exit fullscreen mode

Workflow 2 (Frontend Configuration)

  1. Application code is developed on local machine and pushed to Gitlab.
  2. Pipeline is trigged (More details below)
    • Scan application code
    • Build image, scan and push to ECR
    • Retrieve relevant outputs from backend infrastructure
    • Create TF_vars file and push back to GitLab
  3. 2nd Terraform workspace triggered by push to repo w/ tag
  4. Similar plan > sentinel scan > apply process takes place.

GitLab Pipeline

Stage 1: Test - SAST, Dependency, Secrets etc..

image: docker:latest
services:
- docker:dind
variables:
  DOCKER\_HOST: tcp://docker:2375/
  DOCKER\_DRIVER: overlay2
  REPO\_NAME: gitlab-cicd

// Declaring the required GitLab scans.
include:
  - template: Jobs/Dependency-Scanning.gitlab-ci.yml
  - template: Jobs/SAST.gitlab-ci.yml
  - template: Jobs/Secret-Detection.gitlab-ci.yml

// All included templates run during 'test' stage.
stages:
  - test
  - build-image
  - fetch-terraform-outputs
  - update-terraform
Enter fullscreen mode Exit fullscreen mode

Stage 2: Build, Scan, Push

build:
  stage: build-image
  before\_script:
  - apk add --no-cache aws-cli
  - apk add --no-cache curl
  script:

  // Building Docker image
  - echo "Building Docker image..."
  - docker build -t $REPO\_NAME:latest .

  // Scanning Docker image with Trivy
  - echo "Running Trivy scan on Docker image"
  - curl -sSL https://raw.githubusercontent.com/aquasecurity/trivy/main/contrib/install.sh
    | sh -
  - export PATH=$PATH:$(pwd)/bin
  - trivy image --exit-code 0 --severity HIGH,CRITICAL $REPO\_NAME:latest || true
  - trivy image --format json --output trivy-results.json $REPO\_NAME:latest

  # Retrieving ECR repo credentials
  - echo "Logging in to Amazon ECR..."
  - aws ecr get-login-password --region $AWS\_DEFAULT\_REGION | docker login --username
    AWS --password-stdin $AWS\_ACCOUNT\_ID.dkr.ecr.$AWS\_DEFAULT\_REGION.amazonaws.com

  # Pushing Docker image to ECR
  - echo "Pushing Docker image to ECR..."
  - TIMESTAMP=$(date +%Y%m%d%H%M%S)
  - IMAGE\_TAG="$REPO\_NAME:$TIMESTAMP"
  - docker tag $REPO\_NAME:latest $AWS\_ACCOUNT\_ID.dkr.ecr.$AWS\_DEFAULT\_REGION.amazonaws.com/$IMAGE\_TAG
  - docker push $AWS\_ACCOUNT\_ID.dkr.ecr.$AWS\_DEFAULT\_REGION.amazonaws.com/$IMAGE\_TAG
  - echo "TF\_VAR\_image\_uri=$AWS\_ACCOUNT\_ID.dkr.ecr.$AWS\_DEFAULT\_REGION.amazonaws.com/$IMAGE\_TAG"
    >> build.env
  artifacts:
    paths:
    - build.env
Enter fullscreen mode Exit fullscreen mode

Stage 3: Fetch TF outputs

fetch-terraform-outputs:
  stage: fetch-terraform-outputs
  image: alpine:latest
  script:
  - apk add --no-cache curl jq
  - echo "Creating variables for specific outputs..."

  // Retrieving outputs via Terraform Cloud API
  - "curl -s -X GET \\\\\\n  \\"https://app.terraform.io/api/v2/workspaces/${HCP\_WORKSPACE\_ID}/current-state-version-outputs\\"
    \\\\\\n -H \\"Authorization: Bearer ${HCP\_TOKEN}\\" \\\\\\n  -H
    'Content-Type: application/vnd.api+json' | \\\\\\njq -r '.data\[\] | select(.attributes.name

    // Saving outputs as environment variables to be passed to the next stage.
    | test(\\"public\_subnet\_ids|alb-sg-id|container-sg-id|vpc\_id\\")) | \\n  if .attributes.name
    == \\"public\_subnet\_ids\\" then\\n    \\"PUBLIC\_SUBNET\_IDS=\\\\(.attributes.value)\\"\\n
    \\ elif .attributes.name == \\"alb-sg-id\\" then\\n    \\"ALB\_SG\_ID=\\\\(.attributes.value)\\"\\n
    \\ elif .attributes.name == \\"container-sg-id\\" then\\n    \\"CONTAINER\_SG\_ID=\\\\(.attributes.value)\\"\\n
    \\ elif .attributes.name == \\"vpc\_id\\" then\\n    \\"VPC\_ID=\\\\(.attributes.value)\\"\\n
    \\ else\\n    empty\\n  end' > terraform\_outputs.env\\n"
  artifacts:
    reports:
      dotenv: terraform\_outputs.env
Enter fullscreen mode Exit fullscreen mode

Stage 4: Update main.tf

update-terraform:
  stage: update-terraform
  image: alpine:latest
  dependencies:
  - build
  - fetch-terraform-outputs
  before\_script:
  - apk add --no-cache git
  - git config --global user.email "${USER\_EMAIL}"
  - git config --global user.name "${USER\_NAME}"
  script:
  - echo "Contents of current directory:"
  - ls -la
  - echo "Contents of build.env:"
  - cat build.env || echo "build.env not found"
  - echo "Contents of terraform\_outputs.env:"
  - cat terraform\_outputs.env || echo "terraform\_outputs.env not found"
  - export $(cat build.env | xargs)
  - export $(cat terraform\_outputs.env | xargs)
  - echo "Cloning repository..."
  - git clone https://<username>:${GITLAB\_PAT}@gitlab.com/<project_id>/<repo.git> || exit
    1
  - cd Test

  // Create TF\_vars file
  - echo "Creating/Updating TF\_vars file..."
  - |
    cat << EOF > terraform.tfvars
    image\_uri = "${TF\_VAR\_image\_uri}"
    public\_subnet\_ids = ${PUBLIC\_SUBNET\_IDS}
    alb\_sg\_id = "${ALB\_SG\_ID}"
    container\_sg\_id = "${CONTAINER\_SG\_ID}"
    vpc\_id = "${VPC\_ID}"
    EOF

  // Commit and push TF\_vars to repo
  - git add terraform.tfvars
  - git commit -m "Update image URI and Terraform outputs in TF\_vars \[ci skip\]" ||
    echo "No changes to commit"
  - TAG\_NAME="$(date +%Y.%m.%d-%H%M%S)"
  - echo "Creating a new tag $TAG\_NAME"

  // Creating a tag to trigger TF cloud only from pushes from this pipeline.
  // 'ci skip' tells the repo not to run the pipeline again on this push. 
  - git tag -a $TAG\_NAME -m "Release version $TAG\_NAME \[ci skip\]"
  - git push origin HEAD:main --tags || exit 1
Enter fullscreen mode Exit fullscreen mode

Final Notes

In conclusion, I now have an end-to-end deployment solution that ensures my application is both secure and robust. This streamlined process has significantly reduced my mean time to deployment, allowing me to reallocate time and resources to other areas.

By identifying potential issues much earlier in the deployment process, I can mitigate risks that previously led to delays and unnecessary costs. This proactive approach not only enhances the overall efficiency of our development cycle but also improves the quality of our releases.

Looking ahead, I plan to involve additional security testing, Incorporate testing and production environments and integrate a monitoring tool such as Grafana.

. . . . . . . .