Ensuring Disaster Recovery and High Availability in AWS EKS: Best Practices

Supratip Banerjee - Aug 30 - - Dev Community

When managing applications on AWS Elastic Kubernetes Service (EKS), ensuring disaster recovery and high availability is crucial. These practices help protect your applications from failures and ensure they remain accessible even during unexpected incidents. In this guide, we will explore best practices for achieving disaster recovery and high availability in EKS.

1. Understanding High Availability and Disaster Recovery

High availability (HA) means that your applications and services are always accessible and operational, even if parts of the infrastructure fail. Disaster recovery (DR) involves strategies and processes to restore your application to a normal state after a major failure or disaster. Both are essential for maintaining business continuity and minimizing downtime.

2. Multi-AZ Deployments for High Availability

AWS EKS supports running Kubernetes clusters across multiple Availability Zones (AZs). By spreading your EKS nodes across multiple AZs, you can ensure that if one AZ experiences issues, your application can still run on nodes in other AZs.

Configuring Multi-AZ Clusters

To set up a multi-AZ EKS cluster, you need to create an EKS cluster and node groups that span across multiple AZs. Here is an example of how you can configure your EKS cluster with multiple AZs:

# eks-cluster.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: my-cluster
  region: us-west-2
  version: "1.21"

availabilityZones:
  - us-west-2a
  - us-west-2b
  - us-west-2c

nodeGroups:
  - name: ng-1
    desiredCapacity: 3
    minSize: 2
    maxSize: 4
    availabilityZones:
      - us-west-2a
      - us-west-2b
      - us-west-2c
    instanceType: t3.medium
Enter fullscreen mode Exit fullscreen mode

Explanation: This configuration file sets up an EKS cluster named my-cluster across three Availability Zones (us-west-2a, us-west-2b, and us-west-2c). The nodeGroups section defines a node group that spans these AZs, with a desired capacity of 3 nodes and a maximum of 4 nodes. This setup ensures that your nodes are distributed across multiple AZs, improving high availability.

3. Implementing Automated Backups

Automated backups are vital for disaster recovery. In EKS, you can use AWS services such as Amazon RDS for backing up databases or implement your backup strategies for Kubernetes resources.

Creating Regular Snapshots

For persistent data stored in Amazon EBS volumes, you can create automated snapshots. Below is an example of how to set up automated snapshots for EBS volumes using AWS Lambda:

# create_snapshot.py
import boto3
from datetime import datetime

def lambda_handler(event, context):
    ec2 = boto3.client('ec2')
    volumes = ec2.describe_volumes(Filters=[{'Name': 'tag:Backup', 'Values': ['True']}])

    for volume in volumes['Volumes']:
        snapshot = ec2.create_snapshot(VolumeId=volume['VolumeId'], Description='Automated backup - {}'.format(datetime.now()))
        print(f'Snapshot created: {snapshot["SnapshotId"]}')
Enter fullscreen mode Exit fullscreen mode

Explanation: This Python script, to be run as an AWS Lambda function, creates snapshots of EBS volumes tagged with Backup=True. The snapshots are created with a description including the current date and time, helping you keep track of backup versions.

4. Configuring Health Checks and Auto-Scaling

Health checks and auto-scaling are crucial for maintaining high availability and performance. Kubernetes supports liveness and readiness probes to monitor the health of your pods, and EKS integrates with AWS Auto Scaling to handle load changes automatically.

Setting Up Liveness and Readiness Probes

Here is an example of a Kubernetes deployment configuration with liveness and readiness probes:

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: my-app-container
        image: my-app-image:latest
        ports:
        - containerPort: 80
        livenessProbe:
          httpGet:
            path: /healthz
            port: 80
          initialDelaySeconds: 60
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /readiness
            port: 80
          initialDelaySeconds: 60
          periodSeconds: 30
Enter fullscreen mode Exit fullscreen mode

Explanation: In this configuration, the liveness and readiness probes are set up for the my-app-container container. The liveness probe checks the /healthz endpoint, and the readiness probe checks the /readiness endpoint. These probes help Kubernetes determine if the container is healthy and ready to accept traffic. The initial delay and period values help avoid false negatives during startup.

Auto-Scaling

To automatically adjust the number of pods based on demand, you can use the Horizontal Pod Autoscaler (HPA):

# hpa.yaml
apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 50
Enter fullscreen mode Exit fullscreen mode

Explanation: This HPA configuration scales the my-app deployment based on CPU utilization. It ensures that there are at least 2 replicas and can scale up to 10 replicas based on CPU usage. This helps maintain performance and availability during varying loads.

5. Implementing Multi-Region Deployment

For comprehensive disaster recovery, consider deploying your application across multiple AWS regions. This approach protects against regional outages and ensures that your application remains available even if an entire region experiences issues.

Setting Up Multi-Region Deployment

Deploy your EKS clusters in different regions and use AWS Global Accelerator or Route 53 for traffic routing. Here’s a simplified example of using Route 53 for multi-region failover:

# route53-failover.yaml
Resources:
  FailoverRecordSet:
    Type: AWS::Route53::RecordSet
    Properties:
      HostedZoneId: <your-hosted-zone-id>
      Name: my-app.example.com
      Type: A
      Failover: PRIMARY
      AliasTarget:
        DNSName: <primary-region-load-balancer-dns>
        HostedZoneId: <primary-region-hosted-zone-id>
      SetIdentifier: Primary
      HealthCheckId: <health-check-id>
  SecondaryRecordSet:
    Type: AWS::Route53::RecordSet
    Properties:
      HostedZoneId: <your-hosted-zone-id>
      Name: my-app.example.com
      Type: A
      Failover: SECONDARY
      AliasTarget:
        DNSName: <secondary-region-load-balancer-dns>
        HostedZoneId: <secondary-region-hosted-zone-id>
      SetIdentifier: Secondary
      HealthCheckId: <health-check-id>
Enter fullscreen mode Exit fullscreen mode

Explanation: This Route 53 configuration sets up DNS failover between primary and secondary regions. If the primary region fails, Route 53 will automatically route traffic to the secondary region based on health checks. This ensures that your application remains accessible even if there’s a regional failure.

Conclusion

Implementing disaster recovery and high availability for your AWS EKS applications involves setting up multi-AZ deployments, automating backups, configuring health checks and auto-scaling, and considering multi-region deployments. By following these best practices and using the provided code examples, you can ensure that your EKS applications remain resilient and available, even in the face of unexpected challenges.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .