Self-Healing and Monitoring: A comprehensive guide to revolutionizing System Resilience Through Automation

SeunB - Sep 2 - - Dev Community

Image description

In today’s fast-paced digital world, maintaining system reliability and minimizing downtime are critical for business success. This comprehensive guide explores how to enhance system resilience through advanced monitoring and self-healing mechanisms.
We will walk you through integrating Datadog for monitoring, setting up automated recovery scripts, and leveraging Node.js and webhooks to create a reliable self-healing system (with a focus on disk management).
By the end of this guide, you'll have a fully automated setup that can proactively manage system issues, ensuring smooth and uninterrupted operations.

Prerequisites

  • A Datadog Account: Create an active Datadog account for monitoring and alerts.
  • A Linux Server: This can be either on-premises or a cloud-based instance.
  • Internet Access: Required for installing software, setting up integrations, and accessing Datadog.

1. Create a Datadog Account

  1. Sign Up for Datadog:
    • Visit Datadog's sign-up page and create a free trial account by entering your email and setting a password.
    • Complete the registration process, create a name for your organization, and verify your email if required.
    • Log in to your Datadog dashboard.

2. Deploy Your First Datadog Agent

Guides to install on your preferred operating system are listed in the integration --> agents section.

For a Local Ubuntu Server:

  1. Install the Datadog Agent:

    • Obtain Your Datadog API Key:
      • Navigate to Organization settings > API Keys to find your API key.

    image

    image

  • Install the Agent:

    • Run the following command on your Ubuntu server to install the Datadog Agent: bash DD_API_KEY=8axxxxxxxxxxxxxxxxxxxxxxxxxxxf43 DD_SITE="datadoghq.com" bash -c "$(curl -L https://install.datadoghq.com/scripts/install_script_agent7.sh)"

    image

    • This command sets up the agent and automatically configures it with your API key.
  1. Verify Agent Installation:

    • Check the agent status to ensure it is running: bash systemctl status datadog-agent sudo datadog-agent status
    • You should see output indicating that the agent is running and sending data.

    image

  • To check logs, use: bash tail -f /var/log/datadog/agent.log

For Ubuntu Server Managing a Kubernetes Cluster with kubectl and helm already installed:

  1. Create a Datadog API Key Secret:

    • Execute the following command to create a Kubernetes secret containing your Datadog API key: bash kubectl create secret generic datadog-secret --from-literal api-key=8axxxxxxxxxxxxxxxxxxxxxxxxxxxf43
  2. Deploy the Datadog Agent in the Cluster:

    helm repo add datadog https://helm.datadoghq.com
    helm repo update

  • Create and configure datadog-values.yaml:

     nano datadog-values.yaml
    

    Add the following content:

     datadog:
       apiKeyExistingSecret: datadog-secret
    
  • Deploy the Datadog Agent:

     helm install datadog-agent -f datadog-values.yaml datadog/datadog
    

    image

  • Confirm agents are running:

     kubectl get all
    

    image

3. Prepare a Disk for Monitoring

An alert will be triggered when the disk capacity reaches a determined threshold.

  1. Add a New Disk to the Server:
    • In your VMware or AWS cloud instance, add a new 2GB disk.

Adding a 2GB Disk in VMware Workstation

  1. Open VMware Workstation:

    • Launch VMware Workstation and select your virtual machine.
  2. Open VM Settings:

    • Right-click on the virtual machine and select Settings.

    image

  3. Add a New Disk:

    • Click Add to open the Add Hardware Wizard.
    • Choose Hard Disk and click Next.
    • Select SCSI (recommended) or IDE and click Next.
    • Choose Create a new virtual disk and click Next.
    • Specify the disk size as 2 GB.
    • Choose the location to store the virtual disk file and click Next.
    • Click Finish to create the disk.

Adding a 2GB Disk in AWS

  1. Log in to AWS Management Console:

  2. Navigate to EC2 Dashboard:

    • Go to Services > EC2.
  3. Create a New EBS Volume:

    • In the left sidebar, click on Volumes under Elastic Block Store.
    • Click Create Volume.
    • Configure the volume:
      • Volume Type: Choose General Purpose SSD (gp3), Provisioned IOPS SSD (io1), etc.
      • Size: Enter 2 GiB.
      • Availability Zone: Select the same availability zone as your EC2 instance.
    • Click Create Volume.

    image

  4. Attach the EBS Volume to an EC2 Instance:

    • Go back to Volumes.
    • Select the volume you created.
    • Click Actions > Attach Volume.
    • Choose the instance you want to attach the volume to from the drop-down list.
    • Click Attach.
  5. Log in to Your EC2 or VMware Instance.

  6. Prepare the Volume for Use:

    • Verify Disk: bash lsblk The new disk should be listed as /dev/xvdf or similar.

    image

  7. Install LVM:
    LVM (Logical Volume Management) is used to manage disk volumes because it offers flexibility, efficiency, and scalability. It allows dynamic resizing of partitions, easy addition and removal of disks, and improved storage utilization through aggregation and thin provisioning. LVM enhances performance with striping, supports snapshots for backups, simplifies administration, and can be combined with mirroring or RAID for high availability, making it an ideal choice for environments with dynamic storage needs.

  • Install LVM tools: bash sudo apt update sudo apt install -y lvm2
  1. Set Up LVM:

    • Create a Physical Volume (PV): bash sudo pvcreate /dev/sdb
    • Create a Volume Group (VG): bash sudo vgcreate demoVG /dev/sdb
    • Create a Logical Volume (LV) using all available space: bash sudo lvcreate -n demoLV -l 100%FREE demoVG
  2. Format and Mount the Logical Volume:

    • Format the LV with ext4 filesystem: bash sudo mkfs.ext4 /dev/demoVG/demoLV
    • Create a mount point and mount the LV: bash sudo mkdir /demo sudo mount /dev/demoVG/demoLV /demo
    • Verify the Mount: bash df -h /demo

    image

  3. Configure Automatic Mounting:

    • Add the following entry to /etc/fstab for automatic mounting at boot: bash echo '/dev/demoVG/demoLV /demo ext4 defaults 0 2' | sudo tee -a /etc/fstab
    • Verify fstab Configuration: bash cat /etc/fstab

4. Set Up the Webhook HTTPS Listener Using Node.js

If you prefer not to use Node.js, you may explore python Flask service, or Datadog’s serverless functions (if using a cloud provider like AWS) to trigger the script directly via AWS Lambda or an equivalent service, but the below method gives direct control on the infrastructure.

First create the script, that would be triggered by Datadog’s Webhook Integration that clears/move log files in /demo directory.

Example script (purge_demo.sh):

  nano /tmp/purge_demo.sh
Enter fullscreen mode Exit fullscreen mode
  #!/bin/bash

  LOGFILE="/tmp/purge_demo.log"

  echo "Running purge script at $(date)" >> $LOGFILE

  # Directory to purge
  TARGET_DIR="/demo"

  # Check if the directory exists
  if [ -d "$TARGET_DIR" ]; then
      echo "Purging all files in $TARGET_DIR..." >> $LOGFILE
      rm -rf ${TARGET_DIR}/* >> $LOGFILE 2>&1
      echo "All files in $TARGET_DIR have been purged." >> $LOGFILE
  else
      echo "Directory $TARGET_DIR does not exist." >> $LOGFILE
  fi
Enter fullscreen mode Exit fullscreen mode
  • Make the Script Executable: bash chmod +x /path/to/purge_demo.sh
  1. Install Node.js:

    • Install the Node.js package repository and Node.js: bash curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash - sudo apt install -y nodejs
    • Verify Installation: bash node -v npm -v

    image

  2. Create a Simple Node.js Webhook Listener:

    • Create a directory for the webhook listener and navigate to it: bash mkdir ~/webhook_listener cd ~/webhook_listener
    • Initialize a new Node.js project: bash npm init -y
    • Install Express: bash npm install express
    • Create the webhook_listener.js file: bash nano webhook_listener.js

    image

 - Add the following code to `webhook_listener.js`:
   ```javascript
   const express = require('express');
   const { exec } = require('child_process');

   const app = express();
   const port = 6060;

   app.post('/purge', (req, res) => {
       // Execute the purge script
       exec('/tmp/purge_demo.sh', (error, stdout, stderr) => {
           if (error) {
               console.error(`Error executing script: ${error.message}`);
               res.status(500).send('Internal Server Error');
               return;
           }
           if (stderr) {
               console.error(`Script stderr: ${stderr}`);
           }
           console.log(`Script output: ${stdout}`);
           res.send('Purge script executed');
       });
   });

   app.listen(port, () => {
       console.log(`Webhook listener running at http://localhost:${port}`);
   });
   ```

   <img width="450" alt="image" src="https://github.com/user-attachments/assets/a63ad15f-795c-4f7e-bb92-c6a001abd848">
Enter fullscreen mode Exit fullscreen mode
  • The service will be actively listening for incoming HTTP POST requests on port 6060. When Datadog triggers the webhook, it will send an HTTP or HTTPS POST request to this specific URL. This request will prompt the execution of the purge script.
  1. Make the Listener Persistent with PM2 (the service will continuously run in background):

    • Install PM2: bash sudo npm install -g pm2
    • Start the webhook listener with PM2: bash pm2 start webhook_listener.js
    • Verify PM2 Process: bash pm2 list pm2 stop webhook_listener.js pm2 restart webhook_listener.js

    image

Step 4: Test the Webhook Listener

  1. Simulate a Webhook Request:

You can use curl to simulate a POST request to your webhook:

   curl -X POST http://localhost:6060/purge
Enter fullscreen mode Exit fullscreen mode

image

If everything is set up correctly, the Node.js script should execute /tmp/purge_demo.sh and return a confirmation message.

5. [Optional] Expose the Local Server with a VPN Tunnel, Just in case it is not a Linux cloud instance, it will need a temporary internet access through a vpn tunnel.

  1. Install Localtunnel:

    • Install Localtunnel: bash sudo npm install -g localtunnel
  2. Start a Tunnel:

    • Start a Localtunnel to expose your local webhook listener: bash lt --port 6060 --subdomain trigger-xxxx
    • This screen output URL will look like https://trigger-xxxx.loca.lt.
  3. Get Tunnel Password:

    • Access the tunnel password (if needed) for first-time access:
    • Open the URL in a browser and you may be requested to enter the tunnel password.
     wget -q -O - https://loca.lt/mytunnelpassword
    
    • The private IP displayed will be your password.

6. Configure Datadog Webhook

  1. Create a Webhook in Datadog:

    • Log in to Datadog and navigate to Integrations > Search for Webhooks.
    • Click New Webhook and configure it:
      • Name: Run_Purge_Script
      • URL: https://trigger-xxxxx.loca.lt/purge #for tunnel URL
      • OR
      • URL: https://server_domainIP/purge #for cloud instance
      • Additional Options: Set as needed. [optional]
    • Click Save.

    image

    image

  2. Test that datadog can send a POST test request and Set Up a Datadog Monitor to Trigger the Webhook:

  • On the monitoring page, navigate to synthetic monitoring and testing > New Test.
  • Click New API test, HTTP, URL: POST, https://server_domainIP/purge, > send.
  • You should get a success response as the screenshot below

    image

  • Create a Monitor:

    • Navigate to Infrastructure in Datadog page.
    • Hover your mouse on the host and click on view host dashboard image
    • A graphic display of some metrics you can monitor will be shown here.
    • Click on the metrics you want to monitor (e.g. disk usage by device), click Create Monitor.

    image

  • Configure the Monitor:

    • Set the query to trigger an alert when disk usage exceeds 90%: bash max(last_5m):max:system.disk.in_use{device:/dev/mapper/demoVG-demoLV} by {device} > 0.9

    image

    • Set the alert message: vbnet Alert: Disk usage on /demo has exceeded 90%. Triggering purge script.
    • Add recipients: less @your_email@domain.com @webhook-Run_Purge_Script

    Here, your webhook will also be a recipient, by simply typing the @ key, in the message tab, a list of recipients will pop up for you to select.

    image

  • Save and Activate the Monitor:

    • Click Save to activate the monitor.
  • Navigate to monitor section to see a list of your configured monitors.

    image

7. Verify the Self-Healing Process

  1. Populate the /demo Directory:
    • Open a separate session to the server.
    • Copy files to /demo directory until it reaches 95% capacity to simulate a full disk: bash dd if=/dev/zero of=/demo/testfile bs=1M count=1900

image

2. Monitor Disk Usage:

  1. Open a New Session: Start by opening a new session to the server.

  2. Run a Continuous Loop: Monitor the /demo directory by running the following command:

    while true; do date && ls -l && pwd && du -ms; sleep 2; done
    
  • This command displays the real-time date, lists the files, shows the size of the files, and provides the current working directory of the /demo partition.

    • Monitor Disk Usage: Ensure that the disk usage reaches 90% to trigger the Datadog monitor.
  1. Check Webhook Execution:

    • Verify that the webhook is called and the purge script executes as expected.

    image

  • An email will be automatically sent about the filled-up disk, and the /demo partition will be cleaned up.

    Screenshot 2024-08-26 194702

    You will notice that the time that the email was triggered and the time the disk was full, are the same.
    Within seconds, the script has been executed and any upcoming disruption would have been averted.

  • Check the /demo directory to ensure that files are deleted when the threshold is crossed.

    Screenshot 2024-08-26 194210

  • Another email will be received to inform that the alert has been treated and closed.

    Screenshot 2024-08-26 194746

By following these detailed steps, you'll establish a realistic self-healing system. While the script provided focuses on clearing logs, it can be adapted to perform other actions, such as restarting a service, scaling an instance, rolling back a kubernetes update or any other task. With Datadog monitoring disk usage and triggering a Node.js webhook, the system will automatically execute the necessary script, ensuring responsive and efficient management of your infrastructure.
Please feel free to leave a like, comment, or ask a question if you need clarity on any of the steps. Happy Learning!

. . . . . .