In today’s fast-paced digital world, maintaining system reliability and minimizing downtime are critical for business success. This comprehensive guide explores how to enhance system resilience through advanced monitoring and self-healing mechanisms.
We will walk you through integrating Datadog for monitoring, setting up automated recovery scripts, and leveraging Node.js and webhooks to create a reliable self-healing system (with a focus on disk management).
By the end of this guide, you'll have a fully automated setup that can proactively manage system issues, ensuring smooth and uninterrupted operations.
Prerequisites
- A Datadog Account: Create an active Datadog account for monitoring and alerts.
- A Linux Server: This can be either on-premises or a cloud-based instance.
- Internet Access: Required for installing software, setting up integrations, and accessing Datadog.
1. Create a Datadog Account
-
Sign Up for Datadog:
- Visit Datadog's sign-up page and create a
free trial
account by entering your email and setting a password. - Complete the registration process, create a name for your
organization
, and verify your email if required. - Log in to your Datadog dashboard.
- Visit Datadog's sign-up page and create a
2. Deploy Your First Datadog Agent
Guides to install on your preferred operating system are listed in the integration
--> agents
section.
For a Local Ubuntu Server:
-
Install the Datadog Agent:
-
Obtain Your Datadog API Key:
- Navigate to Organization settings > API Keys to find your API key.
-
Obtain Your Datadog API Key:
-
Install the Agent:
- Run the following command on your Ubuntu server to install the Datadog Agent:
bash DD_API_KEY=8axxxxxxxxxxxxxxxxxxxxxxxxxxxf43 DD_SITE="datadoghq.com" bash -c "$(curl -L https://install.datadoghq.com/scripts/install_script_agent7.sh)"
- This command sets up the agent and automatically configures it with your API key.
- Run the following command on your Ubuntu server to install the Datadog Agent:
-
Verify Agent Installation:
- Check the agent status to ensure it is running:
bash systemctl status datadog-agent sudo datadog-agent status
- You should see output indicating that the agent is running and sending data.
- Check the agent status to ensure it is running:
- To check logs, use:
bash tail -f /var/log/datadog/agent.log
For Ubuntu Server Managing a Kubernetes Cluster with kubectl
and helm
already installed:
-
Create a Datadog API Key Secret:
- Execute the following command to create a Kubernetes secret containing your Datadog API key:
bash kubectl create secret generic datadog-secret --from-literal api-key=8axxxxxxxxxxxxxxxxxxxxxxxxxxxf43
- Execute the following command to create a Kubernetes secret containing your Datadog API key:
-
Deploy the Datadog Agent in the Cluster:
- Add the Datadog Helm repository and update: ```bash curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 chmod 700 get_helm.sh ./get_helm.sh
helm repo add datadog https://helm.datadoghq.com
helm repo update
-
Create and configure
datadog-values.yaml
:nano datadog-values.yaml
Add the following content:
datadog: apiKeyExistingSecret: datadog-secret
-
Deploy the Datadog Agent:
helm install datadog-agent -f datadog-values.yaml datadog/datadog
-
Confirm agents are running:
kubectl get all
3. Prepare a Disk for Monitoring
An alert will be triggered when the disk capacity reaches a determined threshold.
-
Add a New Disk to the Server:
- In your VMware or AWS cloud instance, add a new 2GB disk.
Adding a 2GB Disk in VMware Workstation
-
Open VMware Workstation:
- Launch VMware Workstation and select your virtual machine.
-
Open VM Settings:
- Right-click on the virtual machine and select Settings.
-
Add a New Disk:
- Click Add to open the Add Hardware Wizard.
- Choose Hard Disk and click Next.
- Select SCSI (recommended) or IDE and click Next.
- Choose Create a new virtual disk and click Next.
- Specify the disk size as 2 GB.
- Choose the location to store the virtual disk file and click Next.
- Click Finish to create the disk.
Adding a 2GB Disk in AWS
-
Log in to AWS Management Console:
- Navigate to AWS Management Console.
- Log in with your credentials.
-
Navigate to EC2 Dashboard:
- Go to Services > EC2.
-
Create a New EBS Volume:
- In the left sidebar, click on Volumes under Elastic Block Store.
- Click Create Volume.
- Configure the volume:
-
Volume Type: Choose
General Purpose SSD (gp3)
,Provisioned IOPS SSD (io1)
, etc. - Size: Enter 2 GiB.
- Availability Zone: Select the same availability zone as your EC2 instance.
-
Volume Type: Choose
- Click Create Volume.
-
Attach the EBS Volume to an EC2 Instance:
- Go back to Volumes.
- Select the volume you created.
- Click Actions > Attach Volume.
- Choose the instance you want to attach the volume to from the drop-down list.
- Click Attach.
Log in to Your EC2 or VMware Instance.
-
Prepare the Volume for Use:
-
Verify Disk:
bash lsblk
The new disk should be listed as/dev/xvdf
or similar.
-
Verify Disk:
Install LVM:
LVM (Logical Volume Management) is used to manage disk volumes because it offers flexibility, efficiency, and scalability. It allows dynamic resizing of partitions, easy addition and removal of disks, and improved storage utilization through aggregation and thin provisioning. LVM enhances performance with striping, supports snapshots for backups, simplifies administration, and can be combined with mirroring or RAID for high availability, making it an ideal choice for environments with dynamic storage needs.
- Install LVM tools:
bash sudo apt update sudo apt install -y lvm2
-
Set Up LVM:
- Create a Physical Volume (PV):
bash sudo pvcreate /dev/sdb
- Create a Volume Group (VG):
bash sudo vgcreate demoVG /dev/sdb
- Create a Logical Volume (LV) using all available space:
bash sudo lvcreate -n demoLV -l 100%FREE demoVG
- Create a Physical Volume (PV):
-
Format and Mount the Logical Volume:
- Format the LV with ext4 filesystem:
bash sudo mkfs.ext4 /dev/demoVG/demoLV
- Create a mount point and mount the LV:
bash sudo mkdir /demo sudo mount /dev/demoVG/demoLV /demo
-
Verify the Mount:
bash df -h /demo
- Format the LV with ext4 filesystem:
-
Configure Automatic Mounting:
- Add the following entry to
/etc/fstab
for automatic mounting at boot:bash echo '/dev/demoVG/demoLV /demo ext4 defaults 0 2' | sudo tee -a /etc/fstab
-
Verify fstab Configuration:
bash cat /etc/fstab
- Add the following entry to
4. Set Up the Webhook HTTPS Listener Using Node.js
If you prefer not to use Node.js, you may explore python Flask service, or Datadog’s serverless functions (if using a cloud provider like AWS) to trigger the script directly via AWS Lambda or an equivalent service, but the below method gives direct control on the infrastructure.
First create the script, that would be triggered by Datadog’s Webhook Integration that clears/move log files in /demo
directory.
Example script (purge_demo.sh
):
nano /tmp/purge_demo.sh
#!/bin/bash
LOGFILE="/tmp/purge_demo.log"
echo "Running purge script at $(date)" >> $LOGFILE
# Directory to purge
TARGET_DIR="/demo"
# Check if the directory exists
if [ -d "$TARGET_DIR" ]; then
echo "Purging all files in $TARGET_DIR..." >> $LOGFILE
rm -rf ${TARGET_DIR}/* >> $LOGFILE 2>&1
echo "All files in $TARGET_DIR have been purged." >> $LOGFILE
else
echo "Directory $TARGET_DIR does not exist." >> $LOGFILE
fi
-
Make the Script Executable:
bash chmod +x /path/to/purge_demo.sh
-
Install Node.js:
- Install the Node.js package repository and Node.js:
bash curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash - sudo apt install -y nodejs
-
Verify Installation:
bash node -v npm -v
- Install the Node.js package repository and Node.js:
-
Create a Simple Node.js Webhook Listener:
- Create a directory for the webhook listener and navigate to it:
bash mkdir ~/webhook_listener cd ~/webhook_listener
- Initialize a new Node.js project:
bash npm init -y
- Install Express:
bash npm install express
- Create the
webhook_listener.js
file:bash nano webhook_listener.js
- Create a directory for the webhook listener and navigate to it:
- Add the following code to `webhook_listener.js`:
```javascript
const express = require('express');
const { exec } = require('child_process');
const app = express();
const port = 6060;
app.post('/purge', (req, res) => {
// Execute the purge script
exec('/tmp/purge_demo.sh', (error, stdout, stderr) => {
if (error) {
console.error(`Error executing script: ${error.message}`);
res.status(500).send('Internal Server Error');
return;
}
if (stderr) {
console.error(`Script stderr: ${stderr}`);
}
console.log(`Script output: ${stdout}`);
res.send('Purge script executed');
});
});
app.listen(port, () => {
console.log(`Webhook listener running at http://localhost:${port}`);
});
```
<img width="450" alt="image" src="https://github.com/user-attachments/assets/a63ad15f-795c-4f7e-bb92-c6a001abd848">
- The service will be actively listening for incoming HTTP POST requests on port 6060. When Datadog triggers the webhook, it will send an HTTP or HTTPS POST request to this specific URL. This request will prompt the execution of the purge script.
-
Make the Listener Persistent with PM2 (the service will continuously run in background):
- Install PM2:
bash sudo npm install -g pm2
- Start the webhook listener with PM2:
bash pm2 start webhook_listener.js
-
Verify PM2 Process:
bash pm2 list pm2 stop webhook_listener.js pm2 restart webhook_listener.js
- Install PM2:
Step 4: Test the Webhook Listener
- Simulate a Webhook Request:
You can use curl
to simulate a POST request to your webhook:
curl -X POST http://localhost:6060/purge
If everything is set up correctly, the Node.js script should execute /tmp/purge_demo.sh
and return a confirmation message.
5. [Optional] Expose the Local Server with a VPN Tunnel, Just in case it is not a Linux cloud instance, it will need a temporary internet access through a vpn tunnel.
-
Install Localtunnel:
- Install Localtunnel:
bash sudo npm install -g localtunnel
- Install Localtunnel:
-
Start a Tunnel:
- Start a Localtunnel to expose your local webhook listener:
bash lt --port 6060 --subdomain trigger-xxxx
- This screen output URL will look like
https://trigger-xxxx.loca.lt
.
- Start a Localtunnel to expose your local webhook listener:
-
Get Tunnel Password:
- Access the tunnel password (if needed) for first-time access:
- Open the URL in a browser and you may be requested to enter the tunnel password.
wget -q -O - https://loca.lt/mytunnelpassword
- The private IP displayed will be your password.
6. Configure Datadog Webhook
-
Create a Webhook in Datadog:
- Log in to Datadog and navigate to Integrations > Search for Webhooks.
- Click New Webhook and configure it:
-
Name:
Run_Purge_Script
-
URL:
https://trigger-xxxxx.loca.lt/purge
#for tunnel URL - OR
-
URL:
https://server_domainIP/purge
#for cloud instance - Additional Options: Set as needed. [optional]
-
Name:
- Click Save.
Test that datadog can send a POST test request and Set Up a Datadog Monitor to Trigger the Webhook:
- On the monitoring page, navigate to
synthetic monitoring and testing
>New Test
. - Click
New API test
,HTTP
,URL: POST
,https://server_domainIP/purge
, >send
. -
You should get a success response as the screenshot below
-
Create a Monitor:
- Navigate to Infrastructure in Datadog page.
- Hover your mouse on the host and click on
view host dashboard
- A graphic display of some metrics you can monitor will be shown here.
- Click on the metrics you want to monitor (e.g.
disk usage by device
), click Create Monitor.
-
Configure the Monitor:
- Set the query to trigger an alert when disk usage exceeds 90%:
bash max(last_5m):max:system.disk.in_use{device:/dev/mapper/demoVG-demoLV} by {device} > 0.9
- Set the alert message:
vbnet Alert: Disk usage on /demo has exceeded 90%. Triggering purge script.
- Add recipients:
less @your_email@domain.com @webhook-Run_Purge_Script
Here, your webhook will also be a recipient, by simply typing the
@
key, in the message tab, a list of recipients will pop up for you to select. - Set the query to trigger an alert when disk usage exceeds 90%:
-
Save and Activate the Monitor:
- Click Save to activate the monitor.
-
Navigate to
monitor section
to see a list of your configured monitors.
7. Verify the Self-Healing Process
-
Populate the
/demo
Directory:- Open a separate session to the server.
- Copy files to
/demo
directory until it reaches 95% capacity to simulate a full disk:bash dd if=/dev/zero of=/demo/testfile bs=1M count=1900
2. Monitor Disk Usage:
Open a New Session: Start by opening a new session to the server.
-
Run a Continuous Loop: Monitor the
/demo
directory by running the following command:while true; do date && ls -l && pwd && du -ms; sleep 2; done
-
This command displays the real-time date, lists the files, shows the size of the files, and provides the current working directory of the
/demo
partition.- Monitor Disk Usage: Ensure that the disk usage reaches 90% to trigger the Datadog monitor.
-
Check Webhook Execution:
- Verify that the webhook is called and the purge script executes as expected.
-
An email will be automatically sent about the filled-up disk, and the
/demo
partition will be cleaned up.You will notice that the time that the email was triggered and the time the disk was full, are the same.
Within seconds, the script has been executed and any upcoming disruption would have been averted. -
Check the
/demo
directory to ensure that files are deleted when the threshold is crossed. -
Another email will be received to inform that the alert has been treated and closed.
By following these detailed steps, you'll establish a realistic self-healing system. While the script provided focuses on clearing logs, it can be adapted to perform other actions, such as restarting a service, scaling an instance, rolling back a kubernetes update or any other task. With Datadog monitoring disk usage and triggering a Node.js webhook, the system will automatically execute the necessary script, ensuring responsive and efficient management of your infrastructure.
Please feel free to leave a like, comment, or ask a question if you need clarity on any of the steps. Happy Learning!