AIOps is no longer the next big thing — the journey has already started, and you need to get on board as quickly as possible. I'm going to write a four-part series covering how you can implement a comprehensive AIOps solution or framework in AWS. The series will consist of:
Series 1: AIOps Powered by AWS: Developing Intelligent Alerting with CloudWatch & Built-In Capabilities
This is the post I’m going to walk you through today.
Series 2: AIOps with AWS: Building Custom Machine Learning Models for Enhanced Alerting and Insights
This post explores how to implement AIOps using your own data and custom ML models for tailored intelligence.
Series 3: AIOps via AWS: Enabling Intelligent Resolution with Self-Healing Bots and Automation
This post covers the use of self-healing bots and other automated solutions to resolve issues without human intervention.
Series 4: AIOps with AWS: Leveraging GenAI for Smarter, AI-Powered Solutions in IT Operations
This post focuses on how Generative AI can offer advanced solutions for operational intelligence and automation.
AIOps Powered by AWS: Developing Intelligent Alerting with CloudWatch & Built-In Capabilities
AIOps stands for Artificial Intelligence for IT Operations. Today, distributed systems are increasingly complex due to monolithic applications being migrated to microservices, which are then hosted in the cloud. This has created a large volume of data sources, leading to a surge in data volume and an exponential rise in failure scenarios. As a result, humans are no longer able to manage these systems alone, and support from AI is needed.
AWS provides great offerings to enable or implement AIOps. Before we go further, let's identify some of the most leveraged AIOps use cases.
Anomaly Detection:
- Metric Anomaly Detection – This involves identifying anomalies in metrics (e.g., anomalies in order failure rate or API response time).
- Log Anomaly Detection – This involves identifying new error messages appearing in logs or a rise in the occurrence of errors.
Forecasting:
- Metric Forecasting – This involves forecasting a metric value, such as predicting when you will run out of open sessions for a particular service or when you will run out of system resources.
Correlation:
There are a lot of metrics, logs, tracers, or telemetry data and alerts in your system. Sometimes, finding the root cause is like finding a needle in a haystack. We need a way to reduce noise and pinpoint the actual problem. AI is able to correlate alerts, reduce noise, and guide us.
These are some of the most widely used AIOps use cases.
If you notice, a key part of AIOps is intelligent alerting. Standard threshold-based alerts no longer serve our purpose. Therefore, for traffic, errors, latency, or resources, we need to establish a baseline and receive alerts in case of baseline breaches. In these instances, AI acts as intelligent alerting.
Once the alert is triggered:
Self-Healing / Remediation Bots: We can develop self-healing or remediation bots that provide solutions. These bots can be rule-based, or we can leverage GenAI to provide smarter solutions as well.
Now that you have a good understanding of what AIOps is and its use cases, let’s look at how we can leverage AWS to implement a comprehensive AIOps framework.
Instrumentation and Collection
Regardless of whether you choose workload-based solutions (EC2, ECS, or EKS) or serverless solutions (AWS Lambda), you can use CloudWatch Agent, AWS Distro for OpenTelemetry, or X-Ray to get your application to emit telemetry data such as metrics, logs, traces, and events.
Visualizations
On top of the foundational telemetry data, you can build your observability dashboards.
Insights and Analytics
AWS provides various out-of-the-box insights, such as:
- Container Insights
- Lambda Insights
- Log Insights
- Application Insights
- EC2 Health
- AWS CloudTrail
Digital Experience Monitoring
To enable digital experience monitoring, you can leverage tools such as RUM (Real User Monitoring) and Synthetics.
All of the above are hooked into AWS CloudWatch for centralized observability.
AWS Built-in AIOps Capabilities Integrated with CloudWatch
The beauty of AWS is that it provides key AIOps use cases out of the box, such as:
- Metric anomaly detection
- Log anomaly detection
- AI-driven natural language query generation
- Intelligent insights
That’s it. You can leverage these capabilities to develop the intelligent alerting we discussed earlier.
Metric Forecasting
For Metric Forecasting, you can leverage the Forecasting service provided by AWS. We can easily integrate CloudWatch metrics with AWS Forecasting to meet our forecasting needs.
AWS DevOps Guru
Now that you’ve built some of the most common AIOps use cases, wouldn’t it be cool if AWS could monitor your entire AWS account and provide insights? Well, AWS provides AWS DevOps Guru, which can do just that. It’s based on machine learning, and some of the key use cases DevOps Guru brings to the table are:
- Anomaly Detection: Automatically detects unusual patterns in metrics, logs, and events using machine learning.
- Root Cause Analysis: Identifies the root cause of operational issues by correlating data from multiple sources, reducing resolution time.
- Proactive Insights: Offers recommendations to prevent potential issues based on best practices and historical data.
- Resource Optimization: Suggests ways to optimize resource utilization to lower costs and improve performance.
- Database Monitoring: Provides performance insights for both relational (e.g., RDS, Redshift) and non-relational databases (e.g., DynamoDB, ElastiCache).
- Capacity Planning: Forecasts future resource needs based on traffic patterns and usage trends.
Yes, DevOps Guru is your one-stop shop to get most of your AIOps requirements done.
What’s New in AWS Releases at re:Invent 2024?
Yes, these are exciting times! I’ve been tracking the following awesome capabilities released by AWS, which will greatly enhance your AIOps implementation journey.
Amazon CloudWatch Enhancements:
- Contextual Observability Data – Automatically visualizes relationships between metrics, logs, and AWS resources, improving troubleshooting and root cause analysis.
- Network Performance Monitoring – Provides near real-time monitoring of network performance across workloads using flow monitors.
- Database Insights for Amazon Aurora – Offers deeper insights for Amazon Aurora PostgreSQL and MySQL, designed for DevOps engineers and DBAs.
- Enhanced Observability for ECS – Adds detailed metrics from cluster to container level to improve troubleshooting for ECS workloads.
- CloudWatch Observability Solutions for AWS Services – Pre-configured solutions for common AWS services like JVM, Apache Kafka, and NGINX.
- Centralized Telemetry Configuration Visibility – Provides centralized auditing and visibility for AWS telemetry configurations (e.g., VPC Flow Logs, EC2 metrics) to ensure complete monitoring coverage.
- Amazon CloudWatch Application Signals – Provides complete visibility into application transaction spans, enhancing performance analysis and root cause identification.
That's a wrap for the series opener. With these capabilities, you can build a comprehensive AIOps framework to elevate your application reliability to the next level.