TL;DR
You can monitor your Serverless (Open)API with the help of CloudWatch Alarms. Please refer to this Github repository for the code.
CloudWatch Alarms for API Gateway are detailed in directory ./modules/cloudwatch-alarms-apigateway and for AWS Lambda in ./modules/cloudwatch-alarms-lambda
Next week i'll discuss making sense of logs and failures with AWS CloudWatch, and enable tracing with AWS X-Ray.
Thanks for reading!
Topic break-down
- TL;DR
- Topic break-down
- How to monitor your serverless (Open)API?
- How to setup CloudWatch Alarms for API Gateway?
- How to setup CloudWatch Alarms for AWS Lambda?
- The result
- What's next
- Further reading
How to monitor your serverless (Open)API?
In previous articles we've discussed the advantages of OpenAPI and how we can utilize OpenAPI for deployment on AWS API Gateway, and applying input/output verification & validation. This week we're tackling a piece of the serverless observability problem, and that's monitoring.
The service on AWS to do just that is AWS CloudWatch. AWS CloudWatch collects metrics and logs across all services such that we can program actions or simply create dashboards to display that data as useful graphs. This gives us insights into the real-time status of our API's.
This blog will show you how you can implement & configure monitoring for; AWS API Gateway, and AWS Lambda.
Please refer to this Github repository to view the code and get a more in-depth understanding of the setup.
How to setup CloudWatch Alarms for API Gateway?
Which metrics are we tracking?
The metrics that are available for API Gateway are summarized in great detail here. We will cover these metrics (quoted in part from AWS docs):
-
Latency The time between when API Gateway receives a request from a client and when it returns a response to the client. The latency includes the integration latency and other API Gateway overhead. Specifically, we want to monitor these:
- P95/P99 this measures the 95th and 99th percentile, the outliers are what we're interesting in measuring and will give us the best possible measurement of real world performance of our application for end users.
- 4XX Errors is a count of all 4XX errors (average or a sum total).
- 5XX Errors is a count of all the 5XX errors, also given as either an average or a sum total.
Prerequisites
Prerequisites for per API Method alarms is to enable Detailed CloudWatch Metrics. This will incur extra costs!
We do this in the aws_api_gateway_method_settings resource in the apigateway module, set metrics_enabled to true:
resource "aws_api_gateway_method_settings" "_" {
rest_api_id = aws_api_gateway_rest_api._.id
stage_name = aws_api_gateway_stage._.stage_name
method_path = "*/*"
settings {
throttling_burst_limit = var.api_throttling_burst_limit
throttling_rate_limit = var.api_throttling_rate_limit
metrics_enabled = var.api_metrics_enabled
}
}
In your AWS API Gateway console under Stages > Logs/Tracing you should see a tick next to Enable Detailed CloudWatch Metrics
The Terraform details
The configuration for CloudWatch Alarms for API Gateway is located in a new module; cloudwatch-alarms-apigateway. You can find this in the directory; ./modules/cloudwatch-alarms-apigateway
In our deployment main.tf file, call the module with all the API resources you want to enable Alarms for under the api_resources variable:
module "cloudwatch_alarms_apigateway" {
source = "../../modules/cloudwatch-alarms-apigateway"
namespace = var.namespace
region = var.region
resource_tag_name = var.resource_tag_name
api_name = module.apigateway.api_name
api_stage = module.apigateway.api_stage
resources = var.api_resources
}
The api_resources variable is defined in dev.tfvars this specifies which API's we want monitored:
api_resources = {
"/identity/authenticate" = "POST",
"/identity/register" = "POST",
"/identity/reset" = "POST",
"/identity/verify" = "POST",
"/user" = "GET"
}
How to setup CloudWatch Alarms for AWS Lambda?
Which metrics are we tracking?
Summarized in great detail here. We will cover these metrics (quoted in part from AWS docs):
- Errors: measures the number of invocations that failed due to errors (4XX)
- Throttles: Measures the number of Lambda function invocation attempts that were throttled due to invocation rates exceeding the customer’s concurrent limits (error code 429).
- IteratorAge: Emitted for stream-based invocations only (functions triggered by an Amazon DynamoDB stream or Kinesis stream). Measures the age of the last record for each batch of records processed. Age is the difference between the time Lambda received the batch, and the time the last record in the batch was written to the stream.
- DeadLetterErrors: Incremented when Lambda is unable to write the failed event payload to your function's dead-letter queue. This could be due to the following: Permissions errors, Throttles from downstream services, Misconfigured resources, Timeouts
The Terraform details
In each of the Services directories (e.g. identity and user) we apply the configuration as shown below to enable the AWS Lambda CloudWatch alarms for Errors and Throttles.
Since we do not have a dead letter queue, nor are we integrated with a stream (DynamoDB or Kinesis) we do not need to create these alarms.
Everything else we do not explicitly need to set, there are sensible defaults configured in the cloudwatch-alarms-lambda module variables.tf file
locals {
resource_name_prefix = "${var.namespace}-${var.resource_tag_name}"
lambda_function_name = "identity"
}
module "cloudwatch-alarms-lambda" {
source = "../../cloudwatch-alarms-lambda"
namespace = var.namespace
region = var.region
resource_tag_name = var.resource_tag_name
create_iteratorAge_alarm = false
create_deadLetterQueue_alarm = false
function_name = "${local.resource_name_prefix}-${local.lambda_function_name}"
}
The result
Once both AWS Lambda and API Gateway alarms have been configured, we can deploy this with npm run dev-infra. If you need to initialize Terraform again, run: npm run dev-init.
Once deployed, in our CloudWatch console we'll see the Alarms listed like below:
Note: most will read insufficient data, over time they will turn to OK.
What's next
We've created all the alarms in CloudWatch, next logical additions to this would be to integrate this with your own team tooling such as a channel on Slack, or perhaps a dedicated Administrator portal/mobile application to increase visibility of your API status.
Next week I'll attempt to showcase how errors & failures can be made more transparent with AWS CloudWatch and AWS X-Ray with these goals in mind:
- How to make sense of logs & failures,
- and how can we trace execution across multiple services.
Thanks for reading!