Amazon DevOps Guru for the Serverless applications - Part 13 Anomaly detection on Aurora Serverless v2 with Data API (kind of)

Vadym Kazulkin - Nov 28 - - Dev Community

Introduction

In my article Amazon DevOps Guru for the Serverless applications - Part 10 Anomaly detection on Aurora Serverless v2 we learned that DevOps Guru was able to successfully detect anomalies with Aurora (Serverless v2) PostgreSQL database in case of Lambda function with Java 21 managed runtime was connected to it via JDBC. We scaled our database only from 0.5 to 1 ACU and created a very high load on the database by invoking Lambda function to retrieve product by id several hundred times concurrently for multiple minutes. We saw that DevOps Guru correctly pointed to the increased sum of database connections and constantly high database (CPU) load. In this article I'd like to figure our whether DevOps Guru will detect the anomaly doing the same experiment but using Data API for Aurora Serverless v2 with AWS SDK for Java instead of JDBC.

This article is a copy of my article Aurora Serverless v2 Data API meets DevOps Guru or not? which I published as a part of the Data API for Amazon Aurora Serverless v2 with AWS SDK for Java series.

Anomaly detection on Aurora Serverless v2 with Data API

Let's look into our sample application and use SAM template to create infrastructure and deploy the application described on the following picture :

Image description

The application creates products stored in the Aurora Serverless v2 PostgreSQL database and retrieves them by id using Data API. The relevant Lambda function which we'll use to retrieve product by its id is GetProductByIdViaAuroraServerlessV2DataApi and its handler implementation is GetProductByIdViaAuroraServerlessV2DataApiHandler.

As in the previous article we use hey tool to perform the stress test like this

hey -z 15m -c 300 -H "X-API-Key: XXXa6XXXX" https://XXX.execute-api.eu-central-1.amazonaws.com/prod/productsWithDataApi/1   
Enter fullscreen mode Exit fullscreen mode

In this example we invoke the API Gateway endpoint with 300 concurrent containers for 15 minutes. Behind the prod/productsWithoutDataApi endpoint Lambda function GetProductByIdViaAuroraServerlessV2WithoutDataApi will be invoked wich will retrieve the product by id 1 from the Aurora Serverless v2 PostgreSQL database.

We configured in our [SAM template]((https://github.com/Vadym79/AWSLambdaJavaAuroraServerlessV2DataApi/blob/master/template.yaml) Aurora database cluster to scale from minimal capacity 0.5 to maximal capacity 1 ACU (which is very small database size) in case of the increased load for the cost saving purpose.

  AuroraServerlessV2Cluster:
    Type: 'AWS::RDS::DBCluster'
...
      ServerlessV2ScalingConfiguration:
        MinCapacity: 0.5
        MaxCapacity: 1
Enter fullscreen mode Exit fullscreen mode

Aurora (Serverless v2) database manages the maximal number of the database connections available proportionally to the database size (in our case the ACU setting) also with Data API for Aurora Serverless v2 (which is a huge difference to v1 which will become out of support end of year 2024 where was a hard quota of 1000 database connection per second). For more information, please read the documentation about Maximum connections for Aurora Serverless v2. So, with the increased number of invocations, we expect to reach the maximal number of the database connections available and high database (CPU) load soon, so that database won't be able to respond to the new Lambda function requests to retrieve product by id (Lambda will then also run into). With that we will provoke the anomaly and would like to figure out whether DevOps Guru will be able to detect it. And it was able, kind of.... The following insight was generated:

Image description

And the following aggregated anomalous metrics have been identified:

Image description

Comparing to the aggregated anomalous metrics identified in case of using JDBC instead of Data API described in my article Amazon DevOps Guru for the Serverless applications - Part 10 Anomaly detection on Aurora Serverless v2 we completely muss the Aurora database anomalous metrics: database connection sum and database (CPU) load but correctly see the error in Lambda which ran into the defined time out of 15 seconds as the database couldn't respond.

Image description.

So, what's the difference? Let's explore both incidents that we reproduced on Aurora Serverless v2 PostgreSQL cluster with JDBC(Non Data API) and Data API :

In terms of ACU utilization/scaling they both look the same:

Image description

In terms on other database metrics like: CPU Utilization, DatabaseConnection DBLoad(CPU) there are huge differences:

Image description

  • CPU Utilization looks the same for JDBC(Non-Data API) and Data API cases. But DevOps Guru seems to not to consider this metric, as we didn't see it even for JDBC experiment
  • DBLoad(CPU) which is very low for Data API usage. It seems that for Dat API there is some Load Balancer in front of the Aurora Serverless v2 database which monitors the connection usage and protects the database for being overloaded.
  • DatabaseConnection metric is not shown (or shown as 0) for Data API usage. The reason for that is that we don't manage database connection for Data API, it's done on the other side for us. Of course they still play an important role we learned in Maximum connections for Aurora Serverless v2, but this metric seems to be exposed to outside in the CloudWatch Metrics and even DevOps Guru doesn't have any access to the real numbers.

With that and very low DBLoad(CPU) no DevOps Guru insight for the Aurora Serverless v2 cluster with Data API usage has been generated compared to JDBC use case.

I did the second experiment by connecting into the Aurora Serverless v2 cluster directly and wrote the script to create the load test by writing the script who fetches the product by id multiple hundred times using the standard way (non-Data API). Similar as we did with hey tool, but taking to the database directly instead of invoking Api Gateway. After I put the database under the load, I started the same experiment with the hey tool as described above and wanted to see what would happen. The same insight was generated but this time with the following anomalous metrics:

Image description

Now we see at least additional Aurora Serverless v2 database connection sum anomalous metric, but DBLoad(CPU) metrics are still missing.

Graphed anomalies look like this:

Image description

Of course, the experiment wasn't clean, as I did 2 load tests after each other and partially in parallel : the first one connecting to the database directly without API Gateway usage and the second by using Data API. This confirmed my initial assumption that database connection sum metrics is a very important criteria to generate DevOps Guru insight for Aurora Serverless v2 (and for RDS in general) and it's not expose in general in case of using Data API.
I already contacted Devops Guru team and shared with them my insights with the expectations that they will improve the service. Or first of all exposing database connection as a CloudWatch Metric will be fixed for using Aurora Serverless v2 with Data API.

Conclusion

In this article learned that DevOps Guru could successfully detect anomalies with Aurora (Serverless v2) PostgreSQL database in case of Lambda function with Java 21 managed runtime connected to it via Data API but could only showed the anomalous metrics related to the Lambda function being timed out as the database didn't respond. The main reason for that seems to be that database connection as a CloudWatch Metric isn't exposed (or always displayed as 0) in case of using Aurora Serverless v2 with Data API. Aurora Serverless v2 database metrics (database connection sum) was only showed during the second artificial experiment.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .