Amazon DevOps Guru for the Serverless applications - Part 12 Anomaly detection on Lambda consuming from DynamoDB Streams

Vadym Kazulkin - Sep 30 - - Dev Community

Introduction

In the 1st part of the series we introduced the Amazon DevOps Guru service, described its value proposition, the benefits of using it and explain how to configure it. We also need to go through all the steps in the 2nd part of the series to set everything up. In the subsequent parts we saw DevOps Guru in action detecting anomalies on DynamoDB and Aurora Serverless v2, API Gateway and Lambda alone and also in conjunction with other AWS Serverless Services like SQS, Kinesis, Step Functions and SNS.

In this part of the series I'd like to explore whether DevOps Guru will recognize anomalies on Lambda function consuming from DynamoDB Streams

Detecting anomalies on Lambda consuming from DynamoDB Streams

Let's enhance our architecture so that in case of creation of the new product persisted in the DynamoDB, DynamoDB Streams will call UpdateProduct Lambda function. Here is the link to the AWS SAM template. UpdateProduct Lambda function is defined with DynamoStream event type. We also added Dead Letter Queue as Lambda failure destination in the DestinationConfig of the DynamoStream event.

UpdatedProductFunction:
    Type: AWS::Serverless::Function
    Properties:
      FunctionName: UpdatedProduct
      .....
      Events:
        DynamoStream:
          Type: DynamoDB 
          Properties:
            Stream: !GetAtt ProductsTable.StreamArn
            DestinationConfig:
              OnFailure:
                Type: SQS
                Destination: !GetAtt OnFailureQueue.Arn
            StartingPosition: LATEST
            BatchSize: 50
            MaximumRetryAttempts: 5
            MaximumRecordAgeInSeconds: 3600
Enter fullscreen mode Exit fullscreen mode

and here how DynamoDB ProductsTable is defined together with StreamSpecification.

  ProductsTable:
    Type: AWS::DynamoDB::Table
    Properties:
      TableName: "ProductsTable"
      AttributeDefinitions:
        - AttributeName: 'PK'
          AttributeType: 'S'
      KeySchema:
        - AttributeName: 'PK'
          KeyType: 'HASH'
      BillingMode: PAY_PER_REQUEST
      StreamSpecification:
        StreamViewType: NEW_IMAGE
Enter fullscreen mode Exit fullscreen mode

This is how the final architecture looks like :

Image description

Not let's imagine that UpdateProduct Lambda function always runs into the error while processing DynamodbStreamRecord (simply throw some error there) which is a part of DynamoDbEvent. Lambda function then retries 5 times according to our configuration of DynamoStream event and then the message will be placed into Dead Letter Queue.

MaximumRetryAttempts: 5
MaximumRecordAgeInSeconds: 3600 
Enter fullscreen mode Exit fullscreen mode

We can reproduce the failure with curl or hey tool, so that we have many failed UpdateProduct Lambda functions.

hey -q 1 -z 15m -c 1 -m PUT -d '{"id": 1, "name": "Print 10x13", "price": 0.15}' -H "X-API-Key: XXXa6XXXX" https://XXX.execute-api.eu-central-1.amazonaws.com/prod/products

Enter fullscreen mode Exit fullscreen mode

I wanted to figure out whether DevOps Guru will detect this incident and what information it will give us.

The incident was first of all recognized by DevOps Guru :

Image description

with the following anomalous metrics Errors Sum and "IteratorAge Maximum" on UpdateProduct Lambda function :

Image description

and the following graphed anomalies :

Image description

Interestingly, if the compare anomalous metrics with the anomalies with Kinesis Data Streams (which works similar to DynamoDB Streams) which we reproduced in the article Amazon DevOps Guru for the Serverless applications - Part 6 Continuing with anomaly detection on Lambda invocations we saw additional anomalous metrics on the Kinesis Data Streams like "GetRecords.Byte Sum" and "GetRecords.Records Maximum" which both indicate that there are unprocessed Kinesis Data Streams record(s) for a long period of time. CloudWatch also showed me "GetRecords.Byte Sum" and "GetRecords.Records Maximum" increased on DynamoDB Streams during the incident, but they were not listed in the anomalous metrics. Generally, it's not wrong, as there is an error in the Lambda function itself and not with DynamoDB Streams. The value of the "IteratorAge Maximum" anomalous metric increases when the Lambda function can't efficiently process the data that's written to the Kinesis/DynamoDB streams that invoke the function. So, there is anyway enough information in place to investigate the incident and understand what AWS Serverless services are involved in it, but the DevOps Guru behaves a bit differently in case of incidents with Kinesis Data Streams and DynamoDB Streams.

I reproduced this type of anomaly several times and DevOps Guru occasionally created another type of insight for the same anomaly instead:

Image description

pointing to other anomalous metrics "NumberOfMessagesSent Sum" and "ApproximateAgeOfOldestMessage Maximum" on the Dead Letter Queue as Lambda failure destination, but with surprisingly no anomalous metrics listed from UpdateProduct Lambda function:

Image description

and the following graphed anomalies :

Image description

Conclusion

In this article we explored whether DevOps Guru will recognize anomalies on Lambda consuming from DynamoDB Streams in case this Lambda function runs into error. General answer was yes, but we experienced 2 different flavors of anomalous metrics: "Errors Sum" and "IteratorAge Maximum" on Lambda function and "NumberOfMessagesSent Sum" and "ApproximateAgeOfOldestMessage Maximum" on Dead Letter Queues Lambda failure destination. I'd personally expect that all these anomalous metrics will be presented together in one DevOps Guru insight and not separated in different DevOps Guru insights.

I will approach the DevOps Guru team with my insights so that they can verify the experiment and look behind the scenes what's happening and hopefully improve DevOps Guru service to correctly handle also this anomaly.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .