Intro
When it comes to building a software system, one of the most critical components is error handling, because “everything fails all the time”. While it's impossible to anticipate every possible failure scenario, having a plan in place when an unexpected thing happens can always be helpful for robustness and resilience of the system.
This ‘plan’ can be a simple Dead Letter Queue (DLQ)!
Dead Letter Queues can act as the safety net where you can keep the messages when unexpected issues arise. This helps you to isolate problematic messages, debug the issue without
disrupting the rest of your workflow, and then retry or reprocess the message as needed.
Many AWS Serverless services support SQS queues as the dead letter queue natively. However, Step Functions - one of the main Serverless services offered by AWS for workflow orchestration - does not support dead letter queues natively (yet).
In this blog post, I am going to discuss a couple of workarounds to safely capture any messages that have failed to process by a Step Function into a dead letter queue.
Scenario
Let’s consider a very common use case where we have a message in a SQS source queue, which needs to be processed by Step Functions. First the messages are read by a Lambda function that starts a step function execution for each message.
Here, once Step function execution lambda reads the message from Source Queue and successfully starts the Step function execution, the message will be marked as successfully processed and will be deleted from the queue. If there’s any errors in the Step Function execution, the message will be lost.
Solution 01
In order to retain the message even when there is a failure in Step Function execution, we can add a 2nd SQS queue which acts as a dead letter queue as follows.
The state machine will look like this:
How it works
Within the State machine, we can create a step namely “Send message to DLQ” which is a SDK integration for SQS SendMessage functionality.
In this step, the message to be sent to the DLQ is built based on the execution input retrieved from the context parameters.
In the other steps of StateMachine, as required, we can configure the Catch errors in Error handling settings where we use the above “Send message to DLQ” step as the catcher.
This way, when an error happens in a state, we can send the message to the DLQ and re-process from there.
Try this yourself
Here is a Github repository of a sample project I created to try out this solution.
https://github.com/pubudusj/dlq-for-stepfunctions
You can deploy this to your AWS account using CDK.
Once deployed you can test the functionality by sending a message to the source SQS queue in below expected format.
{
"metadata": {},
"data": {
"foo": "bar"
}
}
And you can see the message ends up in the DLQ and the "metadata" object now includes the "attempt_number" as well.
As an alternative for using the metadata section of the message to set the attempt number, you may use SQS message attribute as well.
Please note: In this approach, the DLQ is not a "real" DLQ and is not configured with the source SQS queue. However, it will help to capture any messages that failed to process by the Step Function execution.
Solution 02
In this method, we will use a real DLQ that is configured with the source queue.
The state machine will look like this:
How it works
There is a source SQS queue and a DLQ configured to it.
In the DLQ settings, the max receive count is set as > 1 so the message will be available in the DLQ immediately after the first failure.
There is a Lambda function which has the trigger set up to process messages from the source queue and initialize SF execution.
In the Event Source mapping setting of this lambda function, it is required to set the report_batch_item_failures to True.
First, when the message is processed by the lambda function, we set the visibility timeout of the message to a larger value. This must be larger than the time it takes to complete the Step Function execution.
Then, the step function execution will be started. Here, we pass the SQS queue url and the message receipt handler values along with the original values from the message from SQS.
In the example above, in order to determine if we need to simulate the failure, we use a simple choice state.
If it is a successful execution, we will call the step - Delete SQS Message. Here we use the SQS SDK integration to delete the message using the SQS queue url and the receipt handle values received in the input payload.
If it is a failure, we will call a step named - “Set message visibility timeout 5s”. Here we will use SQS SDK integration for the action: “changeMessageVisibility” to set the SQS message’s visibility to 5 seconds. For this SDK integration, we use the SQS queue url and the SQS receipt handle values passed in the execution input.
Once the message visibility is set to 5 seconds, it will again appear on the source queue after 5 seconds. However, since we have the rule ‘max receive count’ set to more than 1, the message will be immediately put into the DLQ of the source queue.
Try this yourself
I have another Github repository for you to try this in your own AWS environment. You can set it up using CDK/Python.
https://github.com/pubudusj/dlq-for-stepfunctions-2
To simulate a failure scenario, send a message into the source queue with a "failed" value as True.
{
"foo":"bar",
"failed": true
}
This will make the step function execution fail and the message will be immediately available in the DLQ of the source queue.
With this approach, you can use the native DLQ functionality when we cannot process messages in the Step Function execution.
Summary
Step Functions is one of the widely used AWS Serverless services. However it doesn’t support dead letter queues (DLQs) natively yet. Still there are workarounds to achieve this with few simple steps. This blog post explained two of such workarounds which help to build a better resilient system.