AWS Step Function just announced a new enhanced error handling and retry mechanism in State Machine executions that enables more fine-grain control of error retry rules. And also uses the enchanced Workflow Studio authoring experience to build the workflow.
Read more about how error handling works on Step Functions but in this blog, we will focus more on the new parameters in error handling with catch and retries on Step Functions such as MaxDelaySeconds
and JitterStrategy
.
Errors in Step Functions
During the execution of a State Machine, there are possibilities of execution would be interrupted by various errors such as States.Timeout
when the task execution has taken more than TimeoutSeconds
defined as when it failed to get a heartbeat longer than HeartbeatSeconds
defined.
Some of the possible errors are listed below.
States.All
wildcard is available on Step Functions to work with all the errors encountered during the execution.
Deploying a State Machine
Navigate to AWS Step Functions Create state machine and then select the Orchestrate Lambda Functions template with the new console experience.
Once the template is selected, Workflow Studio will give you more details of the template.
The Orchestrate Lambda Functions template showcases the stock buy/sell recommendation based on the stock price. Choose Run a demo option to deploy the state machine and other resources such as Lambda Functions with SNS and SQS to your AWS Account.
To test out the state machine, you could run a sample execution.
Updating the State Machine with error retries
Enabling retry
Let's update the Check Stock Price state which invokes a Lambda function with an error retry with a few retry options
{
"Retry": [
{
"ErrorEquals": [
"States.ALL"
],
"BackoffRate": 2,
"IntervalSeconds": 1,
"MaxAttempts": 3,
"Comment": "Check Stock Price Lambda error",
"MaxDelaySeconds": 2
}
}
In this error retry snippet, the wildcard States.ALL
error listens to all the errors in this state to perform a retry. This retry has a few other options -
IntervalSeconds
is an integer that specifies the number of seconds before the first retry.BackoffRate
which is a property that would multiply theIntervalSeconds
property to determine the next retry would occur.MaxAttempts
defines the maximum number of retries possible.MaxDelaySeconds
defines the maximum time in seconds that the retry interval can increase.
When the state machine is executed with an error retry in the first state,
Retry | Retry attempt second |
---|---|
1st retry | 1s (IntervalSeconds ) |
2nd retry | 3s |
3rd retry | 6s |
Enabling retry with JitterStrategy
In the second Generate Buy/Sell recommendation state, enabling retry for all States.ALL
wildcard with the below retry options -
{
"Retry": [
{
"ErrorEquals": [
"States.ALL"
],
"JitterStrategy": "FULL",
"Comment": "Buy/Sell recommendation error",
"MaxAttempts": 5
}
]
}
Along with the previously enabled options, the additional property set is - JitterStrategy
as FULL
. When JitterStrategy
is enabled with the value FULL
, it randomizes delay intervals so that the retry mechanism doesn't retry excessively, this is very powerful especially when Lambda is invoked concurrently.
When the error occurs in the Generate Buy/Sell recommendation state,
Retry | Retry attempt started after |
---|---|
1st retry | 00:00:01.298 |
2nd retry | 00:00:01.479 |
3rd retry | 00:00:01.752 |
4th retry | 00:00:02.986 |
5th retry | 00:00:04.213 |
Notice the retry attempts are between 1s and 5s (MaxAttempts
) which are random when compared to the previous case without JitterStrategy
.
Enabling Fail flow
Whenever failing the state machine, it throws an error
and the cause
that can customize the error states. But for this flow, enabling a custom error message defined in the state machine for the choice state.
{
"Fail": {
"Type": "Fail",
"Error": "Stock prediction failed",
"Cause": "Unable to proceed as stock prediction to buy or sell stock failed"
}
}
When the choice state goes through a default state, the state machine fails execution.
When the error occurs, the fail
flow captures the error as Stock prediction failed with the cause Unable to proceed as stock prediction to buy or sell stock failed.
Keep a watch on retries
When working with state machines that have multiple states with retries defined in all or many of them, the retry mechanism would retry until MaxAttempts
.
Q: It's good to retry multiple times
Well, not if the computing brains or the state would reproduce the same error. However, definitely good when retrying would result in success.
Q: Multiple retries shouldn't cause interruptions in my state machine execution
If the retries result in a success, it won't interrupt the state machine execution. But if the same error occurs, better to fail
the execution with passing events to EventBridge or DLQ.
Q: Would this be expensive?
In a Standard workflow, state machine execution is priced based on every state transition and when retry is configured there would be a state transition for each retry attempt this would be expensive.
Also, the retry attempt is invoking another AWS resource, that execution would also be billed. In this workflow, the Lambda function was invoked multiple times which also accounts for the total workflow cost.
Wrap up!
The retry mechanism would help when working with resource-based errors such as Lambda.ServiceException
or any other AWS Service error based on the task. With JitterStrategy
enabled, addresses the concurrent retries.
Note, this example was to showcase the features with error retry and fail flow.