In this post, I want to share a story about how significant downtime could have impacted the application if I hadn't used a multi-region setup.
In the last few years, I have undertaken the multi-region road for all my applications. I am leading a new path for a streaming application with a new backend, leveraging AWS's serverless offering. In this article, I do not want to talk about why Serverless is better at the scale of any cluster configuration, why Serverless make it simple to have a multi-region infrastructure or how the total cost of ownership is cheaper than a multi-region cluster infrastructure. This article focuses on something I discovered recently that could cause a major downtime in production.
All my infrastructure is deployed using the AWS Serverless Application Model (AWS SAM). This tool consists of two parts:
- AWS SAM templates
- AWS SAM CLI
Without going into details, I enable versioning when I use Lambda. Each version makes the Lambda immutable, ensuring that code and configuration do not change once published.
Example:
AWSTemplateFormatVersion: 2010-09-09
Transform:
- AWS::Serverless-2016-10-31
Mappings:
myURL:
test:
url: https://test.com
stage:
url: https://stage.com
prod:
url: https://prod.com
Resources:
MyFunction:
Type: AWS::Serverless::Function
Properties:
CodeUri: ../src/handlers/something
.....
AutoPublishAlias: live
Environment:
Variables:
SOMETHING: !FindInMap [myURL, !Ref StageName, 'url']
.....
AWS THEORY
Lambda creates a new version of your function each time that you publish the function. The new version is a copy of the unpublished version of the function. The unpublished version is named $LATEST.
AutoPublishAliasAllProperties - Specifies when a new AWS::Lambda::Version is created. When true, a new Lambda version is created when any property in the Lambda function is modified. When false, a new Lambda version is created only when any of the following properties are modified:
Environment
,MemorySize
,SnapStart
or any change that results in an update to the Code property, such asCodeDict
,ImageUri
, orInlineCode
PROBLEM
Updating only the value of an environment variable (without any other changes in the template/code) can result in a successful deployment that isn't picked up by the lambda function. Because I deployed from CI and saw that everything had been deployed, I mistakenly thought it was all updated. Sadly, no new version exists, and the changes are only applied to $LATEST
.
SOLUTION
I think there is only one real solution without considering:
- Forcing a new deployment by making a small, inconsequential change to the Lambda function code (why someone will do that)
- Adding a parameter to the SAM template that changes with each deployment (it does not work)
- Explicitly publish a new version in the deployment pipeline after the SAM deployment. Using the AWS CLI or SDK to do this programmatically (so much effort)
The correct solution is:
- AWS::LanguageExtensions transform
I did not know as many people that I asked for, but apparently AWS::LanguageExtensions
have been announced back in 2022.
It MUST be used in the correct order:
AWSTemplateFormatVersion: 2010-09-09
Transform:
- AWS::LanguageExtensions
- AWS::Serverless-2016-10-31
With LanguageExtensions,
they fixed the previous limitation of being unable to dynamically resolve intrinsic functions or parameter references.
SO WHAT?
Adding LanguageExtensions
requires the STACK to be removed.
Removing a stack, especially for APIs on a streaming application with continuous traffic h24, is a recipe for disaster, and the multi-region infrastructure saved me.
On the architecture above, I have:
- CloudFront - You can use as API Acceleration with or without cache
- Route 53 - As router to create Active-Active or Active-Passive infrastructure
- API Gateway - As front regional door. It is possible to use as well ALB or Lambda function URLs
- Lambda - as my serverless computational service.
With this architecture, I can remove a stack in Region A without downtime because I only have to shift the traffic with Route 53 to Region B. Once the stack is replaced, I can perform this operation for each region in my architecture until it is all replaced and only then apply reset to the original Route53 configuration.
I wonder if CDK had the same problem, and the answer is YES if you are in some old years version, but of course, the latest did not have it
const handler = new lambda.Function(this, "handler", {
code: new Lambda.AssetCode(path.resolve(__dirname, "dist")),
handler: `index.${config.api.handler}`,
runtime: Lambda.Runtime.NODEJS_20_X,
architecture: lambda.Architecture.ARM_64,
currentVersionOptions: {
removalPolicy: RemovalPolicy.RETAIN,
retryAttempts: 1,
},
environment: {
TABLE_NAME: "xxxx",
},
});
const alias = new lambda.Alias(this, 'LambdaAlias', {
aliasName: 'live',
version: handler.currentVersion,
});
new apigw.LambdaRestApi(this, config.apiName, {
handler: alias,
description: config.apiDescription
});
Conclusion
Since we are all using an SDK to perform our tasks, a similar issues can arise. If the infrastructure is critical, transitioning to a multi-region setup with serverless architecture is quite simple. The pay-per-use model helps keep costs under control, allowing us to pay only for active usage. For instance, an Active-Passive design can help mitigate similar problems without the burden of paying for unnecessary infrastructure.