Tips to prevent a serverless wreck

Matt Lewis - Nov 15 '21 - - Dev Community

Lately I've been involved in reviewing services developed using 'serverless' that had been struggling with performance issues. Rather than talk about all the benefits of serverless, this post looks at some top tips for keeping your service operational and production-ready, gained from analysing these real life issues.

TIP 1: Focus on observability not logs

As an engineer, or someone responsible for monitoring a service, the best way to easily find and fix issues is through the use of observability and serverless monitoring tools rather than focusing on logs. Although AWS provide a service called AWS X-Ray for distributed tracing, there are better tools in the partner ecosystems such as those provided by Lumigo, Thundra, Instana and several others. For example, Lumigo provides some key benefits over X-Ray such as (thanks to Yan Cui for highlighting some of these):

  • auto-instrumentation
  • support for streams
  • dashboard with overview of environment - lambda invocations, most errors, cold starts
  • issues page showing recent failures including timeouts
  • captures request and response for every HTTP request
  • supports auto-scrubbing of sensitive data
  • supports searching for specific transactions
  • built in alerting

Although there is still value in logging when stepping through complex business logic, you will soon find that these observability tools are your first point of call for understanding the behaviour and current state of your service.

TIP 2: Alert from metrics

Historically, I'd been involved in writing systems where you would log a failure using an ERROR log level. These logs would be shipped, and another component would scrape the files, and alert based on log patterns. Unfortunately, it is still common to see this same approach replicated in a cloud environment.

My recommended approach today is to use metrics. For example, when your function has finished processing, Lambda sends metrics about the invocation to Amazon Cloudwatch. You can set alarms to respond to these changes in metrics, and out of the box Lambda provides a number of Invocation, Performance and Concurrency metrics. You can also create your own custom metrics using the Cloudwatch embedded metric format, taking into account your own specific context, which can be alerted on in the same way.

A starting point for what alerts to configure for common AWS services is provided in this post by Yan Cui

Although it is possible to ingest metrics into external systems, this should not be done for the purpose of alerting. This is because there is often a delay with the asynchronous nature of ingesting metrics, by which time, you are already aware of the errors from disappointed customers.

TIP 3: Focus on async

The serverless application lens part of the Well Architected Framework noted that:

you can no longer sustain complex workflows using synchronous transactions. The more service interaction you need the more you end up chaining calls that may end up increasing the risk on service stability as well as response time.

The first two fallacies of "the 8 fallacies of distributed computing" are that "the network is reliable" and "latency is zero". Chaining together synchronous HTTP calls leads to a brittle service that is prone to break. This led to Tim Bray, a former distinguished engineer at AWS, to state that:

“if your application is cloud-native, or large-scale, or distributed, and doesn’t include a messaging component, that’s probably a bug.”

There are plenty of options available to decouple components in the AWS ecosystem. To get a better understanding of the differences between Events, Queues, Topics and Streams it's worth watching this TechTalk by Julian Wood.

Personally, I'm excited at the growing adoption of event-driven applications using an event bus provided by Amazon EventBridge. Rather than orchestrating calls between components, you can publish an event that represents a significant change in state, and other components can subscribe to these events and respond accordingly. A good place to start here is looking at the talks and blogs by Sheen Brisals on how Lego have adopted EventBridge.

TIP 4: Understand your timings

There is a fantastic article on the AWS Builders Library that talks about 'Timeouts, retries and backoff with jitter.

When you design a service, you need to be aware of the timeout settings and retry behaviour of all components in the flow of data from the frontend all the way to the backend. There is no point setting a Lambda timeout of 60 seconds if it is triggered by API Gateway with a default timeout of 29 seconds that cannot be increased.

Alongside the timeout within a service, there are also timeout settings within the SDK. The default behaviour for the AWS SDK is different based on the service being invoked and the specific driver being used as highlighted in this knowledge centre article.

AWS SDK Timeouts

In the cases I looked at, no socket timeout had been set whilst using the AWS SDK for JavaScript. This meant that when a network problem was encountered, the API call would hang for 120 seconds.

TIP 5: Know your design patterns

After saying that you should focus on asynchronous communication, there are times when you have no choice but to make a synchronous call. This may because you are reliant on an external interface. In which case, if the external API is failing, best practice is to respond back to the calling client immediately by closing the circuit, whilst retrying the API in the background until it is ready for the circuit to be opened. This is the basis of the Circuit Breaker design pattern. Jeremy Daly has written up a number of Serverless Microservice Patterns for AWS

TIP 6: Optimise your service

The bottom line is the more you understand both the programming language and the AWS services you are using, the better the chances of delivering a resilient, secure and highly available service.

At a high level, I will break down optimising your service into three main areas:

a) General Optimisations

There are set of optimisations when working with AWS Lambda functions that have emerged over time. AWS publish a set of best practices for working AWS Lambda functions.

Having a good knowledge of the AWS Lambda lifecycle alongside associated configurations is essential. Some examples include:

  • Ensure HTTP Keep-Alive is enabled if using Nodejs
  • Reduce package size to reduce cold start times using tools like webpack or esbuild
  • Use the AWS Lambda Power Tuning tool by Alex Casalboni to determine the right memory/power configuration and instruction set architecture
  • Use reserved concurrency to limit the blast radius of a function where appropriate. I have seen examples where this not being set during a batch load has resulted in the entire AWS Lambda concurrent executions for an entire account being consumed
  • Initialise SDK clients and database connections out of the function handler

b) Reduce network calls

Each time a call is made over the network, additional latency and error handling is introduced. These calls should be reduced to the minimum possible, and two specific examples come to mind.

The first is where a call was made to DynamoDB to check a record doesn't exist, before a second call was made to put an Item. In this case, using a Condition Expression enabled this to be done in a single call.

The second involved multiple calls to DynamoDB to retrieve data once after the other. In this case, the challenge was down to poor schema design. The growing adoption of Single Table Design in DynamoDB is driven by the desire to reduce the number of calls needed.

If you plan on using DynamoDB in production, 'The DynamoDB Book' by Alex DeBrie is essential reading.

c) Parallelise where possible

Although we want to reduce the number of network calls, sometimes there is no other option. In these cases, you should look to run in parallel. This has the overall result of reducing the function duration and the latency experienced by the calling consumer. In asynchronous JavaScript, it is easy to run tasks in parallel and wait for all of them to complete using Promise.all.

NOTE there are some optimisations that are good to always do since they are essentially free to carry out. Other ones will take up engineering time to set up, run, analyse and refactor. It makes sense to focus on the most frequently invoked services or functions or those that observability tools have highlighted as poorly performing. Always be careful over optimising and not taking into account engineering effort.

TIP 7: Serverless != Lambda

Finally, remember that AWS Lambda is not a silver bullet.

When AWS Lambda was first launched, the configuration options where limited, which made it a simple service to use. Now, with Lambda extensions, layers, up to 15 minute timeout, and up to 10 GB memory allocation, all kinds of use cases have been opened up. However, this often leads to the default choice being writing a new function. If you are writing an ETL batch job using AWS Lambda where you are having to chunk down files to process and allocate 15 minute timeout and 10GB RAM, it is likely there are better "serverless" options out there such as AWS Glue.

One thing better than writing a function which you have to manage and maintain is not writing a function at all, and this way we are moving to an approach of "favouring configuration over code".

At the last AWS re:Invent, there was an announcement for AWS Glue Elastic Views which is currently in preview. This removes the need to write custom functions to combine and replicate data. More recent is the announcement that Step Functions now has direct integration for over 200 AWS Services. Again, this removes the need for writing custom code.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .