API resilience is the ability of an API to fail fast or ensure that it continues to function after failure when faced with error high traffic, or partial system failures. This involves implementing common API resilience design patterns such as retries, timeouts, circuit breakers, failover, and fallbacks. A fallback using API Gateway is a plan B for an API - when the primary API service fails, the API Gateway can redirect traffic to a secondary service or return a predefined response. In this article, we will explore the challenges with existing fallback techniques and how to implement them efficiently using the APISIX API Gateway.
Implementing Fallbacks with APISIX
To implement a fallback mechanism with APISIX, you can use its built-in upstream priorities feature or using a response-rewrite plugin return a predefined response when a service call fails. Here’s a step-by-step example of how to set up both fallback methods.
Prerequisite(s)
This guide assumes the following tools are installed locally:
- Before you start, it is good to have a basic understanding of APISIX. Familiarity with API gateway, and its key concepts such as routes, upstream, Admin API, plugins, and HTTP protocol will also be beneficial.
- Docker is used to install the containerized etcd and APISIX.
- Install cURL to send requests to the services for validation.
Start the APISIX Docker Project
This project leverages existing the pre-defined Docker Compose configuration file to set up, deploy, and run APISIX, etcd, Prometheus, and other services with a single command. First, clone the apisix-docker repo on GitHub, open it in your favorite editor, navigate to the example
folder, and start the project by simply running docker compose up
command in a terminal from the project root folder.
When you start the project, Docker downloads any images it needs to run. It also runs two example backend services web1
and web2
. You can see the full list of services in docker-compose.yaml file.
Fallback with APISIX Upstream Priorities Enabled
You can set up each upstream node with a certain level of priority to enable. When the node endpoint with the higher priority fails, the API Gateway can redirect traffic to a secondary node with a lower priority. The default priority for all nodes is 0, nodes with negative priority can be configured as a backup.
Create a route to the two services and configure the priority attribute for each upstream service node:
curl "http://127.0.0.1:9180/apisix/admin/routes" -H "X-API-KEY: edd1c9f034335f136f87ad84b625c8f1" -X PUT -d '
{
"id":"backend-service-route",
"methods":[
"GET"
],
"uri":"/get",
"upstream":{
"nodes":[
{
"host":"web1",
"port":80,
"weight":1,
"priority":0
},
{
"host":"web2",
"port":80,
"weight":1,
"priority":-1
}
]
}
}'
-
methods
: This specifies the HTTP method that this route will match. In this case, it's set to matchGET
requests. -
uri
: This is the path that the route will match. So, anyGET
request to/get
will be handled by this route. -
nodes
: This is an array of backend servers. Each object in the array represents a server with itshost
,port
, andweight
. Theweight
is used for load balancing; in this case, both servers have a weight of1
, which would typically mean they'd share traffic equally. -
priority
: This is an additional configuration for the two nodes (web1
,web2
). Thepriority
field is used to determine the order in which nodes are selected. A node with a lower priority (a higher negative number) will be used only if all nodes with higher priority (lower negative numbers or positive numbers) are unavailable.
Verify if you get only a response from web1
service when you send a request to the route:
curl "http://127.0.0.1:9080/get"
You should see a response similar to the following:
hello web1
This means web1
has executed first as it is functioning. Now stop web1
service container to verify if APISIX fallbacks to web2
service.
docker container stop example-web1-1
Now if you send again another request to the route, you will get a response from the fallback service we specified.
curl "http://127.0.0.1:9080/get"
hello web2
By default, it takes 60 seconds while the request goes to service one first and falls back to service two if it is unavailable. You can also change this time by setting the timeout
attribute of the Upstream object. Another fallback strategy could be during releases. If a new version of the API is buggy, you can route traffic to the old version which is on standby using APISIX's traffic split feature. Fallback to the previous version if the new version has issues. This fallback method works well with Upstream health checks too.
Fallback with APISIX Response Rewrite Plugin
The APISIX response-rewrite plugin allows you to modify the response status code, body, and headers before returning it to the client. This can be particularly useful for implementing a fallback mechanism by providing a default response when the upstream service fails.
If you followed the first approach, rerun the web1
service container in Docker:
docker container start example-web1-1
To use the response-rewrite plugin for a fallback, you need to configure it in a route. Here's an example of how you might enable the plugin using a curl
command:
curl "http://127.0.0.1:9180/apisix/admin/routes" -H "X-API-KEY: edd1c9f034335f136f87ad84b625c8f1" -X PUT -d '
{
"id":"backend-service-route",
"methods":[
"GET"
],
"uri":"/get",
"plugins":{
"response-rewrite":{
"status_code":200,
"body":"{\"message\":\"This is a fallback response when the service is unavailable\"}",
"vars":[
[
"status",
"==",
503
]
]
}
},
"upstream":{
"nodes":{
"web1:80":1
}
}
}'
In the above example, we defined a single backend service (web1:80
) where the traffic should be directed when this route is matched. If the upstream service (web1:80
) responds with a 503 Service Unavailable
status, the response-rewrite
plugin will modify the response to have a 200 OK
status with a custom JSON body. This effectively creates a fallback response when the upstream service is not available.
"vars": [["status", "==", 503]]
: This condition tells the plugin to apply the rewrite only when the original status code of the response is503 Service Unavailable
.
Now if you send a request to the route, you should get a modified response:
curl "http://127.0.0.1:9080/get"
{"message":"This is a fallback response when the service is unavailable"}
Challenges of Implementing a Fallback Mechanism
Fallbacks are a critical component of a resilient system design. However, they can introduce more problems when they are implemented incorrectly. When discussing fallback strategies, the challenges faced can differ between single-machine environments and distributed systems. Let’s review them to understand them with examples and learn how to avoid them using APISIX.
Difficulty in Testing Fallback Logic
It's hard to accurately simulate application failure conditions such as a database failure in a single-machine context. Testing fallback strategies in distributed systems even gets more complex due to the involvement of multiple machines and services, making it difficult to replicate all possible failure modes. For example, an API on a local server has a fallback to a cached response when the database is unreachable. Testing this scenario requires simulating database downtime, which might not be part of regular testing, leading to untested fallback code under actual production load.
APISIX can be configured to route traffic to simulate various scenarios, including fallback conditions. This allows for more realistic testing of fallback logic under controlled conditions, ensuring that the fallback services can handle production traffic.
Fallbacks themselves can fail
If a fallback solution is not as resilient as expected, it might fail under the increased load that occurs when it is called into action, causing a cascading failure. Also, a fallback to a less efficient service can increase response times and load, potentially leading to a system-wide slowdown or outage. For example, an API might have a fallback to write logs to a local file system when a remote logging service is unavailable. This could lead to slower performance due to synchronous file I/O operations.
With APISIX, you can prioritize traffic to ensure that critical requests are processed first. This can prevent a fallback service from becoming overwhelmed and worsening the system's performance.
Fallbacks have operational risks
Implementing a fallback might introduce new points of failure, such as a secondary database that isn't kept in sync with the primary, leading to data inconsistencies.
APISIX's observability features, like logging, metrics, and tracing, can monitor the health and performance of both primary and fallback services. This real-time monitoring can help identify and mitigate risks associated with fallback strategies.
Fallbacks have latent and amplified bugs
Fallback code paths may contain inactive bugs that only occur under specific failure conditions, which might happen not often and are hard to predict which might not be discovered for months or years. For example, a fallback mechanism in an API that switches to a different authentication method during an identity service outage might contain a bug that only appears when the fallback is triggered, which could be a rare event.
APISIX supports continuous A/B testing and canary releases, allowing teams to test fallback paths in production with a small percentage of traffic. This continuous exposure can help uncover latent bugs before they become critical.
Fallbacks are infrequently exercised
Fallback mechanisms are infrequently used, so when they are triggered, they may not perform as expected due to a lack of regular testing and updates. For example, an API that serves geographic data might have a fallback to a static dataset when the dynamic data source is unavailable. If this fallback is rarely used, it might serve outdated information when activated because it's not regularly updated or tested.
On the other hand, APISIX allows for the configuration of dynamic routing and can be used to periodically redirect a portion of traffic to the fallback services. This ensures that the fallback paths are exercised regularly and remain ready for use.
Conclusion
Fallbacks are a safety net for when things go wrong with APIs. By using APISIX's upstream configuration or response-rewrite plugin, developers can provide thoughtful, user-friendly responses that keep the system functional and maintain trust with users. The key is to anticipate potential points of failure and to design fallbacks that provide the best possible experience under the circumstances.
Related resources
Community
🙋 Join the Apache APISIX Community
About the author
Visit my blog: www.iambobur.com