Implement Fallback with API Gateway

Bobur Umurzokov - Nov 26 '23 - - Dev Community

API resilience is the ability of an API to fail fast or ensure that it continues to function after failure when faced with error high traffic, or partial system failures. This involves implementing common API resilience design patterns such as retries, timeouts, circuit breakers, failover, and fallbacks. A fallback using API Gateway is a plan B for an API - when the primary API service fails, the API Gateway can redirect traffic to a secondary service or return a predefined response. In this article, we will explore the challenges with existing fallback techniques and how to implement them efficiently using the APISIX API Gateway.

Fallback at the APISIX Gateway

Implementing Fallbacks with APISIX

To implement a fallback mechanism with APISIX, you can use its built-in upstream priorities feature or using a response-rewrite plugin return a predefined response when a service call fails. Here’s a step-by-step example of how to set up both fallback methods.

Prerequisite(s)

This guide assumes the following tools are installed locally:

  • Before you start, it is good to have a basic understanding of APISIX. Familiarity with API gateway, and its key concepts such as routes, upstream, Admin API, plugins, and HTTP protocol will also be beneficial.
  • Docker is used to install the containerized etcd and APISIX.
  • Install cURL to send requests to the services for validation.

Start the APISIX Docker Project

This project leverages existing the pre-defined Docker Compose configuration file to set up, deploy, and run APISIX, etcd, Prometheus, and other services with a single command. First, clone the apisix-docker repo on GitHub, open it in your favorite editor, navigate to the example folder, and start the project by simply running docker compose up command in a terminal from the project root folder.

When you start the project, Docker downloads any images it needs to run. It also runs two example backend services web1 and web2. You can see the full list of services in docker-compose.yaml file.

Fallback with APISIX Upstream Priorities Enabled

You can set up each upstream node with a certain level of priority to enable. When the node endpoint with the higher priority fails, the API Gateway can redirect traffic to a secondary node with a lower priority. The default priority for all nodes is 0, nodes with negative priority can be configured as a backup.

Fallback with APISIX upstream priorities enabled

Create a route to the two services and configure the priority attribute for each upstream service node:

curl "http://127.0.0.1:9180/apisix/admin/routes" -H "X-API-KEY: edd1c9f034335f136f87ad84b625c8f1" -X PUT -d '
{
   "id":"backend-service-route",
   "methods":[
      "GET"
   ],
   "uri":"/get",
   "upstream":{
      "nodes":[
         {
            "host":"web1",
            "port":80,
            "weight":1,
            "priority":0
         },
         {
            "host":"web2",
            "port":80,
            "weight":1,
            "priority":-1
         }
      ]
   }
}'
Enter fullscreen mode Exit fullscreen mode
  • methods: This specifies the HTTP method that this route will match. In this case, it's set to match GET requests.
  • uri: This is the path that the route will match. So, any GET request to /get will be handled by this route.
  • nodes: This is an array of backend servers. Each object in the array represents a server with its host, port, and weight. The weight is used for load balancing; in this case, both servers have a weight of 1, which would typically mean they'd share traffic equally.
  • priority: This is an additional configuration for the two nodes (web1, web2). The priority field is used to determine the order in which nodes are selected. A node with a lower priority (a higher negative number) will be used only if all nodes with higher priority (lower negative numbers or positive numbers) are unavailable.

Verify if you get only a response from web1 service when you send a request to the route:

curl "http://127.0.0.1:9080/get"
Enter fullscreen mode Exit fullscreen mode

You should see a response similar to the following:

hello web1
Enter fullscreen mode Exit fullscreen mode

This means web1 has executed first as it is functioning. Now stop web1 service container to verify if APISIX fallbacks to web2 service.

docker container stop example-web1-1
Enter fullscreen mode Exit fullscreen mode

Now if you send again another request to the route, you will get a response from the fallback service we specified.

curl "http://127.0.0.1:9080/get"
hello web2
Enter fullscreen mode Exit fullscreen mode

By default, it takes 60 seconds while the request goes to service one first and falls back to service two if it is unavailable. You can also change this time by setting the timeout attribute of the Upstream object. Another fallback strategy could be during releases. If a new version of the API is buggy, you can route traffic to the old version which is on standby using APISIX's traffic split feature. Fallback to the previous version if the new version has issues. This fallback method works well with Upstream health checks too.

Fallback with APISIX Response Rewrite Plugin

The APISIX response-rewrite plugin allows you to modify the response status code, body, and headers before returning it to the client. This can be particularly useful for implementing a fallback mechanism by providing a default response when the upstream service fails.

Fallback with APISIX response rewrite plugin

If you followed the first approach, rerun the web1 service container in Docker:

docker container start example-web1-1
Enter fullscreen mode Exit fullscreen mode

To use the response-rewrite plugin for a fallback, you need to configure it in a route. Here's an example of how you might enable the plugin using a curl command:

curl "http://127.0.0.1:9180/apisix/admin/routes" -H "X-API-KEY: edd1c9f034335f136f87ad84b625c8f1" -X PUT -d '
{
   "id":"backend-service-route",
   "methods":[
      "GET"
   ],
   "uri":"/get",
   "plugins":{
      "response-rewrite":{
         "status_code":200,
         "body":"{\"message\":\"This is a fallback response when the service is unavailable\"}",
         "vars":[
            [
               "status",
               "==",
               503
            ]
         ]
      }
   },
   "upstream":{
      "nodes":{
         "web1:80":1
      }
   }
}'
Enter fullscreen mode Exit fullscreen mode

In the above example, we defined a single backend service (web1:80) where the traffic should be directed when this route is matched. If the upstream service (web1:80) responds with a 503 Service Unavailable status, the response-rewrite plugin will modify the response to have a 200 OK status with a custom JSON body. This effectively creates a fallback response when the upstream service is not available.

"vars": [["status", "==", 503]]: This condition tells the plugin to apply the rewrite only when the original status code of the response is 503 Service Unavailable.

Now if you send a request to the route, you should get a modified response:

curl "http://127.0.0.1:9080/get"

{"message":"This is a fallback response when the service is unavailable"}
Enter fullscreen mode Exit fullscreen mode

Challenges of Implementing a Fallback Mechanism

Fallbacks are a critical component of a resilient system design. However, they can introduce more problems when they are implemented incorrectly. When discussing fallback strategies, the challenges faced can differ between single-machine environments and distributed systems. Let’s review them to understand them with examples and learn how to avoid them using APISIX.

Difficulty in Testing Fallback Logic

It's hard to accurately simulate application failure conditions such as a database failure in a single-machine context. Testing fallback strategies in distributed systems even gets more complex due to the involvement of multiple machines and services, making it difficult to replicate all possible failure modes. For example, an API on a local server has a fallback to a cached response when the database is unreachable. Testing this scenario requires simulating database downtime, which might not be part of regular testing, leading to untested fallback code under actual production load.

APISIX can be configured to route traffic to simulate various scenarios, including fallback conditions. This allows for more realistic testing of fallback logic under controlled conditions, ensuring that the fallback services can handle production traffic.

Fallbacks themselves can fail

If a fallback solution is not as resilient as expected, it might fail under the increased load that occurs when it is called into action, causing a cascading failure. Also, a fallback to a less efficient service can increase response times and load, potentially leading to a system-wide slowdown or outage. For example, an API might have a fallback to write logs to a local file system when a remote logging service is unavailable. This could lead to slower performance due to synchronous file I/O operations.

With APISIX, you can prioritize traffic to ensure that critical requests are processed first. This can prevent a fallback service from becoming overwhelmed and worsening the system's performance.

Fallbacks have operational risks

Implementing a fallback might introduce new points of failure, such as a secondary database that isn't kept in sync with the primary, leading to data inconsistencies.

APISIX's observability features, like logging, metrics, and tracing, can monitor the health and performance of both primary and fallback services. This real-time monitoring can help identify and mitigate risks associated with fallback strategies.

Fallbacks have latent and amplified bugs

Fallback code paths may contain inactive bugs that only occur under specific failure conditions, which might happen not often and are hard to predict which might not be discovered for months or years. For example, a fallback mechanism in an API that switches to a different authentication method during an identity service outage might contain a bug that only appears when the fallback is triggered, which could be a rare event.

APISIX supports continuous A/B testing and canary releases, allowing teams to test fallback paths in production with a small percentage of traffic. This continuous exposure can help uncover latent bugs before they become critical.

Fallbacks are infrequently exercised

Fallback mechanisms are infrequently used, so when they are triggered, they may not perform as expected due to a lack of regular testing and updates. For example, an API that serves geographic data might have a fallback to a static dataset when the dynamic data source is unavailable. If this fallback is rarely used, it might serve outdated information when activated because it's not regularly updated or tested.

On the other hand, APISIX allows for the configuration of dynamic routing and can be used to periodically redirect a portion of traffic to the fallback services. This ensures that the fallback paths are exercised regularly and remain ready for use.

Conclusion

Fallbacks are a safety net for when things go wrong with APIs. By using APISIX's upstream configuration or response-rewrite plugin, developers can provide thoughtful, user-friendly responses that keep the system functional and maintain trust with users. The key is to anticipate potential points of failure and to design fallbacks that provide the best possible experience under the circumstances.

Related resources

Community

🙋 Join the Apache APISIX Community

🐦 Follow us on Twitter

📝 Find us on Slack

💁 How to contribute page

About the author

Visit my blog: www.iambobur.com

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .