Building a multi-region highly available identity provider with the AWS cloud and Ory Hydra

Derek Berger - Nov 7 '23 - - Dev Community

AsurionID is an OpenID Connect (OIDC) compatible identity provider. It allows Asurion developers to easily integrate identity and access management into their applications using a standard protocol (OIDC) and open-source libraries. Our team worked from specific requirements, including custom user experience and low cost, so we decided to build a homegrown solution instead of using an off-the-shelf solution. We built AsurionID on AWS using open-source Ory Hydra and custom microservices.

High availability using multi-AZ in a single region

As shown in the diagram below, in AsurionID's initial architecture its microservices ran on Amazon Elastic Kubernetes Service (EKS) across 3 Availability Zones (AZs) in a single region. Amazon ElastiCache for Redis, used for storing temporary session data, was also deployed in 2 AZs (primary in one AZ and replica in another AZ). We used Amazon Aurora multi-AZ features to protect the database against AZ-level failures.

Multi-AZ high availability
Multi-AZ high availability


This provided AsurionID with availability of up to three nines (99.9%) in a single region. As more and more applications adopted AsurionID for identity and access management, it became more critical to our business. We wanted to protect AsurionID against region-level service disruptions which are less frequent but can be more impactful. That’s what led us to multi-region architecture.

Designed for protection against regional service disruptions

In our latest architecture, all microservices now run in active-active mode, in two EKS clusters, across two AWS regions. With active-active, both regions' services are always live and taking traffic, and we use Route 53 weighted routing to distribute customer traffic between the two regions.

Multi-region, active-active microservices
Multi-region, active-active microservices


We leverage Route 53 inverted health checks, following the Secondary Takes Over Primary (STOP) pattern, to handle failover if microservices encounter region-level disruption.

In our implementation of STOP, we associate the weighted DNS records with the inverted health checks, and those health checks with S3 objects. We invoke health check failure for a particular DNS by uploading its associated object. The failing health check stops Route 53 from forwarding requests to its associated regional ALB.

STOP pattern for failing over microservices
STOP pattern for failing over microservices


With this approach, we have achieved static stability and independence from the Route 53 control plane for failing over our microservices, which has resulted in higher availability for AsurionID microservices, up to four nines (99.99%).

We have taken a slightly different approach for the caching layer. Since we cache only ephemeral data like one-time passcodes (OTP), we aren’t replicating this data to the secondary region. But we have another ElastiCache for Redis cluster always running in the secondary region, and in case our caching layer is impaired by an AWS regional service interruption, we would invoke failover using STOP, just like our microservices.

Multi-region caching architecture
Multi-region caching architecture


This new architecture has helped us achieve static stability and control plane independence for the caching layer as well as the application layer.

For the database, we are using Aurora Global database with a read replica in the secondary region.

Aurora Global database
Aurora Global database


In case of a region-level Aurora impairment, we would promote the second region's instance to primary.

Future Enhancements

We now strive for the same static stability and control-plane independence in the database layer as we have for our microservices and caching layers. In our current database architecture, the promotion of the read replica triggers a Lambda that updates Route 53 CNAME values (a control plane function) to route all application traffic to the new primary database cluster. We are looking for new approaches to database failover that use data plane operations.

One potential option is AWS Route 53 Application Recovery Controller (ARC). Route 53 ARC works with Route 53 health checks to enable failover using the data plane, with the extra capability of checking the standby database to ensure it is ready for failover. ARC can also fail over an entire application stack in one operation, making it expandable to our cache and microservice layers.

Conclusion

In this article, we have walked you through how AsurionID started out with a multi-AZ approach to high availability and how we further improved availability with a multi-region architecture. Our architecture protects AsurionID against regional AWS service disruptions, achieves static stability, and uses data plane functions for failing over the microservices and caching layers.

While the primary goals of our multi-region architecture were improved availability and resiliency, the architecture has provided the team with even more benefits. We can now perform releases and infrastructure upgrades during business hours without impacting customers by routing traffic to one region while performing tasks in the other. The ability to perform critical operations during the day has improved the quality of life for the engineers. Of course, we could have realized these capabilities with a single-region architecture, but for us, they became additional benefits of a multi-region architecture.


Asurion is a leading tech care company that provides device protection, tech support, repair, and replacement services to 300 million customers worldwide. It partners with mobile carriers, retailers, and device manufacturers to deliver innovative solutions for smartphones, tablets, computers, and home appliances in over 20 countries worldwide. Asurion is headquartered in Nashville, TN.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .