How Ably delivers on message durability and QoS across a large-scale distributed system

Ably Blog - Sep 9 - - Dev Community

Ably is a distributed pub/sub messaging platform that acts as the broker in realtime data streaming pipelines. Publishers send messages to Ably, and we deliver those messages to subscribers.

Beyond the core pub/sub capability, Ably has built highly reliable and dependable infrastructure to ensure message durability and QoS. Through this we are able to offer:

  • Highly available infrastructure to safeguard against data center outages: Ably boasts a geographically distributed infrastructure with redundant servers to ensure messages are never lost during outages.
  • Guaranteed message delivery: Ably prioritizes message delivery success to ensure messages reach their destination even if data centers experience problems.
  • Message integrity: Ably ensures messages are delivered in the correct order and without duplicates. Our delivery guarantee ensures subscribers receive messages “exactly once”.
  • Connection state and recovery: Ably offers seamless user experience through robust connection management. Additionally, Ably can persistently store message history for retrieval, further enhancing data reliability.

In this article, we cover the redundancy built into our server clusters and the design we have used to ensure message ordering and exactly-once delivery. We also explain how publishers and subscribers can recover gracefully following network disconnection.

Fault tolerance for data center outages

Ably is hosted on AWS EC2 infrastructure and uses its associated services. Our main realtime production cluster is deployed to seven regions across the US, Europe, and Asia. We select two servers to host each channel for fault tolerance so all message operations continue seamlessly with no loss of data in the event of an outage.

Fault tolerance for data center outages

We also store each message to a channel in at least one other region. Should service in an entire region and its data centers be disrupted, such as in the case of the AWS US-EAST-1 outage in December 2021, there is no risk of message loss. Even if an instance or data center fails, Ably can provide 99.99999% message availability and survivability.

fault-tolerance-of-a-stateful-datastore

When a publisher sends a message to the Ably platform, the message is stored and, only then, does the publisher receive an acknowledgment to indicate publication. This prevents the false-positive situation where a publisher believes a message to have been published, but it is lost before it is recorded.

If a problem arises in a data center, and it is impossible for Ably to store a message in those multiple locations, then the message will be rejected via an exception or failure callback to its publisher. This way, the publisher knows that message failed to publish, and can make a subsequent attempt.

The next attempt would be successful because Ably has implemented various strategies to provide message reliability, and work around fault data centers. Let's look at how these work in practice ...

Guaranteed message delivery

When a client connects to Ably, they are automatically connected to the closest datacenter through our latency-based routing using DNS. When a client sends a request to our DNS servers, they determine the client’s location and respond with one or more IP addresses from the closest data center (closest meaning the one with the lowest latency).

If a datacenter is unhealthy, Ably automatically stops routing all traffic to that datacenter at a DNS level and ensures that any DNS requests are directed to the next closest healthy data center. All DNS has a TTL of 60 seconds, so a data center becoming unavailable will have traffic routed to one of our other data centers within a minute.

Ably’s client libraries provide an additional level of DNS redundancy. If our DNS latency-based routing or the entire ably.io domain becomes unavailable, the client libraries connect to a data center from a backup domain.

All of this makes it possible to offer best in class reliability, with a 99.999% uptime SLA on the Ably realtime service. We know we have built the system to handle server hardware and software failures without any loss of data and service reliability, an that any disruption to an AWS service will have no impact on the Ably service.

Message integrity

The discussion above focuses on reliability, which means that a publisher can be sure that nothing they send through Ably is lost. Equally important is the integrity of the published messages because duplicated or out-of-order messages can have significant consequences. For example, chat messaging relies on a well-defined sequence of messages.

Ably ensures that messages are delivered to subscribers in the order they were published. This is achieved by giving every message sent a unique incrementing serial number. The serial number is based on a timestamp, and it’s used to disambiguate messages published in the same millisecond.

This timestamp is what makes it possible for Ably to offer an exactly-once guarantee. It means that when the publisher sends a message, we can perform an existence check for already-accepted messages with the same identifier and discard any duplicates.

discarding-duplicate-messages-based-on-message-ID

Having checked for duplicate messages, we then go onto the slightly more complex task of ensuring that subscribers receive messages precisely once. Let's look at how this is handled...

Suppose a subscriber client disconnects and later reconnects. In that case, the stream of messages from the publisher must resume precisely where it left off without repetition of any message or loss of others. The subscriber needs to indicate the last message received to resume receiving messages from the stream at the correct point.

the-serial-number-helps-ensure-message-ordering-and-exactly-once-delivery

From an engineering perspective, numerical ordering simplifies the implementation of exactly-once semantics. When subscribers reconnect, they only need to send the serial number of the last message they’ve received to Ably. Based on the serial number, which specifies a position in an ordered sequence of messages, we can resume the stream, ensuring that the subscriber doesn’t receive a message twice or none at all. Furthermore, when the stream resumes, the subscriber receives all messages in the correct order.

| Related article: Achieving exactly-once delivery with Ably |

Connection state and recovery

Achieving message integrity is relatively easy over reliable networks - but what happens when a network failure causes a client to disconnect?

It is inevitable for a user to lose connection every so often, such as when they are using a phone and move into an area with no coverage, or if their computer goes offline because of a network issue. When a connection drops, the apps and services a user depends upon should gracefully handle the outage. And when the connection resumes, the user should be able to pick up where they left off, as the apps they are using recover smoothly.

Ably has designed robust connection management into its client libraries. Once a client has connected to Ably, the server regularly sends a WebSocket ping frame every 15 seconds as a keep-alive. This ‘heartbeat’ allows both ends of the connection to detect if the connection state changes.

An Ably client library cannot rely on the WebSocket API to propagate information from the transport layer if the socket closes. Instead, the client library explicitly monitors the connection state by watching the ping frames because if those stop arriving, the socket has closed. Because the Ably client library has complete information about the WebSocket connection, any app that uses it can respond accordingly to changes in the connection state.

The client library responds to the ping frames, so the server also detects a change to the network state if the heartbeat of responses ceases. Ably’s client libraries will attempt to reconnect after a disconnection every 15 seconds and whenever the OS notifies the client library that the internet connection has resumed. If a client library remains disconnected for under 2 minutes, when the connection is re-established, it can retrieve any missed messages during disconnection.

Once the client has been disconnected for more than 2 minutes, the client library moves into the suspended state to indicate that all connection state is lost, and any channels are automatically suspended. Once the connection is re-established, the client library will automatically reattach the suspended channels and emit an attached event that indicates that the disconnection period lasted longer than two minutes. Apps can reflect that they have been disconnected to a user, and so manage their expectations.

Ably also offers message history persistence to apps that need to avoid message loss. If this is enabled, messages are stored for 24 – 72 hours on disk and can be retrieved using the History API. To ensure persistence succeeds messages are stored in three regions.

This historical data is also used by Ably’s Rewind feature so that a client can retrieve messages sent prior to its connection point. For example, when a user first joins a room, they can load the chat stream and see messages exchanged by members of a room prior to the point that they entered.

Looking for a reliable pub/sub solution?

Predictability and reliability are crucial to the user experience of any online app or service. The four pillars of Ably engineering enable us to guarantee that our customers can build and deliver a predictable experience to their users.

If you are looking for a reliable pub/sub solution that can operate at scale, why not try Ably for free? You can enjoy all platform capabilities and:

  • 6M monthly messages
  • 200 concurrent channels
  • 200 concurrent connections
  • Live chat & email support

Sign up for free

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .