In the last few months at Ably we’ve spoken with hundreds of candidates for our Lead Distributed Systems Engineer and Distributed Systems Engineering roles. We’ve been surprised by how varied each candidate’s knowledge has been. It got us wondering if the challenge in finding the right people is that there is no clear definition of what skills are required to excel in this role.
Give that we've been working on our distributed realtime messaging platform, global cloud network, and realtime APIs for over six years, we think we're qualified enough to take a stab at defining what a distributed systems engineer needs.
If you want to become a distributed systems engineer, believe you are one, or want to recruit one for your team then here’s Ably's opinionated guide on the concepts you should have a thorough understanding of.
The concepts a Distributed Systems Engineer needs to know
- Microservices or SoA is not a distributed system
- Understanding hash rings is a pre-requisite
- Gossip protocols and consensus algorithms underpin everything
- Eventually consistent data types and read/write consistencies
- Deep understanding of network protocols
Microservices or SoA is not a distributed system
Here's an example of a simplistic design of a service based architecture with horizontal scalability:
There’s not much “distributed” about this system. There are multiple hosts and network interconnections but they are tightly coupled. And their network interactions are reliable, have low-latency, and are predictable. Genuinely distributed, in our view, means:
- Systems where nodes are distributed globally
- Network interactions are unpredictable and can create partitions
- Nonetheless those nodes work together to create a predictable outcome
Distributed systems, at scale, involve state being distributed and re-balanced across the system, reacting as nodes are added and removed, and they do this in spite of the unpredictability that is inherent in a global system.
Understanding hash rings is a pre-requisite
If you think a hash ring has something to do with a criminal cannabis organization, then that’s certainly amusing, but unfortunately means you’re missing knowledge of a common pattern used for distributed systems.
If the above doesn’t look familiar, then I recommend you start by diving into how popular distributed systems work, all of which rely on the ideas behind a consistent hash ring. See:
Gossip protocols and consensus algorithms underpin everything
Large distributed systems usually have to track changes in cluster topology in response to network partitions, failures, and scaling events. Various protocols exist to ensure that this can happen, with varying levels of consistency and complexity. This needs to be dynamic and real time because:
- Nodes come and go in elastic systems
- Failures need to be detected quickly
- Load and state need to be rebalanced in real time
With a stateful system like Ably, state also needs to be moved in real time between new and old nodes whilst providing continuity throughout.
If you have never worked with Gossip or consensus algorithms, then I recommend you read up on:
- Gossip protocol
- Paxos protocol
- Raft consensus algorithm
- Popular consensus backed systems like etcd and Zookeeper, and gossip backed systems like Serf
Eventually consistent data types and read and write consistencies
Generally in a distributed system, locks are impractical to implement and impossible to scale. As a result, trade-offs need to be made between the consistency and availability of data. In many cases, availability can be prioritised and consistency guarantees weakened to eventual consistency, with data structures such as CRDTs.
If you’re not familiar with CRDT or Operational Transform, the concepts of variable consistencies for queries or writes to data in a distributed data store, then you’ve got some reading to do:
- Operational Transform — originally implemented by Google in their Wave product and now in Google Docs. It has uses in collaboration apps, but OTs are complex and not widely implemented.
- Conflict-free Replicated Data Types or CRDT provides an eventually consistent result so long as the data types available are used. Used by Riak distributed database and presence in Phoenix.
- Consistency levels for both read and writes in distributed databases like Cassandra.
Deep understanding of network protocols
In a distributed system, you’ll almost certainly be working within all layers of the networking stack. Ably relies extensively on various higher level protocols such as HTTP, WebSockets, gRPC, and TCP sockets. But without a deep understanding of those protocols and the full stack of protocols they rely on all the way down to the OS itself, you’ll likely struggle to solve problems in a distributed system when things go wrong.
Take for example the following request or WebSocket connection which involves all of the following. At each layer you should be confident in your understanding and ability to debug problems at a packet or frame level:
- DNS protocol and UDP for address lookup
- File descriptors (on *nix) and buffers used for connections, NAT tables, conntrack tables etc.
- IP to route packets between hosts
- TCP to establish a connection
- TLS handshakes, termination and certificate authentication
- HTTP/1.1 or more recently 2.0 used extensively by gRPC
- WebSocket upgrades over HTTP
And that’s not all…
From our perspective of operating a truly global and distributed system, a working understanding of the specific concepts described above is what we expect from a distributed systems engineer. Before that you need to also be a solid systems engineer. This requires you to have fundamentals in place such as programming languages, general design patterns, version control, infrastructure management, and continuous integration and deployment systems.
Further reading
- An introduction to distributed systems by Kyle Kingsbury
- Hidden scaling issues of distributed systems
- How Ably efficiently implemented consistent hashing
- Implementing connection state and stream continuity over a distributed realtime messaging system
- Building a distributed rate limiter that scales horizontally
About the author
I’m Matt, technical CEO and co-founder of Ably. I'm very interested in realtime problems, realtime engineering, and where the industry is headed. That’s the reason I co-founded Ably, which provides cloud infrastructure and APIs to help developers simplify complex realtime engineering. Organizations build with Ably because we make it easy to power and scale realtime features in apps, or distribute data streams to third-party developers as realtime APIs.
We're hiring across our engineering and commercial teams, including Distributed Systems Engineering roles so check out our job board. Not looking for a role but know someone who is? Refer an engineer that we employ and we'll send you $3k as a thanks. One email = $3k.
And if you’re interested in chatting to me about realtime problems, distributed systems, or this article, please do reach out to me @mattheworiordan or @ablyrealtime on Twitter.