Building Resilient Distributed Systems

Forthcoming book focusing on making both teams and software more resilient

Distributed systems come in many shapes and sizes, and are increasingly a major part of how we build and deliver software. But distributed systems create challenges. On the face of it, they appear to offer the ability to create more stable software. No longer should a single machine dying take our system off line, right? Unfortunately the world is not that simple.

This book is all about navigating the complex world of distributed systems, with the aim to help you keep your users happy without going crazy in the process.

Status

This book is currently in early access. You can read the currently available draft chapters over at O'Reilly's Online Platform. I plan to release new chapters every couple of months, with a planned publication date of August 2025.

Read The Early Draft Now

Want To Get Updates?

If you'd like to follow the progress of the book, and be notified when new updates are available to read, please sign up to my newsletter. Expect it to be a low-traffic list, posting around once or twice a month.

Who Is This Book For?

This book is primarily for anyone in a technical role who is helping build a distributed system. Developers, architects, testers, operations folks and SREs alike will find a lot here that will help them in their day job.

For the non-technical folks reading, there is still a lot in here for you. There is a lot of coverage of cultural and social aspects and how they can positively impact the resiliency of your distributed systems, and the second half of the book is focused on these aspects.

Whether you are struggling with achieving acceptable resiliency for an existing system, hoping to avoid making too many mistakes as you start your own journey into microservices, or looking to understand what resilience even is, then this is the book for you.

Feedback

If you have any feedback after reading the early access, then please contact me and let me know!

Planned Contents

This book is still being written. As such, the table of contents is subject to change. If there is something missing here that I should include, please feel free to send me some feedback

The main body of the book is broken into three separate parts—Foundation, Implementation, and People. Let’s look at what each part covers.

Part I, Foundation

The first half of this book focuses on the technical aspects of making systems more resilient.

  • Chapter 1: What Is Resilience? - In Early Access
    Resiliency can mean different things to different people, so it's important to start with a shared understanding. This chapter looks at resiliency from a number of angles, exploring concepts from the wider resilience engineering space, and also introducing the concept of sociotechnical systems.
  • Chapter 2: Fundamental Concepts Of Resilience
    There is a lot of conventional wisdom in computing, and often you are just expected to just somehow learn it through osmossis. That sucks - so in this chapter I distill down some fundamental concepts which will stand you in good stead when trying to make systems more resilient.
  • Chapter 3: Timeouts - In Early Access
    Networks and computers can be frustrating, and they can stop working (or go slow) at the worst time. Dealing with this fundamental truth starts with knowing when to give up - in this chapter you'll learn all about timeouts, including how to set them correctly and the importance of randomness.
  • Chapter 4: Retries & Idempotency - In Early Access
    If at first you don't succeed, try again! Or maybe, give up? When computers or network calls start failing, trying again often makes sense, and that is the thrust of this chapter. However, trying again means we have to deal with what happens if we end up doing the same work more than once - so I'll also take you on a deep dive into the topic of idempotency.
  • Chapter 5: Rate Limiting
    This chapter explores how to reduce the amount of work your system is doing to keep it stable. It covers back pressure, load shedding, circuit breakers and more.
  • Chapter 6: Queueing
    Queueing can be an effective way to absorb work to be processed later, rather than overwhelming a system with large spikes of traffic. This chapter looks at how queues can be implemented to deal with larger loads, what happens when your queue fills up, the role of message brokers, and also how to balance your queue processing with Little's Law.
  • Chapter 7: Scaling For Resilience
    Often the answer to keeping a system stable when load is increasing is to make the system bigger. In this chapter, you'll see how throwing computing resources at the problem can often help - but also see where it breaks down. This chapter will look at different forms of scaling, including dynamic autoscaling, and gives some concrete tips for when just getting a bigger box is the right answer.
  • Chapter 8: Observing Resilience
    Wanting your system to be resilient is one thing, knowing it's resilient is something else. In this chapter I'll take you through how to collect the information you need to ensure you're meeting your targets. I'll also cover how to define Service Level Objectives (SLOs), explore team vs system targets, and look at why error budgets can be useful.

Part 2: People, Process, and Culture

When considering the resiliency of a distributed system, we have to go beyond the technical and explore the behaviors, culture and processes of the people building and maintaining the system itself.

  • Chapter 9: The Sociotechnical System
    The concept of the sociotechnical system has been around for decades, but it has only been somewhat recently that this school of thinking has come to the fore in digital system resiliency. In this chapter, I'll take you through the implications for approaching resiliency through a sociotechnical lens, along with some helpful models to make sure you're taking an holistic view to resiliency.
  • Chapter 10: Incident Management
    When things go wrong, what do you do? In this chapter I'll explore how to send good alerts, and how these can then be handled effectively. You'll see how to avoid operator burnout, how to manage support rotas, and also the impact of sending too many alerts.
  • Chapter 11: Constant Learning
    To keep a system resilient, you need to continually improve. At the most basic level, this means making sure that you learn as much as possible in the wake of incidents. So in this chapter, I start by taking you through the importance of post mortems, and give you tips on how to make sure you translate this into action. Learning goes beyond this however, so you'll see how nurturing the right culture in your organization is vital, as well as how activities like game days can be a useful part of building a learning organization.