- Multi-chapter guide
- Service availability
- High availability design
A Best Practices Guide to High Availability Design
Table of Contents
Like this article?
Subscribe to our Linkedin Newsletter to receive more educational content
Subscribe nowHigh availability (HA) design focuses on building systems that remain operational despite hardware, software, or infrastructure failures. The goal is to minimize downtime and ensure service continuity, as even short outages can result in significant financial and operational costs.
This article covers how to design for high availability in modern systems, defining HA, distinguishing between types of failures, and explaining key metrics. It also discusses core design principles, failover strategies, and architectural patterns.You’ll learn how to balance availability with durability, manage complexity, and avoid over-engineering. You’ll also explore testing and monitoring techniques, chaos engineering, and incident analysis of high-availability systems.
Customer-Facing Reliability Powered by Service-Level Objectives
Service Availability Powered by Service-Level Objectives
Learn MoreIntegrate with your existing monitoring tools to create simple and composite SLOs
Rely on patented algorithms to calculate accurate and trustworthy SLOs
Fast forward historical data to define accurate SLOs and SLIs in minutes
Summary of key high availability design concepts
The table below summarizes the high availability design concepts this article will explore in more detail.
Concept |
Description |
Planned and unplanned failures |
Planned failures include maintenance and upgrades of infrastructure. Unplanned failures stem from hardware faults, network issues, software bugs, etc. HA systems must handle both. |
Downtime |
Any period when a service is unavailable. It impacts revenue, user trust, and operational continuity. The business impact of downtime can be significant, making HA a key architectural goal. |
Reliability metrics |
High availability is quantified using metrics such as availability percentage, SLI, and SLO. These measurements help assess system reliability and inform design targets for service performance. |
Core principles of high-availability design |
Design for redundancy across all layers (compute, network, storage). Use load balancing, replication, and failover mechanisms. Choose between active-passive (simpler) and active-active (higher uptime, more complexity) architectures based on trade-offs. |
Load balancing |
Distributes traffic across instances or regions to prevent service overload. Also supports resilience under peak demand. |
Replication |
Protects against local failures and ensures continuity. |
Failover |
Automatic switching to backup systems during failures. |
Passive-active |
Offers simpler consistency and easier recovery. It also simplifies state management. |
Active-active |
Provides higher availability but requires conflict resolution and advanced integrations. Improves uptime but increases coordination complexity. |
Best practices for designing a high-availability system |
Key practices include prioritizing simplicity over zero-downtime perfection. Minimizing dependencies, understanding the trade-offs between consistency, durability, and availability. Using health checks, monitoring, self-healing, and chaos engineering. Running post-incident reviews to improve future reliability. |
Understanding high-availability
High availability (HA) refers to the architectural design that ensures a system remains accessible and functional for a specific, quantified percentage of time, often expressed using the 'nines' concept (e.g., 99.99% or 'four nines' uptime). High availability design aims to withstand hardware failures, software bugs, network issues, and routine maintenance without causing user-facing outages. Achieving HA involves eliminating single points of failure, incorporating redundancy, automating failover, and monitoring system health continuously.
Even brief disruptions can damage customer trust, result in lost revenue, and breach service-level agreements (SLAs). High availability design reduces these risks by improving fault tolerance and enabling fast, reliable recovery.
Difference between planned and unplanned failures
High-availability systems face both predictable failures that can be planned for and unexpected ones that demand an instant response. Planned and unplanned failures require distinct architectural strategies, tooling, and operational responses. The table below breaks down the characteristics, causes, and mitigation techniques, highlighting how a resilient system must be engineered to handle failures gracefully.
Aspect |
Planned |
Unplanned |
Definition |
Controlled, intentional interruptions initiated by the engineering or operations team to perform system changes. |
Unexpected, involuntary disruptions caused by faults, external conditions, or internal errors. |
Typical causes |
Infrastructure upgrades, database schema migrations, application deployment, OS patching |
Hardware faults (disk, memory, NIC), kernel panics, application crashes, DNS failures, data corruption, power loss, etc. |
Failure domain |
Narrow and scoped to specific services, nodes, or regions under maintenance. |
May cascade across services or regions depending on system coupling and fault isolation. |
Predictability |
Fully predictable and scheduled. Can be planned during low-traffic windows. |
Unpredictable and often time-sensitive. Requires real-time detection and automated mitigation. |
Availability strategy |
Achieved through zero-downtime deployment strategies |
Requires architectural resilience such as redundancy, statelessness, automated failover, and self-healing mechanisms. |
Blast radius control |
Can be tightly controlled via circuit breakers, feature flags, rate limiting, and staged rollouts. |
Depends on fault isolation (e.g., cell-based architecture, bulkheads, sharded infrastructure) and fail-safes. |
Recovery process |
Rollback mechanisms, versioned deployments, or failover to the standby infrastructure |
Failover to redundant components, node replacement, state restoration from backups, or replicated nodes. |
Impact on SLA |
Should not impact SLA if done correctly. Service remains operational during change. |
Directly threatens SLOs and therefore SLAs. Rapid response and graceful degradation are key to preserving contractual uptime targets. |
The relationship between availability and reliability
People often use the terms "availability" and "reliability" interchangeably, but they measure different aspects of system behavior. Availability tracks a system's uptime and reachability. It’s typically expressed as a percentage (e.g., 99.99%) and reflects the total uptime over a given period.
Reliability measures how consistently a system performs its intended function without failure over a defined period. Teams usually measure reliability with metrics like error rates, success rates, and latency distributions, rather than total downtime.
For example, a server might be responding to requests, but if it has unpredictable latency with periodical errors, it would be considered unreliable.
What impacts overall availability
Designing for availability requires addressing both reliability and resilience. Reliability affects availability by influencing the frequency of system failures, while resilience affects how quickly systems recover from those failures. A reliable system experiences fewer failures, which means less frequent downtime and higher availability. A resilient system recovers quickly when failures do occur, minimizing downtime duration. If a system fails frequently or takes too long to recover, its availability drops, regardless of the amount of redundancy in place.
When to prioritize each in the system design
Prioritize reliability when the system must perform consistently over time without errors. Reliability is particularly crucial for applications that require data integrity and transaction accuracy, such as payment processing, medical records, or industrial control systems. In such cases, even brief errors can have severe consequences.
Prioritize availability when the system must remain accessible regardless of transient failures. This applies to services like video streaming, search engines, or social media platforms, where occasional glitches are acceptable as long as users can continue to interact with the system.
In many cases, you need both, but the emphasis depends on what failure looks like to the end users. If a failure means downtime, focus on availability. If a failure means incorrect behavior, focus on reliability. Design trade-offs should reflect what the end users value most.
Key components of reliability measurement
To design and operate a reliable system, you need to measure its behavior under both normal and failure conditions. The following components help you measure and validate that behavior:
- Service Level Indicators (SLIs): These are the raw metrics that reflect user experience. It captures the user-impacting behavior you care about. Common SLIs include availability percentage, request latency, and error rate. For example, an SLI might track the percentage of successful HTTP responses over a five-minute window.
- Service Level Objectives (SLOs): SLOs define the target values for SLIs. It balances user expectations, risk tolerance, and development velocity via an error budget. If your SLI shows 99.95% availability, your SLO might require that value to stay above 99.9%. These targets help you quantify the level of reliability your service needs to achieve.
- Mean Time to Recovery (MTTR): This tells you how quickly the system can recover after a failure. A lower MTTR means faster recovery, which reduces downtime and helps you stay within SLO limits. Note that tracking MTTR as a reliability metric can lead to false conclusions, as outlier scenarios can skew the mean. Consider evaluating each incident in terms of SLO impact and operational response time instead.
Each of these components works together to achieve high availability. SLIs and SLOs help you define and monitor reliability, while MTTR helps you understand how your system behaves when things go wrong and how long it takes to recover. By selecting meaningful SLIs, aligning SLOs with business goals, and tracking MTTR, you create a feedback loop that both prevents failures and accelerates recovery when they happen. This integrated approach turns raw metrics into actionable reliability engineering.
Availability formula calculation
Availability measures the percentage of time your system is operational and accessible to users. It’s calculated using this formula:
Availability (%) = (Uptime / (Uptime + Downtime)) × 100
Say your system runs continuously for a full 30-day month, which is 43,200 minutes. If it experiences 10 minutes of downtime, you calculate availability as:
Availability = (43,200 / (43,200 + 10)) × 100 = 99.9768%
This number provides a baseline for tracking performance against your targets. It also helps you assess how much downtime is acceptable for a given availability target. If you aim for 99.9%, then 43 minutes of downtime per month is acceptable. Anything beyond that means you're off target.
Customer-Facing Reliability Powered by Service-Level Objectives
Service Availability Powered by Service-Level Objectives
Learn More
MTTR vs SLO
Mean Time to Recovery (MTTR) measures how long it takes to restore service after a failure. While useful, MTTR alone doesn’t reflect whether users are experiencing acceptable service. It only tells how fast you fix issues, not how often those issues occur or how they impact the user experience.
This is where Service Level Objectives (SLOs) offer a stronger lens. Instead of chasing faster incident response times across the board, SLOs help you define what matters to your users. For example, “99.9% of requests should return within 300 ms” or “99.95% uptime over 30 days.”
They allow teams to prioritize incidents based on real user impact and the speed of recovery. For example, a system with a fast MTTR might still breach its SLOs if it fails too frequently. In some cases, a slower recovery is acceptable if outages are rare and don’t violate the agreed-upon objective.
In practice:
- MTTR helps your team monitor internal performance.
- SLOs align reliability work with business goals and user satisfaction.
By prioritizing SLOs, you ensure you're restoring service quickly and maintaining the level of service your users care about.
The nines of availability and what they mean in practice
Availability is often expressed in terms of “nines”, which is a shorthand for how much uptime a system delivers over a given period. In the context of availability, five nines is considered "near-perfect" uptime. It allows for no more than 5 minutes and 15.6 seconds of downtime per year, or roughly 6 seconds per week. Note that it does not have to be expressed as a number of nines, and it’s perfectly acceptable to express an SLA as something like 99.5%.
Number of 9s vs Cost
Each additional nine requires exponentially more effort in terms of infrastructure, testing, automation, redundancy, and monitoring. Achieving three nines might be feasible for most production workloads. Pushing for five nines involves significant trade-offs, including cost, complexity, and diminishing returns.
The target depends on what your users expect. A social media feed going down for a minute won’t carry the same weight as a payment API failing. SLOs help you define what “good enough” means in the context of user experience and business priorities without unnecessarily chasing higher availability.
Aiming for more nines is valuable when those extra decimal points matter to your users and only if you’re equipped to support the engineering investment required to sustain them.
Core principles of high-availability design
Designing for high availability means accepting that failure is inevitable and building systems that can absorb it without downtime. Below are the foundational principles that support this goal.
Redundancy at all layers
Redundancy prevents single points of failure by duplicating critical components across the stack:
- Network: Utilize multiple network paths and redundant routers to ensure connectivity remains intact in the event of hardware issues or link drops.
- Compute: Run workloads across multiple instances. In Kubernetes, this might mean pod replication. With traditional virtual machines, there could be multiple availability zones.
- Storage: Protect against data loss and hardware failure with RAID for local redundancy and replication for distributed durability.
- Databases: Use read replicas, clustering, and multi-primary configurations to allow systems to fail over without disrupting writes.
The goal is to ensure that no single hardware or software failure can cause the system to fail.
Load balancing, circuit breakers, and exponential backoff
- Load balancing distributes traffic across multiple nodes, improving resilience and ensuring no single component is overloaded.
- Circuit breakers monitor failure rates, timeouts, and error patterns from the client side to prevent the calling service from being overwhelmed by a failing dependency.
- Exponential backoff protects your infrastructure by spacing out retries, which prevents surges of retry storms during partial outages.
These patterns help systems degrade gracefully under pressure, rather than failing outright.
Replication across locations
Running services in multiple regions reduces the blast radius of localized failures. Even if one data center goes offline, another can continue serving requests. This is essential for systems that need to maintain global uptime targets.
The effects of a single-region outage on the availability in a multi-region architecture. (Source)
The diagram above illustrates a regional outage in a highly available, multi-region deployment on Google Cloud. Traffic is routed through global Cloud DNS and regional HTTPS load balancers to redundant web and app tiers across multiple zones. The backend uses Cloud Spanner with multi-region replicas to ensure data availability and consistency even during regional failures.
Failover mechanisms
Failover allows a backup component to take over automatically when the primary fails. This could be DNS-based failover between regions, database failover from primary to replica, and application-level failover using active-active or active-passive models.
Architectural patterns
The two main categories, active-passive and active-active, cover a spectrum of strategies, each balancing cost, complexity, and recovery performance (RPO/RTO)
Active-passive
The most basic form of active-passive failover is the Pilot Light approach, where the core infrastructure, such as a minimal set of servers or container images, is pre-provisioned and kept in an idle state. Only essential services are deployed, reducing recovery time to tens of minutes. For example, using Infrastructure-as-Code (IaC) tools like Terraform, you can quickly spin up a complete production environment when disaster strikes.
Pilot light approach implemented on AWS cloud. (Source)
As you can see, a full application stack is active in one region, while a minimal, scaled-down replica is maintained in a secondary region. In a failover, traffic is rerouted, and the scaled-down components in the pilot light region are rapidly brought online to restore service. The downside of this is that in the event of a significant regional disruption, many other organizations may also be attempting to establish new infrastructure. For critical applications, it might not be sufficient because it is difficult to test these scenarios.
Warm Standby brings systems closer to readiness by running services at reduced capacity in a secondary region or environment. Databases remain live and replicated, and application servers are partially scaled out. This strategy, often used with auto-scaling groups and cross-region replication (e.g., Amazon RDS + EC2), can reduce RTO and RPO to a few minutes but incurs higher costs and complexity.
Warm-standby pattern implemented on GCP. (Source)
In the above scenario, the primary application resides on-premises. Snapshots of web and application servers are stored in GCP, while the on-premises database replicates to a Google Cloud database server, with connectivity via dedicated Interconnect or IPsec VPN. These snapshots are regularly updated from reference servers already running in GCP, allowing for rapid activation and a balance between recovery speed and cost.
Active-active
Active-active is a fully redundant approach. It runs multiple nodes or regions in parallel, distributing traffic across them via load balancers or DNS-based routing. This model is common in global systems like Cloudflare’s edge network, Google's Spanner, or services using multi-region Kubernetes clusters with global load balancers. These systems achieve near real-time recovery but require complex coordination, global data replication (often with eventual or quorum-based consistency), and significantly higher operational costs. Active-active setups must also address challenges like split-brain scenarios, where isolated regions continue to accept writes during network partitions, leading to data divergence and consistency conflicts that require conflict resolution strategies or strong consensus protocols.
Active-active architectural pattern implementation on AWS cloud (source)
The above diagram illustrates an active-active architectural pattern that utilizes multiple AWS regions, where both regions simultaneously serve traffic via Route 53. DynamoDB global tables provide automatic, continuous data replication between regions, ensuring low-latency data access and resilience for the application's backend..
Visit SLOcademy, our free SLO learning center
Visit SLOcademy. No Form.Best practices for high availability design
This section examines key high availability design strategies, with a focus on practices that ensure continuous service and minimize downtime.
Prioritize simplicity and avoid over-engineering
One of the most common pitfalls in HA system design is chasing theoretical perfection through excessive complexity. Going from 99.9% to 99.99% uptime might be worth it, but pushing for five or six nines often leads to diminishing returns. The cost in engineering time, infrastructure overhead, and operational complexity rises exponentially. At the same time, the benefit to users can be marginal, especially if those users aren't affected by rare edge-case outages.
Emphasize simplicity over zero-downtime perfection
Each additional "nine" of uptime requires increasingly elaborate redundancy strategies: multi-region failover, real-time replication, quorum consensus, and automated recovery from rare failure modes. These layers increase the surface area for bugs and can create new avenues for failure, ironically undermining the goal of high availability. Instead, focus on designing for graceful degradation. If a subsystem fails, ensure the rest of the system continues to operate. Not every failure needs to be masked entirely, but it should be isolated.
Minimize moving parts and manage complexity
Every additional dependency, such as external APIs, queues, and storage layers, increases the number of failure points. Aim for a stateless design: Stateless services are easier to scale, restart, and replace. Prefer a few well-tested components over many loosely coupled ones. If a feature requires multiple services to coordinate, ask whether the complexity is justified.
Consider cost, performance, and practical limits
Design around your actual service-level agreement (SLA) instead of a hypothetical ideal. For most businesses, four nines (99.99%) is a stretch goal; anything beyond that is likely overkill unless you’re running a financial exchange or a medical monitoring system. Factor in operational realities: latency budgets, failover timing, and the cost of redundancy. Balance performance with resilience; faster isn’t always better if it increases brittleness.
When operating at scale, reliability targets often need to reflect the real-world interdependencies between services. Composite SLOs allow you to build hierarchical models that aggregate the behavior of multiple services or components, each weighted according to its contribution to the user experience. This enables a reliability envelope that’s aligned with the actual structure of your system, rather than assuming uniform importance across all parts.
Nobl9’s composite SLO overview tiles. (Source)
Instead of tuning each SLO to an artificially strict threshold, you can define a broader reliability goal and delegate that goal across services, accounting for trade-offs between performance, risk, and cost. If a non-critical backend service burns more error budget than usual but the end-user experience is unaffected, the composite can absorb that variance without triggering false alarms. You gain the flexibility to express nuanced service objectives while maintaining a realistic, system-level view of availability.
Test disaster scenarios frequently
Reliability comes from proving your recovery strategy under controlled, repeatable failure scenarios. Run failure drills at every level: automated chaos tests for service restarts, scheduled region failovers, and full disaster recovery rehearsals periodically. Track recovery against RTO and RPO targets, and monitor user-facing SLIs to reveal hidden weaknesses. Also, validate alerts and team response under realistic pressure. Design recovery processes to be safe, predictable, and easy to exercise, so confidence comes from practice, not assumptions.
Balance availability and durability
Let’s examine how to make informed trade-offs based on your system's needs to strike the right balance between availability and durability.
Choose downtime over potential data loss
Data integrity should take priority over uptime. Temporary unavailability is usually recoverable. Lost or corrupted data isn't. When faced with a trade-off, it's better to serve errors or throttle traffic than to risk inconsistent writes with partial transactions. A brief outage impacts user experience, but silent data loss breaks user trust.
Understand trade-offs in consistency
Designing distributed systems means making deliberate trade-offs, especially under network partitions. According to the CAP theorem, during a network partition, you must choose between consistency and availability. Since partition tolerance is essential in distributed systems, the real trade-off arises only when partitions occur, and even then, it's about balancing degrees of consistency and availability.
Some systems, such as Dynamo or Cassandra, prioritize availability and eventual consistency, while others (e.g., ZooKeeper) prioritize strict consistency at the cost of occasional unavailability. Choose based on your application’s tolerance for stale data.
Be cautious with distributed synchronization
Distributed coordination, such as distributed locks and leader election, introduces many failure points, including clock drift, complex consensus algorithms struggling with edge cases, and misconfigurations that appear in production. If you must use distributed coordination, keep the critical section small and ensure it is failure-tolerant. Avoid synchronizing systems that don't need strong consistency guarantees.
Automate recovery with Closed Loop Incident Management
Closed Loop Incident Management aims to automate the entire recovery pipeline, e.g., detecting a failure, executing remediation, and confirming resolution without any human intervention. This includes self-healing mechanisms such as restarting failed pods, rebalancing traffic, and rolling back bad deployments. It reduces time-to-recovery and makes the system more resilient during off-hours and scaling events. Always design automation to be inherently safe, including checks to prevent uncontrolled remediation loops based on false alerts.
Test and monitor for resilience
Even well-architected systems can fail under pressure. Learn how to proactively test and monitor your system’s resilience before users feel the impact.
Health checks and failure detection logic
Implement both application-level and system-level health checks. For applications, use endpoints like /healthz (HTTP) or gRPC health probes to expose liveliness and readiness. At the infrastructure level, monitor disk space, memory usage, CPU saturation, and I/O bottlenecks. These signals should feed directly into orchestration platforms to remove unhealthy instances before users are impacted.
Incident-aware SLO management
During major incidents, operational priorities shift from strict SLO adherence to service recovery. Not all incidents are equal. A flexible system should be able to adapt its Service Level Objectives (SLOs) based on real-time conditions.
For example, during a major incident or under extreme load, you might temporarily accept degraded performance (e.g., higher latency or error rates) as an unavoidable consequence while you work to restore service stability and functionality. Your focus moves from strictly meeting the SLO in that moment to recovering to a state where the SLO can be met again, understanding that the incident itself consumes part of your error budget. This is an operational decision about how to manage a crisis, not a redefinition of the SLO itself.
Nobl9’s Service Health Dashboard complements this by giving you a real-time, high-level view of where those shifts are impacting service health. Instead of chasing static thresholds, you can prioritize recovery efforts where error budgets are burning fastest, SLOs are under stress, without losing sight of the big picture.
Service Health Dashboard: Error budget (Source)
You can correlate operational changes with organization-wide reliability metrics without needing to inspect individual SLO definitions in isolation. You can also avoid alert fatigue by contextualizing meaningful deviations based on current conditions and temporary trade-offs made to preserve broader system stability.
Having a consolidated and adaptive view of service health ensures you’re making decisions with full awareness of shifting reliability posture. You can use dynamic SLOs to automate reliability tuning while relying on the dashboard to validate that the system is still operating within acceptable bounds.
Real-time monitoring and alerting
Combine metrics, structured logging, and distributed tracing to gain a complete view of system health. Use Prometheus, OpenTelemetry, or similar tools to track request latency, error rates, resource usage, and queue depths. Set alerts based on symptoms, e.g., a sustained increase in error rate combined with a drop in traffic should trigger escalation. Distributed tracing is mandatory for diagnosing bottlenecks and pinpointing failures in distributed microservice architectures.
Self-healing mechanisms
Build your system to recover from failure automatically. Use circuit breakers, retries with backoff, and graceful degradation to absorb transient failures. Stateless services can be redeployed or replaced on the fly. These techniques reduce downtime and limit blast radius when failures occur. For example, Kubernetes makes this easier with liveness and readiness probes, restart policies, and automatic rescheduling.
Chaos engineering and failure simulation
Intentionally inject failures to uncover brittle assumptions. Use chaos engineering to simulate availability zone (AZ) failures, kill random services, throttle network connections, and corrupt disk I/O. This forces your system to confront real-world stress and exposes weak points before users do. Additionally, run backup and restore drills, validate failover paths, and ensure observability remains intact during chaos. If you can't simulate it, you probably can't recover from it.
Post-incident analysis and learning
Every outage is an opportunity to improve. This section covers how to turn incidents into long-term reliability gains through structured analysis and learning.
Blameless root cause analysis
Blameless postmortems create a safe environment where engineers can share honest details without fear of reprisal. They create space where engineers can speak honestly about what they saw, did, missed, or misunderstood. This encourages transparency and accelerates organizational learning. Focus on what failed in the process, not who failed.
Make your analysis structured:
- Start with a clear timeline: what happened, when, and how the system responded.
- Use data, metrics, logs, traces, screenshots, or whatever shows what happened.
- Patch the architectural weak spots, for example: missing retries, unbounded queues, poor failover logic, etc.
- Fix detection gaps, i.e., maybe the alert fired but didn’t reach the right team.
- Inspect whether SLOs performed as expected and consider adding new SLOs or updating SLOs.
Use structured methods like the “Five Whys”
Effective post-incident reviews dig past surface-level symptoms. Techniques like the Five Whys help uncover deeper, systemic causes. For example:
- Why did the service crash? → It ran out of memory.
- Why did it run out of memory? → A memory leak in a background worker.
- Why wasn’t the leak caught earlier? → No monitoring on heap usage.
- Why wasn’t heap monitoring configured? → Monitoring templates don’t cover this service type.
- Why not? → Template governance is inconsistent.
This kind of reasoning reveals design gaps, tooling weaknesses, and process oversights that can be addressed to reduce future risk.
Feed insights back into design
Tie remediation items to your backlog and track them like any other engineering work. When insights from failures are directly integrated into system design and operations, each incident becomes a lever for recovery and resilience. Here are six tips to help ensure your team gets the most out of its feedback loops:
- Prioritize by impact: Fix what threatens availability first and ensure remediation efforts target breached SLOs
- Patch weak links: Revisit your architecture diagram and ask: Did this failure come from a single point of failure, lack of retry logic, or an unbounded queue? Apply targeted fixes by introducing redundancy, backpressure, circuit breakers, and autoscaling where needed.
- Automate recovery paths: Embed recovery steps into orchestration (e.g., Kubernetes jobs or Lambdas) where possible. For example, if a human had to SSH into a box to fix something, write the script and integrate it into your incident response workflow.
- Harden alerts: Refine noisy alerts by tying them to user-facing symptoms and high-cardinality metrics. Move toward SLO-based alerting instead of threshold-based noise.
- Update operational docs: Improve runbooks based on responders' needs. Then, validate those updates in chaos simulations.
- Track remediation like product work: Assign owners, add deadlines, and integrate into your incident review cadence. Use tags like resilience, availability to group systemic improvements in your backlog.
Learn how 300 surveyed enterprises use SLOs
Download ReportLast thoughts
Availability and reliability are two sides of the same coin when it comes to high availability design. Each is essential to building performant HA systems that help teams meet or exceed their SLAs and SLOs. The exact strategies and design patterns (e.g., active-active vs. active-passive) that are best for a specific use case will vary depending on business requirements, the cost of downtime, and system criticality. Regardless of the specific strategies and techniques you implement, you should prioritize simplicity, automate recovery safely, rigorously test with chaos engineering, and continuously learn from every incident to feed improvements back into system design to achieve high availability.
Navigate Chapters: