Table of Contents

Service availability refers to the percentage of time a service is accessible and functions as expected. When systems perform inconsistently or behave unexpectedly, users lose trust. And lost trust often translates to lost sales and productivity.

Engineering teams measure system reliability through Service Level Indicators (SLIs). SLIs are specific metrics like request success rate, response time, and error rate. Teams then set Service Level Objectives (SLOs), which define acceptable thresholds for these SLI metrics over specific time periods. For example: “99.9% of API requests must return successfully within 300ms, measured over a rolling 30-day period.”

Meeting these targets requires systems built for scalability, maintainability, and fault tolerance. However, improvements in availability become exponentially harder and more expensive at each incremental “nine” of uptime (e.g., 99.9% uptime is significantly cheaper and easier than 99.99% uptime). Teams must balance engineering effort, cost, and user needs.

Summary of service availability best practices

The table below summarizes the nine service availability best practices this article will explore in detail.  

Best practice

Description

Set effective SLOs

Choose SLOs that reflect user experience, not internal metrics. User response times matter more than CPU usage spikes that don't affect users.

Monitor error budgets

Track how much failure your SLOs allow within the time window before taking action. For a 99.9% latency SLO over 30 days, you have a 0.1% error budget. Monitor burn rate - if you consume 50% of your monthly budget in 3 days, take immediate action before violating the SLO."

Implement redundancy and failover to improve service availability

Using active-active or active-passive infrastructure eliminates single points of failure, ensuring services remain available during a fault. 

Allow for graceful degradation

Enable partial functionality when components fail. Serve cached results instead of error messages when personalization services are unavailable.

Enable CI/CD and immutable infrastructure

Build infrastructure using IaC for reproducibility and quick recovery. Use blue/green or canary deployments to reduce risk during changes.

Create liveness, readiness, and startup probes

Use liveness probes to restart failed containers, readiness probes to route traffic only to healthy services, and startup probes to verify initialization completion.

Align service availability targets with business objectives

Base SLOs on user impact and business costs of downtime. Targets that exceed historical performance without investment create unrealistic expectations.

Maintain operational runbooks for incident response

Document failure modes and recovery steps. Keep runbooks in version control where on-call teams can find them quickly.

Implement incident retrospectives

Review what broke, why it broke, and how to prevent recurrence. Focus on systems and processes, not individuals, to encourage honest analysis.

Customer-Facing Reliability Powered by Service-Level Objectives

Service Availability Powered by Service-Level Objectives

Learn More

Integrate with your existing monitoring tools to create simple and composite SLOs

Rely on patented algorithms to calculate accurate and trustworthy SLOs

Fast forward historical data to define accurate SLOs and SLIs in minutes

Set effective SLOs

To set meaningful Service Level Objectives (SLOs), it’s important to understand these  three core concepts:

  • Service Level Indicators (SLIs) are quantitative measures of service performance, such as the percentage of successful HTTP requests or the 95th percentile response latency.
  • Service Level Objectives (SLOs) are specific targets for those SLIs over a defined period. For example, “99.9% of requests must succeed over a rolling 30-day window.”
  • Service Level Agreements (SLAs) are formal commitments to customers, often including financial penalties for non-compliance, which are typically built on top of SLOs.

Establishing effective SLOs begins with choosing SLIs that reflect the user experience, not just internal system metrics. While a high CPU load might concern engineers, it may not warrant inclusion in an SLO if end users are unaffected. Focus on indicators like error rates, latency, and availability that directly impact users.

Setting SLOs requires collaboration between engineers, product managers, and business stakeholders to define acceptable levels of reliability. Engineers may want conservative targets to avoid failure, while others might push for optimistic targets in pursuit of “perfect” reliability. Effective SLOs must be meaningful (measure user impact), achievable (based on historical performance and planned investment), and actionable (trigger specific responses when breached).

SLOs should evolve alongside your system and customer expectations through the  SLO Development Lifecycle (SLODLC):

  1. Define – Understand your users and what matters to them.
  2. Design – Choose appropriate SLIs and set initial SLO targets.
  3. Experiment – Try them out in a non-critical setting if possible.
  4. Observe – Collect data and assess how well the system meets the objectives.
  5. Refine – Adjust SLIs or SLOs based on real-world results.
  6. Reassess – Periodically revisit SLOs as both the product and users change.

For availability-related SLOs, these SLIs are frequently used:

  • Request success rate – Ratio of successful (2xx) to failed (5xx) requests.
  • Endpoint uptime – Measured via synthetic or real-user monitoring.
  • Latency thresholds – Percentage of requests completing within 300ms.

When setting availability targets, consider historical performance data, system limitations, and user expectations. Each additional "nine" of availability requires exponentially more engineering effort and cost:

Availability Target

Allowed Downtime (Monthly)

99.9% (“three nines”)

~43.8 minutes

99.99% (“four nines”)

~4.38 minutes

99.999% (“five nines”)

~26.3 seconds


For example, if historical data shows 99.9% of user requests complete in under 300ms, aiming for 99.999% without significant engineering investment may be unrealistic, while targeting 99.99% with additional budget allocated could be achievable.

Monitor error budgets

Once SLOs are in place, error budgets become powerful decision-making tools. An error budget is the inverse of your SLO. If your service has a 99.9% target for requests completing in under 300ms, then 0.1% of requests can exceed 300ms; that’s your error budget.

Error budgets give teams a margin for imperfection. Instead of alerting on every latency spike, you alert based on the error budget burn rate. If a CPU spike causes some requests to take longer than 300ms, this may be acceptable. You only alert when the number of slow responses approaches your 0.1% budget.

This approach reduces alert noise and supports more informed incident responses. For example, if you've burned through your error budget 5x faster than normal over the past hour, that triggers an alert. If you've only burned 1.2x faster, it might just be logged for later review.

Error budgets also serve as governance tools for decision-making. The table below demonstrates three common use cases where error budgets can directly inform real-world decisions. 

Decision

Description

Deployment freezes

Stop new feature deployments when the error budget is exhausted. If more than 0.1% of requests have exceeded 300ms this month, postpone releases until reliability improves.

Engineering prioritization

Reallocate effort to stability work when burning budget quickly. If you're consuming 50% of your monthly error budget in the first week, shift focus from features to performance optimization.

Objective adjustments

Tighten targets if error budgets go consistently unused. If you regularly achieve 99.95% instead of your 99.9% target, consider raising the bar to 99.95%.

Modern SLO management tools, such as Nobl9, integrate with observability platforms to automatically track burn rates. Rather than alerting on raw metrics, these systems alert when SLOs are at risk based on current consumption patterns.

Nobl9 error budget monitoring dashboard.

Error budgets shift teams from reactive firefighting to data-driven reliability management. When implemented properly, they provide an objective framework for balancing feature delivery with system reliability.

Customer-Facing Reliability Powered by Service-Level Objectives

Service Availability Powered by Service-Level Objectives

Learn More

Implement redundancy and failover to improve service availability 

Beyond monitoring SLIs and managing error budgets, design and implement architectures that eliminate single points of failure to maximize service availability.

Active-active clusters distribute traffic across multiple zones or regions using strategies like weighted routing, geographic distribution, or round-robin, providing resilience and performance. They, however, introduce complexity around data consistency and potential split-brain scenarios where disconnected nodes make conflicting decisions. Active-active works best for stateless services or when you can accept eventual consistency.

Active-passive configurations keep a secondary instance idle until failure is detected. While this avoids consistency issues and reduces resource costs, it introduces failover delays and requires maintaining data synchronization, plus automating the failover process. Expect some service disruption during the transition from primary to secondary.

Both approaches require automated failover logic. Manual intervention during outages increases downtime and introduces human error.

A microservice architecture designed across multiple AWS regions might use:

  • Asynchronous messaging with regional queues and cross-region replication for durability and decoupled service communication
  • Read-only failover nodes in secondary regions to ensure degraded service rather than total failure during primary region outages

Without automated load balancing to distribute traffic, the infrastructure remains vulnerable to downtime during regional failures.

Active-active setup with load balancer and two regions. (Source)

Allow for graceful degradation

Highly available systems aim to stay useful even when things go wrong. Delivering partial functionality when full functionality isn’t possible, known as graceful degradation, helps to achieve this. 

Rather than failing when a dependency is unavailable, allow your system to degrade in a controlled way. For example:

  • If a personalization service is down, serve cached recommendations instead of nothing.
  • If a payment gateway is unreachable, allow users to continue browsing and save their cart for later checkout.

This keeps users engaged and prevents temporary backend issues from escalating into poor user experiences.

Content Delivery Networks (CDNs) serve cached responses for static or semi-dynamic content, reducing load on origin services. When an origin service fails, users continue to receive cached content from their local CDN, ensuring functionality remains intact even during central system outages.

Circuit breakers automatically detect failing components and cut them off temporarily to prevent cascading failures. This prevents wasting resources on calls that cannot be completed and provides the downstream service time to recover.

Feature flags enable two types of degradation strategies: 

  • Operational toggles turn non-essential features off during incidents to preserve performance and availability for core functionality. If resource usage spikes, these flags temporarily remove the functionality causing the spike.
  • Kill switches quickly disable problematic features without requiring deployments. If a new feature causes issues in production, kill switches can remove it immediately while the team investigates and fixes the problem.

These approaches enhance user experience, as users perceive the app as functional, even if some features are temporarily unavailable. Pressure on backend systems during incidents is reduced, and overall system availability improves without requiring 100% backend reliability.

Graceful degradation acknowledges the realities of distributed systems. Perfect reliability is impossible, but well-designed systems can still deliver a quality user experience even in the event of failures.

Enable CI/CD and immutable infrastructure

High availability isn’t just about how systems behave at runtime but also how they are built, deployed, and changed over time. CI/CD practices and immutable infrastructure help reduce risk, improve consistency, and enable faster recovery.

Build your environments using Infrastructure as Code to ensure reproducibility, reduce configuration drift, and support quick recovery during failures.

Canary deployments reduce the blast radius of changes by gradually shifting traffic to new versions, starting with a small subset of users and validating success before rolling out to more users. Teams should monitor real-time metrics, such as error rates, response times, and throughput, during rollouts. If these metrics show problems, halt the deployment or roll back before a broad impact occurs.

Canary deployment in CI/CD. (Source)

Blue/green deployments involve deploying a complete replacement infrastructure, testing it thoroughly, and then instantly cutting over traffic once confidence is established that the new environment is fully operational. This maintains availability by ensuring the new environment is functional before receiving traffic.

Blue/Green deployment in CI/CD

Teams can combine these strategies by using blue/green infrastructure deployment with canary traffic patterns for maximum safety, but should consider the operational overhead.

Implementing immutable infrastructure in your CI/CD pipeline, that is, replacing your components instead of reconfiguring them, streamlines operations and builds a foundation for resilience, reliability, and velocity to coexist. Consistent environments across dev, staging, and production reduce surprises during deployment, enabling faster, safer rollouts and the confident shipping of new features. Recovery is also simplified because environments are predictable and declarative.

Visit SLOcademy, our free SLO learning center

Visit SLOcademy. No Form.

Create health checks, liveness, and readiness probes

To maintain service availability, your services must communicate their health to users and the systems that manage them. This can be achieved through liveness, readiness, and startup probes. Although these serve similar purposes, they trigger different responses when they fail.

Liveness probes detect when a process is stuck or has failed at a basic level. Implement endpoints such as ‘/healthz’ that return a 2xx response confirming the service is alive. Confirming a service is listening on a known TCP port can also help verify its liveness. When liveness probes fail, orchestrators like Kubernetes restart the container.

Readiness probes assess whether a service functions correctly and can handle requests. Create endpoints such as '/health' that check database connectivity, verify dependencies are reachable, and confirm the application can process requests. When readiness probes fail, the service is removed from load balancer endpoints but not restarted, allowing it to recover without disruption.

Startup probes indicate whether the service has completed initialization and is prepared to handle requests. These should verify that the application has finished its startup sequence: loading configuration, establishing database connections, warming caches, etc. Unlike readiness probes, startup probes are one-time checks that run only during container startup. Failures prevent the container from starting but don't affect running containers.

All three types enable self-healing systems and smooth deployment workflows. Combined, they help enable zero-downtime deployments by supporting rolling updates with real-time feedback on service health to gracefully drain traffic from old instances only after new ones are ready, integrating with load balancers and orchestrators to automate recovery, and scaling safely.

Implementing health checks correctly ensures that your platform responds intelligently to failures and state transitions, rather than relying on reactive human intervention.

The YAML below shows a typical Kubernetes Deployment configuration making use of these three probes:


apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      containers:
      - name: web-app
        image: web-app:latest
        ports:
        - containerPort: 8080
        
        # Startup probe - runs only during container startup
        startupProbe:
          httpGet:
            path: /startup
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 6  # 30 seconds total before giving up
        
        # Liveness probe - restarts container if fails
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        
        # Readiness probe - removes from service if fails
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 2

Align service availability targets with business objectives

A common failure in a reliability strategy is treating SLOs as purely technical goals disconnected from what the business or users need. Effective SLOs should be grounded in real-world impact and business value.

Designing for 99.999% availability (26 seconds of monthly downtime) requires significant complexity and cost increases. For most consumer applications, this investment may not deliver proportionate value. However, for critical systems like medical devices, financial trading platforms, or emergency services, such targets may be justified by regulatory requirements or user safety needs.

Recovery time objective (RTO) visualized. (Source)

Targets that exceed system capability or historical performance without corresponding investment set teams up for failure and undermine trust in the process. Conversely, when error budgets go consistently unused, this may indicate either highly reliable service delivery or SLOs that have become disconnected from actual system performance and are too lenient.

When creating SLOs, begin by identifying specific user impact scenarios and quantifying business costs of different failure types:

  • User research: What response times do users notice? At what point do they abandon tasks?
  • Business impact analysis: What does one hour of downtime cost in lost revenue, customer support load, or reputation?
  • Regulatory requirements: Are there compliance standards that mandate a specific availability level?

This data should inform SLO targets. Collaborate across engineering, product, and business teams, treating SLOs and error budgets as tools for prioritization decisions rather than just reporting metrics.

When SLOs reflect business goals, they become strategic tools that help balance feature delivery with operational excellence and make reliability a product-oriented decision, rather than an infrastructure one.

Maintain operational runbooks for incident response

When failures occur, preparedness often determines the outcome. Actionable runbooks empower teams to respond to and recover from incidents quickly and consistently.

Document known failure modes so your team can handle failure effectively. For known issues, include symptoms, diagnostics, and step-by-step recovery actions. Store your runbooks in source control or an internal knowledge base, ensuring that on-call teams can quickly locate them.

Run incident response drills to simulate outages and recovery procedures, such as service restarts, traffic rerouting, and deployment rollbacks. Treat these as learning opportunities for engineers responsible for responding to real incidents, and as an opportunity to improve your systems and documentation. At the same time, test backup restoration, and define Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs).

Feed learnings back into system design and SLO adjustments, particularly during the Refine and Reassess phases of the SLO Development Lifecycle. Look for opportunities to automate failover and failback procedures to reduce human error and time to recovery.

A well-maintained runbook culture turns unknowns into knowns, reduces stress during high-pressure moments, and shortens the path to recovery while closing the loop back into system improvement.

Learn how 300 surveyed enterprises use SLOs

Download Report

Implement incident retrospectives

An incident that impacted service availability ends when you understand what happened and improve from it, not when the service is restored. Retrospectives (or postmortems) are critical to building resilient systems and teams.

Retrospectives identify what broke, why, and how to prevent recurrence. They validate or adjust SLOs to ensure they accurately reflect user expectations and system capabilities. They also reinforce a culture of ownership without fear.

Teams should strive to conduct “blameless retrospectives” that focus on systems and processes, not individuals. Ask "what happened and why?" rather than "who caused this?" This encourages honest reporting and deeper insights into root causes.

SLO review is essential in any retrospective. Examine whether the incident breached an SLO, whether alerting captured the right signals, and whether user-facing impacts were detected or missed. Document timelines, contributing factors, and lessons learned in a shared wiki or incident management system where other teams can access them.

Use retrospectives as input for concrete improvements: updating runbooks and automations, refining alert thresholds and SLIs, adjusting SLOs during the Refine and Reassess phases, and prioritizing engineering work to reduce risk.

Effective incident retrospectives transform failure instructive into a learning opportunity that drives continuous improvement for systems and processes. 

Conclusion

Service availability requires more than maximizing uptime. Systems must handle failure gracefully, and teams must plan for, detect, and recover from problems effectively. This means building systems that fail predictably, recover quickly, and degrade gracefully when components are unavailable.

Resilience requires a continuous effort that combines thoughtful design, systematic monitoring, and disciplined operational practices. By following the nine best practices we’ve covered in this article, engineering teams can strike a balance between availability, cost, and effort that reliably meets their business objectives.

Navigate Chapters:

Continue reading this series