Table of Contents

Service level objectives (SLOs) are a core component of site reliability engineering (SRE). They help SRE teams define and maintain reliability based on service-level indicators (SLIs). Creating meaningful SLOs is essential for their success and requires a structured approach that aligns with user expectations and business goals. Implementing well-designed SLOs delivers tangible business benefits, including reduced downtime costs, improved customer satisfaction, and more efficient resource allocation.

The SLO development lifecycle (SLODLC) provides a framework for creating metrics that matter to service-centric organizations. Practitioners in the DevOps community created the SLODLC framework to provide practical, actionable templates and examples to help organizations in their journey to adopt SLOs.

The SLODLC framework has six phases: initiate, discover, review, design, implement, and operate. This article discusses each phase and provides real-world examples of how it helps deliver a service and manage the tradeoffs involved in running a digital application.

Summary of key concepts as context for the service level objective examples

Concept

Description 

SLO development lifecycle

A repeatable methodology for creating metrics that matter

Phase 0: Initiate

Teams need alignment on why reliability matters, who the key stakeholders are, and how SLOs will impact decision-making.

Phase 1: Discover—Identifying user expectations and SLIs 

Focus on understanding how users experience reliability.

Phase 2: Design—Defining achievable vs. aspirational SLOs

Once SLIs are identified, set SLO targets that balance realistic reliability with aspirational goals for improvement.

Phase 3: Implement—Instrumenting and collecting SLIs

With the SLOs defined, monitor them in production and ensure that teams have visibility into performance.

Phase 4: Operate—Using SLOs for incident response and scaling

SLOs guide operational decisions and help teams prioritize reliability efforts vs. new feature work.

Periodic review—Adapting SLOs based on performance trends

SLOs aren’t static: They should be regularly reviewed to account for infrastructure changes, user growth, and changes in expectations.

Phase 0: Initiate

Before setting any SLOs, teams must understand why reliability matters to the relevant stakeholders and how SLOs will impact decision-making. Buy-in from leadership, engineering teams, and other stakeholders is essential for success, and Phase 0 helps achieve this.

The Business Case Worksheet is pivotal to Phase 0. It aims to summarize why you are considering this journey, which helps confirm buy-in from all involved. It establishes the business objectives the SLOs will support, the critical services that require reliability tracking, and how the SLOs will influence engineering trade-offs, among other goals.

Here is an example excerpt from the Business Case Worksheet that could apply to a commerce business, showing Section 1.2, which details achievable goals that are anticipated from the adoption of SLODLC:

Achievable Goals
Goal: Maintain a high-performance and reliable online presence.
Rationale: Downtime or drops in performance can lead to loss of sales and customer disappointment.
Owner: Operations Team

Goal: Improve inventory tracking accuracy.
Rationale: Discrepancies lead to loss of sales and customer disappointment.
Owner: Operations Team

Goal: Reduce stock replacement times.
Rationale: Lack of stock leads to lost sales
Owner: Supply Chain Team
Goal: Enable real-time stock visibility.
Rationale: End-of-day reporting shows out-of-date stock information.
Owner: IT Team
Goal: Reduce system maintenance costs.
Rationale: Cloud-based systems reduce infrastructure costs and increase scalability.
Owner: Finance Team

Keep each item concise, similar to the example above. Along with the rest of the worksheet, this should ensure that the reasoning for implementing SLOs and SLIs is clear to all stakeholders and that everyone is invested in their success.

Phase 0 should be less about specific SLI targets and more about the high-level SLOs being considered. This can include both highly technical objectives (“Maintaining a high-performance and reliable online presence”) and less technical ones (“Reducing stock replacement times”). The importance of this phase is that the objectives can be measured and can improve business performance, not necessarily how they are measured or at what levels the measures are expected to be.

Customer-Facing Reliability Powered by Service-Level Objectives

Learn More

Integrate with your existing monitoring tools to create simple and composite SLOs

Rely on patented algorithms to calculate accurate and trustworthy SLOs

Fast forward historical data to define accurate SLOs and SLIs in minutes

Phase 1: Discover—Identifying user expectations and SLIs 

With buy-in confirmed during Phase 0, Phase 1 aims to establish a more specific purpose and impact of SLOs with the help of the Discovery Worksheet. By concentrating on the services the SLOs relate to, their context within the business, and the expectations of various stakeholders, more formal, measurable targets are created to form the basis of the SLOs.

Example 1

An example of Section 1.4 of the Discovery Worksheet, “Service Expectations,” that could apply to a commerce business is shown below:

Service level agreements with their levels, in order of criticality:
Availability: 99.99% availability across all locations
Response time: 99% of requests completed within 500 ms
Data accuracy: 99.5% inventory accuracy
Data durability: Zero data loss outside of scheduled maintenance windows
Caching: Maintain ≥90% cache hit ratio across all PoPs

Consequences of not meeting SLAs:
Lost sales
Customer dissatisfaction
Staff frustration
Defined by: Executive Leadership and IT Department
Responsible: IT Operations Team

Informal expectations toward services
Expectation: Seamless integration with POS systems

Who defined reliability expectations?
Store Managers

As with Phase 0, this could include both technical and non-technical items. This helps establish the difference between users' and engineers' expectations, ensuring correct prioritization. While engineers may be inclined to perceive response time or availability to be the highest priority, users may be more concerned that the data be accurate and durable. A delayed response may be accepted if the reaction is always accurate. Phase 1 establishes these differences in expectation and formally prioritizes them.

Section 2 of the Discovery Worksheet, “Prioritize User Journeys and Expectations,” captures user expectations more deeply. Here is an example of Section 2.1 capturing a variety of users with their specific expectations for functionality and performance.

2.1. Define the Users of the Service
Store managers:
Need real-time alerts for low-stock items
Need access to data to make ordering decisions
Expect accurate data in real-time
Inventory staff:
Need the ability to update stock levels and manage discrepancies
Expect a user-friendly interface
Need barcode scanning for efficient stock management
Retail customers:
Want to check product availability online before visiting stores
Expect accurate stock information online
Want notifications for restocked items
Expect a reliable and high-performance online presence

Understanding the variety of system users, what they use the system for, and their expectations for successful operation helps identify which SLIs are most important to real-world reliability. Although every business’s prioritization will differ, the example excerpt above could be used to argue that retail customers' expectations should take priority over store managers’ as being unable to check stock availability before visiting a store or not being notified when an item is restocked could lead to loss of custom. In contrast, a store manager receiving low stock alerts daily, rather than in real time, could be nothing more than a minor inconvenience if stock is dispatched less than daily.

Note that some users’ expectations may overlap. Thoroughly including each user groups’ expectations ensures that these overlaps are uncovered and helps with prioritization.

Example 2

With the service level and user expectations agreed on, Section 4, “Observe System Behavior,” establishes how the expectations can be observed and monitored. For example:

4.3.1.Data sources
Uptime monitoring: Use services such as Pingdom or Datadog Synthetic Monitoring for availability monitoring and reporting.
Response times: Web server and API gateway metric report latency of all requests.
Reconciliation: Perform checks between the inventory database and ERP source-of-truth.
Backups: Monitor backup success rate alongside RPO and RTO metrics.
Stock movement patterns: Monitor SKU flows among warehouses, stores, and customers.
User interaction logs: Track how customers, store managers, inventory staff, and suppliers interact with the system.
Anomalies and alerts: Detect unexpected stock shortages, overstock situations, or system failures.

The examples above cover just part of the Discovery Worksheet, but when worked to completion, a picture of the user experience, expectations, appropriate SLIs, and where to find the data to measure them begins to develop.

Customer-Facing Reliability Powered by Service-Level Objectives

Learn More

Phase 2: Design—Defining achievable vs. aspirational SLOs

Once the SLIs to be measured are identified, Phase 2 aims to design realistic targets and aspirational goals for them, balancing user expectations against system constraints and differentiating between short-term and long-term goals.

This is achieved using the SLI/SLO Specification Template.

Some example SLI/SLO specifications that could come from the previous discovery phase include the following:

Web Applications—Page Load Time (Latency)
Helps enable the response time identified in Section 1.4
SLI: Percentage of requests completed within 500 ms
SLO: 99% of requests load in <500 ms
Achievable Target: 500 ms
Aspirational Target: 200 ms
Measurement: Web Vitals instrumentation (e.g., Largest Contentful Paint [LCP]) 
Database and Storage—Backup Reliability
Helps enable the availability requirement identified in Section 1.4
SLI: Percentage of scheduled backups completed successfully
SLO: 99.99% backup success rate
Achievable Target: 99.99%
Aspirational Target: 100%
Measurement: Backup job logs and success notifications
Background Jobs and Event Processing—Task Execution
Helps achieve the stock pattern analysis and anomaly detection identified in Section 4
SLI: Percentage of scheduled jobs executed within 1 minute
SLO: 99.9% of cron jobs execute within 1 minute of schedule
Achievable Target: 1 minute
Aspirational Target: 30 seconds
Measurement: Job execution logs delay metrics
Web Applications Uptime
Helps achieve the availability/uptime targets identified in Sections 1 and 4
SLI: Percentage of time the web applications are available across a 30 day period
SLO: 99.9% availability across all regions across 30 days
Achievable Target: 99.9%
Aspirational Target: 99.999%
Measurement: Pingdom availability metrics

Each example references the item from previous phases, which justifies measuring the SLO, a measurable, achievable, and aspirational target to confirm with certainty whether the SLO has been met, and the measurement used to verify the metric. 

Tips:

  • The SLI, SLO, targets, and measurement should clearly define what is considered an acceptable level of service or an aspiration for improvement.
  • Targets should be set to reasonable levels and may already be regularly achieved. Setting unattainable or overly ambitious targets from the outset does little to assist in genuinely measuring performance.
  • Aspirational targets should be set to a higher standard that is rarely met, if ever, to ensure that improvements are always worked toward.

Phase 3: Implement—Instrumenting and collecting SLIs

The next step is monitoring the defined SLOs in production and ensuring that teams have real-time visibility into performance. With that in mind, the SLODLC Implement Worksheet used in Phase 3 specifies what tools should track SLIs, how to alert teams when SLOs are at risk, and how to collect meaningful metrics without adding unnecessary overhead.

Here is an example excerpt from Section 2 of an “Implement Worksheet” related to the page load time SLO. It defines several ways that statistics relevant to the SLO could be collected.

Server–Side Reporting
Purpose: Data is collected server-side to report on request rates, errors, and user journeys around the application.
Implementation: Consider application performance management tools.
Client-Side Reporting
Purpose: Measure performance from the client’s perspective to ensure no disparity between the server response and client processing.
Implementation: First Contentful Paint (FCP), Time to First Byte (TTFB), and the time between load end and navigation start can all be measured client-side and reported back.
Aggregate and Analyze the Data
Purpose: Transform raw data into actionable insights to monitor and improve user experience.
Implementation: Store metrics in a structured format. Calculate the percentage of page loads meeting the SLO. Visualize this data through dashboards to track performance trends over time.

Recording, storing, and analyzing metrics is only useful if stakeholders have visibility of the analysis. Visualisations and dashboards, such as those available as part of Nobl9 Reliability Center (shown below), provide clear and concise views of when an SLO is at risk or has been breached, prompting quick reactions to potential issues.

Service Health Dashboard: Error budget

Phase 4: Operate—Using SLOs for incident response and scaling

SLOs should not be treated as just metrics. They should guide operational decisions and help teams prioritize reliability efforts relative to new feature work. They may even influence on-call alerting and decisions about whether the deployment of new features should be reduced to preserve reliability.

Given the example from Phase 3 above, on-call alerting is activated if the aggregated data suggests that fewer than 99% of pages load in under 500 ms over some time interval, indicating that performance has dropped below the previously agreed acceptable level. The Nobl9 platform offers rich features and integrations with various alert systems.

Depending on the outcome of the on-call investigation, a decision could be made to reverse a recent deployment deemed to have introduced the drop in performance and restrict any further planned deployments until a full investigation into the cause has been completed.

Alternatively, suppose performance is known to drop during the deployment process for any reason. In that case, on-call alerting may be paused during a planned deployment, and the temporary drop in performance may be accepted if data suggests that far more than 99% of page loads have recently been completed in under 500 ms. In this case, the risk of breaching the SLO’s target is significantly reduced.

Furthermore, suppose the data suggests that 100% of page loads are completed within 500 ms, and 98% are completed in close to 200 ms. This is a prompt to adjust expectations and achieve the aspirational target of 99% of loads in under 200 ms.

Visit SLOcademy, our free SLO learning center

Visit SLOcademy. No Form.

Periodic Review—Adapting SLOs based on performance trends

SLOs aren’t static. As the example above suggests, they should be reviewed regularly to account for infrastructure changes, user growth, and changes in expectations. You should review whether SLOs are too aggressive or lenient, whether traffic pattern changes require adjustments, whether you are meeting SLOs or exceeding them without risk periodically, and when changes are made in the knowledge that they could have an impact. The Periodic Review Checklist and Nobl9 consolidated SLO reports can help with this. 

An example excerpt from a Periodic Review Checklist after reviewing SLOs related to page load times is shown below:

SLIs reviewed: Percentage of page loads completed within the target time.
Review conclusions: Since the implementation of recent optimizations, 99.5% of page loads are completed in under 400 ms, surpassing the current SLO of 99% under 500 ms.
SLOs reviewed: 99% of requests load in under 500ms.
Review conclusions: Given the consistent achievement of page loads under 400 ms, the SLO could be revised to a more ambitious target, such as 99% under 400 ms, to reflect current performance levels.

SLI/SLO operation health check
Area: Error budget events and alerts
Conclusions, issues, and lessons learned: The frequency of alerts has decreased due to improved performance, indicating better system stability. Recalibrate alert thresholds to prevent complacency.
Area: SLI data cleanliness
Conclusions, issues, and lessons learned: SLI data remains accurate and reliable. 
Area: SLI/SLO adjustments
Conclusions, issues, and lessons learned: With recent performance gains, the current SLO no longer represents a challenging target. To maintain high standards, it should be adjusted to 99% under 400 ms.
Area: SLO insights
Conclusions, issues, and lessons learned: Better performance has improved user satisfaction and engagement. Monitoring user behavior and performance metrics provides insight into the relationship between performance and user experience.
Action items:
Propose and document a revised SLO of 99% of page loads completed in under 400 ms.
Adjust alerting thresholds in monitoring systems to align with the proposed SLO.
Communicate proposed SLO changes to all relevant stakeholders and gather feedback.
Analyze user engagement metrics to assess the impact of improved page load times on behavior.

Learn how 300 surveyed enterprises use SLOs

Download Report

Conclusion

Implementing effective SLOs involves more than just picking numbers. Following a structured framework that ties business needs, user expectations, and technical realities together, such as the SLODLC methodology and the wide variety of templates and worksheets it offers, helps create actionable, measurable, and adaptable SLOs. This, in turn, helps teams define, measure, and maintain reliability.

Navigate Chapters:

Continue reading this series