Multi-chapter guide
Continuous delivery maturity model

A Guide to Continuous Delivery Maturity Model

Most teams understand that they should deploy faster and more frequently, but speed without the proper foundation can lead to disaster. Consider what happened at Knight Capital in 2012 when a deployment error cost the firm $440 million in just 45 minutes. They had deployment velocity but lacked the maturity to support it; without automated rollbacks, insufficient monitoring, and gaps in their quality gates caused a routine deployment to turn into a catastrophe.

Continuous delivery (CD) is the engineering practice of creating a reliable, automated pipeline that ensures that code is always in a deployable state. It answers the technical question “Can we deploy safely?” Mature CD decouples the ability to deploy from the decision of when to release while shifting focus from technical risk to business strategy.

A continuous delivery maturity model (CDMM) gives you a framework for assessing your team's standing across the software delivery process. It helps you evaluate your current position to make smarter investment decisions about which capabilities to build next. One way to assess such maturity is to examine it across four dimensions (frequency and speed, quality and risk, observability, and experimentation) each with four progressive levels (beginner, intermediate, advanced, and expert).

This article explains each dimension, describes the characteristics of each maturity level, and provides guidance on assessing your current standing and where to go next.

Customer-Facing Reliability Powered by Service-Level Objectives

Service Availability Powered by Service-Level Objectives

Learn More

Integrate with your existing monitoring tools to create simple and composite SLOs

Rely on patented algorithms to calculate accurate and trustworthy SLOs

Fast forward historical data to define accurate SLOs and SLIs in minutes

Summary of key continuous delivery maturity model concepts

Concept	Description
Continuous delivery maturity model (CDMM)	A framework to assess CD capabilities across dimensions and progressive levels, allowing teams to create maturity profiles that reflect business priorities and constraints.
Four maturity dimensions	Frequency and speed (deployment cadence) Quality and risk (reliability and failure management) Observability (visibility into systems and deployment impact) Experimentation (safe testing with real users)
Four maturity levels	Progressive levels, each building on the previous one: Beginner (building foundation) Intermediate (orchestrating workflows) Advanced (progressive delivery) Expert (autonomous systems)
Dimension dependencies	Observability typically comes first as a foundation. Quality automation enables frequency increases. Experimentation builds on having both metrics and deployment automation in place.
Assessment approach	Teams ideally should examine each dimension individually, involve people across the organization, focus on current capabilities rather than aspirational plans, and watch for gaps between dimensions.
Prioritization strategy	Plan progressive improvements one level at a time, match maturity profile to business needs rather than arbitrary ideals, and reassess regularly as the organization grows and the context changes.

The four maturity dimensions

The CDMM assesses maturity across four dimensions that work together to create a practical continuous delivery approach. Each dimension measures different capabilities, and advancing in one often depends on progress in others. For instance, you can't safely increase deployment frequency without observability to catch problems, and you can't run meaningful experiments without metrics to measure results.

Four maturity dimensions: frequency and speed, quality and risk, observability, and experimentation.

Frequency and speed

Frequency and speed describe how quickly you can incorporate changes into deployable packages. The goal is to transition from weekly builds that require manual coordination to on-demand builds triggered by commits. Faster cycles and smaller code changes create tighter feedback loops and help developers catch bugs while they are still fresh.

The speed of your build pipeline significantly affects the frequency of your builds. Developers usually stop waiting for feedback if builds (or tests) take two hours instead of ten minutes. Deployment speed is also affected by the way you handle build failures: Teams that immediately revert bad commits maintain a higher frequency than those that tolerate broken builds while fixes are being made.

The challenge is that you can't simply decide to deploy more often. Frequency depends on having quality automation and observability in place. Without automated testing, faster deployments simply mean shipping bugs more quickly, and without observability, you're deploying blindly with no way to catch problems before they cascade.

Quality and risk

Quality encompasses test coverage, test reliability, quality gates that block bad builds, and rollback capabilities for quick recovery. The progression typically moves from manual testing with basic automation through comprehensive automated test suites, then to automated rollbacks, and finally to predictive quality signals that catch issues before users encounter them.

Often, teams think of test coverage first when defining quality. However, they should also consider timing because a test that finds a bug two days after someone wrote the code provides less value than one that catches it ten minutes later. Also, flaky tests that fail randomly undermine confidence because developers usually learn to ignore failures, assuming they're just noise. Most importantly, quality gates must be in place to block bad builds without creating false positives that slow down legitimate deployments.

Risk management is therefore critical as deployment frequency increases. Rollback capabilities must work reliably under pressure, and understanding the blast radius is essential. For example, choosing to deploy to 5% of production traffic limits the impact of bugs compared to an all-at-once release to the entire user base.

Quality is about catching bugs early enough that they don't become expensive production incidents.

Observability

Observability lets you see into what's happening in your systems and how deployments affect behavior. Teams usually start with basic monitoring that tells them when services are down, then advance to comprehensive metrics tracking errors, latency, and business outcomes, and then add proactive alerting with clear signals. You know your observability has reached maturity when you have business-aligned SLOs that automatically enforce reliability targets.

Observability is the foundation for the other three dimensions:

Without visibility into what breaks, you can't safely increase deployment frequency.
Without metrics to measure results, you can't run meaningful experiments.
Without signals to trigger on, you can't automate rollbacks.

When a deployment causes problems, your observability maturity determines whether you spend a few minutes identifying the issue or hours hunting through logs across multiple systems. The progression from basic to advanced observability shows up in how quickly you can answer critical questions like these:

Can we correlate a spike in errors with the specific deployment that caused it?
Can we trace a slow request through all the services it touched?
Can we demonstrate to executives how reliability improvements protected revenue?

Basic monitoring answers “Is it broken?” while mature observability answers “What broke, why, and what's the business impact?”

Experimentation

Experimentation tracks your ability to test changes safely and learn from real user data. The starting point is having static releases, where everyone gets the same version simultaneously. From there, you can add feature flags for controlled rollouts, then A/B testing to compare variants, and, finally, deploy automated systems that progress or halt rollouts based on measured outcomes.

Experimentation decouples deployment from release. It means you can deploy code to production but keep it hidden behind a flag until you're ready to enable it. This separation reduces risk because engineers are not forced into all-or-nothing decisions. For example, testing a new checkout flow with 5% of users and measuring the conversion rate difference gives you data to decide whether the change is effective.

Experimentation builds on the other dimensions through specific dependencies:

You need deployment automation to ship flagged code efficiently.
You need observability to measure experiment results.
You need good software engineering practices to make sure your experiment infrastructure doesn't become a source of bugs.

Teams that rush into experimentation without these foundations end up with unmeasurable experiments or feature flag systems that create more problems than they solve.

The four maturity levels

The CDMM defines four progressive levels that build on each other. Each level adds capabilities that weren't possible without the foundation from the previous stage. The idea is that teams progress through these levels at their own rates in different dimensions, creating maturity profiles that reflect their specific priorities and constraints.

Beginner maturity level: Building the foundation

Teams at the beginner level are transitioning from manual processes to automated ones. This transition creates typical characteristic patterns:

Inconsistent practices across the team
Heavy reliance on tribal knowledge
Reactive rather than proactive responses to problems
Significant manual coordination for each deployment

Usually, the infrastructure exists but is fragile and dependent on specific individuals who know the workarounds and edge cases.

Example scenario: A mid-sized SaaS company with about 20 developers runs weekly deployments timed to when QA completes manual regression testing. A Confluence page lists 35 deployment steps, but only three people actually know how to execute the process because several steps require undocumented tribal knowledge about configuration quirks. Testing mixes automated core test suites with manual QA verification that takes two days. Unit test coverage is inconsistent because not all developers write them. Monitoring consists of health checks that ping services, but logs aren't aggregated. When something breaks, someone gets paged, fixes it by restarting a service, then spends the morning SSHing into servers to figure out what happened. Releases are all-or-nothing events where every customer gets the new version simultaneously with no way to test features with subsets of users or roll back individual features.

This level requires establishing fundamental capabilities like:

Version control for everything that affects deployments to enable reliable build reproduction
Basic CI automation running on every commit to surface integration issues early rather than late
Simple monitoring with health checks and basic alerts so you’re not deploying blind
Repeatable deployment scripts to reduce manual steps and prevent deployment knowledge from staying locked in a few individuals' heads as the team grows

Intermediate maturity level: Orchestrating workflows

Moving from beginner to intermediate maturity requires establishing the automation foundation that enables daily deployments. Once you have version control, basic CI, and scripts in place, you can build orchestration on top of them. Teams at the intermediate level shift from sequential, manual processes to parallel, automated workflows that remove coordination bottlenecks.

Example scenario: A growing fintech startup with 30 developers has moved beyond their beginner-level weekly deployments. Their automated pipeline now triggers on every commit, running unit tests, integration tests, and security scans in parallel. The full test suite completes in 12 minutes, providing rapid feedback. They maintain three environments: development, staging, and production. Code that passes tests in development automatically promotes to staging. After passing smoke tests in staging and getting QA sign-off, the deployment to production requires just a button click rather than a 35-step checklist.

Deployments typically shift to a daily or weekly cadence because automated pipelines handle the work that previously required manual coordination. Builds trigger on every commit, running tests in parallel to provide feedback within minutes rather than hours. Multiple environments exist with automated promotion between them based on quality gates. Testing becomes comprehensive and reliable, with unit testing typically reaching 60% or 80%, integration tests covering critical system interactions, end-to-end tests validating key user workflows, and quality gates automatically blocking bad builds before they reach production.

Observability also expands beyond basic health checks to comprehensive metrics collection. Error rates, latency percentiles, and business metrics get tracked across all services. At this level, centralized logging makes debugging practical when requests span multiple services. Teams begin defining SLOs for critical services, typically targeting thresholds like 99.5% success rate or 95th percentile latency under 500 ms.

Nobl9 SLO dashboard showing error budget and reliability trends over time

The example Nobl9 SLO dashboard above shows what intermediate observability looks like in practice. Error budget tracking reveals the amount of reliability margin remaining before violating the SLO target. Deployment events are annotated in monitoring dashboards, allowing you to correlate metric changes with specific releases. When error rates spike after a deployment, you know exactly which version caused the problem and when it shipped.

Feature flags are also typically introduced at this level, decoupling deployment from release. You can merge code to main and deploy it to production while keeping new features hidden behind toggles. This way, you can test in production with internal users and gradually roll out features to customers to validate behavior before expanding.

Advanced maturity level: Progressive and intelligent delivery

This level focuses on transitioning from automated workflows to intelligent, data-driven deployment strategies. Teams at the intermediate level can deploy daily with confidence, but they still deploy to everyone at once. Advanced teams deploy multiple times per day using progressive strategies that limit blast radius and enable quick recovery when things go wrong.

Example scenario: Consider deploying a new recommendation algorithm to an ecommerce platform. At the intermediate level, you'd deploy it to everyone at once after it passed your test suite. At the advanced level, the new algorithm is deployed as a canary deployment to 5% of production traffic while the system monitors error rates and latency in real-time. Suppose your baseline shows 0.2% errors and 180 ms p95 latency. During the canary phase, errors increase to 1.5% and latency spikes to 450 ms. After monitoring the canary for 10 minutes and confirming sustained degradation (a 1.5% error rate across multiple time windows), automated rollback triggers fire, reverting the 5% of the traffic that was changed to the old algorithm before expanding the rollout. The entire process happens with minimal human intervention.

This scenario illustrates the core capabilities at the advanced level:

Quality practices shift to progressive strategies that limit blast radius through canary and blue-green deployments.
Canary deployments test new releases with small user subsets, monitoring error rates and latency before expanding rollout.
Blue-green deployments maintain two complete environments, enabling instant rollback by switching traffic back to the previous version.

These strategies are especially valuable for database migrations or significant architectural changes where rolling back individual components isn't practical.

At this level, distributed tracing would be used to connect requests across service boundaries, to show exactly where latency gets added or where failures occur. In the recommendation algorithm example above, distributed tracing would reveal that the latency spike comes from the new algorithm making synchronous calls to a recommendation service that times out under load. The trace would show the full request path:

Another feature of progressive delivery is an A/B testing infrastructure, which enables the simultaneous running of multiple experiments. Statistical analysis can be used to determine winners based on conversion rates, engagement metrics, or revenue impact. For example, progressive rollouts begin with 5% of traffic, expand to 25%, and then to 50%, with metrics guiding each expansion. The system can halt rollouts automatically if key metrics degrade.

During this entire process, deployment gates would automatically check metrics before allowing releases to proceed, but these checks occur quickly enough to prevent bottlenecks.

Expert maturity level: Autonomous systems

Autonomous systems are ones that learn, adapt, and correct themselves with minimal human intervention. Advanced teams use progressive strategies and automated rollbacks, but humans still make key decisions about when to deploy and how to respond to issues. Expert teams delegate these decisions to systems that understand both technical and business constraints, creating truly autonomous delivery pipelines.

At this level, releases occur on demand without coordination because the system handles safety checks automatically. Error budgets gate deployments, preventing new releases when reliability is already degraded. Predictive quality identifies issues early, leveraging machine learning models trained on historical deployment data. Chaos testing runs regularly (daily or weekly) in production, injecting failures to validate that automated recovery works as expected. The system detects problems and also remediates common failures automatically and performs operations such as scaling up resources, clearing caches, or rerouting traffic to healthy regions.

Example scenario: Let’s look at a significant pricing algorithm change for a subscription service. The error budget gate verifies sufficient reliability margin (45% remaining). The predictive quality model flags it as medium risk based on historical pricing changes. The system deploys as a progressive rollout starting at 5% of traffic and monitors both technical metrics (error rates, latency) and business metrics (conversion rate, revenue per user). Technical metrics appear perfect, with 0.1% errors and normal latency. However, after six hours, the conversion rate drops from 12% to 9.5%, and revenue per user decreases by 7%. Automated rollback triggers fire based on business metric degradation, reverting the change even though technical metrics remained healthy. The system has protected approximately $150K in potential lost monthly recurring revenue by catching this within hours rather than days.

Nobl9 error budget visualization showing burn rate and remaining budget

SLO management operates at both the business and technical levels. Executive dashboards correlate reliability metrics with revenue impact, showing how error budget depletion affects conversion rates or customer retention.

Unified SLO tracking, a key Nobl9 feature, aggregates reliability across dozens of services, calculating whether the overall system meets business requirements even when individual services have temporary issues. Experimentation becomes the default operating mode, with multi-armed bandit algorithms automatically optimizing between feature variants and shifting traffic toward better-performing options.

Nobl9 oversight dashboard showing a high-level overview of the current SLO state

Assessing your position and planning improvements

To progress through the maturity model, you need to evaluate your current capabilities and identify any significant gaps.

How to assess yourself

Start by examining each dimension individually. Review the maturity level descriptions to determine which one most closely matches your current capabilities. Most teams find they're at different levels across dimensions, which reflects intentional trade-offs based on business priorities and resource constraints.

The assessment works best when you involve people across your organization, like developers who understand build and test automation or product managers who see how feature releases actually work. Getting input from different perspectives helps prevent blind spots where you overestimate your capabilities in areas you don't work with daily.

Try to answer the following example questions across the four dimensions we have discussed throughout this article.

Dimension	Example questions to answer
Frequency and speed	How often do we deploy to production? What's our average commit-to-production time?
Quality and risk	How quickly can we roll back when something goes wrong? What's our test coverage, and how reliable are those tests?
Observability	When something breaks, how long does it take to identify which deployment caused it? Do we have SLOs defined for critical services?
Experimentation	Can we test features with a subset of users? Do we have metrics to measure experiment results? Can we roll back just a feature without redeploying?

Focus on what you can actually do now without new tools or processes; aspirational answers about what you plan to build are not useful. For example, if your monitoring dashboard exists but no one looks at it during deployments, you don't really have observability. If you have feature flags but they're so fragile that teams avoid using them, you don't have real experimentation capability.

Also, watch for red flag combinations that create serious risk:

If you’re deploying at high frequency with low observability, you’re deploying blindly, and you’re probably unable to quickly identify the root causes of system failures.
If you’re deploying at high frequency with low quality, you’re shipping bugs fast and turning increased velocity into more incidents.
If you’re doing advanced experimentation with no metrics, you’re creating unmeasurable results from A/B test variants.
If you’re deploying daily and testing manually, you’re creating QA bottlenecks, forcing you to either skip testing or slow down deployments.

Using your assessment to prioritize

Once you know where you are, the gaps become more visible. Beware of imbalances in your maturity level and the associated implications. For instance, if you're deploying multiple times per day (advanced frequency) but have only basic monitoring (beginner observability), you’re in the danger zone. When production breaks, you'll spend hours hunting for the cause across your many deployments from the day before. Conversely, if you have comprehensive testing with 80% coverage and full CI automation (intermediate quality) but only weekly deployments (beginner frequency), you're not capitalizing on your quality investments. Your test suite can support daily deployments, but coordination overhead will hold you back.

Dependencies between dimensions determine your path to improvement. Observability typically comes first because you need visibility before you can move quickly and safely. Without metrics showing what breaks after deployments, increasing frequency just accelerates your incident rate. Quality automation enables frequency increases by catching bugs before production rather than after. Experimentation builds on having both metrics and deployment automation in place, since you need infrastructure to toggle features and data to measure their impact.

Dimension dependency flow showing observability as a foundation, enabling both quality and frequency, which together enable experimentation

Plan progressive improvements rather than trying to jump levels. For instance, moving from beginner to intermediate is achievable in a matter of a few quarters, while jumping from beginner to expert requires years of investment. Focus on moving one level at a time in your priority dimensions, since each level builds on the previous one.

A small team might advance experimentation quickly since feature flags are relatively simple to implement. However, they often lack the resources for sophisticated observability platforms that aggregate metrics across dozens of services. On the other hand, a large enterprise might have comprehensive observability with dedicated SRE teams maintaining it but struggle to increase frequency due to coordination overhead across multiple product teams. Neither situation is a failure; they're simply different contexts with different constraints.

Your maturity profile should match your business needs rather than an arbitrary ideal. For example, a team supporting high-traffic ecommerce during the holiday season might prioritize observability and quality over experimentation. They need to know immediately when something breaks and prevent incidents entirely. A product team exploring new markets might invest heavily in experimentation, even with basic observability, because learning what customers want matters more than the sophisticated monitoring of features that might be scrapped anyway.

If you're at this level	Focus your next improvements on
Beginner in most dimensions	Observability foundation (metrics, logging, basic SLOs) and test automation
Intermediate observability, beginner quality	Automated testing and quality gates in CI
Intermediate quality, beginner frequency	Deployment automation and reducing manual coordination
Advanced in frequency/quality/observability, beginner experimentation	Feature flag infrastructure and progressive rollout strategies

Conclusion

The continuous delivery maturity model discussed in this article provides a framework for understanding your current position and future direction across four key dimensions: frequency and speed, quality and risk, observability, and experimentation. Teams advance at different rates across these dimensions, creating maturity profiles that reflect their business priorities and constraints rather than following a single prescribed path.

To maximize the benefits of this proposed framework, reassess it regularly as your organization evolves. The maturity level that represents “advanced” for a 10-person startup looks different from what it means for a 500-person enterprise. As you add customers, services, and team members, your context changes, which means your assessment needs to evolve in tandem with that growth. Knowing your current position and using the right tools lets you make better investment decisions.

Navigate Chapters:

Previous Chapter Next Chapter

SRE Pulse | Reliability Engineering with Agentic AI Code

Nobl9 SLO Oversight Webinar

A Guide to Continuous Delivery Maturity Model

Table of Contents

Customer-Facing Reliability Powered by Service-Level Objectives

Service Availability Powered by Service-Level Objectives

Summary of key continuous delivery maturity model concepts