A Best Practices Guide to Continuous Delivery

An organization’s level of continuous delivery (CD) maturity determines how quickly it can improve its software, but competence in this area is not achieved overnight. Much like software development itself, teams progress incrementally by building capabilities in the right order. Just jumping straight to “automate everything” or “deploy multiple times daily” without the proper practices in place sets teams up for failure. As an example, in 2012, Knight Capital deployed faulty code without adequate deployment safeguards and lost about $440 million in 45 minutes. This serves as a useful reminder that speedy releases without testing, observability, and rollback controls can lead to direct business losses.

Teams mature CD practices across four dimensions: frequency and speed, quality and risk management, observability, and experimentation. These form the basis of the continuous delivery maturity model (CDMM). These dimensions have dependencies; for example, teams deploying multiple times daily without automated testing usually ship avoidable bugs, and teams running A/B tests without the necessary metrics infrastructure struggle to measure which version won. Advancing one dimension without the others creates gaps that can turn into incidents.

This article presents a practical progression of best practices for continuous delivery: Build baseline capabilities, automate testing, add deployment-aware metrics, use feature flags, and then add automated rollbacks before advanced quality gates. We use a checkout service as the running example because it illustrates the simple core tradeoff, which is that faster releases can increase reliability risk if they lack the right safeguards.

Customer-Facing Reliability Powered by Service-Level Objectives

Service Availability Powered by Service-Level Objectives

Learn More

Integrate with your existing monitoring tools to create simple and composite SLOs

Rely on patented algorithms to calculate accurate and trustworthy SLOs

Fast forward historical data to define accurate SLOs and SLIs in minutes

Summary of continuous delivery best practices

Best practice	Description
Build baseline capabilities before specializing	Weekly automated deploys with manual testing and basic monitoring establish a foundation before advancing to multiple daily deploys or advanced practices.
Automate testing before increasing deployment frequency	Automate test suites with 70%+ coverage to catch bugs before production, giving you confidence to speed up deployments safely.
Implement comprehensive metrics before deploying multiple times daily	Comprehensive metrics enable both identifying which deployment broke things and measuring A/B test results.
Use feature flags to decouple deployment from release	Feature flags enable safer progressive rollouts and A/B testing by decoupling deployment from release.
Establish automated rollbacks before advanced quality gates	Use automated rollbacks for basic triggers (e.g., error rates and latency spikes), and make sure they work reliably before building advanced quality gates that use error budgets on top of them.

Build baseline capabilities before specializing

Most teams start where they need to start: shipping what works to get the product out the door. Deployments happen weekly because that's when QA finishes their manual test pass, or configs live in a wiki because that's where someone put them three years ago. Maybe deployments are a 47-step checklist that only two people actually know how to run.

Building a baseline means getting these fundamentals in place:

Version-control for everything that affects how your system runs—code, configs, infrastructure scripts, tests, and deployment scripts—so changes become tracked, reviewable, and recoverable when someone inevitably breaks something.
Automate CI so every commit triggers a build and test. You get immediate feedback when something breaks rather than finding out during Friday's deployment.
Set up deployment automation through scripts that run the same way every time, not checklists that vary by who's running them. Same steps, same order, so that every environment follows the same order.
Establish basic monitoring so you know when something's broken, e.g., health checks, simple alerts, and a dashboard showing that your services are up.
Keep integrations to your main branch small and frequent, so that when something breaks after a 50-line change, you know where to look instead of digging through thousands of lines.

How long it takes to reach this baseline depends on team capacity, service count, and existing tooling; some teams get there in a few months, while others need longer. You know it is established when every commit runs automated tests, deployments are executed from version-controlled scripts or pipelines, and changes are small enough to isolate and debug quickly. At that point, increasing deployment frequency is lower risk and easier to control.

Automate testing before increasing deployment frequency

Manual testing becomes a bottleneck the moment you try to deploy more than once a day. Your QA team can only work so fast, and if every deployment requires a two-hour manual regression pass, releases start queuing up behind QA time. Automated testing breaks this bottleneck by letting machines handle repetitive verification work while humans focus on exploratory testing and edge cases that require judgment.

Automated testing doesn't eliminate manual testing; it automates the repetitive verification that blocks deployments, freeing you to move faster without shipping bugs. The test pyramid shown below is the standard approach. The bulk of your tests are fast and focused, with fewer slow, comprehensive tests at the top.

Sample test pyramid diagram (source)

The pyramid shape matters because if you invert it and rely mostly on slow end-to-end tests, your CI pipeline takes forever, and developers stop waiting for results. Teams that gate deployments on CI test results usually catch broken code before it reaches production.

Before increasing deployment frequency, make sure three things are true:

Required CI checks finish fast.
Failed checks block releases.
Flaky failures are rare.

Metric	Target	Why it matters
Code coverage	70%+	Provides confidence that changes are actually tested.
Test duration	< 10 minutes	Developers wait for results instead of ignoring them.
Flaky rate	< 1%	Test failures mean something is actually wrong, not random noise you learn to ignore.

These targets are informed by AWS's CI/CD best practices. Use them as operating targets and tune them to your system and team; once you hit them consistently, you have the quality foundation that makes increasing deployment frequency safe rather than reckless.

Implement comprehensive metrics before deploying multiple times daily

When your checkout service starts failing after the third deployment of the day, you need to know which of those three deployments caused the problem. That means having metrics that correlate issues with specific deployments rather than just showing you that something is broken somewhere.

Service-level indicators (SLIs) measure what actually matters to users rather than just whether your servers are responding to health checks. Combined with service-level objectives (SLOs) that set reliability targets, these metrics provide the foundation for data-driven deployment decisions.

SLI	What it measures	Example SLO
Error rate	Percentage of requests failing	< 0.5% of checkout requests return errors
Latency percentiles	How long successful requests take	95th percentile checkout latency < 2 seconds
Checkout completion rate	Whether users accomplish their goals	99.5% of checkout attempts complete successfully

For a checkout service, the primary SLI is the checkout completion rate. The SLO is the target for that metric, for example, 99.5% of checkout attempts complete successfully. Error rate and latency are supporting SLIs that help explain what changed when the completion rate drops.

Your deployment pipeline needs SLIs, too, because it can bottleneck everything else. At the intermediate level, you're deploying daily; to move to advanced (multiple times per day), track these pipeline metrics.

Metric	Target	Why it matters
Build time	< 10 minutes	Developers stop waiting for feedback if builds take longer.
Deploy frequency	3+ per day	Higher values indicate healthy flow through the pipeline.
Commit to production	< 2 hours	Fast feedback loops catch bugs sooner.
Rollback time	< 5 minutes	Quick recovery when deployments go wrong.

When build times suddenly jump from 8 minutes to 20, treat it as an incident rather than just accepting slower builds; slow pipelines compound into slower everything else. You can still run A/B tests without deployment or variant tagging, but the results are harder to trust when behavior shifts. To test two checkout flows and see which converts better, tag metrics with both deployment version and experiment variant so you can actually determine a winner.

  var checkoutAttempts = prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "checkout_attempts_total",
        Help: "Total checkout attempts",
    },
    []string{"status", "deployment_version", "experiment_variant"},
)

func recordCheckout(success bool, version string, variant string) {
    status := "completed"
    if !success {
        status = "failed"
    }
    checkoutAttempts.WithLabelValues(status, version, variant).Inc()
}

The deployment version and experiment variant labels give teams the context they need to correlate release changes with SLI movement and A/B test outcomes. In Nobl9, deployment annotations add that context directly to the SLO timeline, making it easier to see when a release or hotfix lines up with a reliability shift.

Nobl9 SLO timeline showing an annotated hotfix deployment and its reliability impact

When checkout completion drops after a deployment, the version and variant labels show exactly where the change started. Instead of guessing across multiple deploys in a day, you can quickly decide whether to keep rolling out or pause to investigate.

Use feature flags to decouple deployments from releases

Deploying code and releasing functionality can be two different things, but most teams treat them as the same event: Code gets merged, the pipeline runs successfully, and users see the change. Feature flags break this coupling by allowing teams to deploy code that remains hidden until explicitly enabled.

For example, a checkout change gated with a feature flag can be merged and deployed with the flag initially off. The rollout then starts by enabling, for example, a small cohort (e.g., 25%) from a configuration parameter (like shown below). Key metrics are then monitored, and exposure is either expanded or disabled based on those metrics. Meta has described this same staged approach with Gatekeeper, where feature exposure is managed separately from code deployment.

{
  "new_payment_flow": {
    "enabled": true,
    "variants": [
      { "name": "control", "weight": 0.75 },
      { "name": "treatment", "weight": 0.25 }
    ]
  }
}

The service reads the config above and assigns users to variants using the configured weights for each variant. In this section, control refers to the current checkout flow, while treatment refers to the new checkout flow being tested against it. The code example below uses a simplified binary split (control vs treatment) to show the assignment mechanics.

  func getRequestVariant(r *http.Request, experimentKey string) string {
    // Allow explicit override for testing
    if v := r.Header.Get("X-Experiment-Variant"); v != "" {
        return v
    }
    if v := r.URL.Query().Get("variant"); v != "" {
        return v
    }

    // Use a stable identity so the same user stays in one variant.
    id := getStableExperimentID(r)
    if id == "" {
        // Safe fallback when identity is unavailable.
        return "control"
    }

    // Pull treatment weight from config for this experiment
    // Example: control=0.75, treatment=0.25 -> treatmentWeight=0.25
    treatmentWeight := getVariantWeight(experimentKey, "treatment")
    if treatmentWeight <= 0 {
        return "control"
    }

    // Deterministic bucket in the 0-100 range.
    // Include experimentKey so different experiments do not share buckets.
    bucket := hashToPercent(id + ":" + experimentKey)

    if bucket < treatmentWeight*100 {
        return "treatment"
    }
    return "control"
}

This basic implementation supports intermediate-level feature flags for controlled rollouts. At the advanced maturity level, teams usually add stronger experiment analysis, segment targeting, and automated winner selection. The header and query-parameter overrides let teams force specific variants during development and QA.

Once traffic splits across variants, you need the metrics infrastructure from the previous section to see which variant actually performs better. This is why metrics come first: Without variant labels on your metrics, you can't reliably tell which version won. A properly configured experiment dashboard should clearly show performance differences between variants to make data-driven decisions.

For teams starting out, a simple config file and percentage-based assignment covers most use cases. When you need user segments, geographic rollouts, or multiple concurrent experiments, tools like LaunchDarkly or Flagsmith can cover those needs.

Establish automated rollback before advanced quality gates

Automated rollbacks need health signals to trigger on, and those signals need to mean something before you can build more advanced quality gates on top of them. Error-budget deployment gates work best after basic rollback thresholds are reliable, and those thresholds depend on solid metrics that show when something is actually wrong.

Basic rollback triggers

Start with these three triggers as baseline examples; they catch most deployment problems without any fancy analysis. They are common rollback signals, and the exact values should ultimately be tuned to each service's baseline behavior and SLO targets. When any of these thresholds are crossed, the system rolls back without waiting for someone to notice and intervene.

Trigger	Example threshold	What it catches
Error rate	> 5% for 5 minutes	Bad code, dependency failures
Latency degradation	p99 > 4 seconds	Performance regressions, resource exhaustion
Health check failures	3 consecutive failures	Service crashes, infrastructure issues

Here’s a rollback scenario using a canary deployment (an advanced-level progressive delivery strategy). You're running stable on v1.0.0 with normal error rates, then deploy v2.0.0-canary to 5% of production traffic to test a new payment flow. A few minutes in, the canary starts showing elevated errors and latency spikes as requests time out.

Nobl9 error-budget and burn-rate view showing reliability degradation after a canary rollout

The Nobl9 view shows the rollback signal clearly. Error budget remaining drops, burn rate rises above the target line, and reliability falls below the service objective. Instead of debating whether the canary is bad enough to stop, the team has a shared signal that says the rollout should pause and the previous version should take traffic again.

Automating the rollback

The automation that catches this can start simply as a script that queries Prometheus for the canary's error rate, compares it against a threshold, and triggers rollback when exceeded:

 # Check error rate every 15 seconds after deployment
: "${CURRENT_VERSION:?CURRENT_VERSION must be set}"
: "${PREVIOUS_VERSION:?PREVIOUS_VERSION must be set}"

ERROR_THRESHOLD="0.05"
PROMETHEUS_URL="http://prometheus.internal.com:9090"

error_rate=$(curl -s "$PROMETHEUS_URL/api/v1/query?query=\ sum(rate(checkout_errors_total{deployment_version=\"$CURRENT_VERSION\"}[1m]))/\ sum(rate(checkout_requests_total{deployment_version=\"$CURRENT_VERSION\"}[1m]))" \ | jq -r '.data.result[0].value[1]') 

if (( $(echo "$error_rate > $ERROR_THRESHOLD" | bc -l) )); then 
  echo "!!! Error rate exceeds threshold ($ERROR_THRESHOLD)" 
  echo "!!! Triggering automatic rollback..."           DEPLOYMENT_VERSION=$PREVIOUS_VERSION docker-compose up -d checkout-service 
fi

Argo Rollouts does the same thing with built-in analysis templates:

 apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: checkout-error-rate
  namespace: checkout
spec:
  args:
    - name: service-name
  metrics:
    - name: error-rate
      interval: 30s
      count: 3
      successCondition: result[0] < 0.05
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            (sum(rate(checkout_errors_total{error_type!="none"}[2m])) or vector(0))
            /
            (sum(rate(checkout_request_latency_seconds_count[2m])) + 0.001)

The template checks the error rate every 30 seconds and takes up to three measurements. With failureLimit set to 2, the rollout aborts after two failed measurements. The successCondition: result[0] < 0.05 means anything above 5% error rate fails the analysis.

Implementing deployment gates with error budgets

Once basic rollbacks work reliably (advanced maturity level), you can build error-budget-based rollout gates on top of them. Instead of checking whether the error rate exceeds a fixed threshold, you check whether the deployment is burning through your error budget faster than expected. If your checkout service has an SLO of 99.5% and a monthly error budget, a canary running at 95% success is burning through the budget at 10x the sustainable rate, even though 95% might look acceptable on its own.

Nobl9 calculates burn rates across your SLOs and can feed those signals into your deployment pipeline. A GitHub Actions workflow might check the error budget before allowing a deployment to proceed, as shown below.


  jobs:
  check-error-budget:
    runs-on: ubuntu-latest
    steps:
      - name: Check for hotfix bypass
        id: bypass
        run: |
          if [[ "$" == hotfix/* ]] || [[ "$" == "true" ]]; then
            echo "bypass=true" >> $GITHUB_OUTPUT
          fi
      
      - name: Check SLO Budget
        if: steps.bypass.outputs.bypass != 'true'
        run: |
          BUDGET=$(curl -s -H "Authorization: Bearer $" \
            "https://api.nobl9.com/v1/slos/checkout-service/status" | jq '.errorBudgetRemaining')
          
          if [ "$BUDGET" -lt "25" ]; then
            echo "❌ Error budget below 25% - deployment blocked"
            exit 1
          fi

This blocks deployments when the error budget is already depleted, preventing teams from shipping new changes on top of existing reliability problems. Teams often add bypass logic for hotfix branches or labeled PRs so urgent fixes can still get through. The deployment gate doesn't replace automated rollbacks; it complements them by preventing bad situations from getting worse.

Last thoughts

Each practice in this article builds on the ones before it. Automated testing enables faster deployments. Metrics enable both identifying problems and measuring experiments. Feature flags enable safe rollouts. Rollbacks enable recovery when things go wrong. Skipping steps creates gaps that increase the risks of incidents.

If you are figuring out where to start, begin with baseline controls and automate deployments to make releases repeatable. From there, automate testing, feature flags, and rollbacks, in that order.

Teams may sequence these steps differently based on pressure and context. What matters is understanding the dependencies so that each step makes the next one safer and easier.

That same dependency mindset applies to observability as well. Once teams define a small set of SLOs and track error budgets against them, rollout and rollback decisions are more streamlined. Nobl9 can feed that data into deployment gates and consolidate it.

Navigate Chapters:

Previous Chapter Next Chapter

Measuring Microsoft Teams with SLOs on Kollective Telemetry | Webinar

AI Code Webinar: Code Velocity and Operational Risks

A Best Practices Guide to Continuous Delivery

Table of Contents

Customer-Facing Reliability Powered by Service-Level Objectives

Service Availability Powered by Service-Level Objectives

Summary of continuous delivery best practices

Customer-Facing Reliability Powered by Service-Level Objectives

Service Availability Powered by Service-Level Objectives

Build baseline capabilities before specializing

Automate testing before increasing deployment frequency

Implement comprehensive metrics before deploying multiple times daily

Use feature flags to decouple deployments from releases

Establish automated rollback before advanced quality gates

Basic rollback triggers

Automating the rollback

Implementing deployment gates with error budgets

Last thoughts

Continue reading this series

Measuring Microsoft Teams with SLOs on Kollective Telemetry | Webinar

AI Code Webinar: Code Velocity and Operational Risks

A Best Practices Guide to Continuous Delivery

Table of Contents

Like this article?

Customer-Facing Reliability Powered by Service-Level Objectives

Service Availability Powered by Service-Level Objectives

Summary of continuous delivery best practices

Customer-Facing Reliability Powered by Service-Level Objectives

Service Availability Powered by Service-Level Objectives

Build baseline capabilities before specializing

Automate testing before increasing deployment frequency

Implement comprehensive metrics before deploying multiple times daily

Use feature flags to decouple deployments from releases

Establish automated rollback before advanced quality gates

Basic rollback triggers

Automating the rollback

Implementing deployment gates with error budgets

Last thoughts

Continue reading this series