Table of Contents

You're sitting in a weekly postmortem review, and your product manager asks the following: “Customers complained about slow checkout during yesterday's flash sale. Did we find out why?” The path to answering this by determining the cause and a good solution depends on the team’s observability setup. With Prometheus, you'll check metrics dashboards, then hop to separate tools like Jaeger for traces and Loki for logs to piece together what happened. With an OpenTelemetry setup, you can track the entire user journey from click to confirmation using a shared context.

Prometheus is a comprehensive monitoring solution that comes as a single binary. Install it, configure your scrape targets, and you immediately have metrics collection, storage, querying, and alerting working together. OpenTelemetry takes a different approach as a standardized framework for instrumenting applications to emit metrics, traces, and logs. Still, you need to assemble your backend stack for storage and analysis. While Prometheus integrates everything into one platform, OpenTelemetry focuses solely on data collection and export, leaving you to choose the rest of your observability tools.

This article examines both tools and the observability capabilities they unlock from a practical perspective and then discusses how each affects your ability to implement service-level objectives that reflect your customers' experiences.

 

Prometheus

OpenTelemetry

Core features

Complete metrics platform with built-in time-series storage, PromQL query engine, and scraping architecture

Telemetry collection framework providing standardized APIs, SDKs, and a collector for unified metrics, traces, and logs

Data collection

Pull-based metrics collection, backend that scrapes endpoints from your applications, and exporters on a scheduled interval

Flexible telemetry collection that supports pushing, pulling, processing data, and exporting to multiple backends, including Prometheus

Implementation approach

Implementation decisions are mostly regarding built-in components: configuring scraping, storage, and querying within Prometheus itself

Implementation decisions span multiple systems: instrumentation setup, collector pipeline, and backend coordination

Data correlation

Manual correlation of metrics using labels and timestamps

Built-in correlation, with every trace span, including related metrics and logs with shared context

Ecosystem

An integrated ecosystem with established tooling and extensive community knowledge

A distributed ecosystem with growing vendor support requiring backend selection and component assembly

Resource requirements

Resource planning usually starts with a single statically linked binary and collector, and adds federation and remote storage complexity when scaling

Resource planning across multiple systems from the start, collectors, storage backends, and visualization with different scaling characteristics

SLO data quality

Strong foundation for infrastructure SLOs, business SLOs across multiple services, limited by metrics-only data model; requires greater instrumentation effort to get right.

Benefits for business SLOs from automatic trace correlation across multiple services and the ability to handle infrastructure SLOs equally well

Customer-Facing Reliability Powered by Service-Level Objectives

Service Availability Powered by Service-Level Objectives

Learn More

Integrate with your existing monitoring tools to create simple and composite SLOs

Rely on patented algorithms to calculate accurate and trustworthy SLOs

Fast forward historical data to define accurate SLOs and SLIs in minutes

Core features

Prometheus delivers a complete monitoring solution in one statically linked application. It includes a time-series database, an HTTP scraping engine, PromQL query language, an alerting system, and a web interface. Just install it on a server, and start monitoring immediately.

For example, Go code to expose and monitor checkout_requests_total would look like this:


import "github.com/prometheus/client_golang/prometheus"

var (
    checkoutRequests = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "checkout_requests_total",
            Help: "Total checkout requests",
        }, []string{"status"})
)

func handleCheckout(w http.ResponseWriter, r *http.Request) {    
    // Your checkout logic here
    
    checkoutRequests.WithLabelValues("success").Inc()
}

// Expose metrics at /metrics endpoint
http.Handle("/metrics", promhttp.Handler())

The following diagram shows a typical Prometheus integration:

Architecture diagram of a Prometheus monitoring solution

OpenTelemetry (OTel) standardizes how applications emit metrics, traces, and logs through language SDKs and a collector architecture. It handles data collection and export, but you choose your storage and analysis tools. For a checkout service, for example, you would instrument the code with OTel SDKs, configure the collector to receive the data, and then export to a backend like Prometheus for metrics, Jaeger for traces, and Loki for logs. You can also choose to send everything to a unified platform like ClickHouse.

The following diagram shows a typical OpenTelemetry setup:

Architectural diagram of an OpenTelemetry monitoring solution

Data collection

Prometheus uses a pull-based model where the server scrapes HTTP endpoints from your applications and exporters on scheduled intervals. Your checkout service exposes a /metrics endpoint with request counts, latency, and error rates in the Prometheus dimensional data format. The format is human-readable, where each metric has a name and optional labels (which are key-value pairs) like checkout_requests_total{method="POST",status="200"} 42. Every 15 seconds (or whatever interval you configure), Prometheus pulls this data and stores it in its time-series database.

Prometheus dashboard showing a simple metric

Prometheus controls the collection schedule and can detect when services go down (failed scrapes become data points themselves). You configure scrape targets or use service discovery, and Prometheus handles collection.

OpenTelemetry applications push data to the OTel collector or provide endpoints for pulling. OTel collectors handle metrics, traces, and logs in a unified pipeline that maintains correlation across all telemetry signals using shared context. Your checkout service instrumented with OTel SDKs might push trace spans showing the complete user journey from cart to payment confirmation, with metrics and logs sharing the same context. The OTel collector receives these signals, processes them, and exports them to your configured backends.

The practical difference shows up when debugging a slow checkout issue. With Prometheus by itself, you manually correlate metrics with separate trace and log systems. With OTel, you see a slow database query in traces, error spikes in metrics, and relevant log entries connected by a shared context.

Implementation approach

Prometheus focuses on built-in components. Developers add client libraries and expose metrics endpoints, and SREs configure scraping, storage, and querying within the Prometheus server. For a checkout service, you add counters and histograms for request counts and latencies, configure Prometheus to scrape the endpoint, and write PromQL queries for dashboards. All decisions happen within one unified system. Scaling beyond single instances requires federated hierarchies and remote storage, which add operational complexity.

The dashboard below shows a PromQL query tracking successful request rates over a ten-minute window. This is a single-system complete monitoring solution.

Prometheus dashboard showing promQL query and resulting graph

OpenTelemetry implementation involves multiple systems from the start. Developers configure SDKs across services for instrumentation, while SREs coordinate collectors, data pipelines, storage, and set up a visualization. For the same checkout service, you instrument with OTel SDKs, deploy collectors, route metrics to Prometheus or a cloud provider, send traces to Jaeger or a managed service, handle logs separately, and ensure that everything connects for end-to-end visibility.

Here is a Go code snippet to show context propagation using trace IDs across all telemetry signals.


import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/trace"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/metric"
)

var tracer = otel.Tracer("checkout-service")
var meter = otel.Meter("checkout-service")
var requestDuration, _ = meter.Float64Histogram("checkout_request_duration")

func handleCheckout(w http.ResponseWriter, r *http.Request) {
    start := time.Now()
    ctx, span := tracer.Start(r.Context(), "checkout_request")
    defer span.End()
    
    traceID := trace.SpanContextFromContext(ctx).TraceID().String()

    // Your checkout logic here

    span.SetAttributes(
        attribute.String("payment.method", "credit_card"),
        attribute.Int("cart.items", itemCount),
    )
    
    requestDuration.Record(ctx, time.Since(start).Seconds())
    
    logger.Log("msg", "checkout completed", "traceID", traceID, 
              "duration", time.Since(start))
}

Here is what is happening:

  • tracer.Start() creates a new trace (a record that tracks a user's request through your system) for this checkout request and generates a unique trace ID that will follow this customer's journey.
  • The ctx (context) carries that trace ID through every part of the checkout process.
  • traceID := trace.SpanContextFromContext(ctx)... extracts the trace ID for use directly, like when pushing logs below.
  • span.SetAttributes() tags the trace with business details like payment method and cart size.  This contextual information appears alongside timing data when you are debugging issues later.
  • requestDuration.Record(ctx, ...) records a timing metric that measures how long the checkout took, and the OTel SDK automatically tags this metric with the trace ID from the context.
  • logger.Log("traceID", traceID, ...) writes a log entry that includes the same trace ID, so when you search logs later, you can find all entries related to this specific customer's checkout attempt.

Customer-Facing Reliability Powered by Service-Level Objectives

Service Availability Powered by Service-Level Objectives

Learn More

Data correlation

Data stored in Prometheus can be correlated through labels and timestamps across separate systems. When investigating that slow checkout issue, start with metrics showing service="checkout" and endpoint="/payment" and then manually search Jaeger traces for the same time window and labels. Finally, search the logs with matching timestamps and the same context (labels). The correlation happens in your head, matching the service labels and time windows across multiple tools.

Grafana dashboard visualizing Prometheus metrics for a FastAPI application (source)

The OpenTelemetry framework embeds correlation directly in the telemetry metadata. Each request gets a context that flows through every service call, automatically tagging related metrics and logs. When a customer's checkout fails, the ID of the context appears in the span (a single unit of work, such as an API call or database query within the trace context), showing the slow database query. The metrics increment is tagged with the same context, and error logs include the same context.

Grafana dashboard showing telemetry signals with shared context (source)

Ecosystem

Prometheus benefits from years of community-built tooling around a single cohesive monitoring system with well-established patterns. Grafana dashboards, AlertManager configurations, and exporters like node_exporter for infrastructure metrics have extensive documentation. When you need to monitor MySQL, there's a standard mysqld_exporter with proven dashboard templates and alerting rules that thousands of teams have used. Similar exporters exist for most applications and infrastructure.

OpenTelemetry, like Prometheus, also has strong community documentation and backing from major cloud providers (Google, Microsoft, AWS) and observability companies. These vendors offer separate backends, unlike Prometheus's integrated approach. The OTLP standard allows collectors to ingest from legacy systems and export to multiple backends. However, assembling your stack requires researching compatibility and integration requirements for your specific combination.

The table below shows an example framework for such research:

Component

Key questions 

Complexity factor

Trace backend

Can it handle trace volume? Does it support your sampling techniques?

Version compatibility and sampling configuration

Metrics storage

Does the setup support self-hosted solutions as well as platforms like Datadog?

Multi-backend routing and format difference

Log aggregator

Will log aggregation work across all services, and does it support the trace backend?

Context propagation and query capabilities

Visualization

Can the visualization layer connect all the backends, or do we need separate tools for each backend?

Data source configuration and data correlation capabilities

Resource requirements

A Prometheus deployment can start as a straightforward single-binary deployment where you can scale vertically by adding more memory or CPU. However, scaling beyond single-instance limits requires federation hierarchies. Global instances pull data from smaller Prometheus servers that collect metrics from different parts of your infrastructure, or remote-write systems like Cortex or Thanos for long-term storage. What began as one application becomes a significant operation: coordinating multiple Prometheus instances, managing federation configurations, and ensuring that remote storage systems can handle your write volume. 

The diagram below shows multi-region Prometheus scaling with federation, remote storage, and global coordination.

Multi-region Prometheus deployment with federation and remote storage

OpenTelemetry resource planning involves multiple systems from day one. Collectors need CPU and memory to process telemetry pipelines. You plan separate capacity for storage backends: Prometheus for metrics, Jaeger or Tempo for traces, Elasticsearch or Loki for logs, each with different scaling characteristics and resource requirements. When traffic spikes hit, you scale collector instances while ensuring that backends can handle the load.

OpenTelemetry deployment with load-balanced collectors and scaled backends for each signal type

Visit SLOcademy, our free SLO learning center

Visit SLOcademy. No Form.

SLO data quality

Prometheus makes infrastructure SLOs straightforward. Response times, error rates, and availability metrics become the foundation for SLIs like “99.9% of API requests complete under 200ms” or “99.5% database uptime over 30 days.” The data you collect transforms into SLO measurements without additional instrumentation.

However, business SLOs across multiple services hit the metrics-only limitation. When you need to measure “95% of users who add items to cart will complete checkout within 20 seconds,” Prometheus can tell you each service's performance individually, but cannot easily track complete user journeys across services. You might see payment service latency and order service errors during the same timeframe, but proving they're part of the same customer journey requires detective work. Composite SLOs combine individual service metrics into business-focused measurements and help you reduce this detective work.

OpenTelemetry's distributed tracing provides timing data for every step of a user's journey with automatic correlation. When a customer's checkout takes 45 seconds, you see the exact sequence: 

  • 2 seconds for cart validation
  • 38 seconds waiting for the payment gateway
  • 5 seconds for inventory updates

These are all connected by the same telemetry context. This granular visibility enables precise business SLOs that measure user experience rather than individual service performance.

You can implement SLOs like the one discussed above more easily than with Prometheus because traces track complete user journeys across all services. The same trace context that helps with debugging becomes the foundation for measuring complex business processes end-to-end.

Composite SLOs combine multiple individual SLOs into a single business-focused measurement. Instead of tracking cart service availability (99.9%), payment latency (200ms), and inventory uptime (99.5%) as separate reliability metrics, you get one checkout flow SLO that shows overall business health. When payment service degrades, the composite immediately reflects the real impact on customers rather than requiring you to interpret multiple separate metrics. This works with both Prometheus individual metrics and OpenTelemetry correlated data.

Recommendations

Monolith applications

If your service is monolithic, start with Prometheus. It allows you to instrument basic application metrics and other infrastructure components, such as your database, fairly quickly. Deploy a single binary, deploy a metrics exporter, and create infrastructure SLOs within days.

Why Prometheus works:

  • The manual correlation limitation won't hurt you because failures are easier to trace in simpler architectures.
  • Service-level SLOs map directly to simple metrics like “99.5% of checkout requests complete successfully” or “95% of checkout requests finish in less than 500 ms.”
  • No distributed tracing complexity is required.

Implementation approach:

  • If budget permits, use managed Prometheus services (AWS AMP, GCP Managed Prometheus) to avoid operational overhead.
  • Balance observability engineering time between product features and monitoring infrastructure.
  • When you're ready for SLO management, use platforms like Nobl9 that can consume your Prometheus metrics directly using PromQL queries.

Microservices architectures

Consider OpenTelemetry when working with microservices. If debugging a slow checkout requires jumping between Prometheus metrics, Jaeger traces, and Loki logs to piece together what happened manually, you're ready for automatic correlation.

Strategic hybrid approach:

  • Use Prometheus for infrastructure monitoring (CPU, memory, request rates) because it is simple, and your team might already be comfortable with it.
  • Add OpenTelemetry instrumentation using distributed tracing to critical user journeys where you need business SLOs like “80% of users exposed to the new one-click checkout feature will show 25% faster completion times and 15% higher repeat purchase rates within 14 days.”

Plan for complexity early:

  • Manage collectors and configure multiple backends (Prometheus for metrics, Tempo for traces, Loki for logs, etc.).
  • Ensure that your team understands distributed tracing concepts.
  • Budget for both infrastructure costs and the learning curve.

Key takeaways 

Match your tool choice to actual constraints, not ideal observability vision. Consider team bandwidth, system architecture, and SLO requirements. Monolithic applications benefit from the simplicity of Prometheus, while microservices benefit from OpenTelemetry's cross-service visibility. 

Prometheus gets you “checkout service worked, and it took X amount of time.” With OpenTelemetry, you get “this specific feature change improved customer retention by X%, and here's the exact user journey data to prove it.” Evaluate SLO platforms that handle both data sources; you don't want to rebuild SLO calculation logic as your observability stack evolves.

How Nobl9 helps with SLO management

Nobl9 bridges Prometheus and OpenTelemetry by supporting both data sources for SLO creation. With Prometheus integration, you create infrastructure SLOs using PromQL queries directly; “99.9% of API requests succeed” can be mapped to your existing HTTP server metrics. With OpenTelemetry integration, you can create sophisticated web-based SLOs using browser traces for real user monitoring, tracking complete user journeys across multiple services. 

The Nobl9 platform integrates seamlessly with your current stack, providing a holistic view of system performance without vendor lock-in, whether you're running pure Prometheus, pure OpenTelemetry, or the hybrid approach most teams eventually adopt. Instead of rebuilding SLO logic as your observability evolves from simple metrics to correlated telemetry, Nobl9 handles the complexity of turning both Prometheus metrics and OpenTelemetry traces into actionable reliability targets that reflect customer experience.

Learn how 300 surveyed enterprises use SLOs

Download Report

Conclusion

Prometheus works well because it is simple, easy to deploy, and allows you to monitor without too much planning. OpenTelemetry came along to solve the correlation problem when microservices made separate tools painful, but it trades simplicity for comprehensive telemetry collection that requires you to assemble everything else yourself.

Most teams today use them together rather than choosing between them: OpenTelemetry for standardized instrumentation and Prometheus for proven metrics storage and alerting capabilities. The decision isn't which tool to pick, but how much assembly complexity you want to manage versus using managed observability platforms.

In practical terms, whether you use Prometheus alone, OpenTelemetry with Prometheus backends, or fully managed solutions, SLO platforms like Nobl9 can help you focus on the customer outcomes that matter to your business.

Navigate Chapters:

Continue reading this series