7 ready-to-use templates built on the same best practices you just read — including the HTTP 500 runbook from the guide. Stop reading about runbooks. Start having them.
Get All 7 Templates FreeCheck your inbox.
The Runbook Template Kit is on its way.
Didn't get it? Check your spam folder.
Error budget burning 10x. An engineer is paged. In a moment of confusion and chaos, what happens next depends entirely on whether a runbook exists.
"Write for someone who's never seen this system, at 2 AM, during an outage."
— Nobl9 Runbook Template KitImprovised response, repeated searches, inconsistent escalation
Pre-defined steps, clear escalation path, faster resolution
| Single-metric SLOs | Single-tool SLOs | Cross-tool & multi-metric SLOs | Full-stack data normalization | Composite SLO logic | Prescriptive SLOs based on historical data | |
|---|---|---|---|---|---|---|
| Your Existing Monitoring Tools | ✔ | ✔ | ||||
| Your Existing Monitoring Tools + Nobl9 | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
You just read about 8 best practices for building runbooks. Click any one to see exactly which kit asset executes it — and how.
Ten pre-built sections cover the full incident lifecycle: triggers tied to specific alert names and thresholds, numbered diagnostic steps with decision points and expected command outputs, a 4-step rollback procedure (Stop → Restore → Verify Stability → Document), escalation contacts organized by tier, and copy-paste communication templates for both initial notification and resolution. Every section includes a prompt so no engineer ever faces a blank page at 2 AM.
Triggers on HTTP_500_Error_Rate_Critical when error rate exceeds 0.5% of requests for 3+ minutes. Step 1 verifies scope with a curl test. Step 2 inspects logs — exact grep commands for both Apache (/var/log/apache2/error.log) and Nginx. Step 3 checks infrastructure health. Resolution branches into four paths: database connections, memory exhaustion, application code exceptions, and disk space, each with specific commands and time estimates. If a recent deployment is the cause, kubectl rollout undo triggers after 15 minutes with no resolution.
The service outage runbook assigns severity by impact (SEV1 = complete outage, immediate; SEV2 = >50% affected, 15 minutes) and runs a 5-minute immediate response checklist: acknowledge in PagerDuty, run a health check, open an incident Slack channel, page the Incident Manager for SEV1. Diagnosis routes through infrastructure, network, and application layers with a decision tree to crash loop, capacity, network, or deployment rollback procedures. The database failover runbook covers AWS RDS, PostgreSQL manual promotion, and Patroni clusters — and checks replica lag before every failover to flag data loss risk if lag exceeds 5 minutes.
Defines four severity levels with concrete criteria (SEV1 = complete outage; SEV4 = UI glitch, no user impact) and sets the escalation clock from declaration: L2 at 15 minutes for SEV1/2, Incident Manager at 30 minutes, Engineering Leadership at 45, Executive Team at 60. A separate condition table handles non-time triggers — code changes, database access, customer data, vendor escalations. Includes a 3-tier contact directory with PagerDuty routes and response SLAs, plus a structured escalation message template requiring incident summary, severity, impact, duration, actions taken, and specific ask.
Organized into 12 sections: document metadata, overview, triggers, diagnostic steps, resolution actions, rollback, escalation, communication, references, formatting, testing, and version control. Checks whether alert names and thresholds are quantified, whether commands are copy-pasteable, whether escalation contacts have been verified in the last 30 days, and whether the runbook has been tested by someone unfamiliar with the system. Scoring: 50–55 = ready for production, 40–49 = minor improvements, 30–39 = significant gaps, below 30 = major revision required.
Covers four integration points: trigger sections (document which SLO alerts invoke the runbook), decision points (use error budget remaining to guide next steps), escalation criteria (replace arbitrary time limits with burn rate — 10x demands immediate escalation, 2x allows investigation time), and resolution verification (confirm fixes by checking SLO recovery). Includes the full burn rate action table: burn rate >5x = begin investigation, error budget <50% = notify Incident Manager, error budget <10% = page Engineering Lead.
When your SLO fires, every second counts. Without runbooks anchored to real SLO data, your team falls back on gut instinct — and inconsistent response makes outages worse.
| Burn Rate | Budget Left | Required Action |
|---|---|---|
| >10x | Any | All-hands page — immediate response |
| >5x | Any | Begin immediate investigation |
| >5x | <50% | Notify Incident Manager |
| >2x | <10% | Page Engineering Lead |
| >2x | Any | Schedule investigation |
| <1x | Any | Monitor only |
7 ready-to-use templates and checklists built around SLO burn rates and error budget thresholds.
The master fillable template — your team's incident response source of truth.
Step-by-step response for your most common application-layer error — the same example covered in the guide.
Full outage protocol from first detection to stakeholder communication.
Structured failover process engineered to minimize data loss and MTTR.
55-point checklist to validate every runbook before it gets tested in prod.
Map the right person to the right burn rate threshold — no guesswork.
README covering how to embed SLO data and burn rates into every runbook.
7 SLO-aware templates and checklists, ready to customize for your team. Used by reliability engineers who don't have time to improvise at 2 AM.
No Nobl9 account required · Download and use immediately
7 SLO-aware templates and checklists, ready to customize for your team. Used by reliability engineers who don't have time to improvise at 2 AM.
Check your inbox.
The Runbook Template Kit is on its way.
Didn't get it? Check your spam folder.
No Nobl9 account required · Instant download