SRE Incident Management¶

Incident management covers on-call structure, escalation, mitigation, and blameless postmortems. The goal is to minimize MTTR while maintaining sustainable on-call practices.

On-Call Structure¶

Primary + secondary rotation
Duration: 1-2 weeks typical
Maximum: 2 incidents per 12-hour shift (sustainable pace)
Follow-the-sun for global teams
Tools: PagerDuty, Opsgenie, VictorOps

On-Call Responsibilities¶

Acknowledge alert within SLA (e.g., 5 min for P1)
Triage - assess severity, determine escalation need
Mitigate - reduce user impact (not root cause yet)
Communicate - update status page, notify stakeholders
Resolve - full remediation
Handoff - document for next shift, file postmortem

Essential On-Call Knowledge¶

Reading traces - how to debug distributed systems
Fallback switching - existing fallback mechanisms, manual procedures
Deploy rollback - rolling back bad deploys and configurations
Rate limiting - handling malicious DoS and accidental self-DoS
Impact limitation - containing broken parts' impact on whole system

Key principle: primary cause of all problems is someone misconfigured something. Deploy and configuration rollback must be documented and practiced.

Escalation¶

Severity	Response Time	Action
P1 (service down)	Page immediately	War room within 15 min
P2 (degraded)	Page, 30 min	Focused investigation
P3 (minor impact)	Ticket	Next business day

When in doubt, escalate.

Key Incident Roles¶

IMOC (Incident Manager On-Call)¶

Leads and coordinates SEV resolution
Focused on TTR (time to recovery), target: SEV-0 in 15 min
Single point of accountability
Keeps team calm, manages communication to leadership
Engineers focus on technical work, report status to IMOC

TLOC (Technical Lead On-Call)¶

Technical expert for diagnosis and mitigation
Reports status to IMOC
After SEV: works on root cause, leads chaos experiments

Postmortems¶

Principles¶

Blame-free - "what about the system made the error possible?"
Written document - permanent, shared record
Action items with owners - assigned, tracked, prioritized
Shared broadly - organization-wide learning

Template¶

## Incident: [Title]
**Date**: YYYY-MM-DD | **Severity**: P1 | **Duration**: 2h 15m

## Summary
2-3 sentences of what happened.

## Impact
- Users affected: X
- Revenue impact: $Y
- SLO burn: Z%

## Timeline
- HH:MM - Alert fired
- HH:MM - On-call acknowledged
- HH:MM - Root cause identified
- HH:MM - Mitigation applied
- HH:MM - Full resolution

## Root Cause Analysis (5 Whys)
1. Why did the service fail? -> OOM kill
2. Why was memory exhausted? -> Memory leak in handler
...

## What Went Well
- Detection was fast (< 2 min)
- Runbook was accurate

## What Went Wrong
- Monitoring gap: no alert on memory trend
- Rollback took 20 min (should be < 5)

## Action Items
| Action | Owner | Priority | Deadline | Status |
|--------|-------|----------|----------|--------|
| Add memory trend alert | @eng | P1 | 2026-04-10 | TODO |
| Automate rollback | @sre | P2 | 2026-04-17 | TODO |

## Lessons Learned

Trigger Criteria¶

Write postmortem when: user-visible impact, error budget consumed, data loss, manual intervention required, monitoring gap discovered, on-call escalation.

Diagnostics Methodology¶

Observe - what symptoms? When started? What changed?
Hypothesize - possible causes based on symptoms
Test - verify or eliminate with data
Fix - apply remediation
Prevent - add monitoring, tests, automation

Common Debugging Dimensions¶

Time correlation: deploy? Config change? Traffic pattern?
Scope: one host? One AZ? One service? All users or some?
Dependency graph: upstream/downstream services?
Resource exhaustion: CPU, memory, disk, file descriptors, connections?

Tools Hierarchy¶

Dashboards (Grafana) -> Logs (Loki/ELK) -> Traces (Jaeger/Tempo) -> Profiling (pprof) -> Core dumps. Start broad, narrow down.

Document diagnostic procedures
Shadow on-call for training
Wheel of Misfortune (simulated incidents)
Game days (planned reliability exercises)

Gotchas¶

Postmortem without follow-through on action items is worthless
Timeline reconstruction is critical - exactly when each event occurred
Runbooks must be kept updated after each incident
"One person knows how to switch" is a critical risk - document and cross-train
Incident coordinator communicates status, does NOT debug