Belajar SRE #12: On-Call Best Practices
Pelajari on-call best practices: rotation design, escalation policy, runbook creation, dan alert quality management untuk sustainable operations.
On-call adalah komponen kritis dalam SRE yang memastikan sistem production mendapat respons cepat saat terjadi incident. Effective on-call practices bukan hanya tentang “siapa yang menjawab alert”, tetapi juga tentang membangun sustainable system yang menjaga keseimbangan antara reliability dan kesejahteraan tim. Artikel ini membahas rotation design, escalation policy, runbook creation, dan alert quality management.
Jika Anda belum membaca artikel sebelumnya, mulai dari Advanced SRE: Capacity Planning.
Prerequisites
- Pemahaman SLI/SLO/SLA — baca: Advanced SRE: SLI, SLO, dan SLA
- Error Budget Policy — baca: Advanced SRE: Error Budget
- Chaos Engineering — baca: Advanced SRE: Chaos Engineering
- Monitoring stack (Prometheus, Grafana, Alertmanager)
- Incident management tool (PagerDuty atau equivalent)
On-Call Rotation Design
flowchart LR
A[Alert Fires] --> B[Primary On-Call]
B -->|No Response 5min| C[Secondary On-Call]
C -->|No Response 10min| D[Engineering Manager]
D -->|No Response 15min| E[CTO/VP Engineering]
Rotation Principles
- Duration: 1 week per rotation (avoid > 2 weeks — burnout risk)
- Coverage: Primary + Secondary model untuk redundancy
- Minimum team size: 5 engineers untuk sustainable rotation (1 week on, 4 weeks off)
- Handoff: Formal handoff meeting setiap rotation change
Rotation Schedule (5 Engineers)
| Week | Primary | Secondary |
|---|---|---|
| Week 1 | Engineer A | Engineer B |
| Week 2 | Engineer B | Engineer C |
| Week 3 | Engineer C | Engineer D |
| Week 4 | Engineer D | Engineer E |
| Week 5 | Engineer E | Engineer A |
Escalation Policy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
escalation_policy:
name: "Production Services"
tiers:
- tier: 1
name: "Primary On-Call"
timeout_minutes: 5
notification: [push_notification, sms]
- tier: 2
name: "Secondary On-Call"
timeout_minutes: 10
notification: [push_notification, sms, phone_call]
- tier: 3
name: "Engineering Manager"
timeout_minutes: 15
notification: [phone_call, sms]
- tier: 4
name: "CTO / VP Engineering"
timeout_minutes: 30
notification: [phone_call]
Runbook Creation
Setiap alert harus memiliki runbook yang actionable. Template:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Runbook: [Alert Name]
## Alert Information
- **Alert Name:** HighCPUUtilization
- **Severity:** Critical
- **Threshold:** CPU > 80% for 5 minutes
- **Service:** api-gateway
- **Dashboard:** [Link to Grafana]
## Impact Assessment
- User Impact: API response times increase
- Business Impact: Checkout flow affected
## Investigation Steps
### Step 1: Verify Alert
kubectl top pods -n production -l app=api-gateway
### Step 2: Check Traffic
# Prometheus query for traffic spike
sum(rate(http_requests_total{app="api-gateway"}[5m]))
### Step 3: Mitigation
# Scale up if traffic spike
kubectl scale deployment api-gateway --replicas=10
# Rollback if recent deployment
kubectl rollout undo deployment/api-gateway
## Escalation Criteria
- Escalate to Secondary if not resolved in 15 minutes
- Escalate to Manager if customer-facing impact > 30 minutes
Alert Quality Management
| Metric | Target | Description |
|---|---|---|
| Actionable Rate | > 80% | Setiap alert harus memerlukan action |
| Noise Ratio | < 20% | False positives harus minimal |
| MTTA | < 5 min | Mean Time to Acknowledge |
| Runbook Coverage | > 80% | Alerts dengan runbook |
Alert Quality Review Process
Lakukan weekly alert review:
- Review semua alerts yang fired minggu ini
- Classify: actionable vs noise
- Tune thresholds untuk noisy alerts
- Create runbooks untuk alerts tanpa runbook
- Delete alerts yang tidak pernah actionable
On-Call Compensation
| Model | Description | Pros | Cons |
|---|---|---|---|
| Flat rate | Fixed amount per on-call week | Simple, predictable | Unfair jika incident-heavy |
| Per-incident | Bonus per incident handled | Fair distribution | Complex tracking |
| Hybrid | Base + per-incident bonus | Balanced | Moderate complexity |
| Time-off | Comp day after on-call week | Good work-life balance | Scheduling challenges |
Studi Kasus: TechStartup Indonesia
Konteks
TSI pada Scale Phase (2022 Q1) memiliki 5 DevOps engineers yang harus cover 24/7 on-call untuk 15+ microservices.
Kondisi sebelumnya:
- On-call bersifat ad-hoc — tidak ada rotation formal
- Alert fatigue parah (50+ alerts/day)
- Runbooks tidak ada atau outdated
- MTTA rata-rata 18 menit karena unclear ownership
- Satu engineer resign karena burnout
Apa yang Dilakukan
TSI mengimplementasikan structured on-call:
- Formal Weekly Rotation — Primary/secondary model dengan minimum 5 engineers
- PagerDuty Integration — Clear escalation path dan automatic notification
- Runbook Requirement — Setiap alert wajib punya runbook sebelum di-enable
- Weekly Alert Quality Review — Tune thresholds, delete noise, improve signal
- Fair Compensation — Hybrid model (base + per-incident bonus)
Metrics Improvement
| Metric | Sebelum | Sesudah | Perubahan |
|---|---|---|---|
| MTTA | 18 min | 3 min | -83% |
| MTTR | 95 min | 25 min | -74% |
| Alerts/day | 50+ | 8 | -84% |
| Actionable Rate | 20% | 85% | +325% |
| Revenue Loss/month | $45K | $8K | -82% |
| Team Satisfaction | 3.2/5 | 4.1/5 | +28% |
| Runbook Coverage | 15% | 85% | +467% |
Lessons Learned
Yang Berhasil:
- Formal rotation dengan PagerDuty — clear ownership menghilangkan “bukan tanggung jawab saya” mindset
- Weekly alert review — secara konsisten mengurangi noise dari 50+ ke 8 alerts/day dalam 3 bulan
- Runbook requirement — setiap alert baru wajib punya runbook sebelum di-enable di production
- Fair compensation — hybrid model (base + per-incident) mengubah on-call dari beban menjadi tanggung jawab yang dihargai
Yang Perlu Dihindari:
- Jangan biarkan on-call tanpa rotation formal — leads to burnout dan attrition
- Jangan ignore alert fatigue — 50+ alerts/day berarti semua alerts diabaikan
- Jangan buat runbooks yang terlalu panjang — on-call engineer butuh quick actionable steps, bukan documentation
- Jangan skip handoff meeting — context transfer penting untuk continuity
Best Practices
- Implementasikan formal rotation — minimum 5 engineers, 1 week on / 4 weeks off
- Require runbooks untuk setiap alert — no runbook = no alert in production
- Review alert quality weekly — tune thresholds, delete noise, improve signal
- Compensate fairly — on-call adalah extra responsibility yang harus dihargai
- Validate runbooks dengan chaos engineering — untested runbooks adalah false confidence
- Track on-call health metrics — MTTA, MTTR, pages/week, satisfaction score
- Automate common mitigations — jika runbook step selalu sama, automate it
Selanjutnya
Artikel berikutnya: Advanced SRE: Postmortem Culture — setelah membangun on-call yang sustainable, langkah selanjutnya adalah membangun budaya blameless postmortem untuk belajar dari setiap incident.
Topik terkait yang bisa Anda eksplorasi:
- Postmortem Culture — blameless postmortems dan continuous learning
- Toil Reduction — mengurangi repetitive on-call tasks melalui automation
- On-Call Automation & Runbook — advanced automation untuk incident response
References
- Google SRE Book - Being On-Call
- Google SRE Book - Effective Troubleshooting
- PagerDuty Incident Response Guide
- On-Call Compensation Best Practices
Navigasi Series
⬅️ Sebelumnya: Advanced SRE: Capacity Planning
➡️ Selanjutnya: Advanced SRE: Postmortem Culture
