Post

Belajar SRE #12: On-Call Best Practices

Pelajari on-call best practices: rotation design, escalation policy, runbook creation, dan alert quality management untuk sustainable operations.

Belajar SRE #12: On-Call Best Practices

On-call adalah komponen kritis dalam SRE yang memastikan sistem production mendapat respons cepat saat terjadi incident. Effective on-call practices bukan hanya tentang “siapa yang menjawab alert”, tetapi juga tentang membangun sustainable system yang menjaga keseimbangan antara reliability dan kesejahteraan tim. Artikel ini membahas rotation design, escalation policy, runbook creation, dan alert quality management.

Jika Anda belum membaca artikel sebelumnya, mulai dari Advanced SRE: Capacity Planning.

Prerequisites

On-Call Rotation Design

flowchart LR
    A[Alert Fires] --> B[Primary On-Call]
    B -->|No Response 5min| C[Secondary On-Call]
    C -->|No Response 10min| D[Engineering Manager]
    D -->|No Response 15min| E[CTO/VP Engineering]

Rotation Principles

  • Duration: 1 week per rotation (avoid > 2 weeks — burnout risk)
  • Coverage: Primary + Secondary model untuk redundancy
  • Minimum team size: 5 engineers untuk sustainable rotation (1 week on, 4 weeks off)
  • Handoff: Formal handoff meeting setiap rotation change

Rotation Schedule (5 Engineers)

WeekPrimarySecondary
Week 1Engineer AEngineer B
Week 2Engineer BEngineer C
Week 3Engineer CEngineer D
Week 4Engineer DEngineer E
Week 5Engineer EEngineer A

Escalation Policy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
escalation_policy:
  name: "Production Services"
  tiers:
    - tier: 1
      name: "Primary On-Call"
      timeout_minutes: 5
      notification: [push_notification, sms]
    - tier: 2
      name: "Secondary On-Call"
      timeout_minutes: 10
      notification: [push_notification, sms, phone_call]
    - tier: 3
      name: "Engineering Manager"
      timeout_minutes: 15
      notification: [phone_call, sms]
    - tier: 4
      name: "CTO / VP Engineering"
      timeout_minutes: 30
      notification: [phone_call]

Runbook Creation

Setiap alert harus memiliki runbook yang actionable. Template:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Runbook: [Alert Name]

## Alert Information
- **Alert Name:** HighCPUUtilization
- **Severity:** Critical
- **Threshold:** CPU > 80% for 5 minutes
- **Service:** api-gateway
- **Dashboard:** [Link to Grafana]

## Impact Assessment
- User Impact: API response times increase
- Business Impact: Checkout flow affected

## Investigation Steps

### Step 1: Verify Alert
kubectl top pods -n production -l app=api-gateway

### Step 2: Check Traffic
# Prometheus query for traffic spike
sum(rate(http_requests_total{app="api-gateway"}[5m]))

### Step 3: Mitigation
# Scale up if traffic spike
kubectl scale deployment api-gateway --replicas=10

# Rollback if recent deployment
kubectl rollout undo deployment/api-gateway

## Escalation Criteria
- Escalate to Secondary if not resolved in 15 minutes
- Escalate to Manager if customer-facing impact > 30 minutes

Alert Quality Management

MetricTargetDescription
Actionable Rate> 80%Setiap alert harus memerlukan action
Noise Ratio< 20%False positives harus minimal
MTTA< 5 minMean Time to Acknowledge
Runbook Coverage> 80%Alerts dengan runbook

Alert Quality Review Process

Lakukan weekly alert review:

  1. Review semua alerts yang fired minggu ini
  2. Classify: actionable vs noise
  3. Tune thresholds untuk noisy alerts
  4. Create runbooks untuk alerts tanpa runbook
  5. Delete alerts yang tidak pernah actionable

On-Call Compensation

ModelDescriptionProsCons
Flat rateFixed amount per on-call weekSimple, predictableUnfair jika incident-heavy
Per-incidentBonus per incident handledFair distributionComplex tracking
HybridBase + per-incident bonusBalancedModerate complexity
Time-offComp day after on-call weekGood work-life balanceScheduling challenges

Studi Kasus: TechStartup Indonesia

Konteks

TSI pada Scale Phase (2022 Q1) memiliki 5 DevOps engineers yang harus cover 24/7 on-call untuk 15+ microservices.

Kondisi sebelumnya:

  • On-call bersifat ad-hoc — tidak ada rotation formal
  • Alert fatigue parah (50+ alerts/day)
  • Runbooks tidak ada atau outdated
  • MTTA rata-rata 18 menit karena unclear ownership
  • Satu engineer resign karena burnout

Apa yang Dilakukan

TSI mengimplementasikan structured on-call:

  1. Formal Weekly Rotation — Primary/secondary model dengan minimum 5 engineers
  2. PagerDuty Integration — Clear escalation path dan automatic notification
  3. Runbook Requirement — Setiap alert wajib punya runbook sebelum di-enable
  4. Weekly Alert Quality Review — Tune thresholds, delete noise, improve signal
  5. Fair Compensation — Hybrid model (base + per-incident bonus)

Metrics Improvement

MetricSebelumSesudahPerubahan
MTTA18 min3 min-83%
MTTR95 min25 min-74%
Alerts/day50+8-84%
Actionable Rate20%85%+325%
Revenue Loss/month$45K$8K-82%
Team Satisfaction3.2/54.1/5+28%
Runbook Coverage15%85%+467%

Lessons Learned

Yang Berhasil:

  • Formal rotation dengan PagerDuty — clear ownership menghilangkan “bukan tanggung jawab saya” mindset
  • Weekly alert review — secara konsisten mengurangi noise dari 50+ ke 8 alerts/day dalam 3 bulan
  • Runbook requirement — setiap alert baru wajib punya runbook sebelum di-enable di production
  • Fair compensation — hybrid model (base + per-incident) mengubah on-call dari beban menjadi tanggung jawab yang dihargai

Yang Perlu Dihindari:

  • Jangan biarkan on-call tanpa rotation formal — leads to burnout dan attrition
  • Jangan ignore alert fatigue — 50+ alerts/day berarti semua alerts diabaikan
  • Jangan buat runbooks yang terlalu panjang — on-call engineer butuh quick actionable steps, bukan documentation
  • Jangan skip handoff meeting — context transfer penting untuk continuity

Best Practices

  • Implementasikan formal rotation — minimum 5 engineers, 1 week on / 4 weeks off
  • Require runbooks untuk setiap alert — no runbook = no alert in production
  • Review alert quality weekly — tune thresholds, delete noise, improve signal
  • Compensate fairly — on-call adalah extra responsibility yang harus dihargai
  • Validate runbooks dengan chaos engineering — untested runbooks adalah false confidence
  • Track on-call health metrics — MTTA, MTTR, pages/week, satisfaction score
  • Automate common mitigations — jika runbook step selalu sama, automate it

Selanjutnya

Artikel berikutnya: Advanced SRE: Postmortem Culture — setelah membangun on-call yang sustainable, langkah selanjutnya adalah membangun budaya blameless postmortem untuk belajar dari setiap incident.

Topik terkait yang bisa Anda eksplorasi:

  • Postmortem Culture — blameless postmortems dan continuous learning
  • Toil Reduction — mengurangi repetitive on-call tasks melalui automation
  • On-Call Automation & Runbook — advanced automation untuk incident response

References


⬅️ Sebelumnya: Advanced SRE: Capacity Planning

➡️ Selanjutnya: Advanced SRE: Postmortem Culture

This post is licensed under CC BY 4.0 by the author.