Dashboard as Code: Managing Grafana Dashboards with Terraform
A comprehensive guide to building a Dashboard-as-Code system using Terraform and Grafana. From local setup to a CI/CD pipeline for production
Introduction
Picture this scenario: Monday morning, 3:00 AM. An alert fires — there’s an anomaly in production. You open Grafana, and the dashboard that was perfectly fine yesterday is now displaying incorrect data. Someone modified the query on the “HTTP Error Rate” panel directly from the UI on Friday afternoon, without telling anyone. No history, no review, no way to roll back except relying on memory.
This isn’t a hypothetical scenario. This is a real experience that happens regularly on teams managing dozens to hundreds of dashboards without version control.
Over two years working as a Site Reliability Engineer at a large fintech company, I managed 50+ Grafana dashboards across 4 environments (dev, staging, production). Each dashboard had 10-20 panels. Do the math, that’s 500+ panels that anyone could modify, at any time, with no clear audit trail.
Dashboard-as-Code is the solution to this problem. The concept is straightforward: treat dashboards the same way you treat application code stored in Git, reviewed through Pull Requests, and deployed through CI/CD pipelines.
What We’ll Build
In this article, we’ll build a complete Dashboard-as-Code system:
- Local development stack — Prometheus + Grafana + sample app using Docker Compose
- Terraform configurations — Managing Grafana folders and dashboards declaratively
- Dashboard templates — Reusable JSON templates with variables
- CI/CD pipeline — Automated validation and deployment via GitHub Actions
Who Should Read This?
- SRE/DevOps engineers already familiar with Grafana but haven’t applied IaC to observability yet
- Teams with more than 10 dashboards that are starting to struggle with management
- Anyone who’s ever lost a dashboard because someone “accidentally” deleted it
Prerequisites
Before starting, make sure the following tools are installed:
- Docker & Docker Compose v2+ — for the local stack
- Terraform >= 1.5.0 — for Infrastructure as Code
- Git — for version control
- curl & jq — for interacting with APIs
Note: This article targets macOS/Linux. If you’re on Windows, use WSL2.
Architecture
Before diving into the implementation, let’s understand the overall architecture of the system we’ll build:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
┌─────────────────────────────────────────────────────────────────────────┐
│ Local Development │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Prometheus │────▶│ Sample App │ │ Grafana │ │
│ │ :9090 │ │ :8080 │ │ :3000 │ │
│ │ │────▶│ │ │ │ │
│ │ - scrape jobs │ │ - /metrics │ │ - dashboards │ │
│ │ - 15s interval │ │ - HTTP metrics │ │ - datasources │ │
│ └────────┬─────────┘ └─────────────────┘ └────────┬────────┘ │
│ │ │ │
│ └──────────── datasource ──────────────────────────┘ │
│ ▲ │
│ │ │
│ ┌─────────┴────────┐ │
│ │ Terraform │ │
│ │ │ │
│ │ - folders │ │
│ │ - dashboards │ │
│ │ - API calls │ │
│ └──────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────┐
│ CI/CD Pipeline │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ Push │───▶│ Validate │───▶│ Plan │───▶│ Apply (on main) │ │
│ │ to Git │ │ fmt+val │ │ (on PR) │ │ (auto-approve) │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
The flow:
- Prometheus scrapes metrics from all targets (itself, Grafana, sample app)
- Grafana queries Prometheus as a datasource to render dashboards
- Terraform communicates with the Grafana API to create/update folders and dashboards
- The CI/CD pipeline automatically validates and deploys changes to Grafana
Step 1: Repository Setup
A consistent project structure makes onboarding new team members easier and ensures everyone knows where to find things. We separate concerns: local infrastructure (Docker), Terraform configuration, dashboard templates, and documentation.
Implementation
1
2
3
# Clone repository
git clone https://github.com/ahakimx/observability-as-code.git
cd observability-as-code
Or if starting from scratch:
1
2
3
mkdir -p observability-as-code/{prometheus,grafana/provisioning/datasources,terraform/environments,dashboards/templates,.github/workflows}
cd observability-as-code
git init
The target directory structure:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
observability-as-code/
├── README.md
├── docker-compose.yaml
├── prometheus/
│ └── prometheus.yaml
├── grafana/
│ └── provisioning/datasources/prometheus.yaml
├── terraform/
│ ├── main.tf
│ ├── variables.tf
│ ├── folders.tf
│ ├── dashboards.tf
│ ├── outputs.tf
│ └── environments/
│ ├── local.tfvars
│ └── prod.tfvars
├── dashboards/
│ └── templates/
│ └── infrastructure-overview.json
├── .github/
│ └── workflows/
│ └── deploy.yaml
└── .gitignore
Explanation
prometheus/— Prometheus configuration (scrape targets)grafana/provisioning/— Auto-provisioning datasource when Grafana startsterraform/— All Terraform configuration for Grafanaterraform/environments/— Variable files per environmentdashboards/templates/— Dashboard JSON templates (separated from Terraform so they can be shared).github/workflows/— CI/CD pipeline
Note: We deliberately separate dashboard JSON from Terraform config. This is intentional — JSON templates can be generated from other tools (Grafonnet, grafana-dashboard-builder) in the future without modifying the Terraform config.
Step 2: Local Observability Stack
Before managing dashboards as code, we need a Grafana instance to target. Docker Compose provides a reproducible environment, every team member gets an identical setup with a single command.
We also need a metrics source (Prometheus) and a sample application that exposes metrics, so the dashboard we create displays real data rather than blank panels.
Implementation
Create the docker-compose.yaml file:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# docker-compose.yaml
version: "3.8"
services:
prometheus:
image: prom/prometheus:v2.53.0
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yaml:/etc/prometheus/prometheus.yml:ro
- prometheus_data:/prometheus
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=15d"
- "--web.enable-lifecycle"
networks:
- observability
restart: unless-stopped
grafana:
image: grafana/grafana:11.1.0
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin123
- GF_USERS_ALLOW_SIGN_UP=false
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Viewer
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning:ro
- grafana_data:/var/lib/grafana
depends_on:
- prometheus
networks:
- observability
restart: unless-stopped
sample-app:
image: quay.io/brancz/prometheus-example-app:v0.5.0
container_name: sample-app
ports:
- "8080:8080"
networks:
- observability
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
networks:
observability:
driver: bridge
Prometheus configuration to scrape all targets:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# prometheus/prometheus.yaml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_timeout: 10s
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
labels:
environment: "local"
- job_name: "grafana"
static_configs:
- targets: ["grafana:3000"]
labels:
environment: "local"
- job_name: "sample-app"
static_configs:
- targets: ["sample-app:8080"]
labels:
environment: "local"
service: "example-app"
Auto-provision Prometheus as a datasource in Grafana:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# grafana/provisioning/datasources/prometheus.yaml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
jsonData:
timeInterval: "15s"
httpMethod: POST
version: 1
Verification
1
2
3
4
5
6
7
8
9
10
11
# Start all services
docker compose up -d
# Wait until all containers are running
docker compose ps
# Expected output:
# NAME IMAGE STATUS
# grafana grafana/grafana:11.1.0 Up (healthy)
# prometheus prom/prometheus:v2.53.0 Up
# sample-app quay.io/brancz/prometheus-example-app:v0.5.0 Up
Verify each service:
1
2
3
4
5
6
7
8
9
10
11
# Prometheus - check targets
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets | length'
# Expected: 3
# Grafana - check that the datasource is provisioned
curl -s http://localhost:3000/api/datasources \
-u admin:admin123 | jq '.[].name'
# Expected: "Prometheus"
# Sample app - check metrics endpoint
curl -s http://localhost:8080/metrics | head -5
Explanation
A few important design decisions:
--web.enable-lifecycleon Prometheus — Allows config reload without restart (curl -X POST http://localhost:9090/-/reload)GF_AUTH_ANONYMOUS_ENABLED=true— For development, this makes access easier without repeated logins. Do not enable this in production.- Volume mounting
:ro— Config files are mounted read-only. If Grafana/Prometheus needs to modify config, that’s a code smell. - Named volumes (
prometheus_data,grafana_data) — Data persists acrossdocker compose downandup. Usedocker compose down -vfor a clean slate.
Step 3: Grafana API Key
Terraform communicates with Grafana through its HTTP API. This requires authentication. Grafana supports several methods:
- API Key — Simple, scoped per organization (legacy, but still widely used)
- Service Account Token — Recommended for Grafana 9+. Can be assigned to a specific role.
- Basic Auth — Username/password. Not recommended for automation.
For this lab we’ll use an API Key because it’s the most straightforward. In production, use a Service Account Token.
Implementation
1
2
3
4
5
6
7
8
9
10
11
# Create an API key with Admin role
API_KEY=$(curl -s -X POST http://localhost:3000/api/auth/keys \
-H "Content-Type: application/json" \
-u admin:admin123 \
-d '{"name":"terraform-local","role":"Admin","secondsToLive":86400}' \
| jq -r '.key')
echo "API Key: $API_KEY"
# Store as an environment variable
export TF_VAR_grafana_api_key="$API_KEY"
⚠️ Warning: The API key is only displayed once at creation time. Store it securely. For production, use a secret manager (Vault, AWS Secrets Manager, etc.).
Verification
1
2
3
4
# Test the API key
curl -s http://localhost:3000/api/org \
-H "Authorization: Bearer $TF_VA...key" | jq '.name'
# Expected: "Main Org."
Explanation
The secondsToLive: 86400 parameter makes the key expire in 24 hours. This is a best practice for development — you won’t forget to delete unused keys because they automatically expire.
In production, Service Account Tokens can be created without expiry (for CI/CD) or with a longer expiry (30-90 days) with automated rotation.
Step 4: Terraform Configuration
Terraform is an Infrastructure as Code tool that uses a declarative approach, you define the desired state, and Terraform determines the steps needed to achieve it. This is different from an imperative approach (scripting curl commands against the Grafana API).
Advantages of Terraform for Grafana:
- State management — Terraform knows which resources have already been created
- Plan before apply — You can always preview changes before executing them
- Dependency graph — Terraform knows the order of resource creation (folder first, then dashboard)
- Idempotent — Applying multiple times produces the same state
Implementation
Provider Configuration
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# terraform/main.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
grafana = {
source = "grafana/grafana"
version = "~> 3.7.0"
}
}
# Uncomment for remote state in production
# backend "s3" {
# bucket = "your-terraform-state-bucket"
# key = "observability/grafana/terraform.tfstate"
# region = "ap-southeast-1"
# encrypt = true
# dynamodb_table = "terraform-locks"
# }
}
provider "grafana" {
url = var.grafana_url
auth = var.grafana_api_key
}
Variables
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# terraform/variables.tf
variable "grafana_url" {
description = "URL of the Grafana instance"
type = string
default = "http://localhost:3000"
}
variable "grafana_api_key" {
description = "API key or service account token for Grafana"
type = string
sensitive = true
}
variable "environment" {
description = "Deployment environment (local, staging, prod)"
type = string
default = "local"
validation {
condition = contains(["local", "staging", "prod"], var.environment)
error_message = "Environment must be one of: local, staging, prod."
}
}
variable "dashboard_folder_prefix" {
description = "Prefix for dashboard folder names to support multi-env"
type = string
default = ""
}
Folders
1
2
3
4
5
6
7
8
9
10
11
12
# terraform/folders.tf
resource "grafana_folder" "infrastructure" {
title = "${var.dashboard_folder_prefix}Infrastructure"
}
resource "grafana_folder" "application" {
title = "${var.dashboard_folder_prefix}Application"
}
resource "grafana_folder" "slos" {
title = "${var.dashboard_folder_prefix}SLOs"
}
Dashboards
1
2
3
4
5
6
7
8
# terraform/dashboards.tf
resource "grafana_dashboard" "infrastructure_overview" {
folder = grafana_folder.infrastructure.id
config_json = file("${path.module}/../dashboards/templates/infrastructure-overview.json")
overwrite = true
message = "Deployed via Terraform - ${var.environment}"
}
Outputs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# terraform/outputs.tf
output "infrastructure_overview_url" {
description = "URL to the Infrastructure Overview dashboard"
value = "${var.grafana_url}${grafana_dashboard.infrastructure_overview.url}"
}
output "folder_ids" {
description = "Map of folder names to their IDs"
value = {
infrastructure = grafana_folder.infrastructure.id
application = grafana_folder.application.id
slos = grafana_folder.slos.id
}
}
output "environment" {
description = "Current deployment environment"
value = var.environment
}
Environment Files
1
2
3
4
# terraform/environments/local.tfvars
grafana_url = "http://localhost:3000"
environment = "local"
dashboard_folder_prefix = ""
1
2
3
4
# terraform/environments/prod.tfvars
grafana_url = "https://grafana.example.com"
environment = "prod"
dashboard_folder_prefix = ""
Verification
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
cd terraform
# Initialize - download provider
terraform init
# Validate syntax
terraform validate
# Expected: Success! The configuration is valid.
# Format check
terraform fmt -check
# No output = all files are already formatted correctly
# Preview changes
terraform plan -var-file=environments/local.tfvars
The plan output should show:
1
2
3
4
5
6
7
8
9
10
Plan: 4 to add, 0 to change, 0 to destroy.
Changes to Outputs:
+ environment = "local"
+ folder_ids = {
+ application = (known after apply)
+ infrastructure = (known after apply)
+ slos = (known after apply)
}
+ infrastructure_overview_url = (known after apply)
Explanation
A few important points:
sensitive = trueongrafana_api_key— Terraform will not display this value in logs/outputvalidationblock onenvironment— Prevents typos that could cause deployment to the wrong environmentoverwrite = trueon the dashboard — If someone modifies the dashboard via the UI, Terraform will revert it to the state defined in codefile()function — Reads the JSON template from a separate file, keeping Terraform config clean
Step 5: Dashboard Template
Context (Why)
Dashboard JSON is Grafana’s native format for defining dashboards. Every panel, query, and visualization is defined in a single JSON file. The advantages of storing them as templates:
- Version controllable (diff-able)
- Reusable across environments with variables
- Can be generated programmatically (Grafonnet, Go SDK)
- Can be validated before deploy (valid JSON, required fields)
Implementation
Our “Infrastructure Overview” dashboard has 7 panels:
- Targets Status (Stat) — Displays UP/DOWN status for all scrape targets
- Scrape Duration (Stat) — How long Prometheus takes to scrape each target
- HTTP Request Rate (Time Series) — Rate of HTTP requests to the sample app
- Memory Usage (Time Series) — Resident memory (RSS) per process
- CPU Usage (Time Series) — CPU seconds consumed per process
- Goroutines (Time Series) — Number of active goroutines (Go runtime health)
- Scrape Samples (Time Series) — Number of samples scraped per target
Tip: Dashboard JSON files can be quite long. The best approach for creating dashboard JSON:
- Build the dashboard in the Grafana UI
- Export as JSON (Share → Export → Save to file)
- Clean up auto-generated fields (
id,version, etc.)- Add template variables (
$datasource,$job)- Save to
dashboards/templates/
The full file is in the repository: dashboards/templates/infrastructure-overview.json
Key sections of the dashboard JSON:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
{
"templating": {
"list": [
{
"name": "datasource",
"type": "datasource",
"query": "prometheus"
},
{
"name": "job",
"type": "query",
"datasource": {"uid": "${datasource}"},
"definition": "label_values(up, job)",
"includeAll": true,
"multi": true
}
]
}
}
Verification
1
2
3
4
5
6
7
# Validate JSON syntax
python3 -m json.tool dashboards/templates/infrastructure-overview.json > /dev/null
echo $? # Expected: 0
# Check required fields
jq '{title: .title, uid: .uid, panels_count: (.panels | length), tags: .tags}' \
dashboards/templates/infrastructure-overview.json
Expected output:
1
2
3
4
5
6
{
"title": "Infrastructure Overview",
"uid": "infra-overview",
"panels_count": 7,
"tags": ["infrastructure", "overview", "terraform-managed"]
}
Explanation
Some best practices for dashboard templates:
- Descriptive UID (
infra-overview) — Makes referencing in Terraform and URLs easier terraform-managedtag — Visual indicator in Grafana UI that this dashboard is managed by code$datasourcetemplate variable — Dashboard can be used in any environment (each env may have a different datasource)$jobtemplate variable — Filter per service, very useful when troubleshooting a specific target$__rate_interval— A Grafana built-in variable that automatically adjusts the interval based on the scrape interval and resolution
Step 6: Deploy with Terraform
Context (Why)
This is the moment of truth — we’ll apply all the configuration we’ve built to the Grafana instance. Terraform apply will:
- Create 3 folders (Infrastructure, Application, SLOs)
- Upload the dashboard JSON to the Infrastructure folder
- Store the state for future tracking
Implementation
1
2
3
4
5
6
7
8
cd terraform
# Make sure the API key is set
echo $TF_VAR_grafana_api_key | head -c 10
# Should display some characters (not empty)
# Apply!
terraform apply -var-file=environments/local.tfvars
Terraform will display the plan and ask for confirmation:
1
2
3
4
5
Do you want to perform these actions?
Terraform will perform the actions described above.
Only 'yes' will be accepted to approve.
Enter a value: yes
Type yes and press enter.
Verification
1
2
3
4
5
6
7
8
9
10
11
12
# Check Terraform outputs
terraform output
# Verify via Grafana API - folders
curl -s http://localhost:3000/api/folders \
-H "Authorization: Bearer $TF_VA...key" | jq '.[].title'
# Expected: "Infrastructure", "Application", "SLOs"
# Verify via Grafana API - dashboard
curl -s http://localhost:3000/api/dashboards/uid/infra-overview \
-H "Authorization: Bearer $TF_VA...key" | jq '.meta.folderTitle'
# Expected: "Infrastructure"
Open your browser to the URL shown in the output (typically http://localhost:3000/d/infra-overview/infrastructure-overview). The dashboard should display real-time data from Prometheus.
Explanation
After apply, Terraform creates a terraform.tfstate file containing the mapping between resources in code and actual resources in Grafana. This file is critically important:
- Don’t commit it to Git (already in
.gitignore) - In production, store it in a remote backend (S3, GCS, Terraform Cloud)
- Don’t edit it manually — let Terraform manage it
State allows Terraform to know that the “Infrastructure” folder it created has a specific ID in Grafana. Without state, Terraform would try to create a new folder on every apply.
Step 7: CI/CD Pipeline
Context (Why)
With everything working locally, we need an automated pipeline so that dashboard changes can be reviewed and deployed consistently. Our pipeline follows this pattern:
- Pull Request → validate + plan (preview changes)
- Merge to main → auto-apply to production
This provides a safety net: every change is reviewed by the team before reaching production, and the plan output shows exactly what will change.
Implementation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
# .github/workflows/deploy.yaml
name: Deploy Grafana Dashboards
on:
pull_request:
branches: [main]
paths:
- "terraform/**"
- "dashboards/**"
push:
branches: [main]
paths:
- "terraform/**"
- "dashboards/**"
env:
TF_VERSION: "1.7.5"
TF_WORKING_DIR: "./terraform"
jobs:
validate:
name: Validate & Plan
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: $
- name: Terraform Format Check
working-directory: $
run: terraform fmt -check -recursive
- name: Terraform Init
working-directory: $
run: terraform init -backend=false
- name: Terraform Validate
working-directory: $
run: terraform validate
- name: Validate Dashboard JSON
run: |
for f in dashboards/templates/*.json; do
echo "Validating $f..."
python3 -m json.tool "$f" > /dev/null || exit 1
done
echo "All dashboard JSON files are valid."
plan:
name: Terraform Plan
runs-on: ubuntu-latest
needs: validate
if: github.event_name == 'pull_request'
permissions:
contents: read
pull-requests: write
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: $
- name: Terraform Init
working-directory: $
run: terraform init
- name: Terraform Plan
working-directory: $
id: plan
run: |
terraform plan \
-var-file=environments/prod.tfvars \
-var="grafana_api_key=$" \
-no-color \
-out=tfplan
- name: Comment Plan on PR
uses: actions/github-script@v7
if: github.event_name == 'pull_request'
with:
script: |
const output = `#### Terraform Plan 📋
\`\`\`
$
\`\`\`
*Pushed by: @$*`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: output
});
deploy:
name: Deploy to Production
runs-on: ubuntu-latest
needs: validate
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
environment: production
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Setup Terraform
uses: hashicorp/setup-terraform@v3
with:
terraform_version: $
- name: Terraform Init
working-directory: $
run: terraform init
- name: Terraform Apply
working-directory: $
run: |
terraform apply \
-var-file=environments/prod.tfvars \
-var="grafana_api_key=$" \
-auto-approve
- name: Output Dashboard URLs
working-directory: $
run: terraform output -json
GitHub Setup
Before the pipeline can run, you need to configure:
- Repository Secrets:
GRAFANA_API_KEY— Service account token for the production Grafana instanceTF_API_TOKEN— (Optional) If using Terraform Cloud as a backend
- Environment Protection:
- Create a “production” environment under Settings → Environments
- Enable “Required reviewers” — add at least 1 reviewer
- This provides an approval gate before deploying to production
Verification
After pushing to GitHub, create a PR with a small change:
1
2
3
4
5
git checkout -b test/verify-pipeline
# Make a small edit, for example add a tag to the dashboard
git add .
git commit -m "test: verify CI pipeline"
git push -u origin test/verify-pipeline
On GitHub, create a Pull Request. The pipeline will automatically:
- Validate format and syntax
- Validate dashboard JSON
- Post the plan output as a PR comment
Explanation
Some design decisions:
pathsfilter — The pipeline only triggers when there are changes interraform/ordashboards/. README changes don’t need to trigger a deploy.-backend=falsein validate — For the validation job, we don’t need to connect to the remote backend. This makes validation faster and doesn’t require credentials.- Environment protection —
environment: productionon the deploy job ensures there’s manual approval before applying to production. - Plan as PR comment — Reviewers can see exactly what will change without having to run Terraform locally.
Step 8: End-to-End Verification
Complete Workflow
Let’s verify the entire workflow from start to finish:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# 1. Make sure the stack is running
docker compose ps
# 2. Open Grafana
open http://localhost:3000
# 3. Navigate to the Infrastructure folder
# Click the hamburger menu → Dashboards → Infrastructure → Infrastructure Overview
# 4. The dashboard should display:
# - Target Status: all UP (green)
# - Scrape Duration: < 1 second
# - HTTP Request Rate: traffic from Prometheus scraping
# - Memory Usage: stable upward graph
# - CPU Usage: low but with activity
# - Goroutines: stable
# 5. Test idempotency - re-apply should result in no changes
cd terraform
terraform apply -var-file=environments/local.tfvars
# Expected: "No changes. Your infrastructure matches the configuration."
# 6. Test drift detection - modify the dashboard via UI, then apply
# (Open Grafana UI, edit a panel title to "Modified Title", save)
terraform plan -var-file=environments/local.tfvars
# Expected: "1 to change" - Terraform detects the drift!
terraform apply -var-file=environments/local.tfvars
# Dashboard reverts to the state defined in code
Generate Traffic for the Dashboard
To see the HTTP Request Rate panel working, generate some requests to the sample app:
1
2
3
4
5
6
# Simple load test
for i in $(seq 1 100); do
curl -s http://localhost:8080/ > /dev/null
curl -s http://localhost:8080/err > /dev/null 2>&1
sleep 0.1
done
After a few minutes, the HTTP Request Rate panel will display the traffic pattern.
Real-World Considerations
Now that you understand the fundamentals, here are important considerations when implementing Dashboard-as-Code in a real production environment.
State Management
In production, Terraform state must be stored in a remote backend. Some options:
| Backend | Pros | Cons |
|---|---|---|
| S3 + DynamoDB | Cheap, reliable, native locking | Manual setup |
| Terraform Cloud | Free tier, built-in locking + UI | Vendor lock-in |
| GCS | Reliable, built-in locking | GCP only |
Example S3 backend configuration:
1
2
3
4
5
6
7
backend "s3" {
bucket = "mycompany-terraform-state"
key = "observability/grafana/terraform.tfstate"
region = "ap-southeast-1"
encrypt = true
dynamodb_table = "terraform-locks"
}
Important: The state file can contain sensitive data. Always enable encryption (
encrypt = true) and restrict access to the bucket.
Secrets Management
Never hardcode API keys in the repository. Recommended options:
- Environment Variables (Development)
1
export TF_VAR_grafana_api_key="glsa_xxxx"
- CI/CD Secrets (GitHub Actions, GitLab CI)
1
-var="grafana_api_key=$" - Secret Manager (Production)
1 2 3 4 5 6 7 8
# Terraform data source from AWS Secrets Manager data "aws_secretsmanager_secret_version" "grafana" { secret_id = "observability/grafana-api-key" } provider "grafana" { auth = data.aws_secretsmanager_secret_version.grafana.secret_string }
Drift Detection
“Drift” occurs when the state in Grafana differs from what’s defined in code. Common causes:
- Someone edits a dashboard via the Grafana UI
- An automated process (provisioning plugin) modifies resources
- A Grafana upgrade changes the schema
How to detect drift:
1
2
3
4
# Scheduled check (can be run via cron in CI)
terraform plan -var-file=environments/prod.tfvars -detailed-exitcode
# Exit code 0 = no changes
# Exit code 2 = drift detected!
Example cron job in GitHub Actions for drift detection:
1
2
3
on:
schedule:
- cron: '0 8 * * 1-5' # Every weekday at 8 AM
Tip from experience: Don’t auto-fix drift immediately. Notify the team first via Slack/Teams. Sometimes there are legitimate urgent changes made via the UI (during an incident). A good process: detect → notify → discuss → decide (fix or adopt).
Team Workflow
When working in a team, some important conventions:
- Branch naming:
feat/add-payment-dashboard,fix/adjust-cpu-threshold - PR template with checklist:
- Dashboard JSON valid
- Terraform plan reviewed
- Tested locally with
docker compose up - No sensitive data in code
- Code owners for the dashboard directory:
1 2 3
# .github/CODEOWNERS /dashboards/ @sre-team /terraform/ @sre-team @platform-team
- Dashboard UID convention:
<team>-<service>-<purpose>sre-infra-overviewbackend-payment-latencyplatform-k8s-cluster
Scaling Tips
When dashboard count grows (50+), some patterns that help:
- Modularize Terraform — Split by team/domain:
1 2 3 4 5 6
terraform/ ├── modules/ │ ├── sre-dashboards/ │ ├── backend-dashboards/ │ └── platform-dashboards/ └── main.tf
- Dashboard Generation — For repetitive dashboards (per-service), use templating:
1 2 3
# Using Jsonnet/Grafonnet jsonnet -J vendor service-dashboard.jsonnet \ --tla-str service=payment > dashboards/templates/payment.json
- Import Existing Dashboards — If you already have many dashboards in Grafana:
1 2 3 4 5 6 7 8
# Export all dashboards for uid in $(curl -s $GRAFANA_URL/api/search | jq -r '.[].uid'); do curl -s "$GRAFANA_URL/api/dashboards/uid/$uid" | \ jq '.dashboard' > "dashboards/templates/${uid}.json" done # Import into Terraform state terraform import grafana_dashboard.existing_dashboard "$uid"
Troubleshooting
Common Issues
1. “Error: Provider produced inconsistent result”
1
Error: Provider produced inconsistent result after apply
Cause: Grafana modifies the dashboard JSON after save (adding version, id, etc.).
Solution: Make sure the JSON template doesn’t include id and version fields. Use lifecycle { ignore_changes = [...] } if needed:
1
2
3
4
5
6
7
resource "grafana_dashboard" "example" {
config_json = file("...")
lifecycle {
ignore_changes = [config_json] # Be careful, this disables drift detection
}
}
2. “Error: status: 401 Unauthorized”
Cause: API key expired or incorrect.
Solution:
1
2
3
4
5
6
7
8
9
# Check if the key is still valid
curl -s http://localhost:3000/api/org \
-H "Authorization: Bearer $TF_VA...key"
# If 401, create a new key
curl -s -X POST http://localhost:3000/api/auth/keys \
-H "Content-Type: application/json" \
-u admin:admin123 \
-d '{"name":"terraform-new","role":"Admin"}'
3. “Error: Folder not found” when deploying dashboard
Cause: Dashboard is being deployed before the folder is created.
Solution: Make sure the dependency is correct. Terraform should handle this automatically since grafana_folder.infrastructure.id is referenced in the dashboard resource. If the error persists, add an explicit dependency:
1
2
3
4
resource "grafana_dashboard" "infrastructure_overview" {
depends_on = [grafana_folder.infrastructure]
# ...
}
4. Docker Compose: Grafana “datasource not found”
Cause: Grafana starts before Prometheus is ready.
Solution: Container dependency is already handled in docker-compose.yaml with depends_on. If the issue persists, restart Grafana:
1
docker compose restart grafana
5. Dashboard panels showing “No Data”
Common causes:
- Prometheus hasn’t scraped the target yet (wait 15-30 seconds)
- Datasource misconfigured
- Query syntax error
Debug steps:
1
2
3
4
5
6
7
8
9
10
11
# 1. Check if Prometheus has data
curl -s 'http://localhost:9090/api/v1/query?query=up' | jq '.data.result'
# 2. Check the datasource in Grafana
curl -s http://localhost:3000/api/datasources \
-u admin:admin123 | jq '.[0].url'
# Should be "http://prometheus:9090" (not localhost!)
# 3. Test query via Grafana proxy
curl -s 'http://localhost:3000/api/datasources/proxy/1/api/v1/query?query=up' \
-u admin:admin123 | jq '.data.result | length'
Cleanup
When you’re done with this lab:
1
2
3
4
5
6
7
8
9
10
11
# Remove Terraform resources from Grafana
cd terraform
terraform destroy -var-file=environments/local.tfvars
# Stop and remove Docker containers + volumes
cd ..
docker compose down -v
# (Optional) Remove the entire project
cd ..
rm -rf observability-as-code
Summary
In this article, we’ve built a complete Dashboard-as-Code system:
| Component | Tool | Purpose |
|---|---|---|
| Local Stack | Docker Compose | Development environment |
| Metrics | Prometheus | Data source |
| Visualization | Grafana | Dashboard rendering |
| IaC | Terraform + Grafana Provider | Dashboard management |
| CI/CD | GitHub Actions | Automated deployment |
Key takeaways:
- Dashboard = Code — Treated the same as application code (versioned, reviewed, tested)
- Terraform state — The single source of truth for the mapping between code and reality
- CI/CD pipeline — A safety net that prevents unreviewed changes from reaching production
- Drift detection — A mechanism to detect changes made outside of Terraform
What’s Next: Alerts as Code (Part 2)
In the next article, we’ll continue with Alerts-as-Code - managing Grafana alerting rules using Terraform. We’ll cover:
- Grafana Alerting (Unified Alerting) architecture
- Contact points & notification policies as code
- Alert rules with multi-dimensional queries
- Silence and mute timing management
- Integration with PagerDuty/Slack/OpsGenie
- Testing alerts before deployment (alert simulation)
Stay tuned! Follow this blog or star the repository for update notifications.
Resources
- Repository: github.com/ahakimx/observability-as-code
- Terraform Grafana Provider: registry.terraform.io/providers/grafana/grafana
- Grafana Provisioning: grafana.com/docs/grafana/latest/administration/provisioning
- Prometheus Configuration: prometheus.io/docs/prometheus/latest/configuration
- Grafonnet (Jsonnet): grafana.github.io/grafonnet/
This article is part of the Observability as Code series on ahakimx.id. This series covers how to apply Infrastructure-as-Code principles to the entire observability stack: dashboards, alerts, SLOs, and monitoring configuration.
