# Monitoring Setup Guide Skill

Produce a complete monitoring setup guide for a service — defining exactly what to measure, how to structure logs, how to configure alerts with actionable thresholds, and how to build dashboards that answer real operational questions. A good monitoring guide eliminates "we don't know what's happening in production" as a root cause category, and gives on-call engineers a single source of truth for what healthy looks like.

## Required Inputs

Ask for these if not already provided:
- **Service name and description** — what the service does and its role in the system
- **Tech stack** — language, framework, and infrastructure (e.g. Go/gRPC on Kubernetes, Python/FastAPI on ECS)
- **Current monitoring tooling** — Datadog, Prometheus + Grafana, CloudWatch, New Relic, Honeycomb, or none yet
- **Key user journeys** — the 2–4 most important things a user or consumer does with the service (these drive what to alert on)
- **Existing alerts** — paste any existing alert configurations or describe what's currently monitored

## Output Format

---

# Monitoring Setup Guide: [Service Name]

**Team:** [Team name] | **Tech lead:** [Name]
**Stack:** [Language/Framework] on [Infrastructure]
**Monitoring platform:** [Datadog / Prometheus+Grafana / CloudWatch / etc.]
**Date:** [Date] | **Review cycle:** Quarterly

---

## 1. Monitoring Philosophy

Good monitoring answers three questions:
1. **Is the service healthy right now?** (alerting)
2. **Was it healthy in the past, and is it trending worse?** (dashboards + SLO tracking)
3. **Why did something fail?** (logs + traces)

This guide defines the answers for [Service Name]. Every alert must be actionable — if an on-call engineer cannot take a specific action in response to the alert, the alert should not exist.

**Key user journeys monitored:**
- Journey 1: [e.g. "User submits a payment — POST /charges, receives confirmation"]
- Journey 2: [e.g. "User views transaction history — GET /transactions"]
- Journey 3: [e.g. "Subscription renewal job runs — background worker processes billing events"]

---

## 2. The Four Golden Signals

Apply the four golden signals specifically to [Service Name]:

### Latency

Latency measures how long requests take to complete. Track it separately for successful and failed requests — slow failures hide behind fast errors if you only measure aggregate latency.

| Metric | Description | Source | Dimensions |
|---|---|---|---|
| `[service].request.duration_ms` | End-to-end request latency | Application instrumentation | `endpoint`, `method`, `status_code` |
| `[service].db.query_duration_ms` | Database query latency | ORM / query instrumentation | `query_name`, `table` |
| `[service].external.request_duration_ms` | Outbound call latency to dependencies | HTTP client instrumentation | `target_service`, `endpoint` |
| `[service].queue.processing_duration_ms` | Time to process one message (if applicable) | Consumer instrumentation | `queue_name`, `message_type` |

**Latency SLO targets:**

| Endpoint / operation | p50 target | p95 target | p99 target |
|---|---|---|---|
| `GET /api/v1/[resource]` | < [50] ms | < [200] ms | < [500] ms |
| `POST /api/v1/[resource]` | < [100] ms | < [400] ms | < [1000] ms |
| `GET /health` | < [10] ms | < [20] ms | < [50] ms |
| [Background job name] | < [5] sec | < [15] sec | < [60] sec |

### Traffic

Traffic measures demand on the system. Use it to detect unexpected spikes, traffic drops (which can indicate upstream failures), and to capacity-plan.

| Metric | Description | Source |
|---|---|---|
| `[service].request.count` | Requests per second | Application / load balancer |
| `[service].request.count_by_endpoint` | RPS broken down by endpoint | Application |
| `[service].queue.messages_consumed_per_second` | Consumer throughput | Queue consumer |
| `[service].queue.depth` | Messages waiting in queue | Queue metrics |

**Traffic baselines (update after observing production for 2+ weeks):**

| Time period | Expected RPS | Low-traffic floor | Spike ceiling |
|---|---|---|---|
| Peak (weekday business hours) | [N] RPS | [N × 0.5] RPS | [N × 5] RPS |
| Off-peak (nights/weekends) | [N × 0.2] RPS | [N × 0.05] RPS | [N] RPS |

### Errors

Errors measure the fraction of requests that fail. Distinguish between client errors (4xx — caller is doing something wrong) and server errors (5xx — the service is broken).

| Metric | Description | Alert on? |
|---|---|---|
| `[service].request.error_rate` | 5xx errors / total requests | Yes — see alert rules |
| `[service].request.client_error_rate` | 4xx errors / total requests | Threshold alert — sudden spike may indicate API misuse |
| `[service].dependency.error_rate` | Errors calling downstream dependencies | Yes — upstream health signal |
| `[service].queue.dlq_depth` | Messages in dead-letter queue | Yes — indicates processing failures |

### Saturation

Saturation measures how "full" the service is — how close to maximum capacity are the constrained resources.

| Resource | Metric | Alert threshold | Source |
|---|---|---|---|
| CPU | `[service].cpu.utilisation_pct` | >80% sustained 5 min | Container / VM metrics |
| Memory | `[service].memory.utilisation_pct` | >85% sustained 5 min | Container / VM metrics |
| DB connections | `[service].db.connection_pool.utilisation_pct` | >75% | Application / DB metrics |
| Thread pool / goroutines | `[service].runtime.goroutine_count` / `thread_count` | >N (establish baseline) | Runtime metrics |
| Disk (if applicable) | `[service].disk.utilisation_pct` | >75% | Infrastructure |
| Queue depth (if applicable) | `[service].queue.depth` | >[backlog threshold] | Queue metrics |

---

## 3. Business Metrics

Beyond the golden signals, track metrics that measure whether the service is delivering business value. These matter for SLO reporting and product dashboards.

| Metric | Description | Source | Alert? |
|---|---|---|---|
| `[service].[primary_action].success_rate` | [e.g. "Payment success rate"] | Application | Yes — if drops >5% vs 1h average |
| `[service].[primary_action].count` | [e.g. "Payments processed per minute"] | Application | Yes — sudden drop (traffic anomaly) |
| `[service].[resource].created_per_hour` | [e.g. "New accounts created"] | Application / DB | No — informational |
| `[service].cache.hit_rate` | Fraction of requests served from cache | Cache instrumentation | Yes — if drops below [60]% |
| `[service].job.[name].success_rate` | [Background job success rate] | Job framework | Yes — if drops below [99]% |

---

## 4. Log Strategy

### Structured Logging Schema

All logs must be structured JSON. Do not emit unstructured text logs in production. Every log line must include the mandatory fields.

**Mandatory fields (every log line):**

```json
{
  "timestamp": "2024-01-15T10:23:45.123Z",
  "level": "info",
  "service": "[service-name]",
  "version": "[git-sha-short]",
  "trace_id": "[uuid-from-request-context]",
  "span_id": "[span-uuid]",
  "request_id": "[uuid-per-request]",
  "message": "[human readable description]"
}
```

**Request log (emit for every HTTP request):**

```json
{
  "timestamp": "...",
  "level": "info",
  "service": "[service-name]",
  "event": "http_request",
  "method": "POST",
  "path": "/api/v1/[resource]",
  "status_code": 201,
  "duration_ms": 45,
  "user_id": "[uuid — DO NOT log PII directly]",
  "request_id": "[uuid]",
  "trace_id": "[uuid]"
}
```

**Error log (emit for every error with context):**

```json
{
  "timestamp": "...",
  "level": "error",
  "service": "[service-name]",
  "event": "error",
  "error_code": "[application-error-code]",
  "error_message": "[description — no sensitive data]",
  "stack_trace": "[stack trace]",
  "request_id": "[uuid]",
  "trace_id": "[uuid]",
  "context": {
    "[key]": "[relevant context without PII]"
  }
}
```

### Log Levels — When to Use Each

| Level | Use when | Example |
|---|---|---|
| `error` | Something failed that requires attention — this should page on-call eventually | Database query failed, external API returned 5xx, required config missing |
| `warn` | Something unexpected happened but service is still functioning | Retry succeeded after failure, cache miss on expected hit, rate limit approaching |
| `info` | Significant business events and request lifecycle | Request received, payment processed, user authenticated, job started/completed |
| `debug` | Detailed diagnostic information — off in production by default | Query parameters, intermediate computation results, cache key lookups |

### What NOT to Log

**Never log:**
- Passwords, tokens, API keys, or secrets (even hashed)
- Full credit card numbers or PAN data
- Social security numbers or government IDs
- Full names + dates of birth + contact info in the same log line (PII aggregation)
- Request/response bodies in full (use field-level extraction instead)
- Health check requests (too noisy — exclude `GET /health` from access logs)

---

## 5. Distributed Tracing Setup

Distributed tracing is mandatory for any service that calls other services. It enables root-cause analysis across service boundaries.

### Instrumentation Checklist

```
[ ] Tracing library installed:
    - Go: go.opentelemetry.io/otel
    - Python: opentelemetry-sdk, opentelemetry-instrumentation
    - Node: @opentelemetry/sdk-node
    - Java: opentelemetry-java-instrumentation

[ ] Tracer initialized at service startup with service name and version

[ ] Trace context propagated via W3C Trace Context headers:
    traceparent: 00-[trace-id]-[span-id]-01
    tracestate: [optional vendor-specific]

[ ] Automatic instrumentation enabled for:
    [ ] Inbound HTTP/gRPC requests (creates root span)
    [ ] Outbound HTTP/gRPC calls (creates child spans)
    [ ] Database queries (creates child spans with sanitized query)
    [ ] Cache operations (Redis, Memcached)
    [ ] Message queue produce/consume

[ ] Custom spans added for:
    [ ] Key business operations ([e.g. payment processing, user lookup])
    [ ] Background jobs (each job execution = root span)
    [ ] Third-party API calls with custom attributes

[ ] Span attributes to capture on all spans:
    - user.id (if authenticated — no PII)
    - deployment.environment (production/staging)
    - service.version (git SHA)
    - [service-specific key attributes]

[ ] Trace exporter configured to: [Datadog / Jaeger / Tempo / OTLP endpoint]

[ ] Sampling rate configured:
    - Production: [1–10]% of requests (adjust based on volume and cost)
    - Always sample: errors, slow requests (>p99 threshold), and 100% of [critical endpoint]
```

### Trace Instrumentation Examples

```python
# Python — OpenTelemetry example
from opentelemetry import trace

tracer = trace.get_tracer("[service-name]")

def process_payment(payment_data):
    with tracer.start_as_current_span("process_payment") as span:
        span.set_attribute("payment.amount_cents", payment_data["amount"])
        span.set_attribute("payment.currency", payment_data["currency"])
        # Never: span.set_attribute("payment.card_number", ...)
        try:
            result = _do_process(payment_data)
            span.set_status(trace.StatusCode.OK)
            return result
        except PaymentError as e:
            span.set_status(trace.StatusCode.ERROR, str(e))
            span.record_exception(e)
            raise
```

---

## 6. Alert Rules Specification

Every alert must have: a name, a condition, a threshold, a severity, and a clear on-call action. Alerts without a clear action should not exist.

### Alert Definitions

| Alert name | Condition | Threshold | Severity | On-call action |
|---|---|---|---|---|
| `[Service]HighErrorRate` | 5xx error rate, 5-min rolling window | >1% for 2 consecutive windows | P1 | Check recent deploys; inspect error logs; see runbook [link] |
| `[Service]CriticalErrorRate` | 5xx error rate, 2-min rolling window | >5% | P1 — immediate | Same as above — page immediately, do not wait |
| `[Service]HighP99Latency` | p99 latency on key endpoints | >2× SLO target for 3 min | P2 | Check DB latency, cache hit rate, and upstream dependencies |
| `[Service]LatencySLOBreach` | p99 latency | >SLO target for 5 consecutive minutes | P1 | SLO burn — page on-call, escalate if not resolved in 20 min |
| `[Service]HighCPU` | CPU utilisation | >80% sustained for 5 min | P2 | Check for traffic spike; scale up if needed; check for runaway processes |
| `[Service]HighMemory` | Memory utilisation | >85% sustained for 5 min | P2 | Check for memory leak (especially after deploys); restart pod if OOM imminent |
| `[Service]DBConnectionPoolHigh` | DB connection pool utilisation | >75% | P2 | Check for long-running queries; consider scaling service or increasing pool size |
| `[Service]DLQDepthHigh` | Dead-letter queue depth | >10 messages | P2 | Inspect DLQ messages for error pattern; fix bug and replay if safe |
| `[Service]TrafficDropAnomaly` | RPS, compared to same hour yesterday | >50% drop sustained 5 min | P1 | Upstream may be down; check caller health; check load balancer |
| `[Service]PrimaryActionSuccessRateDrop` | [Business metric success rate] | <[95]% over 10 min | P1 | [Service-specific action — e.g. "Check payment provider status"] |
| `[Service]DownstreamDependencyErrors` | Error rate calling [dependency] | >5% over 5 min | P2 | Check [dependency] status page; enable fallback if available |

### Alert Configuration Examples

```yaml
# Prometheus / Grafana alerting rules (adapt for your platform)
groups:
  - name: [service-name]-alerts
    rules:

      - alert: [Service]HighErrorRate
        expr: |
          (
            sum(rate([service]_http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate([service]_http_requests_total[5m]))
          ) > 0.01
        for: 2m
        labels:
          severity: critical
          team: [team-name]
        annotations:
          summary: "High error rate on [Service Name]"
          description: "Error rate is {{ $value | humanizePercentage }} (threshold: 1%)"
          runbook_url: "[runbook link]"

      - alert: [Service]HighP99Latency
        expr: |
          histogram_quantile(0.99,
            sum(rate([service]_http_request_duration_seconds_bucket[5m])) by (le, endpoint)
          ) > [0.5]
        for: 3m
        labels:
          severity: warning
          team: [team-name]
        annotations:
          summary: "p99 latency elevated on [Service Name]"
          description: "p99 latency on {{ $labels.endpoint }} is {{ $value | humanizeDuration }}"
          runbook_url: "[runbook link]"
```

```python
# Datadog monitor configuration (Python SDK or Terraform)
import datadog

datadog.initialize(api_key="[key]", app_key="[key]")

datadog.api.Monitor.create(
    type="metric alert",
    query=f"sum(last_5m):sum:{{service}}.http.errors{{service:[service-name]}} / sum:{{service}}.http.requests{{service:[service-name]}} > 0.01",
    name="[Service] High Error Rate",
    message="Error rate exceeded 1%. @pagerduty-[service-oncall]\n\nRunbook: [link]",
    tags=["service:[service-name]", "team:[team-name]"],
    options={
        "thresholds": {"critical": 0.01, "warning": 0.005},
        "notify_no_data": False,
        "evaluation_delay": 60,
    }
)
```

---

## 7. Dashboard Layout Specification

The primary service dashboard must answer "is the service healthy right now?" at a glance. Use this layout:

```
┌─────────────────────────────────────────────────────────────────────┐
│  [SERVICE NAME] — Service Health Dashboard           [Time range ▼] │
├───────────────┬───────────────┬───────────────┬─────────────────────┤
│  Error rate   │  p99 Latency  │  RPS (current)│  SLO budget remaining│
│  [BIG NUMBER] │  [BIG NUMBER] │  [BIG NUMBER] │  [BIG NUMBER / days] │
│  vs SLO: 0.1% │  vs SLO: 500ms│  vs avg: [N]  │  [Error budget gauge]│
├───────────────┴───────────────┴───────────────┴─────────────────────┤
│                   Error rate over time (24h)                        │
│  [Time series: 5xx rate line, SLO threshold line]                   │
├─────────────────────────────────┬───────────────────────────────────┤
│  Latency percentiles over time  │  Request throughput over time     │
│  [Lines: p50, p95, p99, p999]   │  [Bars: RPS by endpoint]          │
│  [SLO threshold horizontal line]│                                   │
├─────────────────────────────────┴───────────────────────────────────┤
│  Latency heatmap (all requests — shows distribution shape)          │
├─────────────────────────────────┬───────────────────────────────────┤
│  CPU utilisation over time      │  Memory utilisation over time     │
│  [All instances/pods — lines]   │  [All instances/pods — lines]     │
│  [Alert threshold: 80%]         │  [Alert threshold: 85%]           │
├─────────────────────────────────┴───────────────────────────────────┤
│  DB: connection pool utilisation│  DB: query latency (p99 per query)│
├─────────────────────────────────┴───────────────────────────────────┤
│  [Business metric 1 over time]  │  [Business metric 2 over time]    │
│  e.g. Payment success rate      │  e.g. Orders created/min          │
└─────────────────────────────────┴───────────────────────────────────┘
```

**Second dashboard — Dependency Health:**

```
┌─────────────────────────────────────────────────────────────────────┐
│  [SERVICE NAME] — Dependency Health                                 │
├─────────────────────────────────────────────────────────────────────┤
│  For each dependency: error rate | latency | current status         │
│  [Database]    [N]% errors | [N]ms p99 | ● Healthy / ⚠ Degraded    │
│  [Redis]       [N]% errors | [N]ms p99 | ● Healthy                 │
│  [External API][N]% errors | [N]ms p99 | ● Healthy                 │
├─────────────────────────────────────────────────────────────────────┤
│  Outbound call latency over time (one line per dependency)          │
├─────────────────────────────────────────────────────────────────────┤
│  Circuit breaker / fallback state (if implemented)                  │
└─────────────────────────────────────────────────────────────────────┘
```

---

## 8. Observability Debt Analysis

Honest assessment of what is missing today and what the priority to add it is:

| Gap | Impact | Priority | Effort | Owner | Target date |
|---|---|---|---|---|---|
| [e.g. No distributed tracing — can't see cross-service latency] | High — blind to dependency issues | P1 | [2 days] | [Name] | [Date] |
| [e.g. No business metric alerts — only infra alerts] | High — silent business failures | P1 | [1 day] | [Name] | [Date] |
| [e.g. Logs are unstructured text — not searchable] | Medium — slow incident investigation | P2 | [3 days] | [Name] | [Date] |
| [e.g. No dead-letter queue monitoring] | Medium — failed messages go unnoticed | P2 | [4 hours] | [Name] | [Date] |
| [e.g. Alert thresholds not calibrated to production baseline] | Medium — alert fatigue or missed alerts | P2 | [1 day] | [Name] | [Date] |
| [e.g. No latency heatmap — outliers invisible in averages] | Low — harder to spot tail latency issues | P3 | [2 hours] | [Name] | [Date] |

**Total observability debt: [N] items | Estimated effort: [N days]**

---

## Quality Checks

- [ ] Every alert has a named on-call action — no alert says "investigate" without specifying what to investigate first
- [ ] Alert thresholds are calibrated against production baselines, not set to default values from a template
- [ ] Structured logging is implemented — no unstructured text log lines in production
- [ ] PII is explicitly excluded from logs — a named engineer has verified this
- [ ] Distributed tracing is propagating trace IDs across all service boundaries (verify with a test request)
- [ ] The primary dashboard answers "is the service healthy?" in under 10 seconds — no hunting for the right panel
- [ ] Business metrics are tracked alongside infrastructure metrics — not just four golden signals
- [ ] Observability debt items have owners and dates — not just "would be nice to have"

## Anti-Patterns

- [ ] Do not create alerts without a specific on-call action — an alert that just says "investigate" trains engineers to ignore it
- [ ] Do not set alert thresholds from a template without calibrating against production baselines — uncalibrated thresholds cause either alert fatigue or missed incidents
- [ ] Do not log PII, tokens, or secrets — a logging standard is incomplete without an explicit list of what must never be logged
- [ ] Do not measure only the four golden signals without adding at least one business metric alert — infrastructure health can be green while the business-critical path is silently failing
- [ ] Do not deploy distributed tracing without verifying that trace IDs propagate across all service boundaries — partial tracing is worse than no tracing because it produces misleading incomplete traces