Monitoring and Observability¶

Monitoring answers "what is broken?" (symptoms) and "why?" (causes). Observability means building a system you can ask any questions about and discover new correlations. Goal: maximize signal, minimize noise.

Four Golden Signals¶

Latency - time to serve requests. Distinguish successful vs failed request latency
Traffic - demand on system (requests/sec, transactions/sec)
Errors - rate of failed requests (explicit 5xx + implicit slow responses)
Saturation - how "full" the service is (CPU, memory, I/O, queue depth)

Measuring Saturation¶

Determine acceptable latency threshold (e.g., 200ms)
Saturation can be caused by equipment degradation, not just load
Sawtooth pattern: periodic saturation (e.g., Elasticsearch index merging) causes throughput drops then spikes

Three Pillars of Observability¶

Pillar	What	Tools
Logging	What happened	Loki, ELK, CloudWatch Logs
Metrics	How system performs	Prometheus, CloudWatch, Datadog
Tracing	Request flow across services	Jaeger, Tempo, Zipkin

Prometheus¶

Pull-based time-series metrics collection:

global:
  scrape_interval: 15s
scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']
  - job_name: 'app'
    static_configs:
      - targets: ['app:8080']

PromQL queries: rate(http_server_requests_seconds_count[5m]), jvm_memory_used_bytes{area="heap"}

Service discovery: Kubernetes, Consul, file-based. Alertmanager handles alert routing, grouping, silencing.

Grafana¶

Visualization platform for Prometheus, Loki, Tempo, Elasticsearch data:

Import pre-built dashboards (e.g., Node Exporter Full: ID 1860)
Panels: graph, stat, gauge, table
Dashboard variables for filtering
Alerting rules with contact points (email, Slack, PagerDuty)

Grafana + Loki (Centralized Logging)¶

Loki indexes labels only (not content) - much lighter than ELK:

LogQL: {container_name="accounts"} |= "error"
Rate:  rate({container_name="accounts"}[5m])

Promtail agent tails log files on each node, sends to Loki.

Grafana Tempo (Distributed Tracing)¶

Receives traces via OTLP. Visualizes request flow as waterfall diagram. Link from Loki logs to Tempo traces via traceId correlation.

OpenTelemetry¶

Vendor-neutral instrumentation standard for traces, metrics, logs. Each request gets: - traceId - unique per request chain - spanId - unique per service hop

W3C Trace Context header propagated across services automatically.

SLI, SLO, SLA¶

SLI (Service Level Indicator)¶

Quantitative measure: latency p95/p99, availability %, error rate, throughput. Measured at user-facing boundary.

SLO (Service Level Objective)¶

Target for SLI: "99.9% of requests in < 200ms over 30 days." Internal engineering target.

SLA (Service Level Agreement)¶

SLO + consequences (refunds, credits). Typically more lenient than SLO.

Error Budget¶

100% - SLO = error budget. SLO 99.9% = 0.1% budget = 43.2 min/month. Budget exhausted -> freeze releases, focus on reliability. Budget remaining -> deploy faster.

Problem Lifecycle¶

OK -> Soft -> Hard -> Notify -> Escalate -> Acknowledge -> Resolve

Every notification must lead to an action
Alerts via various transports: email, SMS, IM (offband delivery)
Context must be attached to every alert

Alerting Philosophy¶

Alert on symptoms, not causes (user impact, not CPU usage)
Every alert should be actionable
Every page should require human intelligence to resolve
Distinguish ticket-worthy vs page-worthy
Multi-window, multi-burn-rate: fast burn = page, slow burn = ticket

Data Sources for SLI¶

Application metrics: implemented in code. Competes with main tasks for resources
Infrastructure APIs: cloud, WAF, network equipment. Great detail, historical data. Knows nothing about app logic
Third-party monitoring: external perspective. Useful for SLA validation
Synthetic probes: black-box checks simulating user behavior

Cloud-Specific Monitoring¶

AWS CloudWatch¶

aws cloudwatch put-metric-alarm \
  --alarm-name "HighCPU" \
  --metric-name "CPUUtilization" \
  --namespace "AWS/EC2" \
  --statistic "Average" \
  --period 300 --threshold 80 \
  --comparison-operator "GreaterThanThreshold" \
  --alarm-actions arn:aws:sns:us-east-1:123456:alert-topic

Azure Monitor (Container Insights)¶

KQL queries on container logs, metrics, performance data. OMS Agent as DaemonSet on each node.

Gotchas¶

Monitoring != Observability. Monitoring checks known issues; observability discovers unknowns
Percentiles (p50, p95, p99) over averages for latency - averages hide outliers
APDEX (Application Performance Index) standardizes user satisfaction: Satisfied (4T)
Spring Boot: spring-boot-starter-actuator + micrometer-registry-prometheus for /actuator/prometheus
Error budget burn rate tracks how fast budget is consumed - a single incident can consume minutes of budget