Skip to content

Ch4 β€” Observability 🟩

A student bragged: β€œMy dashboard has 200 panels.” Budo asked: β€œWhich one is lying to you right now?” The student is still looking.

Status: outline. Lab scaffolding in labs/ch04-observability/.

The problem

β€œp99 on checkout doubled at 14:00, why?” β€” answering this means writing PromQL, correlating deploy events, restarts, saturation, and downstream latency. The skill is mechanical given fluency; the agent has fluency.

What you’ll build

budo why "<symptom>" β€” tools: promql_query, promql_range, loki_query, list_metrics, label_values, recent_deploys. Investigation in, evidence-ranked hypothesis out.

Key concepts introduced

  • Grounding via discovery: the single biggest local-model failure here is hallucinated metric names. Fix is architectural, not prompt-level: list_metrics/label_values tools + a system prompt that mandates discovery-before-query.
  • Time-window discipline: agents that query [7d] by reflex blow context and Prometheus alike.
  • Correlation method as prompt: USE/RED encoded as the investigation procedure.

The scenario

chaos-mesh (or toxiproxy) injects 300ms latency into productcatalogservice β†’ symptom surfaces as frontend/checkout p99. Agent must walk the dependency chain to the true culprit, not blame the symptom service.

Break it

Rename a recording rule the agent learned to rely on; watch it hallucinate the old name. Then harden: discovery-first becomes mandatory, queries validated against list_metrics in code before execution.

Belt test

3 injected scenarios (downstream latency, memory leak β†’ OOM cascade, deploy regression): culprit service named correctly in β‰₯2, with PromQL evidence quoted in the verdict.