Ch4 β Observability π©
A student bragged: βMy dashboard has 200 panels.β Budo asked: βWhich one is lying to you right now?β The student is still looking.
Status: outline. Lab scaffolding in labs/ch04-observability/.
The problem
βp99 on checkout doubled at 14:00, why?β β answering this means writing PromQL, correlating deploy events, restarts, saturation, and downstream latency. The skill is mechanical given fluency; the agent has fluency.
What youβll build
budo why "<symptom>" β tools: promql_query, promql_range, loki_query, list_metrics, label_values, recent_deploys. Investigation in, evidence-ranked hypothesis out.
Key concepts introduced
- Grounding via discovery: the single biggest local-model failure here is hallucinated metric names. Fix is architectural, not prompt-level:
list_metrics/label_valuestools + a system prompt that mandates discovery-before-query. - Time-window discipline: agents that query
[7d]by reflex blow context and Prometheus alike. - Correlation method as prompt: USE/RED encoded as the investigation procedure.
The scenario
chaos-mesh (or toxiproxy) injects 300ms latency into productcatalogservice β symptom surfaces as frontend/checkout p99. Agent must walk the dependency chain to the true culprit, not blame the symptom service.
Break it
Rename a recording rule the agent learned to rely on; watch it hallucinate the old name. Then harden: discovery-first becomes mandatory, queries validated against list_metrics in code before execution.
Belt test
3 injected scenarios (downstream latency, memory leak β OOM cascade, deploy regression): culprit service named correctly in β₯2, with PromQL evidence quoted in the verdict.