Skip to content

Ch3 β€” CI/CD 🟧

β€œIt works on my machine,” said the student. β€œThen we shall ship your machine,” said Budo, inventing containers. The tests were still flaky.

Status: outline. Lab scaffolding in labs/ch03-cicd/.

The problem

Pipeline fails. Someone reruns it. It passes. Forty engineer-minutes evaporate, the flake survives to kill again. Classification β€” flaky / dependency break / infra / real regression β€” is exactly the judgment-over-evidence task agents are good at.

What you’ll build

budo ci <run-url-or-id> plus a webhook-driven service (TS/Node glue, Python agent core): on workflow failure β†’ pull job logs via GitHub API β†’ diff against last green run (deps lockfile, base SHA, runner image) β†’ classify with evidence β†’ comment on the PR. This is your first agent-as-a-service, not agent-as-REPL.

Key concepts introduced

  • Event-driven agents: webhooks, queueing, idempotency (Actions redelivers; your agent must not double-comment)
  • Log-diff as a tool: give the model differences, not 20k-line logs
  • The flaky-test corpus: we seed the shipd repo with the real kinds β€” port collision, time-dependence, test-order dependence, OOM-on-shared-runner

Break it

A failure whose log contains a contributor-controlled string (test names are attacker-controlled!) that tries to make the agent approve the PR.

Belt test

10 historical failures (provided), β‰₯8 correctly classified; zero duplicate comments under webhook redelivery.