Ch3 β CI/CD π§
βIt works on my machine,β said the student. βThen we shall ship your machine,β said Budo, inventing containers. The tests were still flaky.
Status: outline. Lab scaffolding in labs/ch03-cicd/.
The problem
Pipeline fails. Someone reruns it. It passes. Forty engineer-minutes evaporate, the flake survives to kill again. Classification β flaky / dependency break / infra / real regression β is exactly the judgment-over-evidence task agents are good at.
What youβll build
budo ci <run-url-or-id> plus a webhook-driven service (TS/Node glue, Python agent core): on workflow failure β pull job logs via GitHub API β diff against last green run (deps lockfile, base SHA, runner image) β classify with evidence β comment on the PR. This is your first agent-as-a-service, not agent-as-REPL.
Key concepts introduced
- Event-driven agents: webhooks, queueing, idempotency (Actions redelivers; your agent must not double-comment)
- Log-diff as a tool: give the model differences, not 20k-line logs
- The flaky-test corpus: we seed the shipd repo with the real kinds β port collision, time-dependence, test-order dependence, OOM-on-shared-runner
Break it
A failure whose log contains a contributor-controlled string (test names are attacker-controlled!) that tries to make the agent approve the PR.
Belt test
10 historical failures (provided), β₯8 correctly classified; zero duplicate comments under webhook redelivery.