Warm-up — Talking to a local LLM
In this chapter
You’ll write the smallest possible “provider” — a thirty-line Python file that talks to any
OpenAI-compatible chat-completion endpoint. No SDK. No frameworks. Just httpx.
By the end you’ll have:
- Written
chat(messages, tools=None)— one round-trip over HTTP, returns the assistant message. - Sent a real
curlagainst Ollama and read the JSON it sends back. - Wired the
toolsfield and seentool_callscome back on the assistant message. - Parsed tool arguments tolerantly — clean JSON, single-quoted JSON, and trailing-comma JSON.
- Confirmed you can swap to Anthropic or OpenAI by changing env vars, not code.
Time: ~30 minutes. Hardware: anything that runs Ollama.
Optional — skip this and Ch1 still works. A reference provider.py already lives in the tree;
Ch1’s loop runs against either yours or that one. The point of the warm-up is to take the magic
out of “calling an LLM” before you build a loop around it.
“Show me your agent,” said the student. Budo handed him a curl command. “First show me you can speak.” The student blinked. “That’s just HTTP.” “Yes,” said Budo. “And that’s just the trick.”
The problem
Before you write a loop, you must speak to the model. The model is an HTTP endpoint. That is the whole secret. Every “AI SDK” is a thin wrapper around a POST that returns JSON.
We will skip the wrappers and do the POST ourselves — once. After that, you’ll never again wonder what an agent framework is doing under the hood, because there is no under the hood.
What you’ll build
A file called provider.py with exactly two functions:
def chat(messages, tools=None) -> dict: ...def parse_tool_args(raw) -> dict: ...chat does one round-trip to a chat-completion endpoint and returns the assistant message.
parse_tool_args reads the JSON string the model emits as tool arguments and survives the
small sins local models commit.
That’s it. Two functions. No classes. No retries. No streaming. Ch1 builds the loop on top.
sequenceDiagram
autonumber
participant Code as your code
participant Chat as chat()
participant LLM as LLM endpoint
participant Parse as parse_tool_args()
Code->>Chat: messages, tools
Chat->>LLM: POST /chat/completions
LLM-->>Chat: JSON response
Chat-->>Code: assistant message<br/>(content + tool_calls)
Code->>Parse: tool_calls[i].arguments (raw string)
Parse-->>Code: dict of args
Concepts
An OpenAI-compatible chat-completion endpoint is one URL:
POST {BASE_URL}/chat/completionsThe request body is a small JSON object:
{ "model": "qwen2.5:14b", "messages": [{"role": "user", "content": "hello"}], "temperature": 0, "tools": [ /* optional, tool specs */ ]}The response wraps the assistant’s reply:
{ "choices": [{ "message": {"role": "assistant", "content": "Hi there!"} }]}When you pass tools and the model decides to call one, message.content is usually empty and a
tool_calls list appears instead:
"message": { "role": "assistant", "content": null, "tool_calls": [{ "id": "call_abc", "type": "function", "function": {"name": "get_pods", "arguments": "{\"namespace\":\"shop\"}"} }]}Two things to notice:
argumentsis a string, not a parsed object. The endpoint hands you JSON-inside-JSON. You parse it yourself. That’sparse_tool_args.- The shape is identical across providers. Ollama, OpenAI, Anthropic-via-OpenAI-compat, vLLM, LM Studio — same JSON. That’s why we read everything from env vars: same code, different endpoint.
Two design choices we make once and never revisit:
temperature=0. Agents pick tools. Tool choices need to be reproducible or every debugging session becomes a séance.- Env vars, never code.
OPENAI_BASE_URL,BUDO_MODEL,OPENAI_API_KEY. Swap provider → export different values → same code runs.
Build
The lab lives at labs/warmup-llm-client/. Open it in another pane.
Step 1 — Prove the endpoint with curl
Before any Python, prove the endpoint is up. From the lab dir:
just curl-testUnder the hood, that’s just:
curl -sS http://localhost:11434/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer ollama" \ -d '{"model":"qwen2.5:14b","messages":[{"role":"user","content":"Say hi in one short sentence."}],"temperature":0}'You should see a JSON blob with a sentence inside choices[0].message.content. If you don’t,
fix Ollama before touching Python. Plumbing errors are easier to debug at the HTTP layer.
Budo says: if
curlworks and your Python doesn’t, the bug is in your Python. Ifcurldoesn’t work, the bug is in your laptop. Always know which one.
Step 2 — Write chat()
Open labs/warmup-llm-client/starter/provider_skeleton.py. The skeleton already has imports
and env-var reads. You fill in the body:
def chat(messages, tools=None, temperature=0.0): body = {"model": MODEL, "messages": messages, "temperature": temperature} if tools: body["tools"] = tools r = httpx.post( f"{BASE_URL}/chat/completions", json=body, headers={"Authorization": f"Bearer {API_KEY}"}, timeout=300, ) r.raise_for_status() return r.json()["choices"][0]["message"]Four decisions worth naming:
| Line | Why |
|---|---|
if tools: body["tools"] = tools | An empty tools field confuses some endpoints. Omit it unless you actually have tools. |
Authorization: Bearer … | Required by paid APIs, ignored by Ollama. Always send it; portability is cheap. |
timeout=300 | Local 14B models can take 30–60s on cold cache. Generous, not magical. |
r.raise_for_status() | Turn 4xx/5xx into a real Python exception. Silent failure is the worst failure. |
Test it now:
just testYou should see a sentence printed. If you see a stack trace, read it — raise_for_status will
tell you exactly what the endpoint said.
Step 3 — Pass tools through and look at the response
You don’t write tool definitions yet — Ch1 does that. For now, just confirm the field passes
through. Drop this in a scratch session (python3 REPL, from the starter/ dir):
from provider_skeleton import chat
tools = [{ "type": "function", "function": { "name": "get_time", "description": "Returns the current time.", "parameters": {"type": "object", "properties": {}}, },}]
msg = chat([{"role":"user","content":"What time is it? Use the tool."}], tools=tools)print(msg)If the model is tool-capable (qwen2.5:14b is), msg looks like:
{'role': 'assistant', 'content': '', 'tool_calls': [{'id': '...', 'type': 'function', 'function': {'name': 'get_time', 'arguments': '{}'}}]}arguments is a string. Parsing it is the next function.
Step 4 — Write parse_tool_args()
Local 14B models are mostly good at JSON. Mostly is not good enough. The three things they do:
| Input | What it is | Strategy |
|---|---|---|
{"namespace": "shop", "tail": 100} | Clean JSON. 95% of calls. | json.loads(raw) — done. |
{'namespace': 'shop', 'tail': 100} | Single quotes. Python-flavored. | Replace ' with ", parse. |
{"namespace": "shop", "tail": 100},\n | Trailing comma + newline. Model dribbled extra characters. | .rstrip(", \n"), parse. |
Two-tier strategy: strict first, lenient on the fallback, raise loudly if both fail.
def parse_tool_args(raw): try: return json.loads(raw) except json.JSONDecodeError: cleaned = raw.replace("'", '"').rstrip(", \n") return json.loads(cleaned) # if this raises, let itThat last comment matters. Don’t swallow the second exception. Ch1’s loop catches it and feeds the error message back to the model so it can re-emit valid JSON. Self-correction is half of robustness.
Step 5 — Verify with the test recipes
just test # chat() returns a sentencejust parse-test # parse_tool_args handles all three casesBoth should pass:
PASS clean JSON PASS single-quoted PASS trailing comma+nlIf parse-test fails on case 3, you forgot the .rstrip. If it fails on case 2, you forgot the
quote replacement.
Step 6 — Swap providers without changing code
Optional but worth doing once. Set OpenAI env vars and re-run just test:
export OPENAI_BASE_URL=https://api.openai.com/v1export OPENAI_API_KEY=sk-...export BUDO_MODEL=gpt-4o-minijust testSame code, different endpoint, same output shape. That’s the entire point of an OpenAI-compatible API and the entire reason we never hardcoded a provider.
Break it
Three small failures, each one a future debugging story:
| Break | What you’ll see | Fix |
|---|---|---|
Ollama is down (pkill ollama) | httpx.ConnectError: All connection attempts failed | Start Ollama. The error is honest — read it. |
Model not pulled (BUDO_MODEL=qwen2.5:72b just test on a laptop without it) | httpx.HTTPStatusError: 404 ... model not found | ollama pull <model> or pick a smaller one. |
| API key missing on a paid endpoint | 401 Unauthorized | Export OPENAI_API_KEY. Don’t commit it. |
Each one is raise_for_status() paying its rent. Without it you’d get an empty string or a
KeyError deep inside the response, and good luck finding the cause.
Harden it
There is not much to harden here. Three rules and you’re done:
r.raise_for_status()so HTTP errors are loud.timeout=300so a hung server doesn’t hang your agent.- Read every credential from env. Never commit one, never log one.
That is the whole hardening list for a provider. Save complexity for the loop.
Belt test
-
just curl-testreturns a JSON object with a sentence insidechoices[0].message.content. -
just testprints a sentence from the model. - Calling
chat(messages, tools=[...])returns an assistant message with atool_callsfield. -
just parse-testprintsPASSfor all three cases. - You can point
OPENAI_BASE_URLandBUDO_MODELat a different provider andjust teststill works, with no code edits.
Where this lands in Ch1
If you finished the warm-up, drop your file into the live tree:
mv budo/budo/core/provider.py budo/budo/core/provider.reference.pycp labs/warmup-llm-client/starter/provider_skeleton.py budo/budo/core/provider.pyCh1’s loop calls your chat and your parse_tool_args. If your version works, the white belt
is yours to earn.
If you’d rather skip — also fine. The reference provider.py already in the tree is equivalent.
Go straight to Ch1 — The Naked Loop and build the loop on top.
Budo says: the loop is where every interesting decision lives. The provider is just plumbing. Knowing that is the point.