How to Build a Production
In July 2025, a Claude Code recursion loop burned between 16,000 USD and 50,000 USD in five hours. There was no crash or error, just agents doing exactly what they were told, indefinitely, because nobody told them when to stop. Four months later, a four-agent LangChain loop ran for eleven days and cost 47,000 USD. Nobody noticed until the invoice arrived. The pipeline worked correctly in testing, and the agents were doing exactly what they were told. Same pattern. This tutorial is about that missing instruction. You'll build five small Python primitives that catch most agent loop failures before they ship: A spec writerthat forces you to define done before the loop starts A circuit breakerthat kills the loop when it exceeds hard limits A ledgerthat records every turn in an append-only SQLite audit trail An agent loopthat ties all three together A review surfacethat forces human attestation before downstream systems receive anything By the end you'll have a working repo you can drop into any agent project. The full code is at github.com/dannwaneri/production-safe-agent-loop. Why This Keeps Happening Prerequisites Phase 1: Define Done Before You Build Phase 2: Enforce Done at Runtime Phase 3: Record Everything Phase 4: The Loop That Respects Its Boundaries Phase 5: The Review Surface Phase 6: A Real Example, SEO Audit Agent Pluggable LLM Client Running the Tests What You've Built Next Steps The math that got companies into trouble was simple. A chatbot costs roughly 0.04 USD per interaction. An orchestrated multi-agent workflow costs 1.20 USD. That's a 30x multiplier — and production benchmarks show it can reach 70x on complex tasks. The problem isn't that agents are expensive. The problem is that most teams budgeted for chatbot costs and deployed agent architectures. Gartner found the token consumption gap between pilot chatbots and production agent workflows sits at 5-30x. The FinOps Foundation's 2026 State of FinOps report found 73% of enterprises say AI costs exceeded original projections. The mechanism is straightforward once you see it. When an agent fails a task and retries, it doesn't start fresh. It re-reads the entire context window — every prior failed attempt — before trying again. Iteration one costs 100 tokens. Iteration two costs 200. Iteration ten costs thousands. You're paying for every failure, over and over, in milliseconds. That question mark is where the money goes. The other thing making it worse: agents don't fail loudly. Traditional code hits an undefined state and crashes. An LLM hits ambiguity and tries to be helpful. It retries. It reformats the tool call. It spins up a verification agent. The verification agent finds something. A correction agent fires. Nobody defined what "correct" means. The loop looks beautiful on every dashboard you have — activity, tool calls, completion rate — while quietly burning through your budget. Gartner predicts that 40% of agentic projects will be scrapped by 2027 due to economic failure. Most of that failure is preventable. Not with better models, but with exit conditions. Python 3.10+ An Anthropic API key (or any provider — more on that later) Basic familiarity with Python classes and SQLite The most expensive mistake in agent development isn't a bad model choice or a missing retry limit. It's starting the build before you can answer one question in one sentence: What does done look like? Most teams can't answer it. Not because they're careless, but because nothing forces them to before they open the terminal. The spec writer is that forcing function. When you call What does this do? What does this NOT do? What does done look like in one sentence? The third question is the one that matters. It's also the hardest. "The agent audits the site" is not an answer. "The agent crawls the target URL, extracts all The spec stores to SQLite and returns a For testing, Defining the exit condition upstream is discipline. The circuit breaker is enforcement. Two ceilings. Both hard. The boundary is strict: The critical rule: call The defaults (5 turns, 15,000 tokens) match a tight tutorial demo. Your production budget is different. Tune at instantiation: The circuit breaker protects your bank account. The ledger protects your understanding of what happened. Most teams log for debugging — they want to know what went wrong after it went wrong. The ledger has a different purpose. It's governance. Every row is proof that the loop stayed within its boundaries, or didn't, and exactly when. One row per turn. Append-only, no updates, and no deletes. The immutability is the point: a ledger you can edit isn't a ledger, it's a notebook. The schema: The index makes Three decisions worth explaining: Retrieve by session: The agent loop wires the three primitives together. It's the only component that calls the LLM. Everything else is local. The anatomy of a turn, in order: If Pre-flight checking before every LLM call, with no exceptions. A few choices worth calling out in this body: The whole The circuit breaker protects your bank account. The ledger records what happened. But neither tells you whether what happened matched what you promised. That gap is where bad loops get approved. Polished output, green dashboard, missed commitment. A reviewer sees the artifact, decides it looks acceptable, and signs off. Nobody asked whether the original promise was kept. The review surface closes that gap. It reads the session from SQLite, assembles the five-element frame, and forces a comparison before anything downstream receives the output. Here's the five-element frame, in order: Original promise— pulled from the spec table: what it does, what it doesn't do, what done looks like Acceptance criteria— the Diff— first turn input vs final turn output, turns completed, total tokens, whether the loop breached Evidence— all ledger rows for the session: turn-by-turn pass/fail, token delta, execution time Unresolved assumptions— derived from breach rows and failed turns. Empty when clean. When the reviewer is satisfied, they attest: Approval confirms the process ran. Attestation confirms the reviewer compared output to commitment. When the loop touches something regulated, those are different legal documents. The full end-to-end flow with the review surface lives in The loop runs to you. Downstream systems get nothing until someone signs. The pattern only makes sense against a real problem. This is the same agent architecture behind my seo-agent project. SEO audits have a natural cadence: crawl, surface what's broken, fix, wait for reindex. Running the agent continuously doesn't change that cadence. It just burns tokens in the empty space between the moments that matter. A cron job wired to the loop is the honest architecture. Run it: The spec writer prompts you. The loop runs, the circuit breaker fires if the limits are exceeded, and the ledger records every turn. The output lands in front of you and you decide what to fix. The loop runs to you, not into a void. The loop works with any client that satisfies the Here's an OpenAI adapter example (This is illustrative. A production adapter would also map streaming, tool-use, and error shapes.): The adapter pattern is worth teaching explicitly. Provider APIs don't share a shape. Anthropic puts With coverage: 80 tests, 100% coverage on all five core modules. The loop is exercised against a In this tutorial, you've build five small primitives, each independently usable. The pattern: upstream discipline defines the boundaries. Downstream enforcement breaks the circuit. Neither trusts the model to police itself. A loop that runs without an exit condition isn't autonomous. It's a billing event waiting to happen. Define what done looks like before you start. That's the job, and always has been. The repo is at github.com/dannwaneri/production-safe-agent-loop. There are three natural extensions if you want to go further: The SQLite ledger works for isolated sequential loops. The moment you run multiple agents against shared state, you need serializable isolation — concurrent writes to flat JSON corrupt silently. The README documents the three tipping points where a flat ledger needs to graduate. For compliance-scale systems where the auditor wasn't present when the loop ran, SQLite rows aren't enough. A database admin can run an The honest architecture for the SEO audit agent isn't 24/7 autonomous operation. It's a cron job that runs on schedule, surfaces what's broken, and stops. If you need this architecture built for your own stack (circuit breakers, audit trails, production-safe agent loops), I do freelance work. dannwaneri.com/ai-agents/Why This Keeps Happening
# This is the entire problem in three lineswhile True: result = agent.run(task) # done when...?Prerequisites
git clone https://github.com/dannwaneri/production-safe-agent-loopcd production-safe-agent-looppip install -r requirements.txtexport ANTHROPIC_API_KEY=sk-...Phase 1: Define Done Before You Build
# spec_writer.pyfrom spec_writer import SpecWriterspec = SpecWriter(db_path="spec.db").run().run(), it won't return until you've answered three questions:<title>and <meta description>tags, flags any missing or over-length, and stops" is an answer. One of those gives the circuit breaker something to enforce.SpecResultdataclass with a session_id. That ID becomes the thread connecting your spec, your ledger rows, and your loop result. One session, traceable end to end.@dataclass(frozen=True)class SpecResult: what_it_does: str what_it_does_not: str done_looks_like: str session_id: strfrozen=Truematters. The spec is a commitment, not a draft. Once it's written, the loop runs against it. No mid-run revisions.SpecWriteraccepts injectable input_fnand output_fncallables. No stdin monkey-patching required. See tests/test_spec_writer.pyfor working examples — the suite uses a small scripted_inputhelper that returns answers from a generator, and writes to a per-test SQLite file via pytest's tmp_pathfixture. SQLite's :memory:isn't safe here, because SpecWriteropens a fresh connection per method and each :memory:connection is its own isolated database.Phase 2: Enforce Done at Runtime
# circuit_breaker.pyfrom circuit_breaker import CircuitBreaker, CircuitBreakerErrorbreaker = CircuitBreaker(turn_limit=5, token_limit=15000)breaker.check(turn_count, accumulated_tokens) # raises on breachturn_limitcaps how many times the loop can call the LLM. token_limitcaps total token consumption across all turns. Either one tripping raises CircuitBreakerErrorimmediately.turn_count == turn_limitis allowed. turn_count == turn_limit + 1trips. No grace periods or warnings. A hard stop forces a human checkpoint.from dataclasses import dataclass@dataclassclass CircuitBreakerError(Exception): reason: str # "turn_ceiling" or "token_ceiling" turn_count: int accumulated_tokens: int def __post_init__(self) -> None: super().__init__( f"circuit breaker tripped: { self.reason} " f"(turn={ self.turn_count}, tokens={ self.accumulated_tokens})" )class CircuitBreaker: def __init__(self, turn_limit: int = 5, token_limit: int = 15000) -> None: self.turn_limit = turn_limit self.token_limit = token_limit def check(self, turn_count: int, accumulated_tokens: int) -> None: if turn_count > self.turn_limit: self._trip("turn_ceiling", turn_count, accumulated_tokens) if accumulated_tokens > self.token_limit: self._trip("token_ceiling", turn_count, accumulated_tokens) def _trip(self, reason: str, turn_count: int, accumulated_tokens: int) -> None: print( "\n=== CIRCUIT BREAKER CHECKPOINT ===\n" f"reason : { reason}\n" f"turn_count : { turn_count} / limit { self.turn_limit}\n" f"tokens_used : { accumulated_tokens} / limit { self.token_limit}\n" "action : halt loop, surface to human reviewer\n" "==================================" ) raise CircuitBreakerError( reason=reason, turn_count=turn_count, accumulated_tokens=accumulated_tokens, )CircuitBreakerErroris an exception, not a return code. That's intentional. A return code can be ignored. An uncaught exception can't. Silent breach is impossible. The human-readable checkpoint banner is printed to stdout by _trip()beforethe exception is raised, so even if a caller swallows the exception the operator still sees state..check()beforeevery LLM call, not after. Post-flight checking means you've already burned the tokens before you knew the limit was exceeded.# Wrong — post-flightresult = client.messages.create(...)breaker.check(turn_count, accumulated_tokens) # too late# Right — pre-flightbreaker.check(turn_count, accumulated_tokens) # raises before any spendresult = client.messages.create(...)# Production example — tighter token budget, more turnsbreaker = CircuitBreaker(turn_limit=10, token_limit=50000)Phase 3: Record Everything
# ledger.pyfrom ledger import Ledgerledger = Ledger(db_path="ledger.db")ledger.write( session_id=spec.session_id, turn_count=1, state_origin="llm", input_str=task, token_delta=523, execution_time_ms=1240, pass_fail=True,)CREATE TABLE IF NOT EXISTS ledger ( id INTEGER PRIMARY KEY AUTOINCREMENT, session_id TEXT NOT NULL, turn_count INTEGER NOT NULL, state_origin TEXT NOT NULL, input_hash TEXT NOT NULL, token_delta INTEGER NOT NULL, execution_time_ms INTEGER NOT NULL, pass_fail INTEGER NOT NULL, -- 1=pass, 0=fail breach_reason TEXT, -- NULL unless circuit breaker fired created_at TEXT NOT NULL -- ISO 8601, UTC);CREATE INDEX IF NOT EXISTS idx_ledger_session ON ledger(session_id);get_session(session_id)— the primary read path — a constant-time lookup as the ledger grows.input_hashnotinput_text.The raw input string never persists. Only its SHA-256 hash does. There are two benefits to this: identical inputs across runs are detectable, and PII never enters the audit trail.pass_failasINTEGERnotBOOLEAN.SQLite has no boolean type. 1and 0are canonical. Clean Python ergonomics at the API edge, correct SQL types on disk.created_atasdatetime.now(timezone.utc).isoformat().datetime.utcnow()was deprecated in Python 3.12. Timezone-aware timestamps avoid the footgun in any system that crosses timezones.rows = ledger.get_session(spec.session_id)for row in rows: print(f"Turn { row.turn_count}: { 'PASS' if row.pass_fail else 'FAIL'} " f"| { row.token_delta} tokens | { row.execution_time_ms}ms")Phase 4: The Loop That Respects Its Boundaries
# agent_loop.pyfrom agent_loop import AgentLooploop = AgentLoop(spec, breaker, ledger, client)result = loop.run(task)# LoopResult(success, turns, total_tokens, session_id, breach_reason)circuit_breaker.check(turn_count, accumulated_tokens)— raises if either ceiling is exceededclient.messages.create(...)— the actual LLM callledger.write(...)— one row, append-onlystop_reason == "end_turn", return. Otherwise loop.def run(self, task: str) -> LoopResult: session_id = self.spec.session_id messages: list[dict] = [{ "role": "user", "content": task}] turn = 0 total_tokens = 0 try: while True: turn += 1 self.circuit_breaker.check(turn, total_tokens) started = time.perf_counter() response = self.client.messages.create( model=self.model, max_tokens=self.max_tokens, system=self._system_prompt(), messages=messages, ) elapsed_ms = int((time.perf_counter() - started) * 1000) turn_tokens = ( getattr(response.usage, "input_tokens", 0) + getattr(response.usage, "output_tokens", 0) ) total_tokens += turn_tokens text = self._text_from(response) messages.append({ "role": "assistant", "content": text}) self.ledger.write( session_id=session_id, turn_count=turn, state_origin="llm", input_str=task, token_delta=turn_tokens, execution_time_ms=elapsed_ms, pass_fail=True, ) if getattr(response, "stop_reason", "end_turn") == "end_turn": return LoopResult( success=True, turns=turn, total_tokens=total_tokens, session_id=session_id, ) messages.append({ "role": "user", "content": "continue"}) except CircuitBreakerError as err: self.ledger.write( session_id=session_id, turn_count=turn, state_origin="circuit_breaker", input_str=task, token_delta=0, execution_time_ms=0, pass_fail=False, breach_reason=err.reason, ) return LoopResult( success=False, turns=turn, total_tokens=total_tokens, session_id=session_id, breach_reason=err.reason, )def _system_prompt(self) -> str: return ( "You are an agent working on a tightly-scoped task.\n\n" f"What this does: { self.spec.what_it_does}\n" f"What this does NOT do: { self.spec.what_it_does_not}\n" f"Done looks like: { self.spec.done_looks_like}\n" )@staticmethoddef _text_from(response) -> str: content = getattr(response, "content", None) if not content: return "" block = content[0] return getattr(block, "text", "") or ""while True:is wrapped in onetry/except CircuitBreakerError.The check happens at the top of every turn, so a breach is caught the same way whether it fires on turn 1 or turn 6.input_str=taskon every ledger row — the original task, not the last assistant message. The input_hashcolumn then groups rows that share the same starting input across the run.pass_fail=Truefor every LLM turn that returns, Falseonly on breach. The pass/fail flag tracks whether the loop reachedthe row legitimately, not whether the model's output was good. Quality scoring is a separate concern._system_prompt()uses all three spec fields, not just done_looks_like. The model needs the negative scope (what_it_does_not) at least as much as the positive scope.time.perf_counter()nottime.time()— monotonic, immune to wall-clock adjustments mid-run.LoopResult.session_idis inherited from spec.session_id. The ledger rows tie back to the spec without a join. One session ID, one traceable run, start to finish.Phase 5: The Review Surface
from review_surface import ReviewSurfacers = ReviewSurface(spec_db_path="spec.db", ledger_db_path="ledger.db")print(rs.render(session_id))done_looks_likefield rendered as the explicit benchmarkattestation = rs.attest( session_id=result.session_id, reviewer="daniel", notes="Output matches spec. Approved.")print(attestation.frame_hash).attest()writes to the attestationstable in ledger.db. The frame_hashis a SHA-256 of the canonical frame data — deterministic across reviewers attesting the same session. It's the audit receipt. It proves the reviewer saw the exact frame as rendered, not a summary or a paraphrase.@dataclass(frozen=True)class ReviewFrame: session_id: str original_promise: SpecResult acceptance_criteria: str diff: DiffResult evidence: tuple # tuple[LedgerRow, ...] unresolved_assumptions: tuple # tuple[str, ...] created_at: strReviewFrameis frozen for the same reason SpecResultis — the frame is evidence, not a draft. evidenceand unresolved_assumptionsare tuples because lists aren't hashable and frozen dataclasses need hashable fields.examples/review_example.pyin the repo. Run it after any completed session: it renders the five-element frame, prompts for attestation, and writes the receipt if you approve.Phase 6: A Real Example — SEO Audit Agent
# examples/seo_audit_example.pyimport requestsfrom bs4 import BeautifulSoupimport anthropicfrom spec_writer import SpecWriterfrom circuit_breaker import CircuitBreakerfrom ledger import Ledgerfrom agent_loop import AgentLoopdef crawl_url(url: str) -> str: response = requests.get(url, timeout=10) soup = BeautifulSoup(response.text, "html.parser") title = soup.find("title") meta_desc = soup.find("meta", attrs={ "name": "description"}) h1_tags = soup.find_all("h1") return ( f"URL: { url}\n" f"Title: { title.text if title else 'MISSING'}\n" f"Meta description: " f"{ meta_desc['content'] if meta_desc else 'MISSING'}\n" f"H1 count: { len(h1_tags)}\n" f"H1 tags: { [h.text[:50] for h in h1_tags]}" )def run_seo_audit(url: str) -> None: # Step 1: Define done before the loop starts spec = SpecWriter(db_path="spec.db").run() # Step 2: Initialise circuit breaker and ledger breaker = CircuitBreaker(turn_limit=5, token_limit=15000) ledger = Ledger(db_path="ledger.db") client = anthropic.Anthropic() # Step 3: Crawl the URL site_data = crawl_url(url) # Step 4: Run the loop # AgentLoop catches CircuitBreakerError internally and returns # LoopResult(success=False, breach_reason=...). Branch on the # result — do NOT wrap loop.run() in try/except CircuitBreakerError. loop = AgentLoop(spec, breaker, ledger, client) result = loop.run( f"Audit this page for SEO issues:\n\n{ site_data}" ) # Step 5: Print the ledger print(f"\nResult: { 'SUCCESS' if result.success else 'BREACH'}") if not result.success: print(f"Breach reason: { result.breach_reason}") print(f"Turns: { result.turns} | Tokens: { result.total_tokens}") print("\nAudit trail:") for row in ledger.get_session(result.session_id): status = "PASS" if row.pass_fail else "FAIL" print(f" Turn { row.turn_count}: { status} | " f"{ row.token_delta} tokens | { row.execution_time_ms}ms")if __name__ == "__main__": import sys run_seo_audit(sys.argv[1] if len(sys.argv) > 1 else "https://example.com")python examples/seo_audit_example.py https://yourdomain.comPluggable LLM Client
LLMClientprotocol (Anthropic by default). Bring your own via a ~20-line adapter.# agent_loop.pyfrom typing import Protocol, runtime_checkable@runtime_checkableclass MessagesEndpoint(Protocol): def create(self, *, model: str, max_tokens: int, system: str, messages: list) -> object: ...@runtime_checkableclass LLMClient(Protocol): messages: MessagesEndpointmessagesis an instance attribute (not a nested class) because that's how the real Anthropic SDK exposes it — anthropic.Anthropic().messages.create(...). Modeling it as a nested class would mean the real client wouldn't satisfy the Protocol. The @runtime_checkabledecorator lets you sanity-check conformance with isinstance(client, LLMClient), and the repo's test suite uses exactly that assertion against the FakeClienttest double.# openai_adapter.py — illustrative pseudocode, not production-ready.from openai import OpenAI as _OpenAIclass _MessagesAdapter: def __init__(self, client): self._client = client def create(self, *, model, max_tokens, system, messages): completion = self._client.chat.completions.create( model=model, max_tokens=max_tokens, messages=[{ "role": "system", "content": system}] + messages, ) # Reshape OpenAI's response into the Anthropic-shaped surface # AgentLoop reads: response.usage.{ input,output}_tokens, # response.content[0].text, response.stop_reason. return _adapt_response(completion)class OpenAIAdapter: def __init__(self, api_key: str): self._client = _OpenAI(api_key=api_key) self.messages = _MessagesAdapter(self._client) # instance attr, not a nested classsystemat the top level. OpenAI puts it inside the messages array. An adapter shim is ~20 lines and makes the loop provider-agnostic without rewriting anything. Note that self.messagesis assigned in __init__so it's a real attribute on each adapter instance, the same shape as the actual SDK.Running the Tests
python -m pytest tests/python -m coverage run --source=circuit_breaker,ledger,spec_writer,agent_loop,review_surface -m pytest tests/python -m coverage report -mFakeClienttest double defined inline in tests/test_agent_loop.py. It satisfies the LLMClientprotocol via duck typing: messagesis set to self, so client.messages.create(...)routes back to the same object and ships with scripted responses for each test scenario. Clone the repo and run pytestto see all 80 tests pass without touching the network or needing an API key.circuit_breaker.pyhas 100% coverage — no untested paths. It's the financial safety component. Every path through it is exercised.What You've Built
Module Role Lines spec_writer.pyForces three answers before the loop runs 104 circuit_breaker.pyHard ceilings on turns and tokens 41 ledger.pyAppend-only SQLite audit trail 113 agent_loop.pyThe loop that respects both 128 review_surface.pyAssembles the five-element frame, records human attestation 114 Next Steps
1. Graduation to Distributed Systems
2. Cryptographic Signing
UPDATEquery. Ed25519 signing wraps each ledger row in a receipt that proves the log wasn't altered after execution. But that's a different tutorial.Wiring a Cron Job
0 3 * * 2 python examples/seo_audit_example.py https://yourdomain.comis the whole thing. The loop runs to you, not into a void.
相关推荐
-
The Docker Handbook – Learn Docker for Beginners
-
How to Build a Google Sheet AI Agent with Composio and Gemini TTS Support
-
Creating Memorable Web Experiences: A Modern CSS Toolkit
-
rotateZ()
-
How to Build Optimal AI Agents That Actually Work – A Handbook for Devs
-
How to Deploy a Serverless Spam Classifier Using Scikit
- 最近发表
-
- How to Create a GPU
- Pioneering Next
- How AI is Changing the Way We Code
- How to Build a Market Research Copilot with MCP and Python [Full Handbook]
- Front End JavaScript Development Handbook – React, Angular, and Vue Compared
- How to Build an Online Marketplace with Next.js, Express, and Stripe Connect
- How to Build a Positioning
- Efficient Data Processing in Python: Batch vs Streaming Pipelines Explained
- The Golang Handbook – A Beginner's Guide to Learning Go
- How to Build a PDF Page Numbering Tool in the Browser Using JavaScript
- 随机阅读
-
- Web Development
- How to Build Your Own Local AI: Create Free RAG and AI Agents with Qwen 3 and Ollama
- How to Build a Spam Email Detector with Python and Naive Bayes Classifier
- How to Merge PDF Files in the Browser Using JavaScript (Step
- Shola Jegede
- How to Create a GPU
- How to Build an End
- How to Generate PDF Files in the Browser Using JavaScript (With a Real Invoice Example)
- How to Build CRUD Operations with .NET Core – A Todo API Handbook
- Pioneering Next
- How to Deploy an AI Agent with Amazon Bedrock AgentCore
- AI Paper Review: Training Language Models to Follow Instructionswith Human Feedback (InstructGPT)
- What is Shadow AI? The Hidden Risks and Challenges in Modern Organizations
- How to Build an Adaptive Tic
- How to Build a Spam Email Detector with Python and Naive Bayes Classifier
- The freeCodeCamp.org Copyright Policy
- Backend Challenges Teams Face When Processing Repeat Payments
- How to Use MLflow to Manage Your Machine Learning Lifecycle
- What is Shadow AI? The Hidden Risks and Challenges in Modern Organizations
- Deep Reinforcement Learning in Natural Language Understanding
- 搜索
-