How to Ship a Production
Most LLM applications look great in a high-fidelity demo. Then they hit the hands of real users and start failing in very predictable yet damaging ways. They answer questions they should not, they break when document retrieval is weak, they time out due to network latency, and nobody can tell exactly what happened because there are no logs and no tests. In this tutorial, you’ll build a beginner-friendly Retrieval Augmented Generation (RAG) application designed to survive production realities. This isn’t just a script that calls an API. It’s a system featuring a FastAPI backend, a persisted FAISS vector store, and essential safety guardrails (including a retrieval gate and fallbacks). Why RAG Alone Does Not Equal Production-Ready The Architecture You Are Building Project Setup and Structure How to Build the RAG Layer with FAISS How to Add the LLM Call with Structured Output How to Add Guardrails: Retrieval Gate and Fallbacks FastAPI App: Creating the /answer Endpoint How to Add Beginner-Friendly Evals What to Improve Next: Realistic Upgrades Retrieval Augmented Generation (RAG) is often hailed as the hallucination killer. By grounding the model in retrieved text, we provide it with the facts it needs to be accurate. But simply connecting a vector database to an LLM isn’t enough for a production environment. Production issues usually arise from the silent failures in the system surrounding the model: Weak retrieval:If the app retrieves irrelevant chunks of text, the model tries to bridge the gap by inventing an answer anyway. Without a designated “I do not know” path, the model is essentially forced to hallucinate. Lack of visibility:Without structured outputs and basic logging, you can’t tell if bad retrieval, a confusing prompt, or a model update caused a wrong answer. Fragility:A simple API timeout or malformed provider response becomes a user-facing outage if you don’t implement fallbacks. No regression testing:In traditional software, we have unit tests. In AI, we need evals. Without them, a small tweak to your prompt might fix one issue but break ten others without you realising it. We’ll solve each of these issues systematically in this guide. This tutorial is beginner-friendly, but it assumes you have a few basics in place so you can focus on building a robust RAG system instead of getting stuck on setup issues. You should be comfortable with: Python fundamentals(functions, modules, virtual environments) Basic HTTP + JSON(requests, response payloads) APIs with FastAPI(what an endpoint is and how to run a server) High-level LLM concepts(prompting, temperature, structured outputs) You’ll need: Python 3.10+ A working OpenAI-compatible API key(OpenAI or any provider that supports the same request/response shape) A local environment where you can run a FastAPI app (Mac/Linux/Windows) We’ll build a production-minded baseline: A FAISS-backed retrieverwith a persisted index + metadata A retrieval gateto prevent “forced hallucination” Structured JSON outputsso your backend is stable Fallback behaviorfor timeouts and provider errors A small eval harnessto prevent regressions We won’t implement advanced upgrades such as rerankers, semantic chunking, auth, background jobs beyond a roadmap at the end. The flow of our application follows a disciplined path so every answer is grounded in evidence: User query:The user submits a question via a FastAPI endpoint. Retrieval:The system embeds the question and retrieves the top-k most similar document chunks. The retrieval gate:We evaluate the similarity score. If the context is not relevant enough, we stop immediately and refuse the query. Augmentation and generation:If the gate passes, we send a context-augmented prompt to the LLM. Structured response:The model returns a JSON object containing the answer, sources used, and a confidence level. To keep things organized and maintainable, we’ll use a modular structure. This allows you to swap out your LLM provider or your vector database without rewriting your entire core application. First, create a virtual environment to isolate your project: Create a Important note on compatibility: The code below assumes an OpenAI-style API. If you use a provider that is not compatible, you must change the URL, headers (for example In FAISS(Facebook AI Similarity Search) is a fast library for vector similarity search. In a RAG system, each chunk of text becomes an embedding vector (a list of floats). FAISS stores those vectors in an index so you can quickly ask: “Given this question embedding, which document chunks are closest to it?” In this tutorial, we use We’ll use chunking with overlap. If we split a document at exactly 1,000 characters, we might cut a sentence in half, losing its meaning. By using an overlap, for example, 200 characters, we ensure that the end of one chunk and the beginning of the next share context. A major failure point in AI apps is the “chatty” nature of LLMs. If your backend expects a list of sources but the LLM returns conversational filler, your code will crash. We solve this with structured output: instruct the model to return a strict JSON object, then parse it safely. Guardrails are interceptors. They sit between the user and the model to prevent predictable failures. In a standard RAG pipeline, the system always calls the LLM. If the user asks an irrelevant question, the retriever will still return the “closest” (but wrong) chunks. The solution is the retrieval gate: Retrieve top-k chunks and get the top similarity score If the score is below a threshold (for example Only call the LLM when retrieval is strong enough to ground the answer A threshold of Fallbacks ensure that if an API fails or times out, the user gets a helpful message instead of a crash. They also keep your API response shape consistent, which prevents frontend errors and makes logging meaningful. In this tutorial, fallbacks are implemented inside The A small but important habit: keep prompts centralised so they’re versionable and easy to evaluate. In AI systems, outputs are probabilistic. This makes testing harder than traditional software. Evals (evaluations) are a set of “golden questions” and “expected behaviours” you run repeatedly to detect regressions. Instead of “does it output exactly this string,” you test: Should the app refusewhen the retrieval is weak? When it answers, does it include sources? Is the behaviour stable across prompt tweaks and model changes? This should contain both positive and negative cases. This runner calls your API endpoint (end-to-end) and checks expected behaviours. Run your server: In another terminal, run evals: If an eval fails, you have a concrete signal that something changed in retrieval, gating, prompting, or provider behaviour. Building a reliable RAG app is iterative. Here are realistic next steps: Semantic chunking:Break text based on meaning instead of character count. Reranking:Use a cross-encoder reranker to reorder the top-k chunks for higher precision. Metadata filtering:Filter results by category, date, or department to reduce false positives. Better citations:Store chunk IDs and show exactly which chunk(s) the answer came from. Observability:Add request IDs, structured logs, and traces so “what happened?” is answerable. Async + background indexing:Move index building to a background job and keep the API responsive. Building an AI application that survives in the real world is about building a system that is predictable, measurable, and safe. Retrieval quality is measurable:Use similarity scores to gate your LLM. Refusal is a feature:It is better to say “I do not know” than to lie. Fallbacks are mandatory:Design for the moment the API goes down. Evals prevent regressions:Never deploy a change without running your tests. I am Chidozie Managwu, an award-winning AI Product Architect and founder focused on helping global tech talent build real, production-ready skills. I contribute to global AI initiatives as a GAFAI Delegate and lead AI Titans Network, a community for developers learning how to ship AI products. My work has been recognized with the Global Tech Hero award and featured on platforms like HackerNoon.Why RAG Alone Does Not Equal Production-Ready
Prerequisites
Knowledge
Tools + Accounts
What This Tutorial Covers (and What It Doesn’t)
The Architecture You Are Building
Project Setup and Structure
Project Structure
.├── app.py # FastAPI entry point and API logic├── rag.py # FAISS index, persistence, and document retrieval├── llm.py # LLM API interface and JSON parsing├── prompts.py # Centralized prompt templates├── data/ # Source .txt documents├── index/ # Persisted FAISS index and metadata└── evals/ # Evaluation dataset and runner script ├── eval_set.json └── run_evals.pyInstall Dependencies
python -m venv .venvsource .venv/bin/activate # On Windows: .venv\Scripts\activatepip install fastapi uvicorn faiss-cpu numpy pydantic requests python-dotenvConfigure the Environment
.envfile in the root directory. We are targeting OpenAI-compatible providers:OPENAI_API_KEY=your_actual_api_key_hereOPENAI_BASE_URL=https://api.openai.com/v1OPENAI_MODEL=gpt-4o-miniX-API-Key), and the way you extract embeddings and final message content in embed_texts()and call_llm().How to Build the RAG Layer with FAISS
rag.py, we handle the “Retriever” part of RAG. This involves turning raw text into mathematical vectors that the computer can compare.What is FAISS (and What Does It Do)?
IndexFlatIPinner product and normalise vectors with faiss.normalize_L2(...). With normalised vectors, the inner product behaves like cosine similarity, giving us a stable score we can use for a retrieval gate.Chunking Strategy With Overlap
Implementation of
rag.pyimport osimport faissimport numpy as npimport requestsimport jsonfrom typing import List, Dictfrom dotenv import load_dotenvload_dotenv()INDEX_PATH = "index/faiss.index"META_PATH = "index/meta.json"def chunk_text(text: str, size: int = 1000, overlap: int = 200) -> List[str]: chunks = [] step = max(1, size - overlap) for i in range(0, len(text), step): chunk = text[i : i + size].strip() if chunk: chunks.append(chunk) return chunksdef embed_texts(texts: List[str]) -> np.ndarray: # Note: If your provider is not OpenAI-compatible, change this URL and headers url = f"{ os.getenv('OPENAI_BASE_URL')}/embeddings" headers = { "Authorization": f"Bearer { os.getenv('OPENAI_API_KEY')}"} payload = { "input": texts, "model": "text-embedding-3-small"} resp = requests.post(url, headers=headers, json=payload, timeout=30) resp.raise_for_status() # If your provider uses a different response format, change the line below vectors = np.array([item["embedding"] for item in resp.json()["data"]], dtype="float32") return vectorsdef build_index() -> None: all_chunks: List[str] = [] metadata: List[Dict] = [] if not os.path.exists("data"): os.makedirs("data") return for file in os.listdir("data"): if not file.endswith(".txt"): continue with open(f"data/{ file}", "r", encoding="utf-8") as f: text = f.read() chunks = chunk_text(text) all_chunks.extend(chunks) for c in chunks: metadata.append({ "source": file, "text": c}) if not all_chunks: return embeddings = embed_texts(all_chunks) faiss.normalize_L2(embeddings) dim = embeddings.shape[1] index = faiss.IndexFlatIP(dim) index.add(embeddings) os.makedirs("index", exist_ok=True) faiss.write_index(index, INDEX_PATH) with open(META_PATH, "w", encoding="utf-8") as f: json.dump(metadata, f, ensure_ascii=False)def load_index(): if not (os.path.exists(INDEX_PATH) and os.path.exists(META_PATH)): raise FileNotFoundError( "FAISS index not found. Add .txt files to data/ and run build_index()." ) index = faiss.read_index(INDEX_PATH) with open(META_PATH, "r", encoding="utf-8") as f: metadata = json.load(f) return index, metadatadef retrieve(query: str, k: int = 5) -> List[Dict]: index, metadata = load_index() q_emb = embed_texts([query]) faiss.normalize_L2(q_emb) scores, ids = index.search(q_emb, k) results = [] for score, idx in zip(scores[0], ids[0]): if idx == -1: continue m = metadata[idx] results.append( { "score": float(score), "source": m["source"], "text": m["text"], "id": int(idx)} ) return resultsHow to Add the LLM Call with Structured Output
Implementation of
llm.pyimport jsonimport requestsimport osfrom typing import Dict, Anydef call_llm(system_prompt: str, user_prompt: str) -> Dict[str, Any]: # Note: Change URL/Headers if using a non-OpenAI compatible provider url = f"{ os.getenv('OPENAI_BASE_URL')}/chat/completions" headers = { "Authorization": f"Bearer { os.getenv('OPENAI_API_KEY')}", "Content-Type": "application/json", } payload = { "model": os.getenv("OPENAI_MODEL"), "messages": [ { "role": "system", "content": system_prompt}, { "role": "user", "content": user_prompt}, ], "response_format": { "type": "json_object"}, "temperature": 0, } try: resp = requests.post(url, headers=headers, json=payload, timeout=30) resp.raise_for_status() content = resp.json()["choices"][0]["message"]["content"] parsed = json.loads(content) parsed.setdefault("answer", "") parsed.setdefault("refusal", False) parsed.setdefault("confidence", "medium") parsed.setdefault("sources", []) return parsed except (requests.Timeout, requests.ConnectionError): return { "answer": "The system is temporarily unavailable (network issue). Please try again.", "refusal": True, "confidence": "low", "sources": [], "error_type": "network_error", } except Exception: return { "answer": "A system error occurred while generating the answer.", "refusal": True, "confidence": "low", "sources": [], "error_type": "unknown_error", }How to Add Guardrails: Retrieval Gate and Fallbacks
The Retrieval Gate: How It Works and How to Add It
0.30), refuse immediately0.30is a reasonable starting point when using normalised cosine similarity, but you should tune it using evals (next section).Fallbacks and Why They Matter
call_llm()so your FastAPI layer stays simple.FastAPI App: Creating the /answer Endpoint
app.pyfile is the conductor. It ties retrieval, guardrails, prompting, and generation together.Implementation of
app.pyfrom fastapi import FastAPIfrom pydantic import BaseModelfrom rag import retrievefrom llm import call_llmimport promptsimport timeimport logginglogging.basicConfig(level=logging.INFO)logger = logging.getLogger("rag_app")app = FastAPI(title="Production-Ready RAG")class QueryRequest(BaseModel): question: [email protected]("/answer")async def get_answer(req: QueryRequest): start_time = time.time() question = (req.question or "").strip() if not question: return { "answer": "Please provide a non-empty question.", "refusal": True, "confidence": "low", "sources": [], "latency_sec": round(time.time() - start_time, 2), } # 1) Retrieval results = retrieve(question, k=5) top_score = results[0]["score"] if results else 0.0 logger.info("query=%r top_score=%.3f num_results=%d", question, top_score, len(results)) # 2) Retrieval Gate (Guardrail) if top_score < 0.30: return { "answer": "I do not have documents to answer that question.", "refusal": True, "confidence": "low", "sources": [], "latency_sec": round(time.time() - start_time, 2), "retrieval": { "top_score": top_score, "k": 5}, } # 3) Augment context_text = "\n\n".join([f"Source { r['source']}: { r['text']}" for r in results]) user_prompt = f"Context:\n{ context_text}\n\nQuestion: { question}" # 4) Generation with Fallback response = call_llm(prompts.SYSTEM_PROMPT, user_prompt) # 5) Attach debug metadata response["latency_sec"] = round(time.time() - start_time, 2) response["retrieval"] = { "top_score": top_score, "k": 5} return responseCentralized Prompt – Template: prompts.py
Example
prompts.pySYSTEM_PROMPT = """You are a RAG assistant. Use ONLY the provided Context to answer.If the context does not contain the answer, respond with refusal=true.Return a valid JSON object with exactly these keys:- answer: string- refusal: boolean- confidence: "low" | "medium" | "high"- sources: array of strings (source filenames you used)Do not include any extra keys. Do not include markdown. Do not include commentary."""How to Add Beginner-Friendly Evals
Step 1: Create
evals/eval_set.json[ { "id": "in_scope_01", "question": "What is a retrieval gate and why is it important?", "expect_refusal": false, "notes": "Should explain gating and relate it to hallucination prevention." }, { "id": "out_of_scope_01", "question": "What is the capital of France?", "expect_refusal": true, "notes": "If the knowledge base only includes our docs, the app should refuse." }, { "id": "edge_01", "question": "", "expect_refusal": true, "notes": "Empty input should not call the LLM." }]Step 2: Create
evals/run_evals.pyimport jsonimport requestsAPI_URL = "http://127.0.0.1:8000/answer"def run(): with open("evals/eval_set.json", "r", encoding="utf-8") as f: cases = json.load(f) passed = 0 failed = 0 for case in cases: resp = requests.post(API_URL, json={ "question": case["question"]}, timeout=60) resp.raise_for_status() out = resp.json() got_refusal = bool(out.get("refusal", False)) expect_refusal = bool(case["expect_refusal"]) ok = (got_refusal == expect_refusal) # Beginner-friendly: if it answers, sources should exist and be a list if not got_refusal: ok = ok and isinstance(out.get("sources"), list) if ok: passed += 1 print(f"PASS { case['id']}") else: failed += 1 print(f"FAIL { case['id']} expected_refusal={ expect_refusal} got_refusal={ got_refusal}") print("Output:", json.dumps(out, indent=2)) print(f"\nDone. Passed={ passed} Failed={ failed}") if failed: raise SystemExit(1)if __name__ == "__main__": run()How to Use Evals in Practice
uvicorn app:app --reloadpython evals/run_evals.pyWhat to Improve Next: Realistic Upgrades
Final Thoughts: Production-Ready Is a Set of Habits
About Me
- 最近发表
- 随机阅读
-
- How to Start your Career in Tech with freeCodeCamp
- Beau Carnes
- Beau Carnes
- Build a Self
- How to Build CRUD Operations with .NET Core – A Todo API Handbook
- How to Run Private Text
- How to Build a Case Converter Tool Using HTML, CSS, and JavaScript
- How to Build a Browser
- The REST API Handbook – How to Build, Test, Consume, and Document REST APIs
- Build a Self
- Mohammed Fahd Abrah
- software architecture
- How to Build a Cost
- Build a Self
- Shola Jegede
- Bansidhar Kadiya
- Beyond NVIDIA: Where the AI Infra Trade Actually Shows Up
- Bhavin Sheth
- Build a Self
- Beau Carnes
- 搜索
-