Production LLM Patterns¶

Battle-tested patterns for deploying LLM systems in production. These patterns address the gap between demo-quality RAG and reliable business applications, focusing on deterministic context injection, workflow decomposition, and human-in-the-loop automation.

Key Facts¶

Three business archetypes: data extraction (simplest), AI search/assistants (most popular), AI platforms (largest)
Deterministic context injection often outperforms vector search for known question categories
Workflow decomposition: simpler steps = fewer cognitive demands on the LLM
Copilot pattern (partial automation) delivers 2-3x productivity gains
Collect everything: request ID, timestamp, query, full trace - disk is cheap

Patterns¶

Pattern 1: Deterministic Context Injection (No RAG)¶

Instead of vector search, prepare data and feed directly to LLM:

# For liquidity comparison question:
context = """
| Company | Liquidity (EUR) |
|---------|----------------|
| Dior Group | 7,040,000,000 |
| Bellevue Group | 890,000,000 |
| Unica Group | 2,100,000,000 |
"""
# LLM gives perfect answers every time

Use for high-accuracy categories while RAG handles general questions. This "cheating" scales better than you'd think.

Pattern 2: Knowledge Base as Structured Files¶

Not a vector DB - folders with markdown files, Google Sheets, Confluence pages, databases. Information retrieved by known paths, not semantic search.

Example - Business Translator: 1. Script examines source document structure 2. Pulls translation guidelines for target language 3. Pulls domain-specific terminology dictionary 4. Pulls error history (past mistakes and corrections) 5. Assembles into context/prompt for LLM 6. Result: accurate domain-specific translations at low cost

Pattern 3: Instruction Distillation¶

GPT-4 writes compressed instructions for weaker/cheaper models: 1. Describe business process to GPT-4 2. GPT-4 distills into executable instructions 3. Instructions are human-readable - reviewable by non-technical stakeholders 4. If a step fails, feed errors back: "rewrite so this doesn't happen" 5. Iterate until acceptable error rate

Cross-model distillation: GPT-4 writing instructions for local Mistral, iteratively simplifying until the weak model executes correctly.

Pattern 4: Dedicated Agent Abstraction¶

Wrap prompt + context + knowledge base into a "dedicated agent" with name, role, specialization, example queries: - Understandable by business stakeholders ("virtual specialist") - Simplifies routing (router has list of agent descriptions) - Enables dynamic scaling (new agents via configuration) - Users can "train" new agents via simple forms

Router pattern: LLM classifies request against list of agents (name + specialization + examples). Routes to best match. Handles hundreds of specialists.

Hybrid: RAG handles 80% general queries. Router intercepts 20% high-accuracy categories to dedicated agents.

Pattern 5: Workflow Decomposition¶

Break complex processes into simple sequential steps:

Example - B2B Sales Assistant: 1. Query expansion: user request -> search queries 2. Internet search + document download 3. Index downloaded content 4. Full-text + vector search 5. Result review (loop back to step 1 if insufficient) 6. Answer synthesis

Advantages: simpler steps, isolated debugging, human-verifiable, clear failure tracing.

Pattern 6: Copilot / Human Envelope¶

Automate boring/repetitive parts, keep humans for judgment:

Marketing Content Generator: - Human picks topic -> LLM discovers materials -> human verifies - LLM generates 3 plans -> human picks best -> LLM writes draft -> human reviews - 2-3x productivity (5-6 articles/day vs 1-2)

Support Copilot: - Customer calls -> speech-to-text -> LLM processes in parallel - LLM provides: customer profile, likely issue, proposed answer - Human verifies and responds (70-80% faster) or overrides

Pattern 7: Logging and Evaluation¶

Collect everything: request ID, timestamp, query, full trace (prompts, tokens, outputs, logprobs).

Evaluation pipeline: 1. Evaluate end-to-end quality 2. Trace issues to specific step 3. Identify problem (e.g., router misclassifying) 4. Build dataset of correct/incorrect 5. Modify prompt -> run against dataset -> measure 6. If better, deploy. If worse, rollback.

This is regular engineering - incremental improvements with measurements.

LLM Known Limitations¶

Limitation	Solution
Math errors	Extract to Python/calculator tool
String manipulation	Use Python string operations
Physical world reasoning	Add chain-of-thought
Niche domains	Provide domain context explicitly
Rhyme detection	Text models lack phonetic training

Gotchas¶

AI search/assistants statistically fail more often than data extraction due to hallucination
Don't default to RAG when deterministic injection works for your category
"Improving the system through training" is better business communication than "editing prompts"
Always have a human fallback path for when the LLM fails
Workflow decomposition adds latency but dramatically improves reliability
Log storage is cheap - over-log rather than under-log