Production LLM Patterns¶
Battle-tested patterns for deploying LLM systems in production. These patterns address the gap between demo-quality RAG and reliable business applications, focusing on deterministic context injection, workflow decomposition, and human-in-the-loop automation.
Key Facts¶
- Three business archetypes: data extraction (simplest), AI search/assistants (most popular), AI platforms (largest)
- Deterministic context injection often outperforms vector search for known question categories
- Workflow decomposition: simpler steps = fewer cognitive demands on the LLM
- Copilot pattern (partial automation) delivers 2-3x productivity gains
- Collect everything: request ID, timestamp, query, full trace - disk is cheap
Patterns¶
Pattern 1: Deterministic Context Injection (No RAG)¶
Instead of vector search, prepare data and feed directly to LLM:
# For liquidity comparison question:
context = """
| Company | Liquidity (EUR) |
|---------|----------------|
| Dior Group | 7,040,000,000 |
| Bellevue Group | 890,000,000 |
| Unica Group | 2,100,000,000 |
"""
# LLM gives perfect answers every time
Use for high-accuracy categories while RAG handles general questions. This "cheating" scales better than you'd think.
Pattern 2: Knowledge Base as Structured Files¶
Not a vector DB - folders with markdown files, Google Sheets, Confluence pages, databases. Information retrieved by known paths, not semantic search.
Example - Business Translator: 1. Script examines source document structure 2. Pulls translation guidelines for target language 3. Pulls domain-specific terminology dictionary 4. Pulls error history (past mistakes and corrections) 5. Assembles into context/prompt for LLM 6. Result: accurate domain-specific translations at low cost
Pattern 3: Instruction Distillation¶
GPT-4 writes compressed instructions for weaker/cheaper models: 1. Describe business process to GPT-4 2. GPT-4 distills into executable instructions 3. Instructions are human-readable - reviewable by non-technical stakeholders 4. If a step fails, feed errors back: "rewrite so this doesn't happen" 5. Iterate until acceptable error rate
Cross-model distillation: GPT-4 writing instructions for local Mistral, iteratively simplifying until the weak model executes correctly.
Pattern 4: Dedicated Agent Abstraction¶
Wrap prompt + context + knowledge base into a "dedicated agent" with name, role, specialization, example queries: - Understandable by business stakeholders ("virtual specialist") - Simplifies routing (router has list of agent descriptions) - Enables dynamic scaling (new agents via configuration) - Users can "train" new agents via simple forms
Router pattern: LLM classifies request against list of agents (name + specialization + examples). Routes to best match. Handles hundreds of specialists.
Hybrid: RAG handles 80% general queries. Router intercepts 20% high-accuracy categories to dedicated agents.
Pattern 5: Workflow Decomposition¶
Break complex processes into simple sequential steps:
Example - B2B Sales Assistant: 1. Query expansion: user request -> search queries 2. Internet search + document download 3. Index downloaded content 4. Full-text + vector search 5. Result review (loop back to step 1 if insufficient) 6. Answer synthesis
Advantages: simpler steps, isolated debugging, human-verifiable, clear failure tracing.
Pattern 6: Copilot / Human Envelope¶
Automate boring/repetitive parts, keep humans for judgment:
Marketing Content Generator: - Human picks topic -> LLM discovers materials -> human verifies - LLM generates 3 plans -> human picks best -> LLM writes draft -> human reviews - 2-3x productivity (5-6 articles/day vs 1-2)
Support Copilot: - Customer calls -> speech-to-text -> LLM processes in parallel - LLM provides: customer profile, likely issue, proposed answer - Human verifies and responds (70-80% faster) or overrides
Pattern 7: Logging and Evaluation¶
Collect everything: request ID, timestamp, query, full trace (prompts, tokens, outputs, logprobs).
Evaluation pipeline: 1. Evaluate end-to-end quality 2. Trace issues to specific step 3. Identify problem (e.g., router misclassifying) 4. Build dataset of correct/incorrect 5. Modify prompt -> run against dataset -> measure 6. If better, deploy. If worse, rollback.
This is regular engineering - incremental improvements with measurements.
LLM Known Limitations¶
| Limitation | Solution |
|---|---|
| Math errors | Extract to Python/calculator tool |
| String manipulation | Use Python string operations |
| Physical world reasoning | Add chain-of-thought |
| Niche domains | Provide domain context explicitly |
| Rhyme detection | Text models lack phonetic training |
Gotchas¶
- AI search/assistants statistically fail more often than data extraction due to hallucination
- Don't default to RAG when deterministic injection works for your category
- "Improving the system through training" is better business communication than "editing prompts"
- Always have a human fallback path for when the LLM fails
- Workflow decomposition adds latency but dramatically improves reliability
- Log storage is cheap - over-log rather than under-log
See Also¶
- [[rag-pipeline]] - RAG approach for the general case
- [[prompt-engineering]] - Instruction distillation and prompt patterns
- [[agent-fundamentals]] - Agent abstraction concepts
- [[llmops]] - Monitoring and evaluation infrastructure
- [[agent-memory]] - Copilot and human-in-the-loop patterns