LLM API Integration¶

Practical guide to integrating with LLM provider APIs. Covers authentication, message structure, key parameters, streaming, and cost management across OpenAI, Anthropic, and other providers.

Key Facts¶

API keys must NEVER be hardcoded - use environment variables and .env files
LLMs are stateless - each API call is independent, client sends full conversation history
Token-based pricing: input tokens (cheaper) + output tokens (more expensive)
Streaming improves perceived latency by sending response chunks as they're generated
If a key is exposed, revoke immediately and generate a new one

API Key Management¶

# .env file (never commit to version control)
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=AIza...

# Load in Python
from dotenv import load_dotenv
import os
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")

OpenAI Chat Completions¶

Basic Call¶

from openai import OpenAI
client = OpenAI()  # reads OPENAI_API_KEY from env

completion = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is a black hole?"}
    ]
)
print(completion.choices[0].message.content)

Message Roles¶

system: defines model purpose, persona, constraints. Set once at start.
user: human input (questions, tasks, content)
assistant: model responses OR few-shot examples for guiding style

Key Parameters¶

Parameter	Values	Effect
`temperature`	0-2 (default 1)	0 = deterministic, 2 = very random. >1.5 produces nonsense
`max_tokens`	integer	Cap on completion tokens. Too low = truncated
`n`	integer (default 1)	Number of response variants
`seed`	integer	Pseudo-reproducibility (not guaranteed)
`stream`	bool	Stream response as chunks

Streaming¶

completion = client.chat.completions.create(
    model="gpt-4",
    messages=messages,
    stream=True
)
for chunk in completion:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="")

Few-Shot Example Pattern¶

messages = [
    {"role": "system", "content": "Classify tweet sentiment."},
    {"role": "user", "content": "This movie is extraordinary."},
    {"role": "assistant", "content": "positive"},
    {"role": "user", "content": "This album is alright."},
    {"role": "assistant", "content": "neutral"},
    {"role": "user", "content": "This new song blew my mind."}
]

Anthropic Messages API¶

from anthropic import Anthropic
client = Anthropic()

message = client.messages.create(
    model="claude-3-opus-20240229",
    max_tokens=1024,
    system="You are a helpful assistant.",
    messages=[{"role": "user", "content": "What is a black hole?"}]
)
print(message.content[0].text)

Pricing Model¶

Cost = (input_tokens * input_price) + (output_tokens * output_price)

Example (GPT-4o):
  Input:  1000 tokens * $2.50/1M = $0.0025
  Output: 500 tokens  * $10.00/1M = $0.005
  Total: ~$0.0075 per request
  At 10,000 req/day: ~$75/day

Token caching (Anthropic, OpenAI): 25% premium to cache context, then ~10x cheaper for subsequent queries against same context. Transforms economics of repeated document analysis.

LogProbs - Hallucination Detection¶

Request log probabilities per token to see model confidence: - Color-code tokens by confidence: red = uncertain, green = confident - When model hallucinated, first tokens often show high uncertainty - When model blindly trusted wrong input, it was confidently wrong - Development debugging tool, not for end users

Conversation History Management¶

# Must send full history each time (stateless API)
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "First question"},
    {"role": "assistant", "content": "First answer"},
    {"role": "user", "content": "Follow-up"}
]
# Track token count, trim when approaching limit

Gotchas¶

Temperature 0 is not truly deterministic - small variations can occur
max_tokens too low silently truncates responses without error
Streaming responses require different parsing (delta.content vs message.content)
API rate limits vary by tier - implement exponential backoff
Model names change over time (GPT-4 -> GPT-4o -> GPT-4o-mini) - check current docs
Hidden token costs: system prompts and function schemas count as input tokens