Fine-Tuning and LoRA¶

Fine-tuning adapts a pre-trained model to specific tasks, domains, or output styles. LoRA and PEFT methods make this practical on consumer hardware by training only a tiny fraction of parameters.

Key Facts¶

RAG for adding knowledge, fine-tuning for changing behavior/style - often combined in production
Before fine-tuning, establish baselines: zero-shot, few-shot, RAG performance
100 high-quality examples > 10,000 noisy examples (quality >> quantity)
LoRA trains 0.1-1% of total parameters, reducing GPU memory by 4-8x
QLoRA combines LoRA with 4-bit quantization: 7B model fine-tuning on ~6GB VRAM

When to Fine-Tune vs RAG¶

Approach	Best For	Not For
RAG	Domain knowledge, frequently updated data	Changing model behavior/style
Fine-tuning	Behavior, output format, domain adaptation	Real-time knowledge updates
Both	Complex production systems needing both

OpenAI Fine-Tuning¶

# 1. Prepare JSONL training data
# Each line: {"messages": [{"role": "system",...}, {"role": "user",...}, {"role": "assistant",...}]}

# 2. Upload training file
file = client.files.create(file=open("training.jsonl"), purpose="fine-tune")

# 3. Create fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-4o-mini-2024-07-18",
    hyperparameters={"n_epochs": 3}
)

# 4. Use fine-tuned model
response = client.chat.completions.create(
    model="ft:gpt-4o-mini:my-org::abc123",
    messages=[...]
)

Data requirements: minimum 10 examples (50-100+ recommended), diverse, consistent format.

LoRA (Low-Rank Adaptation)¶

Full fine-tuning updates ALL parameters. For a 7B model, that's 7 billion weights requiring massive GPU memory. LoRA decomposes weight updates into two small matrices:

W_new = W_original + A * B

W_original: frozen (e.g., 4096 x 4096)
A: trainable (e.g., 4096 x 16) - rank=16
B: trainable (e.g., 16 x 4096)

Result: ~130K trainable parameters per layer instead of 16M. 99% fewer parameters.

Rank (r): controls expressiveness. Typical: 8, 16, 32, 64. Higher = more capacity, more memory.

LoRA with HuggingFace PEFT¶

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable: 4,194,304 || all: 8,030,261,248 || trainable%: 0.05

Target Modules¶

q_proj, v_proj (attention queries/values) - most common, good default
k_proj (attention keys) - added for more expressiveness
o_proj (attention output)
gate_proj, up_proj, down_proj (FFN) - for deeper adaptation

Training Configuration¶

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./lora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch"
)

QLoRA (Quantized LoRA)¶

Combines LoRA with quantization: 1. Quantize base model to 4-bit (NF4 format) 2. Add LoRA adapters in FP16 3. Train only LoRA parameters

Memory savings: 7B model goes from ~28GB (full) to ~6GB (QLoRA). Enables fine-tuning on consumer GPUs.

PEFT Methods Comparison¶

Method	Approach	Trainable Params
LoRA	Low-rank weight update decomposition	0.1-1%
QLoRA	LoRA + 4-bit quantization	0.1-1%
Prefix Tuning	Trainable prefix tokens per layer	Very small
Prompt Tuning	Trainable soft prompt vectors	Tiny
Adapter Layers	Small trainable layers between frozen layers	1-5%

Data Quality Guidelines¶

Each example should demonstrate the exact behavior you want
Remove duplicates, contradictions, low-quality samples
Hold out 10-20% as test set
Measure task-specific metrics (accuracy, BLEU, F1)
Compare against baseline to verify improvement
Check for overfitting (training metric improves but test doesn't)

Gotchas¶

Fine-tuning on small datasets risks overfitting - always validate on held-out set
Fine-tuned models inherit the base model's limitations (hallucination, reasoning failures)
LoRA adapters can be composed (merge multiple LoRA) but quality may degrade
Hyperparameter tuning (rank, learning rate, epochs) significantly affects results
Fine-tuned model quality degrades if training data format doesn't match inference format
Always measure: sometimes prompt engineering + RAG outperforms fine-tuning