Skip to content

Transfer Learning

Use knowledge from one task (source) to improve learning on another (target). The dominant paradigm in modern deep learning - almost never train from scratch.

Core Idea

Pre-trained models learn general features on large datasets. Lower layers capture universal patterns (edges, textures, word frequencies). Higher layers become task-specific.

Strategies

Feature Extraction

Freeze all pre-trained layers. Replace and train only the final classification head.

import torchvision.models as models
import torch.nn as nn

model = models.resnet50(pretrained=True)
for param in model.parameters():
    param.requires_grad = False  # freeze everything
model.fc = nn.Linear(2048, num_classes)  # replace head

Use when: small dataset (< 1000 samples), target task similar to source.

Fine-Tuning

Unfreeze some layers and train with low learning rate.

# Unfreeze last block
for param in model.layer4.parameters():
    param.requires_grad = True

# Lower LR for pre-trained layers, higher for new head
optimizer = torch.optim.Adam([
    {'params': model.layer4.parameters(), 'lr': 1e-5},
    {'params': model.fc.parameters(), 'lr': 1e-3}
])

Use when: moderate dataset, target task somewhat different from source.

Full Fine-Tuning

Unfreeze all layers, very low learning rate throughout.

Use when: large dataset, task significantly different from source.

Vision Transfer Learning

Source: ImageNet (1.2M images, 1000 classes). Most torchvision models provide pre-trained weights.

Rule of thumb: smaller target dataset = freeze more layers; larger = fine-tune more.

Standard normalization (ImageNet stats):

from torchvision import transforms
transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225])
])

NLP Transfer Learning

Source: large text corpora (Wikipedia, BooksCorpus, Common Crawl).

Pre-training objectives: - BERT: Masked Language Model + Next Sentence Prediction - GPT: Autoregressive language modeling - T5: text-to-text (all tasks framed as generation)

Fine-tuning with HuggingFace:

from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
args = TrainingArguments(
    output_dir='./results',
    learning_rate=2e-5,  # much lower than training from scratch
    num_train_epochs=3,
    per_device_train_batch_size=16
)
trainer = Trainer(model=model, args=args, train_dataset=train_ds)
trainer.train()

When Transfer Learning Helps

Source/Target Similarity Target Data Size Strategy
Similar, small data < 1K Feature extraction
Similar, moderate data 1K-10K Fine-tune last layers
Different, large data > 10K Full fine-tuning
Very different, small data < 1K May not help; try anyway

Domain Adaptation

When source and target domains differ (e.g., photos -> medical images): - Gradual unfreezing: unfreeze one layer at a time from top - Discriminative LR: lower LR for earlier layers - Data augmentation: bridge domain gap - Domain-specific pre-training: pre-train on in-domain unlabeled data first

Gotchas

  • Pre-trained model expects specific input format (size, normalization, tokenization)
  • Fine-tuning LR too high destroys pre-trained features ("catastrophic forgetting")
  • Always use the matching tokenizer for NLP models
  • Transfer from ImageNet may not help for non-natural images (medical, satellite)
  • For tabular data, transfer learning rarely helps - gradient boosting usually wins

See Also

  • [[neural-networks]] - foundation architectures
  • [[cnn-computer-vision]] - vision architectures for transfer
  • [[nlp-text-processing]] - BERT and transformer fine-tuning
  • [[model-evaluation]] - evaluating fine-tuned models