Skip to content

Model Evaluation and Validation

Choosing the right metric and validation strategy is as important as choosing the model. Wrong metric = optimizing for the wrong thing. Wrong validation = overestimating performance.

Regression Metrics

Metric Formula When to Use
MAE mean(|actual - predicted|) Intuitive, in original units
MAPE mean(|actual - predicted| / actual) Need percentage error
MSE mean((actual - predicted)^2) Penalize large errors more
RMSE sqrt(MSE) Same units as target, penalizes large errors
R^2 1 - SS_res/SS_tot Fraction of variance explained
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

Gotcha: MAPE breaks when actual values are near zero (division by zero). R^2 can be negative (model worse than predicting mean).

Classification Metrics

Confusion Matrix

Predicted 0 Predicted 1
Actual 0 TN FP
Actual 1 FN TP
from sklearn.metrics import confusion_matrix, classification_report
cm = confusion_matrix(y_true, y_pred)
print(classification_report(y_true, y_pred))

Key Metrics

Metric Formula Optimize When
Precision TP / (TP + FP) FP costly (spam filter, fraud accusation)
Recall TP / (TP + FN) FN costly (disease detection, fraud detection)
F1 2PR / (P+R) Balance precision/recall
Accuracy (TP+TN) / total Avoid with imbalanced classes

ROC AUC

Threshold-independent metric. Evaluates ranking quality of model scores.

from sklearn.metrics import roc_auc_score, roc_curve

auc = roc_auc_score(y_true, scores)  # scores, not predictions!
fpr, tpr, thresholds = roc_curve(y_true, scores)

import matplotlib.pyplot as plt
plt.plot(fpr, tpr, label=f'AUC = {auc:.3f}')
plt.plot([0,1], [0,1], 'k--')  # random baseline
plt.xlabel('FPR'); plt.ylabel('TPR')
plt.legend()

Interpretation: probability that a random positive has higher score than a random negative. 0.5 = random, 1.0 = perfect.

Precision-Recall Curve

Better than ROC for highly imbalanced data:

from sklearn.metrics import precision_recall_curve, average_precision_score

prec, rec, thresholds = precision_recall_curve(y_true, scores)
ap = average_precision_score(y_true, scores)

Threshold Selection

Models output scores (probabilities). Converting to binary predictions requires choosing a threshold.

# Business context determines threshold
threshold = 0.3  # lower = higher recall, lower precision
y_pred = (scores >= threshold).astype(int)

Higher threshold = fewer but more confident predictions (higher precision, lower recall).

Ranking Metrics

Metric Description
precision@k Fraction of relevant items in top-k
mAP@k Mean average precision across queries
nDCG@k Position-weighted relevance: DCG = sum((2^y_i - 1) / log(i+1))

Cross-Validation

Single train/test split depends on random seed. K-fold gives robust estimates.

from sklearn.model_selection import cross_val_score, StratifiedKFold

# Simple K-fold
scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
print(f"Mean AUC: {scores.mean():.3f} +/- {scores.std():.3f}")

# Stratified (preserves class ratio) - use for classification
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring='f1')

# Time series: use TimeSeriesSplit (no future leakage)
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)

Train / Validation / Test Split

from sklearn.model_selection import train_test_split

# 60/20/20 split
train, test = train_test_split(df, test_size=0.2, random_state=42)
train, val = train_test_split(train, test_size=0.25, random_state=42)
# 0.25 of 0.8 = 0.2
  • Train: fit model
  • Validation: tune hyperparameters, select features
  • Test: final evaluation ONCE (never tune on test)

Bias-Variance Tradeoff

  • High bias (underfitting): model too simple, misses patterns. Both train and val error high
  • High variance (overfitting): model too complex, memorizes noise. Train error low, val error high
  • Diagnosing: large gap between train and val performance = overfitting

Solutions for overfitting: regularization, simpler model, more data, early stopping, dropout, cross-validation.

Gotchas

  • Accuracy trap: with 99/1 class split, predicting all-zeros gives 99% accuracy
  • Data leakage: preprocessing fit on full dataset (including test) inflates metrics
  • Test set contamination: if you tune anything on test set, it's no longer unbiased
  • Class imbalance: use F1, PR-AUC, or ROC AUC instead of accuracy
  • Metric selection: optimize the metric that aligns with business goal, not the one that looks best
  • Overfitting to validation: heavy hyperparameter tuning can overfit to validation set too

See Also