Model Evaluation and Validation¶

Choosing the right metric and validation strategy is as important as choosing the model. Wrong metric = optimizing for the wrong thing. Wrong validation = overestimating performance.

Regression Metrics¶

Metric	Formula	When to Use
MAE	mean(\|actual - predicted\|)	Intuitive, in original units
MAPE	mean(\|actual - predicted\| / actual)	Need percentage error
MSE	mean((actual - predicted)^2)	Penalize large errors more
RMSE	sqrt(MSE)	Same units as target, penalizes large errors
R^2	1 - SS_res/SS_tot	Fraction of variance explained

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

Gotcha: MAPE breaks when actual values are near zero (division by zero). R^2 can be negative (model worse than predicting mean).

Classification Metrics¶

Confusion Matrix¶

	Predicted 0	Predicted 1
Actual 0	TN	FP
Actual 1	FN	TP

from sklearn.metrics import confusion_matrix, classification_report
cm = confusion_matrix(y_true, y_pred)
print(classification_report(y_true, y_pred))

Key Metrics¶

Metric	Formula	Optimize When
Precision	TP / (TP + FP)	FP costly (spam filter, fraud accusation)
Recall	TP / (TP + FN)	FN costly (disease detection, fraud detection)
F1	2PR / (P+R)	Balance precision/recall
Accuracy	(TP+TN) / total	Avoid with imbalanced classes

ROC AUC¶

Threshold-independent metric. Evaluates ranking quality of model scores.

from sklearn.metrics import roc_auc_score, roc_curve

auc = roc_auc_score(y_true, scores)  # scores, not predictions!
fpr, tpr, thresholds = roc_curve(y_true, scores)

import matplotlib.pyplot as plt
plt.plot(fpr, tpr, label=f'AUC = {auc:.3f}')
plt.plot([0,1], [0,1], 'k--')  # random baseline
plt.xlabel('FPR'); plt.ylabel('TPR')
plt.legend()

Interpretation: probability that a random positive has higher score than a random negative. 0.5 = random, 1.0 = perfect.

Precision-Recall Curve¶

Better than ROC for highly imbalanced data:

from sklearn.metrics import precision_recall_curve, average_precision_score

prec, rec, thresholds = precision_recall_curve(y_true, scores)
ap = average_precision_score(y_true, scores)

Threshold Selection¶

Models output scores (probabilities). Converting to binary predictions requires choosing a threshold.

# Business context determines threshold
threshold = 0.3  # lower = higher recall, lower precision
y_pred = (scores >= threshold).astype(int)

Higher threshold = fewer but more confident predictions (higher precision, lower recall).

Ranking Metrics¶

Metric	Description
precision@k	Fraction of relevant items in top-k
mAP@k	Mean average precision across queries
nDCG@k	Position-weighted relevance: DCG = sum((2^y_i - 1) / log(i+1))

Cross-Validation¶

Single train/test split depends on random seed. K-fold gives robust estimates.

from sklearn.model_selection import cross_val_score, StratifiedKFold

# Simple K-fold
scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc')
print(f"Mean AUC: {scores.mean():.3f} +/- {scores.std():.3f}")

# Stratified (preserves class ratio) - use for classification
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring='f1')

# Time series: use TimeSeriesSplit (no future leakage)
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)

Train / Validation / Test Split¶

from sklearn.model_selection import train_test_split

# 60/20/20 split
train, test = train_test_split(df, test_size=0.2, random_state=42)
train, val = train_test_split(train, test_size=0.25, random_state=42)
# 0.25 of 0.8 = 0.2

Train: fit model
Validation: tune hyperparameters, select features
Test: final evaluation ONCE (never tune on test)

Bias-Variance Tradeoff¶

High bias (underfitting): model too simple, misses patterns. Both train and val error high
High variance (overfitting): model too complex, memorizes noise. Train error low, val error high
Diagnosing: large gap between train and val performance = overfitting

Solutions for overfitting: regularization, simpler model, more data, early stopping, dropout, cross-validation.

Gotchas¶

Accuracy trap: with 99/1 class split, predicting all-zeros gives 99% accuracy
Data leakage: preprocessing fit on full dataset (including test) inflates metrics
Test set contamination: if you tune anything on test set, it's no longer unbiased
Class imbalance: use F1, PR-AUC, or ROC AUC instead of accuracy
Metric selection: optimize the metric that aligns with business goal, not the one that looks best
Overfitting to validation: heavy hyperparameter tuning can overfit to validation set too