Skip to content

Gradient Boosting and Tree-Based Models¶

Gradient boosting is the dominant algorithm for tabular data. CatBoost, LightGBM, and XGBoost are implementations of the same idea. Tree-based models handle mixed data types, non-linear relationships, and interactions without manual feature engineering.

Decision Trees¶

Base building block. Recursively split data on feature thresholds to minimize impurity.

Splitting criteria: - Gini impurity: sum(p_i * (1 - p_i)). Measures probability of misclassification - Entropy: -sum(p_i * log(p_i)). Information theory measure of disorder - MSE (regression): variance within each split

Pros: interpretable, handles mixed types, no scaling needed. Cons: high variance, prone to overfitting, unstable (small data changes -> different tree).

Random Forest¶

Ensemble of decision trees. Each tree trained on bootstrap sample with random feature subset.

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

rf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
rf.fit(X_train, y_train)
rf.feature_importances_  # built-in importance scores

Bagging reduces variance by averaging decorrelated predictions
Feature importance: decrease in impurity when splitting on feature
Embarrassingly parallel - each tree trains independently

Gradient Boosting¶

Sequentially build trees, each correcting errors of previous ones.

Idea: fit tree to residuals (errors) of current model, add it with a small learning rate.

CatBoost¶

Best out-of-box performance. Handles categoricals natively.

from catboost import CatBoostRegressor, CatBoostClassifier

model = CatBoostRegressor(cat_features=['transmission', 'fuel_type'])
model.fit(
    train[features], train[target],
    eval_set=(val[features], val[target]),
    verbose=100
)

# Predictions
y_pred = model.predict(test[features])

# Classification: get probabilities
model = CatBoostClassifier(cat_features=cat_features)
model.fit(train[features], train[target],
          eval_set=(val[features], val[target]))
scores = model.predict_proba(test[features])[:, 1]

# Feature importance
model.get_feature_importance(prettified=True)

Validation set is mandatory: CatBoost trains iteratively. Without validation, it memorizes training data. Early stopping halts at validation error minimum.

CatBoost Cross-Validation¶

from catboost import cv, Pool

cv_data = cv(
    pool=Pool(X, y, cat_features=cat_features),
    params={'loss_function': 'Logloss', 'eval_metric': 'AUC'},
    fold_count=5, shuffle=True, stratified=True,
    partition_random_seed=42, verbose=False
)
best_iter = cv_data['test-AUC-mean'].idxmax()

Workflow: CV to find optimal iterations -> train final model on full train+val with that count -> evaluate on test.

XGBoost¶

import xgboost as xgb
model = xgb.XGBClassifier(
    n_estimators=500, max_depth=6, learning_rate=0.1,
    subsample=0.8, colsample_bytree=0.8,
    early_stopping_rounds=50, eval_metric='auc'
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=50)

LightGBM¶

Fastest for large datasets. Leaf-wise growth (vs level-wise).

import lightgbm as lgb
model = lgb.LGBMClassifier(n_estimators=500, learning_rate=0.05, num_leaves=31)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)],
          callbacks=[lgb.early_stopping(50)])

Key Hyperparameters¶

Parameter	Effect	Typical Range
n_estimators / iterations	Number of trees	100-5000
learning_rate	Step size per tree	0.01-0.3
max_depth	Tree depth	3-10
subsample	Fraction of rows per tree	0.5-1.0
colsample_bytree	Fraction of features per tree	0.5-1.0
min_child_weight	Minimum samples in leaf	1-100
reg_lambda (L2)	L2 regularization	0-10
reg_alpha (L1)	L1 regularization	0-10

Rule of thumb: lower learning_rate + more estimators = better but slower. Start with defaults, tune learning_rate and max_depth first.

Handling Imbalanced Classes¶

# CatBoost
model = CatBoostClassifier(scale_pos_weight=neg_count/pos_count)

# XGBoost
model = xgb.XGBClassifier(scale_pos_weight=neg_count/pos_count)

# Or: tune threshold on validation set

Comparison¶

Aspect	CatBoost	XGBoost	LightGBM	Random Forest
Default performance	Best	Good	Good	Good
Categorical handling	Native	Manual encoding	Native	Manual encoding
Speed	Moderate	Moderate	Fastest	Fast (parallel)
Overfitting risk	Low	Moderate	Moderate	Low
GPU support	Yes	Yes	Yes	No

Gotchas¶

Don't use accuracy on imbalanced datasets - a model predicting all-zeros gets 80% accuracy with 80/20 split
CatBoost with no eval_set will overfit silently - always provide validation
Feature importance varies across runs and methods - use permutation importance for reliability
XGBoost requires manual one-hot encoding for categoricals (unless using enable_categorical=True)
Start simple (defaults) before hyperparameter tuning - tuning doesn't compensate for bad features

See Also¶

model evaluation - metrics for model comparison
feature engineering - less critical for trees but still valuable
linear models - simpler alternative, useful baseline
neural networks - when tabular models aren't enough