Skip to content

Mathematics for Machine Learning

The essential math underlying ML algorithms. Focus on what you need to understand WHY algorithms work, not just how to call them.

Calculus for ML

Derivatives

f'(x) = lim(h->0) [f(x+h) - f(x)] / h. Rate of change. Slope of tangent line.

Key rules: - Power: (x^n)' = n*x^(n-1) - Chain rule: (f(g(x)))' = f'(g(x)) * g'(x) - Product: (fg)' = f'g + fg' - e^x: (e^x)' = e^x - ln(x): (ln(x))' = 1/x

Partial Derivatives and Gradient

For f(x1, ..., xn): gradient = vector of partial derivatives.

nabla f = [df/dx1, df/dx2, ..., df/dxn]

Gradient points in direction of steepest ascent. Gradient descent moves opposite: w -= lr * nabla_f.

Critical Points and Optimization

  • Critical point: gradient = 0
  • Second derivative test: f''(x) > 0 -> minimum, f''(x) < 0 -> maximum
  • Hessian matrix: second partial derivatives. Eigenvalues determine nature of critical point
  • Convex functions: any local minimum is global minimum. MSE is convex for linear regression

Backpropagation

Chain rule through computation graph. For layer output = sigma(Wx + b):

d(Loss)/dW = d(Loss)/d(output) * d(output)/d(Wx+b) * d(Wx+b)/dW

Computational graph + automatic differentiation make this tractable for deep networks.

Taylor Series

f(x) = f(a) + f'(a)(x-a) + f''(a)(x-a)^2/2! + ...

Key expansions: e^x = 1 + x + x^2/2! + ..., ln(1+x) = x - x^2/2 + ...

In ML: understanding loss behavior near optima, Newton's method (second-order approximation).

Optimization

Gradient Descent

w_(t+1) = w_t - alpha * nabla f(w_t)

  • Learning rate alpha: too large -> diverge, too small -> slow
  • Stochastic GD: single-sample gradient (noisy but fast)
  • Mini-batch GD: small batch (balance)
  • Momentum: v_t = beta*v_(t-1) + nabla f. Accelerates through narrow valleys
  • Convergence: O(1/t) for convex with Lipschitz gradient

Newton's Method

w_new = w_old - H^(-1) * gradient. Uses second derivatives (Hessian). Converges faster but H^(-1) is expensive.

Quasi-Newton (L-BFGS): approximate Hessian inverse cheaply.

Lagrange Multipliers

Optimize f(x) subject to g(x) = 0: nabla f = lambda * nabla g.

In ML: SVM optimization (maximize margin subject to constraints).

Log-Sum-Exp Trick

Avoid numerical overflow in softmax/log-likelihood:

log(sum(exp(x_i))) = c + log(sum(exp(x_i - c)))  where c = max(x_i)

Integration in ML

  • Computing expected values and probabilities (area under PDF)
  • Normalization constants for distributions
  • KL-divergence between distributions
  • Lebesgue integral: foundation of modern probability theory

Gotchas

  • Gradient descent on non-convex functions (neural nets) finds local minima, not global - but this is usually fine
  • Numerical stability matters: use log-space for products of probabilities
  • Hessian computation is O(n^2) in parameters - impractical for large models
  • Chain rule errors propagate - always verify gradients numerically

See Also