stat940F24
Regularization in Deep Learning
Introduction
Regularization is a fundamental concept in machine learning, particularly in deep learning, where models with a high number of parameters are prone to overfitting. Overfitting occurs when a model learns the noise in the training data rather than the underlying distribution, leading to poor generalization on unseen data. Regularization techniques aim to constrain the model’s capacity, thus preventing overfitting and improving generalization. This chapter will explore various regularization methods in detail, complete with mathematical formulations, intuitive explanations, and practical implementations.
Classical Regularization: Parameter Norm Penalties
L2 Regularization (Weight Decay)
L2 Parameter Regularization (Weight Decay)
Overview
L2 parameter regularization, commonly known as weight decay, is a technique used to prevent overfitting in machine learning models by penalizing large weights. This penalty helps in constraining the model's complexity.
The regularization term is given by:
[math]\displaystyle{ \mathcal{R}(w) = \frac{\lambda}{2} \|w\|_2^2 }[/math]
where:
- [math]\displaystyle{ \lambda }[/math] is the regularization strength (a hyperparameter),
- [math]\displaystyle{ w }[/math] represents the model weights,
- [math]\displaystyle{ \|w\|_2 }[/math] denotes the L2 norm of the weight vector.
Gradient of the Total Objective Function
The gradient of the total objective function, which includes both the loss and the regularization term, is given by:
[math]\displaystyle{ \nabla_w \mathcal{L}_{\text{total}}(w; X, y) = \lambda w + \nabla_w \mathcal{L}(w; X, y) }[/math]
The weight update rule with L2 regularization using gradient descent is:
[math]\displaystyle{ w := w - \eta (\lambda w + \nabla_w \mathcal{L}(w; X, y)) }[/math]
where [math]\displaystyle{ \eta }[/math] is the learning rate.
Quadratic Approximation to the Objective Function
Consider a quadratic approximation to the objective function:
[math]\displaystyle{ \mathcal{L}(w) \approx \mathcal{L}(w^*) + \frac{1}{2} (w - w^*)^\top H (w - w^*) }[/math]
where:
- [math]\displaystyle{ w^* }[/math] is the optimum weight vector,
- [math]\displaystyle{ H }[/math] is the Hessian matrix of second derivatives.
The modified gradient equation becomes:
[math]\displaystyle{ \lambda w + H (w - w^*) = 0 }[/math]
Solving for [math]\displaystyle{ w }[/math], we get:
[math]\displaystyle{ w = (H + \lambda I)^{-1} H w^* }[/math]
where [math]\displaystyle{ I }[/math] is the identity matrix.
Eigenvalue Decomposition
Assume [math]\displaystyle{ H = Q \Lambda Q^\top }[/math] where [math]\displaystyle{ Q }[/math] is the orthogonal matrix of eigenvectors and [math]\displaystyle{ \Lambda }[/math] is the diagonal matrix of eigenvalues.
Then the weight vector can be expressed as:
[math]\displaystyle{ w = Q(\Lambda + \lambda I)^{-1} \Lambda Q^\top w^* }[/math]
The effect of weight decay is to rescale the coefficients of the eigenvectors. The [math]\displaystyle{ i }[/math]-th component is rescaled by a factor of [math]\displaystyle{ \frac{\lambda_i}{\lambda_i + \lambda} }[/math], where [math]\displaystyle{ \lambda_i }[/math] is the [math]\displaystyle{ i }[/math]-th eigenvalue.
- If [math]\displaystyle{ \lambda_i \gt \lambda }[/math], the effect of regularization is relatively small.
- Components with [math]\displaystyle{ \lambda_i \lt \lambda }[/math] will be shrunk to have nearly zero magnitude.
Effective Number of Parameters
Directions along which the parameters contribute significantly to reducing the objective function are preserved. A small eigenvalue of the Hessian indicates that movement in this direction will not significantly increase the gradient.
The effective number of parameters can be defined as:
[math]\displaystyle{ \text{Effective Number of Parameters} = \sum_i \frac{\lambda_i}{\lambda_i + \lambda} }[/math]
As [math]\displaystyle{ \lambda }[/math] increases, the effective number of parameters decreases, which reduces the model's complexity.
(Placeholder for Image) (Include an image illustrating the effect of weight decay on the eigenvalues and the effective number of parameters)