stat940F24
Regularization in Deep Learning
Introduction
Regularization is a fundamental concept in machine learning, particularly in deep learning, where models with a high number of parameters are prone to overfitting. Overfitting occurs when a model learns the noise in the training data rather than the underlying distribution, leading to poor generalization on unseen data. Regularization techniques aim to constrain the model’s capacity, thus preventing overfitting and improving generalization. This chapter will explore various regularization methods in detail, complete with mathematical formulations, intuitive explanations, and practical implementations.
Classical Regularization: Parameter Norm Penalties
L2 Regularization (Weight Decay)
L2 Parameter Regularization (Weight Decay)
Overview
L2 parameter regularization, commonly known as weight decay, is a technique used to prevent overfitting in machine learning models by penalizing large weights. This penalty helps in constraining the model's complexity.
The regularization term is given by:
[math]\displaystyle{ \mathcal{R}(w) = \frac{\lambda}{2} \|w\|_2^2 }[/math]
where:
- [math]\displaystyle{ \lambda }[/math] is the regularization strength (a hyperparameter),
- [math]\displaystyle{ w }[/math] represents the model weights,
- [math]\displaystyle{ \|w\|_2 }[/math] denotes the L2 norm of the weight vector.
Gradient of the Total Objective Function
The gradient of the total objective function, which includes both the loss and the regularization term, is given by:
[math]\displaystyle{ \nabla_w \mathcal{L}_{\text{total}}(w; X, y) = \lambda w + \nabla_w \mathcal{L}(w; X, y) }[/math]
The weight update rule with L2 regularization using gradient descent is:
[math]\displaystyle{ w := w - \eta (\lambda w + \nabla_w \mathcal{L}(w; X, y)) }[/math]
where [math]\displaystyle{ \eta }[/math] is the learning rate.
Quadratic Approximation to the Objective Function
Consider a quadratic approximation to the objective function:
[math]\displaystyle{ \mathcal{L}(w) \approx \mathcal{L}(w^*) + \frac{1}{2} (w - w^*)^\top H (w - w^*) }[/math]
where:
- [math]\displaystyle{ w^* }[/math] is the optimum weight vector,
- [math]\displaystyle{ H }[/math] is the Hessian matrix of second derivatives.
The modified gradient equation becomes:
[math]\displaystyle{ \lambda w + H (w - w^*) = 0 }[/math]
Solving for [math]\displaystyle{ w }[/math], we get:
[math]\displaystyle{ w = (H + \lambda I)^{-1} H w^* }[/math]
where [math]\displaystyle{ I }[/math] is the identity matrix.
Eigenvalue Decomposition
Assume [math]\displaystyle{ H = Q \Lambda Q^\top }[/math] where [math]\displaystyle{ Q }[/math] is the orthogonal matrix of eigenvectors and [math]\displaystyle{ \Lambda }[/math] is the diagonal matrix of eigenvalues.
Then the weight vector can be expressed as:
[math]\displaystyle{ w = Q(\Lambda + \lambda I)^{-1} \Lambda Q^\top w^* }[/math]
The effect of weight decay is to rescale the coefficients of the eigenvectors. The [math]\displaystyle{ i }[/math]-th component is rescaled by a factor of [math]\displaystyle{ \frac{\lambda_i}{\lambda_i + \lambda} }[/math], where [math]\displaystyle{ \lambda_i }[/math] is the [math]\displaystyle{ i }[/math]-th eigenvalue.
- If [math]\displaystyle{ \lambda_i \gt \lambda }[/math], the effect of regularization is relatively small.
- Components with [math]\displaystyle{ \lambda_i \lt \lambda }[/math] will be shrunk to have nearly zero magnitude.
Effective Number of Parameters
Directions along which the parameters contribute significantly to reducing the objective function are preserved. A small eigenvalue of the Hessian indicates that movement in this direction will not significantly increase the gradient.
The effective number of parameters can be defined as:
[math]\displaystyle{ \text{Effective Number of Parameters} = \sum_i \frac{\lambda_i}{\lambda_i + \lambda} }[/math]
As [math]\displaystyle{ \lambda }[/math] increases, the effective number of parameters decreases, which reduces the model's complexity.
(Placeholder for Image) (Include an image illustrating the effect of weight decay on the eigenvalues and the effective number of parameters)
Dataset Augmentation
Overview
Dataset augmentation is a technique used to improve the generalization ability of machine learning models by artificially increasing the size of the training dataset. This is particularly useful when the amount of available data is limited. The idea is to create new, synthetic data by applying various transformations to the original dataset.
- Key Idea: The best way to make a machine learning model generalize better is to train it on more data. When the amount of available data is limited, creating synthetic data (e.g., by applying transformations like rotation, translation, and noise addition) and adding it to the training set can be effective.
- Practical Example: Operations like translating training images a few pixels in each direction can greatly improve generalization. Another approach is to train neural networks with random noise applied to their inputs, which also serves as a form of dataset augmentation. This technique can be applied not only to the input layer but also to hidden layers, effectively performing dataset augmentation at multiple levels of abstraction.
---
Noise Injection
Overview
Noise injection is a regularization strategy that can be applied in two main ways:
1. Adding Noise to the Input: This method can be interpreted as a form of dataset augmentation and also has a direct connection to traditional regularization methods. 2. Adding Noise to the Weights: This method is primarily used in the context of recurrent neural networks and can be viewed as a stochastic implementation of Bayesian inference over the weights.
Mathematical Proof for Injecting Noise at the Input
Consider a regression setting where we have an input-output pair \( (x, y) \) and the goal is to minimize the expected loss function:
[math]\displaystyle{ J = \mathbb{E}_{x, y} \left[(f(x) - y)^2\right] }[/math]
Now, suppose we inject noise into the input \( x \), where the noise \( \epsilon \) is drawn from a distribution with mean zero (e.g., Gaussian noise \( \epsilon \sim \mathcal{N}(0, \sigma^2) \)). The modified objective function with noise-injected inputs becomes:
[math]\displaystyle{ J_{\text{noise}} = \mathbb{E}_{x, y, \epsilon} \left[(f(x + \epsilon) - y)^2\right] }[/math]
To understand the effect of noise injection, we can expand the function \( f(x + \epsilon) \) around \( x \) using a Taylor series:
[math]\displaystyle{ f(x + \epsilon) = f(x) + \epsilon^\top \nabla_x f(x) + \frac{1}{2} \epsilon^\top \nabla_x^2 f(x) \epsilon + \mathcal{O}(\|\epsilon\|^3) }[/math]
Since the expectation of the noise \( \epsilon \) is zero:
[math]\displaystyle{ \mathbb{E}[\epsilon] = 0 }[/math]
and assuming that the noise is isotropic with covariance matrix \( \sigma^2 I \), the expectation of the second-order term becomes:
[math]\displaystyle{ \mathbb{E}[\epsilon \epsilon^\top] = \sigma^2 I }[/math]
Substituting the Taylor expansion into the objective function:
[math]\displaystyle{ J_{\text{noise}} = \mathbb{E}_{x, y} \left[(f(x) - y)^2\right] + \frac{\sigma^2}{2} \mathbb{E}_{x, y} \left[\|\nabla_x f(x)\|^2\right] + \mathcal{O}(\sigma^4) }[/math]
This shows that the objective function with noise injection is equivalent to the original objective function plus a regularization term that penalizes large gradients of the function \( f(x) \). Specifically, the added term [math]\displaystyle{ \frac{\sigma^2}{2} \mathbb{E}_{x, y} \left[\|\nabla_x f(x)\|^2\right] }[/math] reduces the sensitivity of the network's output to small variations in its input.
Key Result:
For small noise variance \( \sigma^2 \), the minimization of the loss function with noise-injected input is equivalent to minimizing the original loss function with an additional regularization term that penalizes large gradients:
[math]\displaystyle{ J_{\text{noise}} \approx J + \frac{\sigma^2}{2} \mathbb{E}_{x, y} \left[\|\nabla_x f(x)\|^2\right] }[/math]
This regularization term effectively reduces the sensitivity of the output with respect to small changes in the input \( x \), which is beneficial in avoiding overfitting.
Connection to Weight Decay:
In linear models, where \( f(x) = w^\top x \), the gradient \( \nabla_x f(x) \) is simply the weight vector \( w \). Therefore, the regularization term becomes:
[math]\displaystyle{ \frac{\sigma^2}{2} \|w\|^2 }[/math]
which is equivalent to L2 regularization or weight decay.