stat940F24: Difference between revisions

Revision as of 21:37, 20 September 2024

Regularization in Deep Learning

Introduction

Regularization is a fundamental concept in machine learning, particularly in deep learning, where models with a high number of parameters are prone to overfitting. Overfitting occurs when a model learns the noise in the training data rather than the underlying distribution, leading to poor generalization on unseen data. Regularization techniques aim to constrain the model’s capacity, thus preventing overfitting and improving generalization. This chapter will explore various regularization methods in detail, complete with mathematical formulations, intuitive explanations, and practical implementations.

Classical Regularization: Parameter Norm Penalties

L2 Regularization (Weight Decay)

L2 Parameter Regularization (Weight Decay)

Overview

L2 parameter regularization, commonly known as weight decay, is a technique used to prevent overfitting in machine learning models by penalizing large weights. This penalty helps in constraining the model's complexity.

The regularization term is given by:

[math]\displaystyle{ \mathcal{R}(w) = \frac{\lambda}{2} \|w\|_2^2 }[/math]

where:

[math]\displaystyle{ \lambda }[/math] is the regularization strength (a hyperparameter),
[math]\displaystyle{ w }[/math] represents the model weights,
[math]\displaystyle{ \|w\|_2 }[/math] denotes the L2 norm of the weight vector.

Gradient of the Total Objective Function

The gradient of the total objective function, which includes both the loss and the regularization term, is given by:

[math]\displaystyle{ \nabla_w \mathcal{L}_{\text{total}}(w; X, y) = \lambda w + \nabla_w \mathcal{L}(w; X, y) }[/math]

The weight update rule with L2 regularization using gradient descent is:

[math]\displaystyle{ w := w - \eta (\lambda w + \nabla_w \mathcal{L}(w; X, y)) }[/math]

where [math]\displaystyle{ \eta }[/math] is the learning rate.

Quadratic Approximation to the Objective Function

Consider a quadratic approximation to the objective function:

[math]\displaystyle{ \mathcal{L}(w) \approx \mathcal{L}(w^*) + \frac{1}{2} (w - w^*)^\top H (w - w^*) }[/math]

where:

[math]\displaystyle{ w^* }[/math] is the optimum weight vector,
[math]\displaystyle{ H }[/math] is the Hessian matrix of second derivatives.

The modified gradient equation becomes:

[math]\displaystyle{ \lambda w + H (w - w^*) = 0 }[/math]

Solving for [math]\displaystyle{ w }[/math], we get:

[math]\displaystyle{ w = (H + \lambda I)^{-1} H w^* }[/math]

where [math]\displaystyle{ I }[/math] is the identity matrix.

Eigenvalue Decomposition

Assume [math]\displaystyle{ H = Q \Lambda Q^\top }[/math] where [math]\displaystyle{ Q }[/math] is the orthogonal matrix of eigenvectors and [math]\displaystyle{ \Lambda }[/math] is the diagonal matrix of eigenvalues.

Then the weight vector can be expressed as:

[math]\displaystyle{ w = Q(\Lambda + \lambda I)^{-1} \Lambda Q^\top w^* }[/math]

The effect of weight decay is to rescale the coefficients of the eigenvectors. The [math]\displaystyle{ i }[/math]-th component is rescaled by a factor of [math]\displaystyle{ \frac{\lambda_i}{\lambda_i + \lambda} }[/math], where [math]\displaystyle{ \lambda_i }[/math] is the [math]\displaystyle{ i }[/math]-th eigenvalue.

If [math]\displaystyle{ \lambda_i \gt \lambda }[/math], the effect of regularization is relatively small.
Components with [math]\displaystyle{ \lambda_i \lt \lambda }[/math] will be shrunk to have nearly zero magnitude.

Effective Number of Parameters

Directions along which the parameters contribute significantly to reducing the objective function are preserved. A small eigenvalue of the Hessian indicates that movement in this direction will not significantly increase the gradient.

The effective number of parameters can be defined as:

[math]\displaystyle{ \text{Effective Number of Parameters} = \sum_i \frac{\lambda_i}{\lambda_i + \lambda} }[/math]

As [math]\displaystyle{ \lambda }[/math] increases, the effective number of parameters decreases, which reduces the model's complexity.

(Placeholder for Image) (Include an image illustrating the effect of weight decay on the eigenvalues and the effective number of parameters)

@@ Line 9: / Line 9: @@
 ==== L2 Regularization (Weight Decay) ====
-'''Mathematical Formulation:'''
+== L2 Parameter Regularization (Weight Decay) ==
-L2 regularization adds a penalty term to the loss function, which is proportional to the sum of the squared weights in the model. The loss function with L2 regularization can be written as:
-<math>\mathcal{L}_{\text{total}} = \mathcal{L} + \frac{\lambda}{2} \sum_{i=1}^{N} w_i^2</math>
+=== Overview ===
-where:
+L2 parameter regularization, commonly known as weight decay, is a technique used to prevent overfitting in machine learning models by penalizing large weights. This penalty helps in constraining the model's complexity.
-* <math>\mathcal{L}</math> is the original loss function (e.g., mean squared error for regression or cross-entropy for classification),
-* <math>\lambda</math> is the regularization strength (a hyperparameter),
-* <math>w_i</math> represents the model weights,
-* <math>N</math> is the total number of weights.
-'''Intuition:'''
+The regularization term is given by:
-The L2 regularization term penalizes large weights, encouraging the model to distribute weights more evenly across all features. This leads to simpler models that are less likely to overfit the training data.
-'''Implementation:'''
+<math> \mathcal{R}(w) = \frac{\lambda}{2} \|w\|_2^2 </math>
-L2 regularization is typically implemented during the optimization process. For example, when using stochastic gradient descent (SGD), the weight update rule with L2 regularization becomes:
-<math>w_i \leftarrow w_i - \eta \frac{\partial \mathcal{L}}{\partial w_i} - \eta \lambda w_i</math>
+where:
+* <math> \lambda </math> is the regularization strength (a hyperparameter),
+* <math> w </math> represents the model weights,
+* <math> \|w\|_2 </math> denotes the L2 norm of the weight vector.
-where <math>\eta</math> is the learning rate.
+=== Gradient of the Total Objective Function ===
-''(Placeholder for Image)''
+The gradient of the total objective function, which includes both the loss and the regularization term, is given by:
-(Include an image illustrating the effect of L2 regularization on the weight distribution and the loss function)
-==== L1 Regularization ====
+<math> \nabla_w \mathcal{L}_{\text{total}}(w; X, y) = \lambda w + \nabla_w \mathcal{L}(w; X, y) </math>
-'''Mathematical Formulation:'''
+The weight update rule with L2 regularization using gradient descent is:
-L1 regularization adds a penalty proportional to the sum of the absolute values of the weights. The modified loss function is:
-<math>\mathcal{L}_{\text{total}} = \mathcal{L} + \lambda \sum_{i=1}^{N} |w_i|</math>
+<math> w := w - \eta (\lambda w + \nabla_w \mathcal{L}(w; X, y)) </math>
-'''Intuition:'''
+where <math> \eta </math> is the learning rate.
-L1 regularization leads to sparsity in the model parameters, meaning that it drives some weights to exactly zero. This can be particularly useful in feature selection, where irrelevant features are eliminated from the model.
-'''Implementation:'''
+=== Quadratic Approximation to the Objective Function ===
-The weight update rule for L1 regularization using SGD is:
-<math>w_i \leftarrow w_i - \eta \frac{\partial \mathcal{L}}{\partial w_i} - \eta \lambda \text{sign}(w_i)</math>
+Consider a quadratic approximation to the objective function:
-where <math>\text{sign}(w_i)</math> is the sign of the weight.
+<math> \mathcal{L}(w) \approx \mathcal{L}(w^*) + \frac{1}{2} (w - w^*)^\top H (w - w^*) </math>
-''(Placeholder for Image)''
+where:
-(Include an image showing how L1 regularization leads to sparse weight distributions)
+* <math> w^* </math> is the optimum weight vector,
+* <math> H </math> is the Hessian matrix of second derivatives.
-=== Dataset Augmentation ===
-==== Overview of Dataset Augmentation ====
-'''Mathematical Formulation:'''
-Let <math>x</math> be an input image, and <math>f(x)</math> the corresponding label. Dataset augmentation involves generating new data points <math>\tilde{x}_i</math> through transformations <math>T_i</math>, such that:
-<math>\tilde{x}_i = T_i(x), \quad f(\tilde{x}_i) = f(x)</math>
-where <math>T_i</math> could be any transformation such as rotation, translation, scaling, or adding noise. The augmented dataset <math>\tilde{X}</math> is then:
-<math>\tilde{X} = \{T_1(x), T_2(x), \dots, T_n(x)\} \text{ for each } x \in X</math>
-'''Intuition:'''
-Augmentation increases the diversity of the training data, making the model more robust to variations in the input data. It essentially allows the model to "see" more data during training without actually collecting more data.
-'''Implementation:'''
-Common transformations include:
-* '''Rotation:''' Rotating the image by a small angle.
-* '''Translation:''' Shifting the image by a few pixels in any direction.
-* '''Scaling:''' Resizing the image while maintaining aspect ratio.
-* '''Flipping:''' Horizontally or vertically flipping the image.
-* '''Adding Noise:''' Injecting Gaussian noise into the image.
-''(Placeholder for Image)''
-(Include an image showing different types of dataset augmentation applied to the same image)
-=== Noise Injection ===
-==== Injecting Noise at the Input Level ====
-'''Mathematical Formulation:'''
-Let <math>x</math> be an input feature vector, and <math>\epsilon</math> be noise drawn from a Gaussian distribution <math>\mathcal{N}(0, \sigma^2)</math>. The noisy input <math>\tilde{x}</math> is:
-<math>\tilde{x} = x + \epsilon</math>
-where <math>\epsilon \sim \mathcal{N}(0, \sigma^2)</math>.
-'''Intuition:'''
-Adding noise to the input data forces the model to learn to be robust to small perturbations, which improves its generalization ability. This technique can also help in reducing overfitting by preventing the model from relying too heavily on specific features.
-'''Implementation:'''
-Noise can be injected at various stages of the model:
-* '''Input Noise:''' Directly added to the input features.
-* '''Weight Noise:''' Added to the weights of the model during training.
-* '''Activation Noise:''' Injected into the activation outputs of hidden layers.
-''(Placeholder for Image)''
-(Include an image demonstrating the effect of noise injection on a simple linear model)
-=== Manifold Tangent Classifier ===
-==== Concept and Mathematics ====
-'''Mathematical Formulation:'''
-The idea behind manifold learning is that high-dimensional data often lie on a low-dimensional manifold. A tangent vector to the manifold at a point <math>x</math> is given by the derivative of the function defining the manifold:
-<math>\mathbf{t} = \frac{\partial f}{\partial x} \Bigg|_{x=x_0}</math>
-The goal is to learn a classifier that is invariant to movements along the tangent directions of the manifold.
-'''Intuition:'''
-By learning the tangent directions to the manifold, the classifier can better generalize to variations in the data that lie within the manifold, while being more sensitive to variations that lie outside the manifold (which are more likely to be noise).
-'''Implementation:'''
+The modified gradient equation becomes:
-Manifold tangent classifiers typically require the computation of Jacobians or other differential operators to capture the local geometry of the data.
-''(Placeholder for Image)''
+<math> \lambda w + H (w - w^*) = 0 </math>
-(Include an image showing a data manifold with tangent vectors at various points)
-=== Early Stopping as a Form of Regularization ===
+Solving for <math> w </math>, we get:
-==== Early Stopping Mechanism ====
+<math> w = (H + \lambda I)^{-1} H w^* </math>
-'''Mathematical Formulation:'''
+where <math> I </math> is the identity matrix.
-Early stopping involves monitoring the validation loss <math>\mathcal{L}_{\text{val}}</math> during training. The training process is halted when:
-<math>\frac{\partial \mathcal{L}_{\text{val}}}{\partial t} > 0</math>
+=== Eigenvalue Decomposition ===
-where <math>t</math> is the training epoch. The condition indicates that the model has started to overfit to the training data, as the validation loss is increasing.
+Assume <math> H = Q \Lambda Q^\top </math> where <math> Q </math> is the orthogonal matrix of eigenvectors and <math> \Lambda </math> is the diagonal matrix of eigenvalues.
-'''Intuition:'''
+Then the weight vector can be expressed as:
-Early stopping prevents the model from learning too much noise from the training data. By stopping at the optimal point, the model retains the best generalization performance on unseen data.
-'''Implementation:'''
+<math> w = Q(\Lambda + \lambda I)^{-1} \Lambda Q^\top w^* </math>
-* '''Monitor Validation Loss:''' Keep track of the validation loss after each epoch.
-* '''Patience:''' Define a patience parameter, which is the number of epochs to wait before stopping if no improvement is seen.
-''(Placeholder for Image)''
+The effect of weight decay is to rescale the coefficients of the eigenvectors. The <math> i </math>-th component is rescaled by a factor of <math> \frac{\lambda_i}{\lambda_i + \lambda} </math>, where <math> \lambda_i </math> is the <math> i </math>-th eigenvalue.
-(Include an image showing training and validation loss curves with the point of early stopping indicated)
-=== Parameter Tying and Parameter Sharing ===
+* If <math> \lambda_i > \lambda </math>, the effect of regularization is relatively small.
+* Components with <math> \lambda_i < \lambda </math> will be shrunk to have nearly zero magnitude.
-==== Concepts and Applications ====
+=== Effective Number of Parameters ===
-'''Mathematical Formulation:'''
+Directions along which the parameters contribute significantly to reducing the objective function are preserved. A small eigenvalue of the Hessian indicates that movement in this direction will not significantly increase the gradient.
-Parameter tying involves using the same set of parameters across different parts of the model. In a convolutional neural network (CNN), for example, the same filter is applied across different regions of the input image. Let <math>W</math> be a shared weight matrix, the convolution operation is:
-<math>f(x) = W \ast x</math>
+The effective number of parameters can be defined as:
-where <math>\ast</math> denotes the convolution operation.
+<math> \text{Effective Number of Parameters} = \sum_i \frac{\lambda_i}{\lambda_i + \lambda} </math>
-'''Intuition:'''
+As <math> \lambda </math> increases, the effective number of parameters decreases, which reduces the model's complexity.
-Parameter tying reduces the number of parameters in the model, making it less prone to overfitting. It also allows the model to detect features that are translation-invariant, as the same parameters are used across different locations in the input.
 ''(Placeholder for Image)''
-(Include an image showing shared weights in a CNN)
+(Include an image illustrating the effect of weight decay on the eigenvalues and the effective number of parameters)

stat940F24: Difference between revisions

Revision as of 21:37, 20 September 2024

Contents

Regularization in Deep Learning

Introduction

Classical Regularization: Parameter Norm Penalties

L2 Regularization (Weight Decay)

L2 Parameter Regularization (Weight Decay)

Overview

Gradient of the Total Objective Function

Quadratic Approximation to the Objective Function

Eigenvalue Decomposition

Effective Number of Parameters

Navigation menu

stat940F24: Difference between revisions

Revision as of 21:37, 20 September 2024

Regularization in Deep Learning

Introduction

Classical Regularization: Parameter Norm Penalties

L2 Regularization (Weight Decay)

L2 Parameter Regularization (Weight Decay)

Overview

Gradient of the Total Objective Function

Quadratic Approximation to the Objective Function

Eigenvalue Decomposition

Effective Number of Parameters

Navigation menu

Search