stat940F24

From statwiki
Jump to navigation Jump to search

Regularization in Deep Learning

Introduction

Regularization is a fundamental concept in machine learning, particularly in deep learning, where models with a high number of parameters are prone to overfitting. Overfitting occurs when a model learns the noise in the training data rather than the underlying distribution, leading to poor generalization on unseen data. Regularization techniques aim to constrain the model’s capacity, thus preventing overfitting and improving generalization. This chapter will explore various regularization methods in detail, complete with mathematical formulations, intuitive explanations, and practical implementations.

Classical Regularization: Parameter Norm Penalties

L2 Regularization (Weight Decay)

Mathematical Formulation: L2 regularization adds a penalty term to the loss function, which is proportional to the sum of the squared weights in the model. The loss function with L2 regularization can be written as:

[math]\displaystyle{ \mathcal{L}_{\text{total}} = \mathcal{L} + \frac{\lambda}{2} \sum_{i=1}^{N} w_i^2 }[/math]

where:

  • [math]\displaystyle{ \mathcal{L} }[/math] is the original loss function (e.g., mean squared error for regression or cross-entropy for classification),
  • [math]\displaystyle{ \lambda }[/math] is the regularization strength (a hyperparameter),
  • [math]\displaystyle{ w_i }[/math] represents the model weights,
  • [math]\displaystyle{ N }[/math] is the total number of weights.

Intuition: The L2 regularization term penalizes large weights, encouraging the model to distribute weights more evenly across all features. This leads to simpler models that are less likely to overfit the training data.

Implementation: L2 regularization is typically implemented during the optimization process. For example, when using stochastic gradient descent (SGD), the weight update rule with L2 regularization becomes:

[math]\displaystyle{ w_i \leftarrow w_i - \eta \frac{\partial \mathcal{L}}{\partial w_i} - \eta \lambda w_i }[/math]

where [math]\displaystyle{ \eta }[/math] is the learning rate.

(Placeholder for Image) (Include an image illustrating the effect of L2 regularization on the weight distribution and the loss function)

L1 Regularization

Mathematical Formulation: L1 regularization adds a penalty proportional to the sum of the absolute values of the weights. The modified loss function is:

[math]\displaystyle{ \mathcal{L}_{\text{total}} = \mathcal{L} + \lambda \sum_{i=1}^{N} |w_i| }[/math]

Intuition: L1 regularization leads to sparsity in the model parameters, meaning that it drives some weights to exactly zero. This can be particularly useful in feature selection, where irrelevant features are eliminated from the model.

Implementation: The weight update rule for L1 regularization using SGD is:

[math]\displaystyle{ w_i \leftarrow w_i - \eta \frac{\partial \mathcal{L}}{\partial w_i} - \eta \lambda \text{sign}(w_i) }[/math]

where [math]\displaystyle{ \text{sign}(w_i) }[/math] is the sign of the weight.

(Placeholder for Image) (Include an image showing how L1 regularization leads to sparse weight distributions)

Dataset Augmentation

Overview of Dataset Augmentation

Mathematical Formulation: Let [math]\displaystyle{ x }[/math] be an input image, and [math]\displaystyle{ f(x) }[/math] the corresponding label. Dataset augmentation involves generating new data points [math]\displaystyle{ \tilde{x}_i }[/math] through transformations [math]\displaystyle{ T_i }[/math], such that:

[math]\displaystyle{ \tilde{x}_i = T_i(x), \quad f(\tilde{x}_i) = f(x) }[/math]

where [math]\displaystyle{ T_i }[/math] could be any transformation such as rotation, translation, scaling, or adding noise. The augmented dataset [math]\displaystyle{ \tilde{X} }[/math] is then:

[math]\displaystyle{ \tilde{X} = \{T_1(x), T_2(x), \dots, T_n(x)\} \text{ for each } x \in X }[/math]

Intuition: Augmentation increases the diversity of the training data, making the model more robust to variations in the input data. It essentially allows the model to "see" more data during training without actually collecting more data.

Implementation: Common transformations include:

  • Rotation: Rotating the image by a small angle.
  • Translation: Shifting the image by a few pixels in any direction.
  • Scaling: Resizing the image while maintaining aspect ratio.
  • Flipping: Horizontally or vertically flipping the image.
  • Adding Noise: Injecting Gaussian noise into the image.

(Placeholder for Image) (Include an image showing different types of dataset augmentation applied to the same image)

Noise Injection

Injecting Noise at the Input Level

Mathematical Formulation: Let [math]\displaystyle{ x }[/math] be an input feature vector, and [math]\displaystyle{ \epsilon }[/math] be noise drawn from a Gaussian distribution [math]\displaystyle{ \mathcal{N}(0, \sigma^2) }[/math]. The noisy input [math]\displaystyle{ \tilde{x} }[/math] is:

[math]\displaystyle{ \tilde{x} = x + \epsilon }[/math]

where [math]\displaystyle{ \epsilon \sim \mathcal{N}(0, \sigma^2) }[/math].

Intuition: Adding noise to the input data forces the model to learn to be robust to small perturbations, which improves its generalization ability. This technique can also help in reducing overfitting by preventing the model from relying too heavily on specific features.

Implementation: Noise can be injected at various stages of the model:

  • Input Noise: Directly added to the input features.
  • Weight Noise: Added to the weights of the model during training.
  • Activation Noise: Injected into the activation outputs of hidden layers.

(Placeholder for Image) (Include an image demonstrating the effect of noise injection on a simple linear model)

Manifold Tangent Classifier

Concept and Mathematics

Mathematical Formulation: The idea behind manifold learning is that high-dimensional data often lie on a low-dimensional manifold. A tangent vector to the manifold at a point [math]\displaystyle{ x }[/math] is given by the derivative of the function defining the manifold:

[math]\displaystyle{ \mathbf{t} = \frac{\partial f}{\partial x} \Bigg|_{x=x_0} }[/math]

The goal is to learn a classifier that is invariant to movements along the tangent directions of the manifold.

Intuition: By learning the tangent directions to the manifold, the classifier can better generalize to variations in the data that lie within the manifold, while being more sensitive to variations that lie outside the manifold (which are more likely to be noise).

Implementation: Manifold tangent classifiers typically require the computation of Jacobians or other differential operators to capture the local geometry of the data.

(Placeholder for Image) (Include an image showing a data manifold with tangent vectors at various points)

Early Stopping as a Form of Regularization

Early Stopping Mechanism

Mathematical Formulation: Early stopping involves monitoring the validation loss [math]\displaystyle{ \mathcal{L}_{\text{val}} }[/math] during training. The training process is halted when:

[math]\displaystyle{ \frac{\partial \mathcal{L}_{\text{val}}}{\partial t} \gt 0 }[/math]

where [math]\displaystyle{ t }[/math] is the training epoch. The condition indicates that the model has started to overfit to the training data, as the validation loss is increasing.

Intuition: Early stopping prevents the model from learning too much noise from the training data. By stopping at the optimal point, the model retains the best generalization performance on unseen data.

Implementation:

  • Monitor Validation Loss: Keep track of the validation loss after each epoch.
  • Patience: Define a patience parameter, which is the number of epochs to wait before stopping if no improvement is seen.

(Placeholder for Image) (Include an image showing training and validation loss curves with the point of early stopping indicated)

Parameter Tying and Parameter Sharing

Concepts and Applications

Mathematical Formulation: Parameter tying involves using the same set of parameters across different parts of the model. In a convolutional neural network (CNN), for example, the same filter is applied across different regions of the input image. Let [math]\displaystyle{ W }[/math] be a shared weight matrix, the convolution operation is:

[math]\displaystyle{ f(x) = W \ast x }[/math]

where [math]\displaystyle{ \ast }[/math] denotes the convolution operation.

Intuition: Parameter tying reduces the number of parameters in the model, making it less prone to overfitting. It also allows the model to detect features that are translation-invariant, as the same parameters are used across different locations in the input.

(Placeholder for Image) (Include an image showing shared weights in a CNN)