stat940W25: Difference between revisions
(6 intermediate revisions by 2 users not shown) | |||
Line 8,079: | Line 8,079: | ||
</div> | </div> | ||
<div style="border: 2px solid #0073e6; background-color: #f0f8ff; padding: 10px; margin: 10px 0; border-radius: 5px;"> | |||
== Exercise 7.8 == | |||
'''Level:''' * (Easy) | |||
'''Exercise Types:''' Novel | |||
==== Question ==== | |||
Answer whether the following statements about Convolutional Neural Networks (CNNs) are <b>True</b> or <b>False</b>. <br> | |||
1.CNNs are designed to work only with grayscale images.<br> | |||
2.Pooling layers in CNNs help reduce the spatial dimensions of the input while preserving important features.<br> | |||
3.Fully connected layers in a CNN help maintain spatial relationships between pixels. | |||
==== Solution ==== | |||
1.False – CNNs can process images with multiple channels, such as RGB images (three channels) or even hyperspectral images with many channels. | |||
2.True – Pooling operations, such as max pooling, downsample the feature maps, making the network more efficient while retaining essential information. | |||
3.False – Fully connected layers discard spatial relationships since they treat the input as a single vector, unlike convolutional layers that preserve spatial structure. | |||
</div> | |||
<div style="border: 2px solid #0073e6; background-color: #f0f8ff; padding: 10px; margin: 10px 0; border-radius: 5px;"> | |||
== Exercise 7.9 == | |||
'''Level:''' * (Easy) | |||
'''Exercise Types:''' Novel | |||
==== Question ==== | |||
In the context of Convolutional Neural Networks (CNNs), explain the following concepts and their significance in image-based deep learning tasks: | |||
(a) How do convolutional layers reduce the number of parameters compared to fully connected layers, and why is this advantageous? | |||
(b) What role do pooling layers (e.g., max-pooling) play in achieving translation invariance? | |||
(c) How does the hierarchical structure of CNNs (e.g., stacking convolutional and pooling layers) enable the learning of complex features from raw pixel data? | |||
==== Solution ==== | |||
(a) Convolutional layers apply small filters (kernels) that slide across the input image, computing dot products locally. These filters are shared across all spatial positions (e.g., a 3×3 kernel uses the same weights for every patch of the image). | |||
(b)Pooling downsamples feature maps by aggregating local regions (e.g., 2×2 windows). Max-pooling selects the maximum activation in each window. It reduces overfitting by compressing spatial dimensions and prioritizes the presence of features over their exact location. | |||
(c) Early Layers: Detect low-level features (edges, corners, colors). | |||
Example: A 3×3 filter might activate for horizontal edges. | |||
Middle Layers: Combine edges into textures or shapes (e.g., circles, stripes). | |||
Example: A filter might respond to "eye-like" patterns. | |||
Deep Layers: Assemble shapes into high-level semantic features (e.g., faces, objects). | |||
</div> | |||
<div style="border: 2px solid #0073e6; background-color: #f0f8ff; padding: 10px; margin: 10px 0; border-radius: 5px;"> | <div style="border: 2px solid #0073e6; background-color: #f0f8ff; padding: 10px; margin: 10px 0; border-radius: 5px;"> | ||
Latest revision as of 00:28, 3 February 2025
Notes on Exercises
Exercises are numbered using a two-part system, where the first number represents the lecture number and the second number represents the exercise number. For example:
- 1.1 refers to the first exercise in Lecture 1.
- 2.3 refers to the third exercise in Lecture 2.
Students are encouraged to complete these exercises as they follow the lecture content to deepen their understanding.
Exercise 1.1
Level: ** (Moderate)
Exercise Types: Novel
Each exercise you contribute should fall into one of the following categories:
- Novel: Preferred – An original exercise created by you.
- Modified: Valued – An exercise adapted or significantly altered from an existing source.
- Copied: Permissible – An exercise reproduced exactly as it appears in the source.
References: Source: (e.g., book or other resources, if a webpage has its URL), Chapter,Page Number.
Question
Prove that the Perceptron Learning Algorithm converges in a finite number of steps if the dataset is linearly separable.
Hint:Note: exc Assume that the dataset [math]\displaystyle{ \{(\mathbf{x}_i, y_i)\}_{i=1}^N }[/math] is linearly separable, where [math]\displaystyle{ \mathbf{x}_i \in \mathbb{R}^d }[/math] are the input vectors, and [math]\displaystyle{ y_i \in \{-1, 1\} }[/math] are their corresponding labels. Show that there exists a weight vector [math]\displaystyle{ \mathbf{w}^* }[/math] and a bias [math]\displaystyle{ b^* }[/math] such that [math]\displaystyle{ y_i (\mathbf{w}^* \cdot \mathbf{x}_i + b^*) \gt 0 }[/math] for all [math]\displaystyle{ i }[/math], and use this assumption to bound the number of updates made by the algorithm.
Solution
Step 1: Linear Separability Assumption
If the dataset is linearly separable, there exists a weight vector [math]\displaystyle{ \mathbf{w}^* }[/math] and a bias [math]\displaystyle{ b^* }[/math] such that: [math]\displaystyle{ y_i (\mathbf{w}^* \cdot \mathbf{x}_i + b^*) \gt 0 \quad \forall i = 1, 2, \dots, N. }[/math] Without loss of generality, let [math]\displaystyle{ \| \mathbf{w}^* \| = 1 }[/math] (normalize [math]\displaystyle{ \mathbf{w}^* }[/math]).
Step 2: Perceptron Update Rule
The Perceptron algorithm updates the weight vector [math]\displaystyle{ \mathbf{w} }[/math] and bias [math]\displaystyle{ b }[/math] as follows:
- Initialize [math]\displaystyle{ \mathbf{w}_0 = 0 }[/math] and [math]\displaystyle{ b_0 = 0 }[/math].
- For each misclassified point [math]\displaystyle{ (\mathbf{x}_i, y_i) }[/math], update:
[math]\displaystyle{ \mathbf{w} \leftarrow \mathbf{w} + y_i \mathbf{x}_i, \quad b \leftarrow b + y_i. }[/math]
Define the margin [math]\displaystyle{ \gamma }[/math] of the dataset as: [math]\displaystyle{ \gamma = \min_{i} \frac{y_i (\mathbf{w}^* \cdot \mathbf{x}_i + b^*)}{\| \mathbf{x}_i \|}. }[/math] Since the dataset is linearly separable, [math]\displaystyle{ \gamma \gt 0 }[/math].
Step 3: Bounding the Number of Updates
Let [math]\displaystyle{ \mathbf{w}_t }[/math] be the weight vector after [math]\displaystyle{ t }[/math]-th update. Define: [math]\displaystyle{ M = \max_i \| \mathbf{x}_i \|^2, }[/math] the maximum squared norm of any input vector.
Growth of [math]\displaystyle{ \| \mathbf{w}_t \|^2 }[/math]
After [math]\displaystyle{ t }[/math] updates, the norm of [math]\displaystyle{ \mathbf{w}_t }[/math] satisfies: [math]\displaystyle{ \| \mathbf{w}_{t+1} \|^2 = \| \mathbf{w}_t + y_i \mathbf{x}_i \|^2 = \| \mathbf{w}_t \|^2 + 2 y_i (\mathbf{w}_t \cdot \mathbf{x}_i) + \| \mathbf{x}_i \|^2. }[/math] Since the point is misclassified, [math]\displaystyle{ y_i (\mathbf{w}_t \cdot \mathbf{x}_i) \lt 0 }[/math]. Thus: [math]\displaystyle{ \| \mathbf{w}_{t+1} \|^2 \leq \| \mathbf{w}_t \|^2 + \| \mathbf{x}_i \|^2 \leq \| \mathbf{w}_t \|^2 + M. }[/math] By induction, after [math]\displaystyle{ t }[/math] updates: [math]\displaystyle{ \| \mathbf{w}_t \|^2 \leq tM. }[/math]
Lower Bound on [math]\displaystyle{ \mathbf{w}_t \cdot \mathbf{w}^* }[/math]
Let [math]\displaystyle{ \mathbf{w}_t }[/math] be the weight vector after [math]\displaystyle{ t }[/math]-th update. Each update increases [math]\displaystyle{ \mathbf{w}_t \cdot \mathbf{w}^* }[/math] by at least [math]\displaystyle{ \gamma }[/math]: [math]\displaystyle{ \mathbf{w}_{t+1} \cdot \mathbf{w}^* = (\mathbf{w}_t + y_i \mathbf{x}_i) \cdot \mathbf{w}^* = \mathbf{w}_t \cdot \mathbf{w}^* + y_i (\mathbf{x}_i \cdot \mathbf{w}^*). }[/math] Since [math]\displaystyle{ y_i (\mathbf{x}_i \cdot \mathbf{w}^*) \geq \gamma }[/math], we have: [math]\displaystyle{ \mathbf{w}_{t+1} \cdot \mathbf{w}^* \geq \mathbf{w}_t \cdot \mathbf{w}^* + \gamma. }[/math] By induction: [math]\displaystyle{ \mathbf{w}_t \cdot \mathbf{w}^* \geq t \gamma. }[/math]
Combining the Results
The Cauchy-Schwarz inequality gives: [math]\displaystyle{ \mathbf{w}_t \cdot \mathbf{w}^* \leq \| \mathbf{w}_t \| \| \mathbf{w}^* \| = \| \mathbf{w}_t \|. }[/math] Thus: [math]\displaystyle{ t \gamma \leq \| \mathbf{w}_t \| \leq \sqrt{tM}. }[/math] Squaring both sides: [math]\displaystyle{ t^2 \gamma^2 \leq tM. }[/math] Dividing through by [math]\displaystyle{ t }[/math] (assuming [math]\displaystyle{ t \gt 0 }[/math]): [math]\displaystyle{ t \leq \frac{M}{\gamma^2}. }[/math]
Step 4: Conclusion
The Perceptron Learning Algorithm converges after at most [math]\displaystyle{ \frac{M}{\gamma^2} }[/math] updates, which is finite. This proves that the algorithm terminates when the dataset is linearly separable.
Exercise 1.2
Level: * (Easy)
Exercise Types: Modified
References: Simon J.D. Prince. Understanding Deep learning. 2024
This problem generalized Problem 4.10 in this textbook to [math]\displaystyle{ N }[/math] inputs and [math]\displaystyle{ M }[/math] outputs.
Question
(a) Consider a deep neural network with a single input, a single output, and [math]\displaystyle{ K }[/math] hidden layers, each containing [math]\displaystyle{ D }[/math] hidden units. How many parameters does this network have in total?
(b) Now, generalize the problem: if the number of inputs is [math]\displaystyle{ N }[/math] and the number of outputs is [math]\displaystyle{ M }[/math], how many parameters does this network have in total?
Solution
(a) Total number of parameters when there is a single input and output:
For the first layer, the input size is [math]\displaystyle{ 1 }[/math] and the output size is [math]\displaystyle{ D }[/math]. Therefore, the number of weights is [math]\displaystyle{ 1D }[/math], and the number of biases is [math]\displaystyle{ D }[/math].
Number of parameters: [math]\displaystyle{ D + D = 2D }[/math]
For hidden layers [math]\displaystyle{ i \longrightarrow i+1,i\in1,...,K-1 }[/math]: Each hidden layer connects [math]\displaystyle{ D }[/math] units to another [math]\displaystyle{ D }[/math] units. Therefore, for each layer, the number of weights is [math]\displaystyle{ D^2 }[/math], and the number of biases is [math]\displaystyle{ D }[/math].
Number of parameters for all [math]\displaystyle{ K-1 }[/math] hidden layers: [math]\displaystyle{ (K-1)(D^2 + D) }[/math]
For the output layer, the number of weights is [math]\displaystyle{ D }[/math], and the number of biases is [math]\displaystyle{ 1 }[/math].
Number of parameters: [math]\displaystyle{ D + 1 }[/math]
Therefore, the total number of parameters is [math]\displaystyle{ 2D + (K-1)(D^2 + D) + D + 1 }[/math].
(b) Total number of parameters for [math]\displaystyle{ N }[/math] inputs and [math]\displaystyle{ M }[/math] outputs:
In this case, the number of parameters for the first layer becomes [math]\displaystyle{ ND+D }[/math], while the number of parameters for the output layer becomes [math]\displaystyle{ DM+M }[/math].
Therefore, in total, the number of parameters is [math]\displaystyle{ ND+D+(K-1)(D^2+D)+MD+M }[/math]
Exercise 1.3
Level: * (Easy)
Exercise Types: Modified
References: Simon J.D. Prince. Understanding Deep learning. MIT Press, 2023
This problem modified from the background mathematics problem chap01 Question1.
Question
A single linear equation with three inputs associates a value [math]\displaystyle{ y }[/math] with each point in a 3D space [math]\displaystyle{ (x_1,x_2,x_3) }[/math]. Is it possible to visualize this? What value is at position [math]\displaystyle{ (0,0,0) }[/math]?
We add an inverse problem: If [math]\displaystyle{ y, \omega_1, \omega_2, \omega_3 }[/math] and [math]\displaystyle{ \beta }[/math] are known, derive a system of equations to solve for the input values [math]\displaystyle{ x_1, x_2, x_3 }[/math] that produce a specific output value of [math]\displaystyle{ y }[/math]. Under what conditions is this problem solvable?
Solution
A single linear equation with three inputs is of the form:
[math]\displaystyle{ y = \beta + \omega_1 x_1 + \omega_2 x_2 + \omega_3 x_3 }[/math]
where [math]\displaystyle{ \beta }[/math] is the offset, and [math]\displaystyle{ \omega_1, \omega_2, \omega_3 }[/math] are weights for the inputs [math]\displaystyle{ x_1, x_2, x_3 }[/math].
We can define the code as follows:
def linear_function_3D(x1, x2, x3, beta, omega1, omega2, omega3): y = beta + omega1 * x1 + omega2 * x2 + omega3 * x3 return y
Given [math]\displaystyle{ \beta = 0.5, \omega_1 = -1.0, \omega_2 = 0.4 }[/math] and [math]\displaystyle{ \omega_3 = -0.3 }[/math],
[math]\displaystyle{ y = \beta + \omega_1 \cdot 0 + \omega_2 \cdot 0 + \omega_3 \cdot 0 }[/math]
Thus, [math]\displaystyle{ y(0, 0, 0) = 0.5. }[/math]
To visualize, we can fix [math]\displaystyle{ x_3 = 0 }[/math] and let [math]\displaystyle{ x_1, x_2 }[/math] vary, and generate the [math]\displaystyle{ y }[/math]-values using the equation.
Here is the code:
import numpy as np import matplotlib.pyplot as plt # Generate grid for x1 and x2, fix x3 = 0 x1 = np.linspace(-10, 10, 100) x2 = np.linspace(-10, 10, 100) x1, x2 = np.meshgrid(x1, x2) x3 = 0 # Define coefficients beta = 0.5 omega1 = -1.0 omega2 = 0.4 omega3 = -0.3 # Compute y-values y = linear_function_3D(x1, x2, x3, beta, omega1, omega2, omega3) # Visualization fig = plt.figure(figsize=(8, 6)) ax = fig.add_subplot(111, projection='3d') ax.plot_surface(x1, x2, y, cmap='viridis') ax.set_xlabel('x1') ax.set_ylabel('x2') ax.set_zlabel('y') plt.title('3D Linear Function with Fixed x3=0') plt.show()
The plot is shown below:
For the inverse problem, given [math]\displaystyle{ y, \beta, \omega_1, \omega_2, \omega_3, }[/math] we can solve [math]\displaystyle{ x_1, x_2, x_3 }[/math]as follows:
[math]\displaystyle{ \begin{bmatrix} \omega_1 & \omega_2 & \omega_3 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} = y - \beta }[/math]
The problem is solvable if [math]\displaystyle{ \omega }[/math] is not a zero vector, ensuring at least one weight contributes to the equation.
y = 10.0 beta = 1.0 omega = [2, -1, 0.5] rhs = y - beta # Solve using least squares x_vec = np.linalg.lstsq(np.array([omega]), [rhs], rcond=None)[0] print(f"Solution for x: {x_vec}")
Exercise 1.4
Level: * (Easy)
Exercise Types: Novel
Question
Thinking about feedforward model with sigmoid activation, compute the output of a single-layer neural network with 3 inputs and 1 output.
Assuming:
- Input vector: [math]\displaystyle{ x = (0.1, 0.4, 0.6) }[/math]
- weights: [math]\displaystyle{ w = (0.2, 0.3, 0.5) }[/math]
- Bias: [math]\displaystyle{ b = 0.1 }[/math]
- a). Sigmoid activation function: [math]\displaystyle{ f(z) = \frac{1}{1 + e^{-z}} }[/math]
- b). ReLU activation function: [math]\displaystyle{ f(z) = \max(0, z) }[/math]
- c). Tanh activation function: [math]\displaystyle{ f(z) = \tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} }[/math]
Solution
1. Compute the weighted sum: [math]\displaystyle{ z = w \cdot x + b = (0.2)(0.1) + (0.3)(0.4) + (0.5)(0.6) + 0.1 }[/math]
Breaking this down step-by-step: [math]\displaystyle{ z = 0.02 + 0.12 + 0.3 + 0.1 = 0.54 }[/math]
2. a). Apply the sigmoid activation function: [math]\displaystyle{ f(z) = \frac{1}{1 + e^{-z}} }[/math]
Substituting [math]\displaystyle{ z = 0.54 }[/math]: [math]\displaystyle{ f(z) = \frac{1}{1 + e^{-0.54}} \approx \frac{1}{1 + 0.582} \approx 0.632 }[/math]
Thus, the final output is 0.632.
b). Similarly, apply the ReLU activation function: [math]\displaystyle{ f(z) = \max(0, z) }[/math]
Substituting [math]\displaystyle{ z = 0.54 }[/math]: [math]\displaystyle{ f(z) = \max(0, 0.54) = 0.54 }[/math]
c). Finally, apply the Tanh activation function: [math]\displaystyle{ f(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} }[/math]
Substituting [math]\displaystyle{ z = 0.54 }[/math]: [math]\displaystyle{ f(z) = \frac{e^{0.54} - e^{-0.54}}{e^{0.54} + e^{-0.54}} \approx \frac{1.716 - 0.583}{1.716 + 0.583} \approx \frac{1.133}{2.299} \approx 0.493 }[/math]
Exercise 1.5
Level: * (Easy)
Exercise Types: Novel
Question
1.2012: ________'s ImageNet victory brings mainstream attention to deep learning.
2.2016: Google's ________ uses deep reinforcement learning to defeat a Go world champion.
3.2017: The ________ architecture revolutionizes Natural Language Processing.
Solution
1. AlexNet
2. AlphaGo
3. Transformer
Key Milestones in Deep Learning
•2006: Deep Belief Networks – The modern era of deep learning begins.
•2012: AlexNet's ImageNet victory brings mainstream attention.
•2014-2015: Introduction of Generative Adversarial Networks (GANs).
•2016: Google's AlphaGo uses deep learning to defeat a Go world champion.
•2017: Transformer architecture revolutionizes Natural Language Processing.
•2018-2019: BERT and GPT-2 set new benchmarks in NLP.
•2020: GPT-3 demonstrates advanced language understanding and generation.
•2021: AlphaFold 2 achieves breakthroughs in protein structure prediction.
•2021-2022: Diffusion Models (e.g., DALL-E 2, Stable Diffusion) achieve state-of-the-art in image and video generation.
•2022: ChatGPT popularizes conversational AI and large language models (LLMs).
Exercise 1.6
Level: * (Easy)
Exercise Type: Novel
Question
a) What are some common examples of first-order search strategies in neural network optimization, and why are first-order methods generally preferred over second-order methods?
b) What is the difference between a deep neural network and a shallow neural network, and how many hidden layers does each typically have?
c) Prove that a perceptron cannot converge for the XOR problem.
Solution
a)
Common examples of first-order search strategies in neural network optimization include Gradient Descent (GD), Stochastic Gradient Descent (SGD), Momentum, and Adam. These methods rely on gradients (first derivatives) of the loss function to update model parameters, making them computationally efficient and scalable. First-order methods are preferred due to their efficiency, scalability to large datasets, and lower memory requirements compared to second-order methods. While second-order methods can converge faster, first-order methods like Adam balance performance and resource usage well, especially in large-scale networks.
b)
A deep neural network typically has more than 2 hidden layers, allowing it to learn complex, abstract features at each layer. A shallow neural network usually has 1 or 2 hidden layers. Therefore, networks with more than 2 hidden layers are considered deep, while those with fewer layers are considered shallow.
c)
Step 1: XOR Dataset
The XOR problem has the following data points and labels:
x₁ | x₂ | y |
0 | 0 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | 0 |
Step 2: Perceptron Decision Boundary
The perceptron decision boundary is defined as:
[math]\displaystyle{ z = w₁ ⋅ x₁ + w₂ ⋅ x₂ + b }[/math]
A point is classified as:
- y = 1 if z > 0
- y = 0 if z < 0
For the XOR dataset, we derive inequalities for each data point.
Step 3: Derive Inequalities
1. For (x₁, x₂) = (0, 0), y = 0:
b < 0
2. For (x₁, x₂) = (0, 1), y = 1:
w₂ + b > 0
3. For (x₁, x₂) = (1, 0), y = 1:
w₁ + b > 0
4. For (x₁, x₂) = (1, 1), y = 0:
w₁ + w₂ + b < 0
Step 4: Attempt to Solve
From the inequalities:
1. [math]\displaystyle{ b \lt 0 }[/math]
2. [math]\displaystyle{ w₂ + b \gt 0 \Rightarrow w₂ \gt -b }[/math]
3. [math]\displaystyle{ w₁ + b \gt 0 \Rightarrow w₁ \gt -b }[/math]
4. [math]\displaystyle{ w₁ + w₂ + b \lt 0 \Rightarrow w₁ + w₂ \lt -b }[/math]
Now, add inequalities (2) and (3):
[math]\displaystyle{ w₁ + w₂ \gt -2b }[/math]
But compare this with inequality (4):
[math]\displaystyle{ w₁ + w₂ \lt -b }[/math]
This leads to a contradiction because [math]\displaystyle{ -2b \lt -b }[/math] cannot be true if [math]\displaystyle{ b \lt 0 }[/math].
Therefore, the XOR dataset is not linearly separable, and the perceptron cannot converge for the XOR problem.
Exercise 1.7
Level: * (Easy)
Exercise Type: Novel
Question
The sigmoid activation function is defined as: [math]\displaystyle{ \sigma(x) = \frac{1}{1 + e^{-x}}. }[/math]
(a) Derive the derivative of [math]\displaystyle{ \sigma(x) }[/math] with respect to [math]\displaystyle{ x }[/math], and show that: [math]\displaystyle{ \sigma'(x) = \sigma(x)(1 - \sigma(x)). }[/math]
(b) Use this property to explain why sigmoid activation is suitable for modeling probabilities in binary classification tasks.
Solution
(a) Derivative: Starting with [math]\displaystyle{ \sigma(x) = \frac{1}{1 + e^{-x}} }[/math], we compute: [math]\displaystyle{ \sigma'(x) = \frac{d}{dx} \left( \frac{1}{1 + e^{-x}} \right) = \frac{e^{-x}}{(1 + e^{-x})^2}. }[/math]
By noting that [math]\displaystyle{ \sigma(x) = \frac{1}{1 + e^{-x}} }[/math] and [math]\displaystyle{ 1 - \sigma(x) = \frac{e^{-x}}{1 + e^{-x}}, }[/math] we simplify to: [math]\displaystyle{ \sigma'(x) = \sigma(x)\bigl(1 - \sigma(x)\bigr). }[/math]
(b) Why sigmoid for probabilities: The sigmoid function maps any real [math]\displaystyle{ x }[/math] into [math]\displaystyle{ (0,1) }[/math], which aligns with the range of valid probabilities in binary classification. Moreover, its derivative [math]\displaystyle{ \sigma'(x) = \sigma(x)(1 - \sigma(x)) }[/math] makes gradient-based optimization naturally scale updates based on “confidence.” When [math]\displaystyle{ \sigma(x) }[/math] is near 0 or 1, the gradient becomes small, preventing large adjustments once the model is fairly certain in its prediction.
A closely related function is the softmax, which generalizes the same probabilistic interpretation to multi-class settings. For two classes, softmax is essentially the same as the sigmoid function, so it can also be suitable for binary classification problems.
Exercise 1.8
Level: * (Easy)
Exercise Types: Novel
Question
In classification, it is possible to minimize the number of misclassifications directly by using:
[math]\displaystyle{ \sum_{i=1}^n \mathbf{1}\Bigl(\text{sign}(\boldsymbol{\beta}^T \mathbf{x}_i + \beta_0) \neq y_i\Bigr) }[/math]
where [math]\displaystyle{ \mathbf{1}(\cdot) }[/math] is the indicator function, [math]\displaystyle{ \boldsymbol{\beta} }[/math] is the weight vector, and [math]\displaystyle{ \beta_0 }[/math] is the bias term. So, the loss function gives 1 for each incorrect response and 0 for each correct one.
(a) Why is this approach not commonly used in practice?
(b) Name and give formulas for two differentiable loss functions commonly employed in practice for binary classification tasks, explaining why they are more popular.
Solution
(a): The expression [math]\displaystyle{ \mathbf{1}\left(\text{sign}(\boldsymbol{\beta}^T \mathbf{x}_i + \beta_0) \neq y_i \right) }[/math] gives only 0 or 1. Small changes in [math]\displaystyle{ \boldsymbol{\beta} }[/math] or [math]\displaystyle{ \beta_0 }[/math] can suddenly change the loss function for a sample from 0 to 1 (or vice versa). Because the loss function is discrete, the gradient with respect to [math]\displaystyle{ \boldsymbol{\beta} }[/math] or [math]\displaystyle{ \beta_0 }[/math] does not exist. Standard optimization techniques like gradient descent rely on differentiable, continuous loss function where partial derivatives can use to update the parameters.
(b): Two alternative loss functions:
Hinge Loss: [math]\displaystyle{ \sum_{i=1}^n \max(0, 1 - y_i (\boldsymbol{\beta}^T \mathbf{x}_i + \beta_0)) }[/math]
The hinge loss is often used in Support Vector Machines (SVMs) and works well when the data is linearly separable.
Logistic (Cross-Entropy) Loss: [math]\displaystyle{ \sum_{i=1}^n \log\left( 1 + \exp(-y_i (\boldsymbol{\beta}^T \mathbf{x}_i + \beta_0)) \right) }[/math]
The logistic loss (or cross-entropy loss) is commonly used in logistic regression and neural networks. It is differentiable so it works well for gradient-based optimization methods.
Exercise 1.9
Level: ** (Easy)
Exercise Types: Novel
Question
How are neural networks modeled? Using an example to explain it clearly.
Solution
Neural networks are modeled from biological neurons. A neural network consists of layers of interconnected neurons where each connection has associated weights. The input layer receives the data features, and each neuron corresponds to one feature from the dataset. The hidden layer consists of multiple neurons that transform the input data into intermediate representations, using a combination of weights, biases, and activation functions which allows the network to learn complex patterns, like the Sigmoid function [math]\displaystyle{ S(x) = \frac {1}{1+e^{-x}} }[/math]. The output layer generates the final prediction, such as probabilities for classification or continuous values for regression. Neural networks learn to map inputs to outputs by adjusting weights during training to minimize the error between predicted and actual outputs.
For example, in the lecture note, inputs [math]\displaystyle{ x_1 = 0.5 , x_2 = 0.9, x_3 = -0.3 }[/math] are passed through a hidden layer with specific weights:
[math]\displaystyle{ H_1 ~ weight = (1.0,-2.0,2.0) }[/math],
[math]\displaystyle{ H_2 ~ weight= (2.0,1.0 -4.0) }[/math],
[math]\displaystyle{ H_3 ~ weight = (1.0,-1.0,0.0) }[/math].
The computations yield hidden neuron values of
[math]\displaystyle{ H_1 = 0.5\times1.0 + 0.9\times -2.0 + -0.3 \times 2.0= 0.13 }[/math],
[math]\displaystyle{ H_2 = 0.5\times 2.0 + 0.9\times 1.0 + -0.3\times -4.0= 0.96 }[/math],
[math]\displaystyle{ H_3 = 0.5\times 1.0 + 0.9\times -1.0+ -0.3\times 0.0 = 0.40 }[/math],
which are then processed by the output layer to produce the final predictions.
This process demonstrates how neural networks learn and transform input data step by step.
Exercise 1.10
Level: ** (Moderate)
Exercise Types: Novel
Question
Biological neurons in the human brain have the following characteristics:
1. A neuron fires an electrical signal only when its membrane potential exceeds a certain threshold. Otherwise, it remains inactive.
2. Neurons are connected to one another through dendrites (input) and axons (outputs), forming a highly interconnected network.
3. The intensity of the signal passed between neurons depends on the strength of the connection, which can change over time due to learning and adaptation.
Considering the above points, answer the following questions:
Explain how these biological properties of neurons might inspire the design and functionality of nodes in artificial neural networks.
Solution
1. Threshold Behavior: The concept of a neuron firing only when its membrane potential exceeds a threshold is mirrored in neural networks through activation functions. These functions decide whether a node "fires" by producing a significant output.
2. Connectivity: The connections between biological neurons via dendrites and axons inspire the weighted connections in artificial neural networks. Each node receives inputs, processes them, and sends weighted outputs to subsequent node, similar how to signals propagate in the brain.
3. Learning and Adaptation: Biological neurons strengthen or weaken their connections based on experience (neuroplasticity). This is similar to how artificial networks adjust weights during training using backpropagation and optimization algorithms. The dynamic modification of weights allows artificial networks to learn from data.
Extra 4. Sparsity of Activation: Biologically, only a small fraction of neurons in the brain are active at specific time, which is energy-efficient and reduces redundancy. RELU attempts to micmic this sparsity so that reducing computational cost and improving generalization becomes feasible.
Exercise 1.11
Level: * (Easy)
Exercise Type: Novel
Question
If the pre-activation is 20, what are the outputs of the following activation functions: ReLU, Leaky ReLU, logistic, and hyperbolic?
Choose the correct answer:
a) 20, 20, 1, 1
b) 20, 0, 1, 1
c) 20, -20, 1, 1
d) 20, 20, -1, 1
e) 20, -20, 1, -1
Solution
The correct answer is a): 20, 20, 1, 1.
Calculation
[math]\displaystyle{ \text{ReLU}(20) = \max(0, 20) = 20 }[/math]
[math]\displaystyle{ \text{LeakyReLU}(20) = \begin{cases} 20 & \text{if } 20 \geq 0 \\ \alpha \cdot 20 & \text{if } 20 \lt 0 \end{cases} = 20 }[/math] where [math]\displaystyle{ \alpha }[/math] is a small constant (typically [math]\displaystyle{ 0.01 }[/math]).
[math]\displaystyle{ \sigma(20) = \frac{1}{1 + e^{-20}} \approx 1 }[/math]
[math]\displaystyle{ \tanh(20) = \frac{e^{20} - e^{-20}}{e^{20} + e^{-20}} = 1 }[/math]
Exercise 1.12
Level: * (Easy)
Exercise Type: Novel
Question
Imagine a simple feedforward neural network with a single hidden layer. The network structure is as follows: - linear activation function - The input layer has 2 neurons. - The hidden layer has 2 neurons. - The output layer has 1 neuron. - There are no biases in the network.
If the weights from the input layer to the hidden layer are given by: [math]\displaystyle{ W^{(1)} = \begin{bmatrix} 0.5 & -0.6 \\ 0.1 & 0.8 \end{bmatrix} }[/math] and the weights from the hidden layer to the output layer are given by: [math]\displaystyle{ W^{(2)} = \begin{bmatrix} 0.3 \\ -0.2 \end{bmatrix} }[/math]
Calculate the output of the network for the input vector [math]\displaystyle{ \mathbf{x} = \begin{bmatrix} 1 \\ 0 \end{bmatrix} }[/math] using a linear activation function for all neurons.
Hint
- The output of each layer is calculated by multiplying the input of that layer by the layer's weight matrix. - Use matrix multiplication to compute the outputs step-by-step.
Solution
- Step 1: Calculate Hidden Layer Output**
The input to the hidden layer is the initial input [math]\displaystyle{ \mathbf{x} }[/math]: [math]\displaystyle{ h^{(1)} = W^{(1)} \times \mathbf{x} = \begin{bmatrix} 0.5 & -0.6 \\ 0.1 & 0.8 \end{bmatrix} \begin{bmatrix} 1 \\ 0 \end{bmatrix} = \begin{bmatrix} 0.5 \\ 0.1 \end{bmatrix} }[/math]
- Step 2: Calculate Output Layer Output**
The input to the output layer is the output from the hidden layer: [math]\displaystyle{ y = W^{(2)} \times h^{(1)} = \begin{bmatrix} 0.3 \\ -0.2 \end{bmatrix} \times \begin{bmatrix} 0.5 \\ 0.1 \end{bmatrix} = 0.3 \times 0.5 + (-0.2) \times 0.1 = 0.15 - 0.02 = 0.13 }[/math]
Thus, the output of the network for the input vector [math]\displaystyle{ \mathbf{x} = \begin{bmatrix} 1 \\ 0 \end{bmatrix} }[/math] is [math]\displaystyle{ 0.13 }[/math].
Exercise 1.13
Level: * (Easy)
Exercise Types: Novel
Question
Explain whether this is a classification, regression, or clustering task each time. If the task is either classification or regression, also comment on whether the focus is prediction or explanation.
1. **Stock Market Trends:**
A financial analyst wants to predict the future stock prices of a company based on historical trends, economic indicators, and company performance metrics.
2. **Customer Segmentation:**
A retail company wants to group its customers based on their purchasing behaviour, including transaction frequency, product categories, and total spending, to design targeted marketing campaigns.
3. **Medical Diagnosis:**
A hospital wants to develop a model to determine whether a patient has a specific disease based on symptoms, medical history, and lab test results.
4. **Predicting Car Fuel Efficiency:**
An automotive researcher wants to understand how engine size, weight, and aerodynamics affect a car's fuel efficiency (miles per gallon).
Solution
**1. Stock Market Trends**
- Task Type:** Regression
- Focus:** Prediction
- Reasoning:** Stock prices are continuous numerical values, making this a regression task. The goal is to predict future prices rather than explain past fluctuations.
**2. Customer Segmentation**
- Task Type:** Clustering
- Focus:** —
- Reasoning:** Customers are grouped based on their purchasing behaviour without predefined labels, making this a clustering task.
**3. Medical Diagnosis**
- Task Type:** Classification
- Focus:** Prediction
- Reasoning:** The disease status is a categorical outcome (Has disease: Yes/No), making this a classification problem. The goal is to predict a diagnosis for future patients.
**4. Predicting Car Fuel Efficiency**
- Task Type:** Regression
- Focus:** Explanation
- Reasoning:** Fuel efficiency (miles per gallon) is a continuous variable. The researcher is interested in understanding how different factors influence efficiency, so the focus is on explanation.
Summary
Task | Type | Focus | Reasoning |
---|---|---|---|
Stock Market Trends | Regression | Prediction | Predict future stock prices (continuous variable). |
Customer Segmentation | Clustering | — | Group customers based on purchasing behaviour. |
Medical Diagnosis | Classification | Prediction | Determine if a patient has a disease (Yes/No). |
Predicting Car Fuel Efficiency | Regression | Explanation | Understand how factors affect fuel efficiency. |
Exercise 1.14
Level: ** (Easy)
Exercise Types: Novel
Question
You are given a set of real-world scenarios. Your task is to identify the most suitable fundamental machine learning approach for each scenario and justify your choice.
- Scenarios:**
1. **Loan Default Prediction:**
A bank wants to predict whether a loan applicant will default on their loan based on their credit history, income, and employment status.
2. **House Price Estimation:**
A real estate company wants to estimate the price of a house based on features such as location, size, and number of bedrooms.
3. **User Grouping for Advertising:**
A social media platform wants to group users with similar interests and online behavior for targeted advertising.
4. **Dimensionality Reduction in Medical Data:**
A medical researcher wants to reduce the number of variables in a dataset containing hundreds of patient health indicators while retaining the most important information.
- Tasks:**
- For each scenario, classify the problem into one of the four fundamental categories: Classification, Regression, Clustering, or Dimensionality Reduction. - Explain why you selected that category for each scenario. - Suggest a possible algorithm that could be used to solve each problem.
Solution
- 1. Loan Default Prediction**
- Task Type:** Classification
- Reasoning:** The target variable (loan default) is categorical (Yes/No), making this a classification problem. The goal is to predict whether an applicant will default based on their financial history.
- Possible Algorithm:** Logistic Regression, Random Forest, or Gradient Boosting.
- 2. House Price Estimation**
- Task Type:** Regression
- Reasoning:** House prices are continuous numerical values, making this a regression task. The goal is to estimate a house's price based on features like location and size.
- Possible Algorithm:** Linear Regression, Decision Trees, or XGBoost.
- 3. User Grouping for Advertising**
- Task Type:** Clustering
- Reasoning:** The goal is to group users based on their behavior without predefined labels, making this a clustering task.
- Possible Algorithm:** K-Means, DBSCAN, or Hierarchical Clustering.
- 4. Dimensionality Reduction in Medical Data**
- Task Type:** Dimensionality Reduction
- Reasoning:** The goal is to reduce the number of variables while preserving essential information, making this a dimensionality reduction task.
- Possible Algorithm:** Principal Component Analysis (PCA), t-SNE, or Autoencoders.
Exercise 1.15
Level: ** (Easy)
Exercise Types: Novel
Question
Define what machine learning is and how it is different from classical statistics. Provide the three learning methods used in machine learning, briefly define each and give an example of where each of them can be used. Include some common algorithms for each of the learning methods.
Solution
- Machine learning Definition**
– Machine Learning is the ability to teach a computer without explicitly programming it
– Examples are used to train computers to perform tasks that would be difficult to program
The difference between classical statistics and machine learning is the size of the data that they infer information from. In classical statistics, this is usually done from a small dataset(not enough data) while in machine learning it is done from a large dataset(Too many data).
- Supervised learning**
Supervised learning is a type of machine learning where the model is trained on a labeled dataset, meaning each training example has input features and a corresponding correct output. The algorithm learns the relationship between inputs and outputs to make predictions on new, unseen data.
Examples: Predicting house prices based on location, size, and other features (Regression). Identifying whether an email is spam or not (Classification).
Common Algorithms: Linear Regression, Logistic Regression, Decision Trees, Random Forest, Support Vector Machines (SVM), Neural Networks.
- Unsupervised Learning**
Unsupervised learning involves training a model on data without labeled outputs. The algorithm attempts to discover patterns, structures, or relationships within the data.
Examples: Grouping customers with similar purchasing behaviors for targeted marketing (Clustering). Identifying important features in a high-dimensional dataset (Dimensionality Reduction).
Common Algorithms: K-Means, Hierarchical Clustering, DBSCAN (Clustering). Principal Component Analysis (PCA), t-SNE, Autoencoders (Dimensionality Reduction).
- Reinforcement Learning**
Reinforcement learning (RL) is a type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize cumulative rewards. The agent interacts with the environment, receives feedback in the form of rewards or penalties, and improves its strategy over time.
Examples: Training a robot to walk by rewarding successful movements. Teaching an AI to play chess or video games by rewarding wins and penalizing losses.
Common Algorithms: Q-Learning, Deep Q Networks (DQN), Policy Gradient Methods, Proximal Policy Optimization (PPO).
Summary
Aspect | Supervised Learning | Unsupervised Learning | Reinforcement Learning |
---|---|---|---|
Definition | Learning from labeled data where inputs are paired with outputs. | Learning patterns or structures from unlabeled data. | Learning by interacting with an environment to maximize cumulative rewards. |
Key Characteristics | Trains on known inputs and outputs to predict outcomes for unseen data. | No predefined labels; discovers hidden structures in the data. | Agent learns through trial and error by receiving rewards or penalties for its actions. |
Examples | - Predicting house prices (Regression). - Classifying emails as spam or not (Classification). |
- Grouping customers by behavior (Clustering). - Reducing variables in large datasets (Dimensionality Reduction). |
- Training robots to walk. - Teaching AI to play chess or video games. |
Common Algorithms | - Linear Regression - Logistic Regression - Decision Trees - Random Forest - SVM - Neural Networks |
- K-Means - Hierarchical Clustering - PCA - t-SNE - Autoencoders |
- Q-Learning - Deep Q Networks (DQN) - Policy Gradient Methods - Proximal Policy Optimization (PPO) |
Exercise 1.16
Level: * (Easy)
Exercise Types: Novel
Question
Categorize each of these machine learning scenarios into supervised learning, unsupervised learning, or reinforcement learning. Justify your reasoning for each case.
(a) A neural network is trained to classify handwritten digits using the MNIST dataset, which contains 60 000 images of handwritten digits, along with the correct answer for each image.
(b) A robot is programmed to learn how to play a video game. It does not have access to the game’s rules, but it can observe its current score after each action. Over time, it learns to play better by maximizing its score.
(c) A deep learning model is designed to segment medical images into different sections corresponding to specific organs. The training data consists of medical scans that have been annotated by experts to mark the boundaries of the organs.
(d) A machine learning model is given 100 000 astronomical images of unknown stars and galaxies. Using dimensionality reduction techniques, it groups similar-looking objects based on their features, such as size and shape.
Solution
(a) Supervised learning: The model is trained with labeled data, where each image has a corresponding digit label.
(b) Reinforcement learning: The model learns by interacting with an environment and receiving feedback in the form of rewards or penalties. It explores different actions to maximize cumulative rewards over time.
(c) Supervised learning: The model uses labeled data where professionals annotated each region of the image.
(d) Unsupervised learning: The model works with unlabeled data to find patterns and group similar objects.
Exercise 1.17
Level: * (Easy)
Exercise Types: Novel
Question
How does the introduction of ReLU as an activation function address the vanishing gradient problem observed in early deep learning models using sigmoid or tanh functions?
Solution
The vanishing gradient problem occurs when activation functions like sigmoid or tanh compress their inputs into small ranges, resulting in gradients that become very small during backpropagation. This hinders learning, particularly in deeper networks.
The ReLU (Rectified Linear Unit), defined as [math]\displaystyle{ f(x) = \max(0, x) }[/math], addresses this issue effectively:
(a) Non-Saturating Gradients: For positive input values, ReLU's gradient remains constant (equal to 1), preventing gradients from vanishing.
(b) Efficient Computation: The simplicity of the ReLU function makes it computationally faster than the sigmoid or tanh functions, which involve more complex exponential calculations.
(c) Sparse Activations: ReLU outputs zero for negative inputs, leading to sparse activations, which can improve computational efficiency and reduce overfitting.
However, ReLU can experience the "dying ReLU" problem, where neurons output zero for all inputs and effectively become inactive. Variants like Leaky ReLU and Parametric ReLU address this by allowing small, non-zero gradients for negative inputs, ensuring neurons remain active.
Exercise 1.18
Level: * (Easy)
Exercise Types: Novel
Question
What is the general concept of text generation in deep learning, and how does it work?
Solution
Text generation in deep learning refers to the process of automatically creating coherent and contextually relevant text based on input data or a learned language model. The goal is to produce text that mimics human-written content, maintaining grammatical structure, logical flow, and contextual relevance.
There are five steps.
1. Training on a Language Corpus: A deep learning model, such as a Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), or Transformer, is trained on a large dataset of text. During training, the model learns patterns, relationships between words, and context within sentences and across paragraphs.
2. Tokenization and Embeddings: Input text is broken into smaller units, such as words or subwords (tokens). These tokens are converted into numerical vectors (embeddings) that capture semantic and syntactic relationships.
3. The model predicts the probability of the next word or token in a sequence based on the context provided by the preceding words. It uses conditional probability, such as: [math]\displaystyle{ P(w_t \mid w_1, w_2, ..., w_{t-1}) }[/math] to determine the likelihood of the next token.
4. Once the model generates probabilities for the next token, decoding strategies are used to construct text.
5. Generated text is evaluated for coherence, fluency, and relevance. Techniques such as fine-tuning on specific domains or datasets improve the model's performance for targeted applications.
Exercise 1.19
Level: * (Easy)
Exercise Types: Novel
Question
Supervised learning and unsupervised learning are two of the main types of machine learning, and they differ mainly in how the models are trained and the type of data used. Briefly state their differences.
Solution
Supervised Learning:
Data: Requires labeled data.
Goal: The model learns a mapping from inputs to the correct output.
Example Tasks: Classification and regression.
Training Process: The model is provided with both input data and corresponding labels during training, allowing it to learn from these examples to make predictions on new, unseen data.
Common Algorithms: Linear regression, decision trees, random forests, support vector machines, and neural networks.
Unsupervised Learning:
Data: Does not require labeled data.
Goal: The model tries to find hidden patterns or structure in the data.
Example Tasks: Clustering and dimensionality reduction.
Training Process: The model analyzes the input data without being told the correct answer, and it organizes or structures the data in meaningful ways.
Common Algorithms: K-means clustering, hierarchical clustering, principal component analysis (PCA), and autoencoders.
Exercise 1.20
Level: * (Easy)
Exercise Types: Novel
Question
It was mentioned in lecture that the step function had previously been used as an activation function, but we now commonly use the sigmoid function as an activation function. Highlight the key differences between these functions.
Solution
- The step function takes a single real numbered value and outputs 0 if the number is negative and 1 if the number is 0 or positive
- The sigmoid activation function is an s shaped curve with the output spanning between 0 and 1 (not inclusive)
- The equation for the sigmoid function is [math]\displaystyle{ f(x) = \frac{1}{1+ e^{-x}} }[/math]
- The step activation function only produces two values as the output, 0 or 1, whereas the sigmoid activation function produces a continuous range of values between 0 and 1
- The smoothness of the sigmoid activation function makes it more suitable for gradient based learning in neural networks, allowing for more efficient back propagation
Exercise 1.21
Level: * (Easy)
Exercise Types: Novel
Question
Consider a linear regression model where we aim to estimate the weight vector [math]\displaystyle{ w }[/math] by minimizing the Residual Sum of Squares (RSS), defined as:
[math]\displaystyle{ \text{RSS}(w) = \frac{1}{2} \sum_{n=1}^{N} (y_n - w^T x_n)^2 = \frac{1}{2} \|Xw - y\|_2^2 = \frac{1}{2} (Xw - y)^T (Xw - y). }[/math]
- Compute the gradient: Derive the gradient of [math]\displaystyle{ \text{RSS}(w) }[/math] with respect to [math]\displaystyle{ w }[/math].
- Find the optimal [math]\displaystyle{ w }[/math]: Solve for [math]\displaystyle{ w }[/math] by setting the gradient to zero.
- Interpretation: What is the significance of the solution you obtained in terms of ordinary least squares (OLS)?
Provide your answers with clear derivations and explanations.
Solution
To find the optimal weight vector [math]\displaystyle{ w }[/math], we first compute the gradient of the Residual Sum of Squares (RSS):
[math]\displaystyle{ \nabla_w \text{RSS}(w) = X^T X w - X^T y. }[/math]
Setting the gradient to zero and solving for [math]\displaystyle{ w }[/math] gives:
[math]\displaystyle{ X^T X w = X^T y. }[/math]
These are known as the normal equations, since at the optimal solution, [math]\displaystyle{ y - Xw }[/math] is orthogonal to the range of [math]\displaystyle{ X }[/math].
The corresponding solution [math]\displaystyle{ \hat{w} }[/math] is the ordinary least squares (OLS) solution, given by:
[math]\displaystyle{ \hat{w} = (X^T X)^{-1} X^T y. }[/math]
The matrix [math]\displaystyle{ (X^T X)^{-1} X^T }[/math] is known as the (left) pseudo-inverse of [math]\displaystyle{ X }[/math], which generalizes matrix inversion for non-square matrices.
To ensure the solution is unique, we examine the Hessian matrix:
[math]\displaystyle{ H(w) = \frac{\partial^2}{\partial w^2} \text{RSS}(w) = X^T X. }[/math]
If [math]\displaystyle{ X }[/math] has full column rank (i.e., its columns are linearly independent), then [math]\displaystyle{ H }[/math] is positive definite, as shown by:
[math]\displaystyle{ v^T (X^T X) v = (X v)^T (X v) = \|X v\|^2 \gt 0, \quad \text{for any nonzero vector } v. }[/math]
Since the Hessian is positive definite in this case, the least squares objective has a unique global minimum.
Exercise 1.22
Level: * (Easy)
Exercise Types: Novel
Question
Which of the following best highlights the key difference between Machine Learning and Deep Learning?
A. Machine Learning is only suitable for small datasets, while Deep Learning can handle datasets of any size.
B. Machine Learning is restricted to regression and classification, whereas Deep Learning is used for image and text processing.
C. Deep Learning can model without any data, while Machine Learning requires large datasets.
D. Machine Learning relies on manually extracted features, while Deep Learning can automatically learn feature representations.
Solution
Answer: D;
Explanation: Machine Learning algorithms often require manual feature engineering, whereas Deep Learning can automatically extract features from data through multi-layered neural networks. This is a significant distinction between the two.
Machine Learning techniques include decision trees, svms, xgboosting, etc. These techniques typically work better on smaller datasets due to simpler structure. They often struggle to match the flexibility and scalability of deep neural networks as the data becomes more complex.
Deep neural networks work well on large datasets consisting of data with a more complex structure, such as images or sentences (LLM). Often times, more data is needed for deep neural networks in order for them to learn effectively and avoid overfitting. Their complex architecture of deep neural networks allow them to learn feature representations from raw data, without the need for manual feature engineering.
In summary,
Machine Learning: Simpler models, better suited for structured/tabular data.
Deep Learning: Automatically extracts features, and excels in unstructured data like images, audio, and text.
Exercise 1.23
Level: * (Easy)
Exercise Types: Novel
Question
Pros and Cons of supervised learning and unsupervised learning?
Solution
Supervised learning is to learn from labelled data. The benefits of supervised learning include its clear objective and direct evaluation through performance metrics such as MSE to compare model predictions with clear labels. The consequences of supervised learning encompass the time-consuming nature to obtain large labelled dataset and the risk of overfitting. Unsupervised learning needs pattern or data structure detection. The benefits of unsupervised learning include opportunities for data preprocessing. Thus, dimension reduction techniques can be used to simplify the data structure. Nevertheless, we can't directly control or interpret the results as good as the supervised learning does.
Exercise 1.24
Level: ** (Easy)
Exercise Types: Novel
Question
Consider the dataset: [math]\displaystyle{ \{(x_1, y_1), (x_2, y_2), (x_3, y_3)\} = \{([1, 2], 1), ([2, 3], 1), ([4, 5], -1)\}, }[/math] and a linear decision boundary defined as: [math]\displaystyle{ w_1x_1 + w_2x_2 + b = 0, }[/math] where the classifier predicts [math]\displaystyle{ y = 1 }[/math] if [math]\displaystyle{ f(x) \gt 0 }[/math], and [math]\displaystyle{ y = -1 }[/math] if [math]\displaystyle{ f(x) \leq 0 }[/math].
Given the weights and bias: [math]\displaystyle{ w_1 = 1, \, w_2 = -1, \, b = 0, }[/math] determine whether all points in the dataset are correctly classified.
Solution
The decision function is: [math]\displaystyle{ f(x) = w_1x_1 + w_2x_2 + b. }[/math] Substituting [math]\displaystyle{ w_1 = 1 }[/math], [math]\displaystyle{ w_2 = -1 }[/math], and [math]\displaystyle{ b = 0 }[/math], we evaluate [math]\displaystyle{ f(x) }[/math] for each point in the dataset.
For [math]\displaystyle{ x_1 = [1, 2] }[/math]: [math]\displaystyle{ f(x_1) = (1)(1) + (-1)(2) + 0 = -1 \quad \Rightarrow \, y = -1 \, (\text{incorrect, since } y_1 = 1). }[/math]
For [math]\displaystyle{ x_2 = [2, 3] }[/math]: [math]\displaystyle{ f(x_2) = (1)(2) + (-1)(3) + 0 = -1 \quad \Rightarrow \, y = -1 \, (\text{incorrect, since } y_2 = 1). }[/math]
For [math]\displaystyle{ x_3 = [4, 5] }[/math]: [math]\displaystyle{ f(x_3) = (1)(4) + (-1)(5) + 0 = -1 \quad \Rightarrow \, y = -1 \, (\text{correct, since } y_3 = -1). }[/math]
Exercise 1.25
Level: * (Easy)
Exercise Types: Novel
Question
Given a dataset with three samples: [math]\displaystyle{ {(x_1,y_1)=(1,2),(x_2,y_2)=(2,3),(x_3,y_3)=(3,5)} }[/math].
Assume a hypothesis class [math]\displaystyle{ F }[/math] consisting of functions [math]\displaystyle{ f(x) = w\cdot x+b }[/math], where [math]\displaystyle{ w }[/math] and [math]\displaystyle{ b }[/math] are parameters. Use the MSE as the loss function: [math]\displaystyle{ L(y,f(x))=(y−f(x))^2 }[/math].
-a). For the score critereon, compute the sample score: [math]\displaystyle{ S(f)= \frac{1}{n}\sum_{i=1}^{n}L(y_i,f(x_i)) }[/math].
-b). For the search strategy, find the optimal parameters [math]\displaystyle{ w }[/math] and [math]\displaystyle{ b }[/math] that minimizes [math]\displaystyle{ S(f) }[/math].
Solution
-a). The loss for three samples:
[math]\displaystyle{ L(y_1, f(x_1))=(2-(w\cdot 1+b))^2 }[/math]
[math]\displaystyle{ L(y_2, f(x_2))=(3-(w\cdot 2+b))^2 }[/math]
[math]\displaystyle{ L(y_3, f(x_3))=(5-(w\cdot 3+b))^2 }[/math]
The sample score:
[math]\displaystyle{ S(f)=\frac{1}{3}[(2-(w\cdot 1+b))^2+(3-(w\cdot 2+b))^2+(5-(w\cdot 3+b))^2] }[/math]
Simplify the sample score formula:
[math]\displaystyle{ S(f)=\frac{1}{3}[(2-w-b)^2+(3-2w-b)^2+(5-3w-b)^2] }[/math]
-b). In order to minimize [math]\displaystyle{ S(f) }[/math], first differentiate [math]\displaystyle{ S(f) }[/math] with respect to [math]\displaystyle{ w }[/math] and [math]\displaystyle{ b }[/math]:
[math]\displaystyle{ \frac{\partial S(f)}{\partial w}=-\frac{2}{3}[(2-w-b)+2(3-2w-b)+3(5-3w-b)] }[/math]
[math]\displaystyle{ \frac{\partial S(f)}{\partial b}=-\frac{2}{3}[(2-w-b)+(3-2w-b)+(5-3w-b)] }[/math]
Set [math]\displaystyle{ \frac{\partial S(f)}{\partial w} }[/math] and [math]\displaystyle{ \frac{\partial S(f)}{\partial b} }[/math] equal to 0, we can get:
[math]\displaystyle{ w=1 }[/math], [math]\displaystyle{ b=1 }[/math].
Therefore, [math]\displaystyle{ S(f) }[/math] is minimized at these parameters.
Exercise 1.26
Level: * (Easy)
Exercise Types: Novel
Question
Given the weights and bias of a single neuron, classify several input points using a step function as the activation method.
Details: - Weights: \( w_1 = 0.6, w_2 = -0.8 \) - Bias: \( b = 0.1 \) - Activation: Step function where output is 1 if input is non-negative, and 0 otherwise.
Input Points: 1. \( (1, 1) \) 2. \( (-1, 1) \) 3. \( (0.5, -0.5) \) 4. \( (0, 0) \)
Solution
Calculating the Outputs For each input point \( (x_1, x_2) \), calculate the linear combination using the formula \( y = w_1 \times x_1 + w_2 \times x_2 + b \), then apply the step function.
- **Point 1 \((1, 1)\):** \[ y = 0.6 \times 1 - 0.8 \times 1 + 0.1 = -0.1 \rightarrow \text{step}(-0.1) = 0 \] (Class 0)
- **Point 2 \((-1, 1)\):** \[ y = 0.6 \times -1 - 0.8 \times 1 + 0.1 = -1.3 \rightarrow \text{step}(-1.3) = 0 \] (Class 0)
- **Point 3 \((0.5, -0.5)\):** \[ y = 0.6 \times 0.5 - 0.8 \times -0.5 + 0.1 = 0.8 \rightarrow \text{step}(0.8) = 1 \] (Class 1)
- **Point 4 \((0, 0)\):** \[ y = 0.6 \times 0 - 0.8 \times 0 + 0.1 = 0.1 \rightarrow \text{step}(0.1) = 1 \] (Class 1)
Conclusion This exercise demonstrates how a neuron uses its weights and bias to compute outputs for given inputs and classify them using a step function based on a threshold.
Exercise 1.27
Level: * (Easy)
Exercise Types: Novel
Question
Consider the following 2D and 3D datasets and determine whether a perceptron solution exists for each. If a solution exists, visually prove it by plotting the data points and a possible decision boundary. Use Python to accomplish this task.
Part 1: Given the dataset:
\begin{array}{|c|c|c|} \hline x_1 & x_2 & y \\ \hline 3 & 5 & -1 \\ 3 & 8 & +1 \\ 7 & 7 & +1 \\ 6 & 5 & +1 \\ \hline \end{array}
Show visually whether there exists a perceptron solution that correctly classifies all points.
Part 2: Given the dataset:
\begin{array}{|c|c|c|} \hline x_1 & x_2 & y \\ \hline 3 & 5 & -1 \\ 3 & 8 & +1 \\ 7 & 7 & -1 \\ 6 & 5 & +1 \\ \hline \end{array}
Show visually whether there exists a perceptron solution that correctly classifies all points.
Part 3: Given the 3D dataset:
\begin{array}{|c|c|c|c|} \hline x_1 & x_2 & x_3 & y \\ \hline 3 & 5 & 2 & -1 \\ 3 & 8 & 6 & +1 \\ 7 & 7 & 5 & -1 \\ 6 & 5 & 4 & +1 \\ \hline \end{array}
Show visually whether there exists a perceptron solution that correctly classifies all points.
Solution
Part 1:
import numpy as np import matplotlib.pyplot as plt part1_data = np.array([ [3, 5, -1], [3, 8, 1], [7, 7, 1], [6, 5, 1] ]) part1_X = part1_data[:, :2] # First two columns for features part1_y = part1_data[:, 2] # Last column for classes # Define colors based on classes, Im chooosing green and red colors = ['red' if label == -1 else 'green' for label in part1_y] plt.scatter(part1_X[:, 0], part1_X[:, 1], c=colors, edgecolors='black', s=100) part1_x1_vals = np.linspace(2, 7, 100) part1_x2_vals = -part1_x1_vals + 9 plt.plot(part1_x1_vals, part1_x2_vals, 'b--') plt.xlabel('x1') plt.ylabel('x2') plt.title('Part 1: 2D Classification Data') plt.grid(True) plt.show()
There exists a perceptron solution that correctly classifies all points.
[math]\displaystyle{ x_1 + x_2 - 9 = 0 }[/math]
where the perceptron parameters are:
[math]\displaystyle{ w_1 = 1 }[/math]
[math]\displaystyle{ w_2 = 1 }[/math]
[math]\displaystyle{ b = -9 }[/math]
Part 2:
part2_data = np.array([ [3, 5, -1], [3, 8, 1], [7, 7, -1], [6, 5, 1] ]) part2_X = part2_data[:, :2] # First two columns for data points part2_y = part2_data[:, 2] # Last column for classes # Define colors based on class classes, Im chooosing green and red colors = ['red' if label == -1 else 'green' for label in part2_y] plt.scatter(part2_X[:, 0], part2_X[:, 1], c=colors, edgecolors='black', s=100) plt.xlabel('x1') plt.ylabel('x2') plt.title('Part 2: 2D Classification Data') plt.legend() plt.grid(True) plt.show()
There does not exist perceptron solution that correctly classifies all points.
Part 3:
from mpl_toolkits import mplot3d #library needed to plot 3d data part3_data = np.array([ [2, 3, 1, -1], [5, 6, 2, -1], [8, 9, 3, +1], [9, 10, 6, +1], ]) part3_X = part3_data[:, :3] # First three columns for data points part3_y = part3_data[:, 3] # Last column for classes # Define colors based on classes, Im chooosing green and red colors = ['red' if label == -1 else 'green' for label in part3_y] # Use projection="3d" to create 3D scatter plot ax = plt.axes(projection='3d') # Plot 3d data points ax.scatter(part3_X[:, 0], part3_X[:, 1], part3_X[:, 2], c=colors, edgecolors='black', s=100) # Create a horizontal hyperplane at x2 = 7 part3_x1, part3_x3 = np.meshgrid(np.linspace(0, 10, 10), np.linspace(0, 10, 10)) part3_x2 = 7 ax.plot_surface(part3_x1, part3_x2, part3_x3, color='blue', alpha=0.5) ax.set_xlabel('x1') ax.set_ylabel('x2') ax.set_zlabel('x3') ax.set_title('3D Perceptron Solution Visualization') plt.show()
There exists a perceptron solution that correctly classifies all points.
[math]\displaystyle{ x_2 = 7 }[/math]
where the perceptron parameters are:
[math]\displaystyle{ w_1 = 0 }[/math]
[math]\displaystyle{ w_2 = 1 }[/math]
[math]\displaystyle{ w_3 = 0 }[/math]
[math]\displaystyle{ b = -7 }[/math]
Exercise 1.28
Level: * (Easy)
Exercise Types: Novel
References: A. Ghodsi, STAT 940 Deep Learning: Lecture 1, University of Waterloo, Winter 2025.
Question
Artificial Intelligence can be applied to a wide variety of fields. Give an example where AI has been used as a tool for scientific discovery.
Solution
Artificial intelligence has been applied in climate science, where deep learning models are used to predict climate patterns, simulate climate models, and forecast extreme weather events. These AI models have demonstrated high accuracy, enabling better disaster preparedness efforts.
Additionally, AI has accelerated scientific research by facilitating faster and more efficient simulations of complex physical systems. It has been used to study black holes and particle physics, advancing our understanding of fundamental sciences.
Exercise 2.1
Level: * (Easy)
Exercise Types: Novel
References: Calin, Ovidiu. Deep learning architectures: A mathematical approach. Springer, 2020
This problem is coincidentally similar to Exercise 5.10.1 (page 163) in this textbook, although that exercise was not used as the basis for this question.
Question
This problem is about using perceptrons to implement logic functions. Assume a dataset of the form [math]\displaystyle{ x_1, x_2 \in \{0, 1\} }[/math], and a perceptron defined as: [math]\displaystyle{ y = H(\beta_0 + \beta_1 x_1 + \beta_2 x_2), }[/math] where [math]\displaystyle{ H }[/math] is the Heaviside step function, defined as: [math]\displaystyle{ H(z) = \begin{cases} 1, & \text{if } z \geq 0, \\ 0, & \text{if } z \lt 0. \end{cases} }[/math]
(a)* Find weights [math]\displaystyle{ \beta_1, \beta_2 }[/math] and bias [math]\displaystyle{ \beta_0 }[/math] for a single perceptron that implements the AND function.
(b)* Find the weights [math]\displaystyle{ \beta_1, \beta_2 }[/math] and bias [math]\displaystyle{ \beta_0 }[/math] for a single perceptron that implements the OR function.
(c)** Given the truth table for the XOR function:
[math]\displaystyle{ \begin{array}{|c|c|c|} \hline x_1 & x_2 & x_1 \oplus x_2 \\ \hline 0 & 0 & 0 \\ 0 & 1 & 1 \\ 1 & 0 & 1 \\ 1 & 1 & 0 \\ \hline \end{array} }[/math]
Show that it cannot be learned by a single perceptron. Find a small neural network of multiple perceptrons that can implement the XOR function. (Hint: a hidden layer with 2 perceptrons).
Solution
(a) A perceptron that implements the AND function:
[math]\displaystyle{ y = H(-1.5 + x_1 + x_2). }[/math]
Here:
[math]\displaystyle{ \beta_0 = -1.5, \quad \beta_1 = 1, \quad \beta_2 = 1. }[/math]. This works because the AND function returns 1 only when both inputs are 1. For a perceptron, the condition for activation is: [math]\displaystyle{ \beta_0 + \beta_1 x_1 + \beta_2 x_2 \geq 0. }[/math] This must hold for (1, 1) but fail for all other combinations. Substituting values leads to the choice of [math]\displaystyle{ \beta_0 = -1.5 }[/math] and [math]\displaystyle{ \beta_1 = \beta_2 = 1 }[/math].
(b) A perceptron that implements the OR function:
[math]\displaystyle{ y = H(-0.5 + x_1 + x_2). }[/math]
Here:
[math]\displaystyle{ \beta_0 = -0.5, \quad \beta_1 = 1, \quad \beta_2 = 1. }[/math] This works because the OR function returns 1 if either or both inputs are 1. Using similar logic to the AND case, the decision boundary conditions lead to these parameters.
(c) XOR is not linearly separable, so it cannot be implemented by a single perceptron.
The XOR function returns 1 when the following are true:
- Either [math]\displaystyle{ x_1 }[/math] or [math]\displaystyle{ x_2 }[/math] are 1. In other words, the expression [math]\displaystyle{ x_1 }[/math] OR [math]\displaystyle{ x_2 }[/math] returns 1.
- [math]\displaystyle{ x_1 }[/math] and [math]\displaystyle{ x_2 }[/math] are not both 1. In other words, the expression [math]\displaystyle{ x_1 }[/math] NAND [math]\displaystyle{ x_2 }[/math] returns 1.
To implement this, the outputs of an OR and a NAND perceptron can be taken as inputs to an AND perceptron. (The NAND perceptron was derived by multiplying the weights and bias of the AND perceptron by -1.)
Why can't the perceptron converge in the case of linear non-separability?
In linearly separable data, there exists a weight vector [math]\displaystyle{ w }[/math] and bias [math]\displaystyle{ b }[/math] such that:
[math]\displaystyle{ y_i(w \cdot x_i + b) \gt 0 \quad \forall i }[/math]
But in the case of linear non-separability, w and b satisfying this condition do not exist, so the perceptron cannot satisfy the convergence condition.
Exercise 2.2
Level: * (Easy)
Exercise Types: Novel
Question
1.How do feedforward neural networks utilize backpropagation to adjust weights and improve the accuracy of predictions during training?
2. How would the training process be affected if the learning rate in optimization algorithm were too high or too low?
Solution
1. After a forward pass where inputs are processes to generate an output, the error between the prediction and actual values is calculated. This error is then propagated backward through the network, and the gradients of the loss function with respect to the weights are computed. Using these gradients, the weights are updated with an optimization algorithm like stochastic gradient descent, gradually minimizing the error and improving the networks' performance.
2. If the learning rate is too high, the weights might overshoot the optimal values, leading to oscillations or divergence. If it's too low, the training process might become very slow and stuck in local minimum.
Calculations
Step 1: Forward Propagation
Each neuron computes:
[math]\displaystyle{ z = W \cdot x + b }[/math]
[math]\displaystyle{ a = f(z) }[/math]
where:
- W = weights, b = bias
- f(z) = activation function (e.g., sigmoid, ReLU)
- a = neuron’s output
Step 2: Compute Loss
The error between predicted [math]\displaystyle{ \hat{y} }[/math] and actual [math]\displaystyle{ y }[/math] is calculated using a loss function, such as **Mean Squared Error (MSE)**:
[math]\displaystyle{ L = \frac{1}{n} \sum (y - \hat{y})^2 }[/math]
For classification, **Cross-Entropy Loss** is commonly used.
Step 3: Backward Propagation
Using the **chain rule**, gradients are computed:
[math]\displaystyle{ \frac{\partial L}{\partial W} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial W} }[/math]
These gradients guide weight updates to minimize loss.
Step 4: Weight Update using Gradient Descent
Weights are updated using:
[math]\displaystyle{ W = W - \alpha \frac{\partial L}{\partial W} }[/math]
where [math]\displaystyle{ \alpha }[/math] is the **learning rate**.
Exercise 2.3
Level: * (Easy)
Exercise Types: Modified
References: Simon J.D. Prince. Understanding Deep learning. 2024
This problem comes from Problem 3.5 in this textbook. In addition to the proof, I explained why this property is important to learning neural networks.
Question
Prove that the following property holds for [math]\displaystyle{ \alpha \in \mathbb{R}^+ }[/math]:
[math]\displaystyle{ \text{ReLU}[\alpha \cdot z] = \alpha \cdot \text{ReLU}[z] }[/math]
Explain why this property is important in neural networks.
Solution
This is known as the non-negative homogeneity property of the ReLU function.
Recall the definition of the ReLU function:
[math]\displaystyle{ \text{ReLU}(z) = \begin{cases} z & \text{if } z \geq 0, \\ 0 & \text{if } z \lt 0. \end{cases} }[/math]
We prove the property by considering the two possible cases for [math]\displaystyle{ z }[/math].
Case 1: [math]\displaystyle{ z \geq 0 }[/math]
If [math]\displaystyle{ z \geq 0 }[/math], then by the definition of the ReLU function:
[math]\displaystyle{ \text{ReLU}(z) = z }[/math]
Therefore:
[math]\displaystyle{ \text{ReLU}(\alpha \cdot z) = \alpha \cdot z }[/math]
and:
[math]\displaystyle{ \alpha \cdot \text{ReLU}(z) = \alpha \cdot z }[/math]
Hence, in this case:
[math]\displaystyle{ \text{ReLU}(\alpha \cdot z) = \alpha \cdot \text{ReLU}(z) }[/math]
Case 2: [math]\displaystyle{ z\lt 0 }[/math]
If [math]\displaystyle{ z \lt 0 }[/math], then [math]\displaystyle{ \alpha \cdot z \lt 0 }[/math].
Therefore:
[math]\displaystyle{ \text{ReLU}(\alpha \cdot z) = \text{ReLU}(z) = 0 }[/math]
and:
[math]\displaystyle{ \alpha \cdot \text{ReLU}(z) = \alpha \cdot 0 = 0 }[/math]
Hence, in this case:
[math]\displaystyle{ \text{ReLU}(\alpha \cdot z) = \alpha \cdot \text{ReLU}(z) }[/math]
Since the property holds in both cases, this completes the proof.
Why is this property important in neural networks?
In a neural network, the input to a neuron is often a linear combination of the weights and inputs.
When training neural networks, scaling the inputs or weights can affect the activations of neurons. However, because ReLU satisfies the homogeneity property, the output of the ReLU function scales proportionally with the input. This means that scaling the inputs by a positive constant (like a learning rate or normalization factor) does not change the overall pattern of activations — it only scales them. This stability in scaling is important during optimization because it makes the network's output more predictable and ensures that scaling transformations don't break the network's functionality.
Additionally, because of the non-negative homogeneity property, the gradients also scale proportionally, the scale of the gradient changes proportionally with the input scale, which ensures that the optimization process remains stable. It helps prevent exploding gradients when the inputs are scaled by large positive values.
The homogeneity property of ReLU also helps the network to perform well on different types of data. By keeping the scaling of activations consistent, it helps maintain the connection between inputs and outputs during training, even when the data is adjusted or scaled. This makes ReLU useful when input values vary a lot, and it simplifies the network's response to changes in input distributions, which is especially valuable when transferring a trained model to new data or domains.
While great in most cases, there have been some slight modifications to this property to yeild better results in specific cases. Known as the "dying ReLU" problem, in which neurons output zero for all inputs when their weights are updated to negative values, this effectively makes them inactive which can hinder learning and reduce model capacity. Similarly, ReLU can also suffer from exploding gradients for large inputs. Therefore, alternative propossed modifications include Leaky ReLU, which allows a small, non-zero gradient for negative inputs to mitigate neuron death; ELU (Exponential Linear Unit), which smooths gradients and improves convergence; and GELU (Gaussian Error Linear Unit), which combines smoothness and non-linearity for improved performance in certain cases.
Exercise 2.4
Level: * (Easy)
Exercise Types: Novel
Question
Train a perceptron on the given dataset using the following initial settings, and ensure it classifies all data points correctly.
- Initial weights: [math]\displaystyle{ w_0 = 0, w_1 = 0, w_2 = 0 }[/math]
- Learning rate: [math]\displaystyle{ \eta = 0.1 }[/math]
- Training dataset:
(x₁ = 1, x₂ = 2, y = 1) (x₁ = -1, x₂ = -1, y = -1) (x₁ = 2, x₂ = 1, y = 1)
[math]\displaystyle{ y = 1 }[/math] if the output [math]\displaystyle{ z = w_1 \cdot x_1 + w_2 \cdot x_2 + w_0 \geq 0 }[/math], otherwise [math]\displaystyle{ y = -1 }[/math].
Solution
Iteration 1
1. First data point (x₁ = 1, x₂ = 2) with label 1:
- Weighted sum: [math]\displaystyle{ \hat{y} = w_0 + w_1 x_1 + w_2 x_2 = 0 + 0(1) + 0(2) = 0 }[/math]
- Predicted label: [math]\displaystyle{ \hat{y} = 1 }[/math]
- Actual label: 1 → No misclassification
2. Second data point (x₁ = -1, x₂ = -1) with label -1:
- Weighted sum: [math]\displaystyle{ \hat{y} = w_0 + w_1 x_1 + w_2 x_2 = 0 + 0(-1) + 0(-1) = 0 }[/math]
- Predicted label: [math]\displaystyle{ \hat{y} = 1 }[/math]
- Actual label: -1 → Misclassified
3. Third data point (x₁ = 2, x₂ = 1) with label 1:
- Weighted sum: [math]\displaystyle{ \hat{y} = w_0 + w_1 x_1 + w_2 x_2 = 0 + 0(2) + 0(1) = 0 }[/math]
- Predicted label: [math]\displaystyle{ \hat{y} = 1 }[/math]
- Actual label: 1 → No misclassification
Update Weights (using the Perceptron rule with the cost as the distance of all misclassified points)
For the misclassified point (x₁ = -1, x₂ = -1):
- Updated weights:
- [math]\displaystyle{ w_0 = w_0 + \eta y = 0 + 0.1(-1) = -0.1 }[/math]
- [math]\displaystyle{ w_1 = w_1 + \eta y x_1 = 0 + 0.1(-1)(-1) = 0.1 }[/math]
- [math]\displaystyle{ w_2 = w_2 + \eta y x_2 = 0 + 0.1(-1)(-1) = 0.1 }[/math]
Updated weights after first iteration: [math]\displaystyle{ w_0 = -0.1, w_1 = 0.1, w_2 = 0.1 }[/math]
Iteration 2
1. First data point (x₁ = 1, x₂ = 2) with label 1:
- Weighted sum: [math]\displaystyle{ \hat{y} = -0.1 + 0.1(1) + 0.1(2) = -0.1 + 0.1 + 0.2 = 0.2 }[/math]
- Predicted label: [math]\displaystyle{ \hat{y} = 1 }[/math]
- Actual label: 1 → No misclassification
2. Second data point (x₁ = -1, x₂ = -1) with label -1:
- Weighted sum: [math]\displaystyle{ \hat{y} = -0.1 + 0.1(-1) + 0.1(-1) = -0.1 - 0.1 - 0.1 = -0.3 }[/math]
- Predicted label: [math]\displaystyle{ \hat{y} = -1 }[/math]
- Actual label: -1 → No misclassification
3. Third data point (x₁ = 2, x₂ = 1) with label 1:
- Weighted sum: [math]\displaystyle{ \hat{y} = -0.1 + 0.1(2) + 0.1(1) = -0.1 + 0.2 + 0.1 = 0.2 }[/math]
- Predicted label: [math]\displaystyle{ \hat{y} = 1 }[/math]
- Actual label: 1 → No misclassification
Since there are no misclassifications in the second iteration, the perceptron has converged!
Final Result
- Weights after convergence: [math]\displaystyle{ w_0 = -0.1, w_1 = 0.1, w_2 = 0.1 }[/math]
- Total cost after convergence: [math]\displaystyle{ Cost = 0 }[/math], since no misclassified points.
Exercise 2.5
Level: * (Moderate)
Exercise Types: Novel
Question
Consider a Feed-Forward Neural Network (FFN) with one or more hidden layers. Answer the following questions:
(a) Describe how does the Feed-Forward Neural Network (FFN) work in general. Describe the component of the Network.
(b) How does the forward pass work ? Provide the relevant formulas for each step.
(c) How does the backward pass (backpropagation) ? Explain and provide the formulas for each step.
Solution
(a): A Feed-Forward Neural Network (FFN) consists of an input layer, one or more hidden layers, and one output layer. Each layer transforms the input data, with each neuron's output being fed to the next layer as input. Each neuron in a layer is a perceptron, which is a basic computational unit that contains weights, bias and an activation function. The perceptron computes a weighted sum of its inputs, adds a bias term, and passes the result through an activation function. In this structure, each layer transforms the data as it passes through, with each neuron's output being fed to the next layer. The final output is the network’s prediction, then the loss function use the prediction and the true label of the data to calculate the loss. The backward pass computes the gradients of the loss with respect to each weight and bias in the network, and then update the weights and biases to minimize the loss. This process is repeated for each sample (or mini-batch) of data until the loss converges and the weights are optimized.
(b): The forward pass involves computing the output for each layer in the network. For each layer, the algorithm performs the following steps:
1. Compute the weighted sum of inputs to the layer:
[math]\displaystyle{ z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)} }[/math]
2. Use the activation function to calculate the output of the layer:
[math]\displaystyle{ \hat{y} = a^{(L)} = \sigma(z^{(L)}) }[/math]
3. Repeat these steps for each layer, until getting to the output layer
(c): The backward pass (backpropagation) updates the gradients of the loss function with respect to each weight and bias, and then use the gradient descents to update the weights.
1. Calculate the errors for each layer:
Error at the output layer: The error term at the output layer has this formula:[math]\displaystyle{ \delta^{(L)} = \frac{\partial \mathcal{L}}{\partial a^{(L)}} \cdot \sigma'(z^{(L)}) }[/math]
Error for the hidden layers: The error for each hidden layer is:
[math]\displaystyle{ \delta^{(l)} = \left( W^{(l+1)} \right)^T \delta^{(l+1)} \cdot \sigma'(z^{(l)}) }[/math]
3. Gradient of the loss with respect to weights and biases. Compute the gradients for the weights and biases:
The gradient for weights is:
[math]\displaystyle{ \frac{\partial \mathcal{L}}{\partial W^{(l)}} = a^{(l-1)} \cdot (\delta^{(l)})^T }[/math]
The gradient is for biases is:
[math]\displaystyle{ \frac{\partial \mathcal{L}}{\partial b^{(l)}} = \delta^{(l)} }[/math]
4. Update the weights and biases using gradient descent:
[math]\displaystyle{ W^{(l)} \leftarrow W^{(l)} - \rho \cdot \frac{\partial \mathcal{L}}{\partial W^{(l)}} }[/math]
Where [math]\displaystyle{ \rho }[/math] is the learning rate.
[math]\displaystyle{ b^{(l)} \leftarrow b^{(l)} - \rho \cdot \frac{\partial \mathcal{L}}{\partial b^{(l)}} }[/math]
Repeat these steps for each layers from output layer to input layers to update all the weights and biases
Exercise 2.6
Level: * (Easy)
Exercise Types: Novel
Question
A single neuron takes an input vector [math]\displaystyle{ x=[2,-3] }[/math], with weights [math]\displaystyle{ w=[0.4,-0.6] }[/math]. The target output is [math]\displaystyle{ y_{\text{true}}=1 }[/math].
1. Calculate the weighted sum [math]\displaystyle{ z = w \cdot x }[/math].
2. Compute the squared error loss: [math]\displaystyle{ L = 0.5 \cdot (z - y_{\text{true}})^2 }[/math]
3. Find the gradient of the loss with respect to the weights [math]\displaystyle{ w }[/math] and perform one step of gradient descent with a learning rate [math]\displaystyle{ \eta = 0.01 }[/math].
4.Provide the updated weights and the error after the update.
5. Compare the result of the previous step with the case of a learning rate of [math]\displaystyle{ \eta = 0.1 }[/math]
Solution
1. [math]\displaystyle{ z = w \cdot x = (0.4 \cdot 2) + (-0.6 \cdot -3) = 0.8 + 1.8 = 2.6 }[/math]
2. [math]\displaystyle{ L = 0.5 \cdot (z - y_{\text{true}})^2 = 0.5 \cdot (2.6 - 1)^2 = 0.5 \cdot (1.6)^2 = 0.5 \cdot 2.56 = 1.28 }[/math]
3. The gradient of the loss with respect to [math]\displaystyle{ w_i }[/math] is: [math]\displaystyle{ \frac{\partial L}{\partial w_i} = (z - y_{\text{true}}) \cdot x_i }[/math]
For [math]\displaystyle{ w_1 }[/math] (associated with [math]\displaystyle{ x_1 = 2 }[/math]): [math]\displaystyle{ \frac{\partial L}{\partial w_1} = (2.6 - 1) \cdot 2 = 1.6 \cdot 2 = 3.2 }[/math]
For [math]\displaystyle{ w_2 }[/math] (associated with [math]\displaystyle{ x_2 = -3 }[/math]): [math]\displaystyle{ \frac{\partial L}{\partial w_2} = (2.6 - 1) \cdot (-3) = 1.6 \cdot -3 = -4.8 }[/math] The updated weights are: [math]\displaystyle{ w_i = w_i - \eta \cdot \frac{\partial L}{\partial w_i} }[/math]
For [math]\displaystyle{ w_1 }[/math]: [math]\displaystyle{ w_1 = 0.4 - 0.01 \cdot 3.2 = 0.4 - 0.032 = 0.368 }[/math]
For [math]\displaystyle{ w_2 }[/math]: [math]\displaystyle{ w_2 = -0.6 - 0.01 \cdot (-4.8) = -0.6 + 0.048 = -0.552 }[/math]
4. Recalculate [math]\displaystyle{ z }[/math] with updated weights: [math]\displaystyle{ z = (0.368 \cdot 2) + (-0.552 \cdot -3) = 0.736 + 1.656 = 2.392 }[/math]
Recalculate the error: [math]\displaystyle{ L = 0.5 \cdot (z - y_{\text{true}})^2 = 0.5 \cdot (2.392 - 1)^2 = 0.5 \cdot (1.392)^2 = 0.5 \cdot 1.937 = 0.968 }[/math]
5. Compare the result of the previous step with the case of a learning rate of [math]\displaystyle{ \eta = 0.1 }[/math]:
For [math]\displaystyle{ w_1 }[/math]: [math]\displaystyle{ w_1 = 0.4 - 0.1 \cdot 3.2 = 0.4 - 0.032 = 0.08 }[/math]
For [math]\displaystyle{ w_2 }[/math]: [math]\displaystyle{ w_2 = -0.6 - 0.1 \cdot (-4.8) = -0.6 + 0.048 = -0.12 }[/math]
Recalculate [math]\displaystyle{ z }[/math] with the updated weights: [math]\displaystyle{ z = (0.08 \cdot 2) + (-0.12 \cdot -3) = 0.16 + 0.36 = 0.52 }[/math]
Recalculate the error: [math]\displaystyle{ L = 0.5 \cdot (z - y_{\text{true}})^2 = 0.5 \cdot (0.52 - 1)^2 = 0.5 \cdot (-0.48)^2 = 0.5 \cdot 0.2304 = 0.1152 }[/math]
Comparison:
- With a learning rate of [math]\displaystyle{ \eta = 0.01 }[/math], the error after one update is [math]\displaystyle{ L = 0.968 }[/math].
- With [math]\displaystyle{ \eta = 0.1 }[/math], the error after one update is [math]\displaystyle{ L = 0.1152 }[/math].
The error is much lower when using a larger learning rate [math]\displaystyle{ \eta = 0.1 }[/math] compared to a smaller learning rate [math]\displaystyle{ \eta = 0.01 }[/math]. However, large learning rates can sometimes cause overshooting of the optimal solution, so care must be taken when selecting a learning rate.
Exercise 2.7
Level: * (Easy)
Exercise Types: Copied
This problem comes from Exercise 2 : Perceptron Learning.
Question
Given two single perceptrons [math]\displaystyle{ a }[/math] and [math]\displaystyle{ b }[/math], each defined by the inequality [math]\displaystyle{ w_0 + w_1 x_1 + w_2 x_2 \geq 0 }[/math]:
- Perceptron [math]\displaystyle{ a }[/math]: [math]\displaystyle{ w_0 = 1 }[/math], [math]\displaystyle{ w_1 = 2 }[/math], [math]\displaystyle{ w_2 = 1 }[/math] - Perceptron [math]\displaystyle{ b }[/math]: [math]\displaystyle{ w_0 = 0 }[/math], [math]\displaystyle{ w_1 = 2 }[/math], [math]\displaystyle{ w_2 = 1 }[/math]
Is perceptron [math]\displaystyle{ a }[/math] more general than perceptron [math]\displaystyle{ b }[/math]?
Solution
To decide if perceptron [math]\displaystyle{ a }[/math] is more general than perceptron [math]\displaystyle{ b }[/math], we compare their respective decision boundaries and positive regions:
- **Perceptron [math]\displaystyle{ a }[/math]:** The positive region satisfies [math]\displaystyle{ 1 + 2 x_1 + x_2 \ge 0 }[/math]. The decision boundary is [math]\displaystyle{ x_2 = -2 x_1 - 1 }[/math].
- **Perceptron [math]\displaystyle{ b }[/math]:** The positive region satisfies [math]\displaystyle{ 2 x_1 + x_2 \ge 0 }[/math]. The decision boundary is [math]\displaystyle{ x_2 = -2 x_1 }[/math].
A perceptron [math]\displaystyle{ a }[/math] is called “more general” than perceptron [math]\displaystyle{ b }[/math] if: 1. Every point that [math]\displaystyle{ b }[/math] classifies as positive is also classified as positive by [math]\displaystyle{ a }[/math]. 2. There exist points that [math]\displaystyle{ a }[/math] classifies as positive which [math]\displaystyle{ b }[/math] does not.
Observe that for any [math]\displaystyle{ x_1 }[/math]: [math]\displaystyle{ -2 x_1 - 1 \leq -2 x_1 }[/math]. Hence, the set of points [math]\displaystyle{ \{(x_1, x_2) : x_2 \ge -2 x_1\} }[/math] (perceptron [math]\displaystyle{ b }[/math]’s positive region) is contained in the set [math]\displaystyle{ \{(x_1, x_2) : x_2 \ge -2 x_1 - 1\} }[/math] (perceptron [math]\displaystyle{ a }[/math]’s positive region). Therefore: 1. If [math]\displaystyle{ b }[/math] classifies a point as positive, [math]\displaystyle{ a }[/math] also does. 2. There are additional points (namely those where [math]\displaystyle{ -2 x_1 - 1 \le x_2 \lt -2 x_1 }[/math]) that [math]\displaystyle{ a }[/math] classifies as positive but [math]\displaystyle{ b }[/math] does not.
Hence, perceptron [math]\displaystyle{ a }[/math] is indeed more general than perceptron [math]\displaystyle{ b }[/math].
Additional Subquestion
For a random sample of points, verify empirically that any point classified as positive by perceptron [math]\displaystyle{ b }[/math] is also classified as positive by perceptron [math]\displaystyle{ a }[/math].
Additional Solution (Sample Code)
Below is a short Python snippet that generates random points and checks their classification according to each perceptron. (No visual output is included.)
```python import numpy as np
def perceptron_output(x1, x2, w0, w1, w2):
return (w0 + w1*x1 + w2*x2) >= 0
- Weights for a and b:
w_a = (1, 2, 1) # w0=1, w1=2, w2=1 w_b = (0, 2, 1) # w0=0, w1=2, w2=1
n_points = 1000 x1_vals = np.random.uniform(-10, 10, n_points) x2_vals = np.random.uniform(-10, 10, n_points)
count_b_positive_also_positive_in_a = 0 count_b_positive = 0
for x1, x2 in zip(x1_vals, x2_vals):
output_b = perceptron_output(x1, x2, *w_b) output_a = perceptron_output(x1, x2, *w_a) if output_b: count_b_positive += 1 if output_a: count_b_positive_also_positive_in_a += 1
print("Out of", count_b_positive, "points that B classified as positive,") print(count_b_positive_also_positive_in_a, "were also positive for A.") print("Hence, all (or nearly all) B-positive points lie in A's positive region.")
Exercise 2.10
Level: * (Easy)
Exercise Type: Novel
Question
In a simple linear regression
a). Derive the vectorized form of the SSE (loss function) in terms of [math]\displaystyle{ Y }[/math], [math]\displaystyle{ X }[/math] and [math]\displaystyle{ \theta }[/math].
b). Find the optimal value of [math]\displaystyle{ \theta }[/math] that minimizes the SSE (Recall that this is the weights of the linear regression).
Solution
a).
[math]\displaystyle{ \begin{align*} SSE &= \|Y - \hat{Y}\|^2 \\ &= (Y - \hat{Y})^T (Y - \hat{Y}) \\ &= Y^TY - Y^TX\theta - (X\theta)^T Y + (X\theta)^T(X\theta) \\ &= Y^TY - 2Y^TX\theta + (X\theta)^T(X\theta) \\ \end{align*} }[/math]
b).
[math]\displaystyle{ \begin{align*} 0 &= \frac{\partial SSE}{\partial \theta} \\ 0 &= \frac{\partial}{\partial \theta} \Big[ Y^TY - 2Y^TX\theta + (X\theta)^T(X\theta)\Big] \\ 0 &= 0 - 2YX^T + 2X^TX\theta \\ 2X^TX\theta &= 2YX^T \\ \theta &= (YX^T)(X^TX)^{-1} \end{align*} }[/math]
Note that [math]\displaystyle{ X^TX\geq 0 }[/math], so [math]\displaystyle{ \frac{\partial^2}{\partial \theta^2} SSE = 2 X^TX\geq 0. }[/math] Hence, the stationary point derived above yields a local minimum.
Exercise 2.11
Level: Moderate
Exercise Types: Novel
Question
Deep learning models often face challenges during training due to the vanishing gradient problem, especially when using sigmoid or tanh activation functions.
(a) Describe the vanishing gradient problem and its impact on the training of deep networks.
(b) Explain how the introduction of ReLU (Rectified Linear Unit) activation function mitigates this problem.
(c) Discuss one potential downside of using ReLU and propose an alternative activation function that addresses this limitation.
Solution
(a) Vanishing Gradient Problem: The vanishing gradient problem occurs when the gradients of the loss function become extremely small as they are propagated back through the layers of a deep network. This leads to:
(i)Slow or stagnant weight updates in early layers;
(ii)Difficulty in effectively training deep models. This issue is particularly pronounced with activation functions like sigmoid and tanh, where gradients approach zero as inputs saturate.
(b) Role of ReLU in Mitigation: ReLU, defined as [math]\displaystyle{ f(x) = \max(0, x) }[/math], mitigates the vanishing gradient problem by:
(i)Producing non-zero gradients for positive inputs, maintaining effective weight updates;
(ii)Introducing sparsity, as neurons deactivate (output zero) for negative inputs, which improves model efficiency.
(c) Downside of ReLU and Alternatives: One downside of ReLU is the "dying ReLU" problem, where neurons output zero for all inputs, effectively becoming inactive. This can happen when weights are poorly initialized or during training. In other words, the weighted sum of inputs falls below zero, and the gradient of the ReLU function becomes zero. High learning rate may push neurons into this inactive states.
Alternative: Leaky ReLU allows a small gradient for negative inputs, defined as [math]\displaystyle{ f(x) = x }[/math] for [math]\displaystyle{ x \gt 0 }[/math] and [math]\displaystyle{ f(x) = \alpha x }[/math] for [math]\displaystyle{ x \leq 0 }[/math], where [math]\displaystyle{ \alpha }[/math] is a small positive constant. This prevents neurons from dying, ensuring all neurons contribute to learning.
Exercise 2.12
Level: * (Easy)
Exercise Type: Novel
Question
Consider the following data:
[math]\displaystyle{ x = \begin{bmatrix} 1 \\ 1 \\ 2 \\ 2 \\ 3 \\ 3 \end{bmatrix}, \quad y = \begin{bmatrix} 1 \\ 2 \\ 2 \\ 3 \\ 2 \\ 3 \end{bmatrix} }[/math]
This data is fitted by a linear regression model with no bias term. What is the loss for the first four data points? Use the L2 loss defined as:
[math]\displaystyle{ L = \frac{1}{2}(y - \hat{y})^2 }[/math].
Solution
The correct losses for the first four data points are 0, 0.5, 0, and 0.5.
Calculation
Step 1: Fit the linear regression model.
Since there is no bias term, the model is [math]\displaystyle{ \hat{y} = \beta x }[/math], where [math]\displaystyle{ \beta = \frac{\sum x_i y_i}{\sum x_i^2} }[/math].
Compute [math]\displaystyle{ \beta }[/math]: [math]\displaystyle{ \beta = \frac{1 \cdot 1 + 1 \cdot 2 + 2 \cdot 2 + 2 \cdot 3 + 3 \cdot 2 + 3 \cdot 3}{1^2 + 1^2 + 2^2 + 2^2 + 3^2 + 3^2} = \frac{28}{28} = 1. }[/math]
Step 2: Calculate [math]\displaystyle{ \hat{y} }[/math] for each [math]\displaystyle{ x }[/math].
For the first four data points:
[math]\displaystyle{ \hat{y}_1 = 1 \cdot 1 = 1 }[/math]
[math]\displaystyle{ \hat{y}_2 = 1 \cdot 1 = 1 }[/math]
[math]\displaystyle{ \hat{y}_3 = 1 \cdot 2 = 2 }[/math]
[math]\displaystyle{ \hat{y}_4 = 1 \cdot 2 = 2 }[/math]
Step 3: Compute the L2 loss for each point.
- For [math]\displaystyle{ (x_1, y_1) }[/math]: [math]\displaystyle{ L_1 = \frac{1}{2}(1 - 1)^2 = 0 }[/math].
- For [math]\displaystyle{ (x_2, y_2) }[/math]: [math]\displaystyle{ L_2 = \frac{1}{2}(2 - 1)^2 = 0.5 }[/math].
- For [math]\displaystyle{ (x_3, y_3) }[/math]: [math]\displaystyle{ L_3 = \frac{1}{2}(2 - 2)^2 = 0 }[/math].
- For [math]\displaystyle{ (x_4, y_4) }[/math]: [math]\displaystyle{ L_4 = \frac{1}{2}(3 - 2)^2 = 0.5 }[/math].
Thus, the losses are: 0, 0.5, 0, 0.5.
Exercise 2.13
Level: ** (Moderate)
Exercise Types: Novel
References: A. Ghodsi, STAT 940 Deep Learning: Lecture 2, University of Waterloo, Winter 2025.
Question
In Lecture 2, we derived a formula for a simple perceptron to determine a hyperplane which splits a set of linearly independant data.
In class, we defined the error as
[math]\displaystyle{ err = \sum_{i \in M} -y_i(\beta^Tx_i+\beta_0) }[/math]
Where [math]\displaystyle{ M }[/math] is the set of points which have been misclassified
The function for the simple perceptron was defined as
[math]\displaystyle{ y_i = \beta^Tx + \beta_0 }[/math]
Where a point belongs to Set 1 if it is above the line, and Set 2 if it is below the line.
In lecture 2, we found the partial derivatives as
[math]\displaystyle{ \frac{\partial err}{\partial f} = \sum_{i \in M}-y_i(\beta^Tx_i+\beta_i) }[/math]
and
[math]\displaystyle{ \frac{\partial err}{\partial \beta} = \sum_{i \in M}-y_i }[/math]
And defined the formula for gradient descent as
[math]\displaystyle{ \begin{bmatrix} \beta \\ \beta_0 \end{bmatrix} = \begin{bmatrix} \beta \\ \beta_0 \end{bmatrix} + \eta \begin{bmatrix} \sum_{i \in M} y_ix_i \\ \sum_{i \in M} y_i \end{bmatrix} }[/math]
Generate a random set of points and a line in 2D using python, and sort the set of points into 2 sets above and below the line. Then use gradient descent and the formulas above to find a hyperplane which also sorts the set with no prior knowledge of the line used for sorting.
Solution
The code below is a sample of implementing the 2D gradient descent algorithm derived in Lecture 2 to a random dataset
import numpy as np import matplotlib.pyplot as plt import functools #Generate a random array of points random_array = np.random.random((40,2)) #Linearlly seperate the points by a random hyperplane (in this case, a 2D line) #y = ax + b a = np.random.random() b = np.random.random()*0.25 #Create the ground truth, classify points into Set 1 or Set 2 set_1_indicies = random_array[:,1] >= (a*random_array[:,0] + b) #Boolean indicies where y is greater to or equal to the line y = ax + b set_2_indicies = random_array[:,1] < (a*random_array[:,0] + b) #Get the (x,y) coordinates of set 1 and set 2 by selecting the indicies of the random array which are either above or below the hyperplane set_1= random_array[set_1_indicies] set_2= random_array[set_2_indicies] plt.figure(1,figsize=(6,6)) plt.scatter(set_1[:,0],set_1[:,1],c='r') plt.scatter(set_2[:,0],set_2[:,1],c='b') plt.xlabel('x') plt.ylabel('y') x = np.linspace(0,1,2) y = x*a + b plt.plot(x,y,'g') plt.legend(['Set 1','Set 2','Initial Hyperplane Used to Split Sets']) plt.title("Generated Dataset") plt.show() #Now use a simple perceptron and the algorithm developed in class during lecture 2 to have an agent with no prior knowledge of the hyperplane determine what it is, or find a hyperplane that can seperate the two sets. #Randomly initialize beta and beta_0 beta = np.random.random() beta_0 = np.random.random()*0.25 def update_weights(beta,beta_0,random_array,set_1_indicies,set_2_indicies,learning_rate): #Find perceptron set 1 and set 2 perceptron_set_1_indicies = random_array[:,1] >= (beta*random_array[:,0] + beta_0) perceptron_set_2_indicies = random_array[:,1] < (beta*random_array[:,0] + beta_0) #Find set M, the misclassified points #A point x,y classified by the agent as class 1 is misclassified if it is actually in set 2 or if it is classified as class 2 and is actually in set 1 M_indicies= np.logical_or(np.logical_and(perceptron_set_1_indicies,set_2_indicies),np.logical_and(perceptron_set_2_indicies,set_1_indicies)) #Get an array of the x,y coordinates of the misclassified points M=random_array[M_indicies] #Define variables for the sums on the right side of the gradient descent linear equation sum_yixi = 0 sum_yi = 0 #Compute the sums needed to do gradient descent for i in range(0,len(M)): sum_yixi = sum_yixi + M[i,0]*M[i,1] sum_yi = sum_yi + M[i,1] #Update beta and beta_0 beta = beta + learning_rate*sum_yixi beta_0 = beta_0 + learning_rate*sum_yi return beta,beta_0 #Run for 1000 epochs for i in range(0,1000): beta,beta_0 = update_weights(beta,beta_0,random_array,set_1_indicies,set_2_indicies,0.01) plt.figure(1,figsize=(6,6)) plt.scatter(set_1[:,0],set_1[:,1],c='r') plt.scatter(set_2[:,0],set_2[:,1],c='b') plt.xlabel('x') plt.ylabel('y') x = np.linspace(0,1,2) y = beta*x + beta_0 plt.plot(x,y,c='g') plt.legend(['Set 1','Set 2','Gradient Descent Hyperplane']) plt.title("Gradient Descent Generated Hyperplane") plt.show()
The figure below shows a sample of the randomly generated linearly separated sets, and the lines used to separate the sets.
The figure below shows the predicted hyperplane found using gradient descent
Exercise 2.14
Level: Easy
Exercise Types: Novel
Question
Consider a simple neural network model with a single hidden layer to classify input data into two categories based on their features. Address the following points:
- Describe the process of input data transformation through a single hidden layer.
- Identify the role of activation functions in neural networks.
- Explain the importance of the learning rate in the neural network training process.
Solution
- Input Processing:
The network receives input features and feeds them through a hidden layer, where each input is subject to a weighted sum, addition of a bias, and application of an activation function. This series of operations transforms the input data into a representation that captures complex relationships within the data.
Mathematically, the output [math]\displaystyle{ h }[/math] of the hidden layer for an input vector [math]\displaystyle{ x }[/math] is given by: [math]\displaystyle{ h = f(W^{(1)}x + b^{(1)}) }[/math], where [math]\displaystyle{ W^{(1)} }[/math] represents the weights, [math]\displaystyle{ b^{(1)} }[/math] the biases, and [math]\displaystyle{ f }[/math] the activation function.
- Role of Activation Functions:
Activation functions such as sigmoid, ReLU, or tanh introduce necessary non-linearities into the model, enabling it to learn more complex patterns and relationships in the data. These functions are applied to each neuron's output and help regulate the neural network's overall output, ensuring predictability and differentiation between different types of outputs.
- Importance of Learning Rate:
The learning rate [math]\displaystyle{ η }[/math] is a critical parameter that determines the extent to which the weights in the network are updated during training. An optimal learning rate ensures efficient convergence to a minimum, whereas an excessively high rate can lead to overshooting and an excessively low rate to slow convergence or getting stuck in local minima.
Exercise 2.15
Level: ** (Moderate)
Exercise Types: Novel
Question
Consider a simple feedforward neural network with one input layer, one hidden layer, and one output layer. The network structure is as follows:
- Input layer: 2 neurons (features [math]\displaystyle{ x_1 }[/math] and [math]\displaystyle{ x_2 }[/math]).
- Hidden layer: 3 neurons with ReLU activation ([math]\displaystyle{ (a = \max(0, z)) }[/math]).
- Output layer: 1 neuron with no activation (linear output).
The weights ([math]\displaystyle{ W }[/math]) and biases ([math]\displaystyle{ b }[/math]) for the layers are:
Hidden layer:
[math]\displaystyle{ W_{\text{hidden}} = \begin{bmatrix} 0.5 & -0.2 \\ 0.8 & 0.3 \\ -0.5 & 0.7 \end{bmatrix}, \quad b_{\text{hidden}} = \begin{bmatrix} 0.1 \\ -0.4 \\ 0.2 \end{bmatrix} }[/math]
Output layer:
[math]\displaystyle{ W_{\text{hidden}} = W_{\text{output}} = \begin{bmatrix} 0.6 & -0.1 & 0.3 \end{bmatrix}, \quad b_{\text{output}} = 0.5 }[/math]
For input values [math]\displaystyle{ x_1 = 1.5 }[/math] and [math]\displaystyle{ x_2 = -0.5 }[/math]:
- Calculate the output of the hidden layer ([math]\displaystyle{ a_{\text{hidden}} }[/math]).
- Calculate the final output of the network ([math]\displaystyle{ y_{\text{output}} }[/math]).
Solution
Step 1: Calculate [math]\displaystyle{ z_{\text{hidden}} = W_{\text{hidden}} \cdot \mathbf{x} + b_{\text{hidden}} }[/math]
The input vector is: [math]\displaystyle{ \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 1.5 \\ -0.5 \end{bmatrix} }[/math]
[math]\displaystyle{ z_{\text{hidden}} = \begin{bmatrix} 0.5 & -0.2 \\ 0.8 & 0.3 \\ -0.5 & 0.7 \end{bmatrix} \cdot \begin{bmatrix} 1.5 \\ -0.5 \end{bmatrix} + \begin{bmatrix} 0.1 \\ -0.4 \\ 0.2 \end{bmatrix} }[/math]
[math]\displaystyle{
z_{\text{hidden}} = \begin{bmatrix} (0.5 \cdot 1.5) + (-0.2 \cdot -0.5) \\ (0.8 \cdot 1.5) + (0.3 \cdot -0.5) \\ (-0.5 \cdot 1.5) + (0.7 \cdot -0.5) \end{bmatrix} + \begin{bmatrix} 0.1 \\ -0.4 \\ 0.2 \end{bmatrix}
}[/math]
[math]\displaystyle{ z_{\text{hidden}} = \begin{bmatrix} 0.75 + 0.1 \\ 1.2 - 0.15 \\ -0.75 - 0.35 \end{bmatrix} + \begin{bmatrix} 0.1 \\ -0.4 \\ 0.2 \end{bmatrix} = \begin{bmatrix} 0.85 \\ 0.65 \\ -0.9 \end{bmatrix} }[/math]
Step 2: Apply ReLU activation
The ReLU activation is defined as:
[math]\displaystyle{ a_{\text{hidden}} = \max(0, z_{\text{hidden}}) }[/math]
[math]\displaystyle{
a_{\text{hidden}} = \begin{bmatrix} \max(0, 0.85) \\ \max(0, 0.65) \\ \max(0, -0.9) \end{bmatrix} = \begin{bmatrix} 0.85 \\ 0.65 \\ 0 \end{bmatrix}
}[/math]
Step 3: Calculate [math]\displaystyle{ z_{\text{output}} = W_{\text{output}} \cdot a_{\text{hidden}} + b_{\text{output}} }[/math]
[math]\displaystyle{ z_{\text{output}} = \begin{bmatrix} 0.6 & -0.1 & 0.3 \end{bmatrix} \cdot \begin{bmatrix} 0.85 \\ 0.65 \\ 0 \end{bmatrix} + 0.5 }[/math]
[math]\displaystyle{ z_{\text{output}} = (0.6 \cdot 0.85) + (-0.1 \cdot 0.65) + (0.3 \cdot 0) + 0.5 }[/math]
[math]\displaystyle{ z_{\text{output}} = 0.51 - 0.065 + 0 + 0.5 = 0.945 }[/math]
Step 4: Final Output
The final output of the network is: [math]\displaystyle{ y_{\text{output}} = z_{\text{output}} = 0.945 }[/math]
Exercise 2.16
Level: * (Easy)
Exercise Types: Novel
Question
Consider the following binary classification task, where the goal is to classify points into two categories: +1 or -1.
Training Data: \begin{array}{|c|c|c|} \hline x_1 & x_2 & y \\ \hline 2 & 3 & 1 \\ 1 & 2 & 1 \\ 3 & 1 & -1 \\ 4 & 5 & -1 \\ \hline \end{array} Where [math]\displaystyle{ x_1 }[/math] and [math]\displaystyle{ x_2 }[/math] are the input features, [math]\displaystyle{ y }[/math] is the target label (+1 or -1).
Task: Train a perceptron model using gradient descent, and update the weights using the perceptron update rule using the first data point(2,3,1). Initialize the weights as [math]\displaystyle{ \beta_0 = 0 }[/math], [math]\displaystyle{ \beta_1 = 0 }[/math], and [math]\displaystyle{ \beta_2 = 0 }[/math]. Choose a learning rate of [math]\displaystyle{ \eta = 0.1 }[/math].
Solution
Step 1: Compute the Prediction
The perceptron model predicts the label using the following equation:
[math]\displaystyle{ \hat{y} = \text{sign}(\beta_0 + \beta_1 x_1 + \beta_2 x_2) }[/math]
Substituting the initial weights and the input values:
[math]\displaystyle{ \hat{y} = \text{sign}(0 + 0 \cdot 2 + 0 \cdot 3) = \text{sign}(0) = 0 }[/math]
Since the predicted label [math]\displaystyle{ \hat{y} = 0 }[/math] is not equal to the true label [math]\displaystyle{ y = 1 }[/math], we need to update the weights.
Step 2: Update the Weights
[math]\displaystyle{ \beta_0 = \beta_0 + \eta \cdot y = 0 + 0.1 \cdot 1 = 0.1 }[/math]
[math]\displaystyle{ \beta_1 = \beta_1 + \eta \cdot y \cdot x_1 = 0 + 0.1 \cdot 1 \cdot 2 = 0.2 }[/math]
[math]\displaystyle{ \beta_2 = \beta_2 + \eta \cdot y \cdot x_2 = 0 + 0.1 \cdot 1 \cdot 3 = 0.3 }[/math]
Exercise 2.17
Level: * (Easy)
Exercise Types: Novel
Question
Consider a binary classification problem where a perceptron is used to separate two linearly separable classes in [math]\displaystyle{ \mathbb{R}^2 }[/math]. The perceptron updates its weight vector w using the following update rule:
[math]\displaystyle{ w^{(t+1)} = w^{(t)} + \eta y^{(t)} x^{(t)} }[/math]
where:
- [math]\displaystyle{ w^{(t)} }[/math] is the weight vector at iteration [math]\displaystyle{ t }[/math],
- [math]\displaystyle{ x^{(t)} \in \mathbb{R}^2 }[/math] is the feature vector of the misclassified sample at iteration [math]\displaystyle{ t }[/math],
- [math]\displaystyle{ y^{(t)} \in \{+1, -1\} }[/math] is the corresponding label,
- [math]\displaystyle{ \eta \gt 0 }[/math] is the learning rate.
Prove that if the data is **linearly separable**, the perceptron algorithm **converges in a finite number of steps
Solution
Define the angle between [math]\displaystyle{ w }[/math] and the optimal weight vector [math]\displaystyle{ w^* }[/math], which perfectly separates the data. The perceptron algorithm updates the weight vector in the direction of each misclassified sample, gradually aligning it with [math]\displaystyle{ w^* }[/math].
By repeatedly applying the update rule and using the Cauchy-Schwarz inequality, we can show that:
[math]\displaystyle{ w^{(t)} \cdot w^* \geq t \gamma R }[/math]
which grows linearly with [math]\displaystyle{ t }[/math]. Meanwhile, the norm of [math]\displaystyle{ w^{(t)} }[/math] is bounded as:
[math]\displaystyle{ \|w^{(t)}\|^2 \leq t R^2 }[/math]
Combining these inequalities leads to the bound on the number of updates:
[math]\displaystyle{ T \leq \frac{R^2}{\gamma^2} }[/math]
Exercise 2.18
Level: * (Easy)
Exercise Types: Novel
Question
In a binary classification problem, assume the input data is a 2-dimensional vector x = [x1,x2].The Perceptron model is defined with the following parameters:
Weights: w = [2,-1] Bias: b = -0.5
The Perceptron output is given by: y = sign(wx + b), where sign(z) is the sign function that outputs +1 if z > 0, and -1 otherwise.
1. Compute the Perceptron output y for the input sample x = [1,2];
2. Assume the target label is t = +1. Determine if the Perceptron classifies the sample correctly. If it misclassifies, provide the update rule for the weights w and biases b, and compute their updated values after one step.
Solution
1. Compute the predicted output y and determine if classification is correct:
[math]\displaystyle{ z = \textbf{w*x} + b = (2*1) + (-1*2) + (-0.5) = 2 - 2 - 0.5 = -0.5 \lt 0 }[/math]
since [math]\displaystyle{ z\lt 0 }[/math], the Perceptron output is
[math]\displaystyle{ y = \text{sign}(-0.5) = -1 }[/math]
2. The target label t = +1, but the Perceptron output y = -1. Hence, the classification is incorrect.
Update rule for weights and bias:
Gradient:
[math]\displaystyle{ \nabla w = \rho (t-y)*x = 0.1*(1-(-1))*[1,2]=[0.2,0.4] \newline }[/math]
[math]\displaystyle{ \nabla b = \rho (t-y) = 0.1*(1-(-1))=0.2 }[/math]
Update:
Assume the learning rate here is 0.1,
[math]\displaystyle{ w^{new} = w^{old} + \nabla w = [2, -1] + [0.2, 0.4] = [2.2, -0.6] }[/math]
[math]\displaystyle{ b^{new} = b^{old} + \nabla b = -0.5 +0.2 = -0.3 }[/math]
Exercise 2.19
Level: ** (Moderate)
Exercise Types: Novel
Question
Given a simple feedforward neural network with one hidden layer and a softmax output, the network has the following structure:
- Input layer: [math]\displaystyle{ x \in \mathbb{R}^n }[/math],
- Hidden layer: Linear transformation [math]\displaystyle{ h = W₁ x + b₁ }[/math], followed by a ReLU activation, i.e., [math]\displaystyle{ h₊ = max(0, h) }[/math],
- Output layer: Softmax output [math]\displaystyle{ y_{output} = Softmax(W₂ h₊ + b₂) }[/math].
Let the true label be [math]\displaystyle{ y \in \mathbb{R}^k }[/math], where [math]\displaystyle{ k }[/math] is the number of output classes and [math]\displaystyle{ y }[/math] is a one-hot vector.
1.Derive the gradient of the loss function (cross-entropy loss) with respect to the output logit [math]\displaystyle{ W_2 h_+ + b_2 }[/math] in terms of the softmax output and the true label [math]\displaystyle{ y }[/math]. Write the expression for [math]\displaystyle{ \frac{\partial L}{\partial y_{output}} }[/math]. Here we assume the loss function is the entropy loss, i.e. [math]\displaystyle{ L = - \sum_{i} y_i \log(y_{output,i}) }[/math] where [math]\displaystyle{ y_i }[/math] is the true label and [math]\displaystyle{ y_{output,i} }[/math]
2. Using the chain rule, derive the gradient of the loss function with respect to the hidden layer [math]\displaystyle{ h_+ }[/math]. How does the ReLU activation affect the gradient computation?
Solution
1. Let the output logits be represented as: [math]\displaystyle{ z = W₂ h₊ + b₂ }[/math] where [math]\displaystyle{ h₊ = max(0, W₁ x + b₁) }[/math] is the output of the hidden layer after the ReLU activation.
The softmax output [math]\displaystyle{ y_{output} = Softmax(z) }[/math] is given by: [math]\displaystyle{ y_{output, i} = \frac{e^{z_i}}{\sum_j e^{z_j}} }[/math]
The cross-entropy loss [math]\displaystyle{ L = - \sum_{i} y_i \log(y_{output,i}) }[/math] where [math]\displaystyle{ y_i }[/math] is the true label and [math]\displaystyle{ y_{output,i} }[/math] is the predicted probability from the softmax.
The gradient of the loss with respect to the output logits [math]\displaystyle{ z_i }[/math] is: [math]\displaystyle{ \frac{\partial L}{\partial z_i} = y_{output,i} - y_i }[/math]
The gradient with respect to the output [math]\displaystyle{ y_{output} }[/math] is: [math]\displaystyle{ \frac{\partial L}{\partial y_{output}} = y_{output} - y }[/math]
2. Now, we compute the gradient of the loss with respect to the hidden layer [math]\displaystyle{ h_+ }[/math].
Using the chain rule, we have: [math]\displaystyle{ \frac{\partial L}{\partial h_+} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial h_+} }[/math]
From part 1, we already computed the derivative of the loss with respect to the output logits: [math]\displaystyle{ \frac{\partial L}{\partial z} = y_{output} - y }[/math]
The derivative of the logits with respect to the hidden layer is: [math]\displaystyle{ \frac{\partial z}{\partial h_+} = W₂ }[/math]
Therefore, the gradient with respect to the hidden layer is: [math]\displaystyle{ \frac{\partial L}{\partial h_+} = W₂^T (y_{output} - y) }[/math]
The ReLU activation introduces a nonlinearity in the gradient computation. Specifically, ReLU is defined as: [math]\displaystyle{ ReLU(x) = max(0, x) }[/math]
The gradient of the ReLU function is: [math]\displaystyle{ ReLU'(x) = 1 \text{ if } x \gt 0 \text{ else } 0 }[/math]
Therefore, the gradient with respect to the hidden layer becomes: [math]\displaystyle{ \frac{\partial L}{\partial h_+} = (W₂^T (y_{output} - y)) \cdot ReLU'(h_+) }[/math]
This means that if an element of [math]\displaystyle{ h_+ }[/math] is zero (due to the ReLU activation), the corresponding gradient is also zero, which effectively prevents updates to that particular element during backpropagation.
Exercise 2.20
Level: * (Easy) Exercise Types: Novel
Question
Compute the distance of the following misclassified points to the hyperplane [math]\displaystyle{ L }[/math]: [math]\displaystyle{ x_1 = [2, 3] }[/math], [math]\displaystyle{ y_1 = 1 }[/math], [math]\displaystyle{ x_2 = [4, -1] }[/math], [math]\displaystyle{ y_2 = -1 }[/math].
The hyperplane parameters are: [math]\displaystyle{ \beta = [1, -2]^T, \quad \beta_0 = 3. }[/math]
Solution
The hyperplane equation is: [math]\displaystyle{ \beta^T x + \beta_0 = x_1 - 2x_2 + 3. }[/math]
For [math]\displaystyle{ x_1 = [2, 3] }[/math], [math]\displaystyle{ y_1 = 1 }[/math]:
[math]\displaystyle{
\beta^T x_1 + \beta_0 = (1)(2) + (-2)(3) + 3 = 2 - 6 + 3 = -1.
}[/math]
The distance is:
[math]\displaystyle{
d_1 = -y_1 (\beta^T x_1 + \beta_0) = -(1)(-1) = 1.
}[/math]
For [math]\displaystyle{ x_2 = [4, -1] }[/math], [math]\displaystyle{ y_2 = -1 }[/math]: [math]\displaystyle{ \beta^T x_2 + \beta_0 = (1)(4) + (-2)(-1) + 3 = 4 + 2 + 3 = 9. }[/math] The distance is: [math]\displaystyle{ d_2 = -y_2 (\beta^T x_2 + \beta_0) = -(-1)(9) = 9. }[/math]
Exercise 2.21
Level: * (Easy)
Exercise Types: Novel
Question
Train a perceptron using gradient descent for the following settings:
- Input Data: [math]\displaystyle{ \mathbf{x}_i = [x_{i1}, x_{i2}]^T }[/math].
- Labels: [math]\displaystyle{ y_i \in \{-1, +1\} }[/math].
- Initial Weights: [math]\displaystyle{ \mathbf{w} = [w_1, w_2]^T = [0, 0]^T }[/math], [math]\displaystyle{ b = 0 }[/math].
- Learning Rate: [math]\displaystyle{ \eta = 0.1 }[/math].
The perceptron update rules are:
[math]\displaystyle{ \mathbf{w} \leftarrow \mathbf{w} + \eta \cdot y_i \cdot \mathbf{x}_i }[/math]
[math]\displaystyle{ b \leftarrow b + \eta \cdot y_i }[/math]
Dataset:
[math]\displaystyle{ \mathbf{x}_1 = [2, 1]^T }[/math], [math]\displaystyle{ y_1 = 1 }[/math]
[math]\displaystyle{ \mathbf{x}_2 = [1, -1]^T }[/math], [math]\displaystyle{ y_2 = -1 }[/math]
Tasks:
- Perform one iteration of gradient descent (update weights and bias).
- Determine whether both points are correctly classified after the update.
Solution
Initialization: [math]\displaystyle{ \mathbf{w} = [0, 0]^T }[/math], [math]\displaystyle{ b = 0 }[/math], [math]\displaystyle{ \eta = 0.1 }[/math].
Step 1: Update for Point [math]\displaystyle{ \mathbf{x}_1 = [2, 1]^T }[/math], [math]\displaystyle{ y_1 = 1 }[/math]
Compute perceptron output: [math]\displaystyle{ \text{Output} = \mathbf{w}^T \mathbf{x}_1 + b = 0 }[/math].
Misclassification occurs ([math]\displaystyle{ \text{Output} \cdot y_1 \leq 0 }[/math]).
Update weights and bias:
[math]\displaystyle{ \mathbf{w} \leftarrow [0, 0]^T + 0.1 \cdot 1 \cdot [2, 1]^T = [0.2, 0.1]^T }[/math]
[math]\displaystyle{ b \leftarrow 0 + 0.1 \cdot 1 = 0.1 }[/math]
Step 2: Update for Point [math]\displaystyle{ \mathbf{x}_2 = [1, -1]^T }[/math], [math]\displaystyle{ y_2 = -1 }[/math]
Compute perceptron output: [math]\displaystyle{ \text{Output} = \mathbf{w}^T \mathbf{x}_2 + b = 0.2 }[/math].
Misclassification occurs ([math]\displaystyle{ \text{Output} \cdot y_2 \gt 0 }[/math]).
Update weights and bias:
[math]\displaystyle{ \mathbf{w} \leftarrow [0.2, 0.1]^T + 0.1 \cdot (-1) \cdot [1, -1]^T = [0.1, 0.2]^T }[/math]
[math]\displaystyle{ b \leftarrow 0.1 + 0.1 \cdot (-1) = 0 }[/math]
Results:
Final weights: [math]\displaystyle{ \mathbf{w} = [0.1, 0.2]^T }[/math]
Final bias: [math]\displaystyle{ b = 0 }[/math]
Classification Check:
For [math]\displaystyle{ \mathbf{x}_1 }[/math]: [math]\displaystyle{ \mathbf{w}^T \mathbf{x}_1 + b = 0.4 }[/math] (correctly classified).
For [math]\displaystyle{ \mathbf{x}_2 }[/math]: [math]\displaystyle{ \mathbf{w}^T \mathbf{x}_2 + b = -0.1 }[/math] (correctly classified).
Conclusion:
After one iteration of gradient descent, both points are correctly classified.
Exercise 2.22
Level: * (Moderate)
Exercise Types: Novel
Question
For the following neural network, do one forward pass through the network using the point (x1, x2, x3) = (1, 2, -1). The hidden layer uses the ReLu activation function and the output layer uses the sigmoid activation function. Calculate pfinal.
Solution
1) Hidden layer calculations: The equations for the hidden layer are [math]\displaystyle{ z1 = x1*w1 + x2*w2 + x3*w3 + w4 }[/math] and [math]\displaystyle{ z2 = x1*w5 + x2*w6 + x3*w7 + w8 }[/math]. Plugging in the values x and the weights we obtain [math]\displaystyle{ z1 = (1)*(0.5) + (2)*(0.3) + (-1)*(0.4) - 1 = -0.3 }[/math] and [math]\displaystyle{ z2 = (1)*(-0.2) + (2)*(1) + (-1)*(0.7) + 0.9 = 2 }[/math]. The equations to activate the hidden layers are [math]\displaystyle{ za = max(0, z1) }[/math] and [math]\displaystyle{ zb = max(0, z2) }[/math]. Plugging in z1 and z2 we obtain [math]\displaystyle{ za = max(0, -0.3) = 0 }[/math] and [math]\displaystyle{ zb = max(0, 2) = 2 }[/math].
2) Output layers calculations: The equations for the output layer is [math]\displaystyle{ zfinal = za*w9 + zb*w10 + w11 }[/math]. Plugging in za, zb and the weights we obtain [math]\displaystyle{ zfinal = 0*0.6 + 2*0.5 - 0.4 = 0.6 }[/math]. To calculate pfinal we need to use the sigmoid activation function so the equation is [math]\displaystyle{ pfinal = \frac{1}{1 + e^{-zfinal}} }[/math]. Plugging in the value for zfinal we obtain [math]\displaystyle{ pfinal = \frac{1}{1 + e^{-0.6}} = 0.646 }[/math].
Exercise 2.23
Level: * (Easy)
Exercise Types: Novel
Question
You are given a perceptron with weights [math]\displaystyle{ w = [2, -3] }[/math] and bias [math]\displaystyle{ b = 5 }[/math]. The perceptron uses the activation function: [math]\displaystyle{ f(x) = sign(w⋅x+b) }[/math] where [math]\displaystyle{ x =[x_1,x_2] }[/math] is the input and [math]\displaystyle{ sign(z) }[/math] outputs 1 if [math]\displaystyle{ z\gt 0 }[/math] and −1 otherwise. 1) Write the equation of the hyperplane that separates the two classes defined by this perceptron. 2) Determine whether the following points are classified as 1 or -1 by the perceptron. [math]\displaystyle{ x_1 =[1,1] }[/math] [math]\displaystyle{ x_2=[2,−1] }[/math] [math]\displaystyle{ x_3=[0,2] }[/math]
Solution
1) The hyperplane is defined by the equation where [math]\displaystyle{ w \cdot x + b = 0 }[/math]. Substituting the given weights and bias gives [math]\displaystyle{ 2x_1 - 3x_2 + 5 = 0. }[/math] This is the equation of the hyperplane that separates the two classes. 2) For [math]\displaystyle{ x_1 =[1,1] }[/math], [math]\displaystyle{ w \cdot x_1 + b = 2(1) - 3(1) + 5 = 4 }[/math]. Thus, [math]\displaystyle{ x_1 = 1 }[/math] by the perceptron. For [math]\displaystyle{ x_2 =[2,-1] }[/math], [math]\displaystyle{ w \cdot x_2 + b = 2(2) - 3(-1) + 5 = 12 }[/math]. Thus, [math]\displaystyle{ x_2 = 1 }[/math] by the perceptron. For [math]\displaystyle{ x_3 =[0,2] }[/math], [math]\displaystyle{ w \cdot x_3 + b = 2(0) - 3(2) + 5 = -1 }[/math]. Thus, [math]\displaystyle{ x_3 = -1 }[/math] by the perceptron.
Exercise 2.24
Level: * (Easy)
Exercise Types: Novel
Question
Given a dataset [math]\displaystyle{ D = \left\{ (x_1,y_1)=([1,2],1), (x_2,y_2)=([2,-1],-1), (x_3,y_3)=([3,1],1), (x_4,y_4)=([1,-2],1)\right\} }[/math]
-a). Show that the dataset [math]\displaystyle{ D }[/math] is linearly separable by finding the weight vector [math]\displaystyle{ \beta=[w_1,w_2] }[/math] and bias [math]\displaystyle{ b }[/math] such that: [math]\displaystyle{ y_i\cdot (\beta^Tx_i+b)\gt 0 , \forall (x_i,y_i)\in D }[/math]
-b). Train the perceptron starting with [math]\displaystyle{ w_1=0, w_2 = 0, b = 0 }[/math]. Use the update rule for misclassification points: [math]\displaystyle{ \beta\gets \beta+y_i\cdot x_i,b\gets b+y_i }[/math] Show the full process and update [math]\displaystyle{ \beta }[/math] and [math]\displaystyle{ b }[/math].
-c). Write the final values of [math]\displaystyle{ \beta }[/math] and [math]\displaystyle{ b }[/math].
Solution
-a). When [math]\displaystyle{ y=1 }[/math], [math]\displaystyle{ [1,2] and [3,1] }[/math].
When [math]\displaystyle{ y=-1 }[/math], [math]\displaystyle{ [2,-1] and [1,-2] }[/math].
Assume a line [math]\displaystyle{ w_!x_1+w_2x_2+b=0 }[/math], where [math]\displaystyle{ \beta=[1,1],b=-2 }[/math].
For [math]\displaystyle{ [1,2]:1\cdot 1+1\cdot 2-2=1\gt 0 }[/math](correctly separated).
For [math]\displaystyle{ [3,1]:1\cdot 3+1\cdot 1-2=2\gt 0 }[/math](correctly separated).
For [math]\displaystyle{ [2,-1]:1\cdot 2+1\cdot (-1)-2=-1\lt 0 }[/math](correctly separated).
For [math]\displaystyle{ [1,-2]:1\cdot 1+1\cdot (-2)-2=-3\lt 0 }[/math](correctly separated).
-b).
As [math]\displaystyle{ w_1=0,w_2=0,b=0 }[/math],
For [math]\displaystyle{ (x_1,y_1)=([1,2],1) }[/math], [math]\displaystyle{ f(x_1)=sign(0\cdot1+0\cdot2+0)=0 }[/math]
This is incorrect, thus update into: [math]\displaystyle{ w_1\gets 0+1\cdot 1=1, w_2\gets 0+1\cdot 2=2, b\gets 0+1=1 }[/math]
For [math]\displaystyle{ (x_2,y_2)=([2,-1],-1) }[/math], [math]\displaystyle{ f(x_2)=sign(1\cdot2+2\cdot(-1)+1)=1 }[/math]
This is incorrect, thus update into: [math]\displaystyle{ w_1\gets 1+(-1)\cdot 2=-1, w_2\gets 2+(-1)\cdot (-1)=3, b\gets 1+(-1)=0 }[/math]
For [math]\displaystyle{ (x_3,y_3)=([3,1],1) }[/math], [math]\displaystyle{ f(x_3)=sign(-1\cdot3+3\cdot1+0)=0 }[/math]
This is incorrect, thus update into: [math]\displaystyle{ w_1\gets -1+1\cdot 3=2, w_2\gets 3+1\cdot 1=4, b\gets 0+1=1 }[/math]
For [math]\displaystyle{ (x_4,y_4)=([1,-2],-1) }[/math], [math]\displaystyle{ f(x_4)=sign(2\cdot1+4\cdot(-2)+1)=-1 }[/math]
This is correct, thus no update.
-c).
Therefore, final weight is [math]\displaystyle{ w_1=2,w_2=4,b=1 }[/math]
Exercise 2.25
Level: * (Easy)
Exercise Types: Novel
Question
Explain the mechanism of self-attention used in transformers, focusing on the computation of attention scores. Use a small example with the input vectors corresponding to three words of your choice to demonstrate how self-attention weights are calculated and used to produce the output vectors.
Solution
Self-Attention Mechanism: Self-attention allows a model to dynamically weigh the importance of each word in a sequence relative to others, which is crucial for understanding the context and relationships in sentences.
Example Setup: - Assume we have three words in our input: "Star Wars," "Star Trek," and "Star Gate". - Each word is initially represented by a simple 2-dimensional embedding for simplicity.
Step 1: Compute Q, K, V vectors For each word, compute the Query (Q), Key (K), and Value (V) vectors using learned weight matrices (assume identity matrices for simplicity): - Q, K, V for "Star Wars" = [1, 0] - Q, K, V for "Star Trek" = [0, 1] - Q, K, V for "Star Gate" = [1, 1]
Step 2: Compute Attention Scores Calculate the dot product of the query vector of each word with the key vector of every other word: - Score for "Star Wars" with "Star Trek": [1, 0] • [0, 1] = 0 - Score for "Star Wars" with "Star Gate": [1, 0] • [1, 1] = 1 - Score for "Star Trek" with "Star Wars": [0, 1] • [1, 0] = 0 - Score for "Star Trek" with "Star Gate": [0, 1] • [1, 1] = 1 - Score for "Star Gate" with "Star Wars": [1, 1] • [1, 0] = 1 - Score for "Star Gate" with "Star Trek": [1, 1] • [0, 1] = 1
Step 3: Normalize Scores using Softmax Apply the softmax function to ensure scores sum to one and represent probabilities: - Normalized scores between "Star Wars," "Star Trek," and "Star Gate": [0.333, 0.333, 0.333] (this type of simplicity would be like using a naive bayes model to predict)
Step 4: Compute Output Vectors Multiply each value vector by the corresponding normalized score and sum them to produce the output vector for each word: - Output for "Star Wars": 0.333*[1, 0] + 0.333*[0, 1] + 0.333*[1, 1] = [0.666, 0.666] - Output for "Star Trek": Similar calculation - Output for "Star Gate": Similar calculation
Exercise 3.1
Level: ** (Moderate)
Exercise Type: Novel
Implement the perceptron learning algorithm with momentum for the AND function. Plot the decision boundary after 10 epochs of training with a learning rate of 0.1 and a momentum of 0.9.
Solution
import numpy as np
import matplotlib.pyplot as plt
# Dataset
data = np.array([
[0, 0, -1],
[0, 1, -1],
[1, 0, -1],
[1, 1, 1]
])
# Initialize weights, learning rate, and momentum
weights = np.random.rand(3)
learning_rate = 0.1
momentum = 0.9
epochs = 10
previous_update = np.zeros(3)
# Add bias term to data
X = np.hstack((np.ones((data.shape[0], 1)), data[:, :2]))
y = data[:, 2]
# Training loop
for epoch in range(epochs):
for i in range(len(X)):
prediction = np.sign(np.dot(weights, X[i]))
if prediction != y[i]:
update = learning_rate * y[i] * X[i] + momentum * previous_update
weights += update
previous_update = update
# Plot final decision boundary
x_vals = np.linspace(-0.5, 1.5, 100)
y_vals = -(weights[1] * x_vals + weights[0]) / weights[2]
plt.plot(x_vals, y_vals, label='Final Decision Boundary')
# Plot dataset
for point in data:
color = 'blue' if point[2] == 1 else 'red'
plt.scatter(point[0], point[1], color=color)
plt.title('Perceptron with Momentum')
plt.legend()
plt.show()
Exercise 3.2
Level: ** (Moderate)
Exercise Types: Novel
Question
Write a python program showing how the back propagation algorithm work with a 2 inputs 2 hidden layer and 1 output layer neural network. Train the network on the XOR problem:
- Input[0,0] -> Output 0
- Input [0,1] -> Output 1
- Input [1,0] -> Output 1
- Input [1,1] -> Output 0
Use the sigmoid activation function. Use Mean Square Error as the loss function.
Solution
import numpy as np
# -------------------------
# 1. Define Activation Functions
# -------------------------
def sigmoid(x):
"""
Sigmoid activation function.
:param x: A decimal value
:return: The sigmoid activation of given value
"""
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
"""
Derivative of the sigmoid function.
Here, 'x' is assumed to be sigmoid(x),
meaning x is already the output of the sigmoid.
:param x: A decimal value
:return: The derivative of sigmoid activation of given value
"""
return x * (1 - x)
# -------------------------
# 2. Prepare the Training Data (XOR)
# -------------------------
# Input data (4 samples, each with 2 features)
X = np.array([
[0, 0],
[0, 1],
[1, 0],
[1, 1]
])
# Target labels (4 samples, each is a single output)
y = np.array([
[0],
[1],
[1],
[0]
])
# -------------------------
# 3. Initialize Network Parameters
# -------------------------
# Weights for input -> hidden (shape: 2x2)
W1 = np.random.randn(2, 2)
# Bias for hidden layer (shape: 1x2)
b1 = np.random.randn(1, 2)
# Weights for hidden -> output (shape: 2x1)
W2 = np.random.randn(2, 1)
# Bias for output layer (shape: 1x1)
b2 = np.random.randn(1, 1)
# Hyperparameters
learning_rate = 0.1
num_epochs = 10000
# -------------------------
# 4. Training Loop
# -------------------------
for epoch in range(num_epochs):
# 4.1. Forward Pass
# - Compute hidden layer output
hidden_input = np.dot(X, W1) + b1 # shape: (4, 2)
hidden_output = sigmoid(hidden_input)
# - Compute final output
final_input = np.dot(hidden_output, W2) + b2 # shape: (4, 1)
final_output = sigmoid(final_input)
# 4.2. Compute Loss (Mean Squared Error)
error = y - final_output # shape: (4, 1)
loss = np.mean(error**2)
# 4.3. Backpropagation
# - Gradient of loss w.r.t. final_output
d_final_output = error * sigmoid_derivative(final_output) # shape: (4, 1)
# - Propagate error back to hidden layer
error_hidden_layer = np.dot(d_final_output, W2.T) # shape: (4, 2)
d_hidden_output = error_hidden_layer * sigmoid_derivative(hidden_output) # shape: (4, 2)
# 4.4. Gradient Descent Updates
# - Update W2, b2
W2 += learning_rate * np.dot(hidden_output.T, d_final_output) # shape: (2, 1)
b2 += learning_rate * np.sum(d_final_output, axis=0, keepdims=True) # shape: (1, 1)
# - Update W1, b1
W1 += learning_rate * np.dot(X.T, d_hidden_output) # shape: (2, 2)
b1 += learning_rate * np.sum(d_hidden_output, axis=0, keepdims=True) # shape: (1, 2)
# Print loss every 1000 epochs
if epoch % 1000 == 0:
print(f"Epoch {epoch}, Loss: {loss:.6f}")
# -------------------------
# 5. Testing / Final Outputs
# -------------------------
print("\nTraining complete.")
print("Final loss:", loss)
# Feedforward one last time to see predictions
hidden_output = sigmoid(np.dot(X, W1) + b1)
final_output = sigmoid(np.dot(hidden_output, W2) + b2)
print("\nOutput after training:")
for i, inp in enumerate(X):
print(f"Input: {inp} -> Predicted: {final_output[i][0]:.4f} (Target: {y[i][0]})")
Output:
Epoch 0, Loss: 0.257193 Epoch 1000, Loss: 0.247720 Epoch 2000, Loss: 0.226962 Epoch 3000, Loss: 0.191367 Epoch 4000, Loss: 0.162169 Epoch 5000, Loss: 0.034894 Epoch 6000, Loss: 0.012459 Epoch 7000, Loss: 0.007127 Epoch 8000, Loss: 0.004890 Epoch 9000, Loss: 0.003687 Training complete. Final loss: 0.0029435579049382756 Output after training: Input: [0 0] -> Predicted: 0.0598 (Target: 0) Input: [0 1] -> Predicted: 0.9461 (Target: 1) Input: [1 0] -> Predicted: 0.9506 (Target: 1) Input: [1 1] -> Predicted: 0.0534 (Target: 0)
Exercise 3.3
Level: * (Easy)
Exercise Types: Novel
Question
Implement 4 iterations of gradient descent with and without momentum for the function [math]\displaystyle{ f(x) = x^2 + 2 }[/math] with learning rate [math]\displaystyle{ \eta=0.1 }[/math], momentum [math]\displaystyle{ \gamma=0.9 }[/math], starting value of [math]\displaystyle{ x_0=2 }[/math], starting velocity of [math]\displaystyle{ v_0=0 }[/math]. Comment on the differences.
Solution
Note that [math]\displaystyle{ f'(x) = 2x }[/math]
Without momentum:
Iteration 1: [math]\displaystyle{ x_1 = x_0 - \eta* f'(x_0) = 2 - 0.1*2*2 = 1.6 }[/math]
Iteration 2: [math]\displaystyle{ x_2 = x_1 - \eta* f'(x_1) = 1.6 - 0.1*2*1.6 = 1.28 }[/math]
Iteration 3: [math]\displaystyle{ x_3 = x_2 - \eta* f'(x_2) = 1.28 - 0.1*2*1.28 = 1.024 }[/math]
Iteration 4: [math]\displaystyle{ x_4 = x_3 - \eta* f'(x_3) =1.024 - 0.1*2*1.024 = 0.8192 }[/math]
With momentum:
Iteration 1: [math]\displaystyle{ v_1 = \gamma*v_0 + \eta * f'(x_0) = 0.9*0 + 0.1*2*2 = 0.4, x_1 = x_0-v_1 = 2-0.4 = 1.6 }[/math]
Iteration 2: [math]\displaystyle{ v_2 = \gamma*v_1 + \eta * f'(x_1) = 0.9*0.4+0.1*2*1.6 = 0.68, x_2 = x_1-v_2 = 1.6-0.68 = 0.92 }[/math]
Iteration 3: [math]\displaystyle{ v_3 = \gamma*v_2 + \eta * f'(x_2) = 0.9*0.68 + 0.1*2*0.92 = 0.796, x_3 = x_2-v_3 = 0.92-0.796 = 0.124 }[/math]
Iteration 4: [math]\displaystyle{ v_4 = \gamma*v_3 + \eta * f'(x_3) = 0.9*0.796 + 0.1*2*0.124 = 0.7412, x_4 = x_3-v_4 = 0.124 - 0.7412 = -0.6172 }[/math]
By observation, we know that the minimum of [math]\displaystyle{ f(x)=x^2+2 }[/math] occurs at [math]\displaystyle{ x=0 }[/math]. We can see that with momentum, the algorithm moves towards the minimum much faster than without momentum as past gradients are accumulated, leading to larger steps. However, we also can see that momentum can cause the algorithm to overshoot the minimum since we are taking larger steps.
Benefits for momentum: Momentum is a technique used in optimization to accelerate convergence. Inspired by physical momentum, it helps in navigating the optimization landscape.
By remembering the direction of previous gradients, which are accumulated into a running average (the velocity), momentum helps guide the updates more smoothly, leading to faster progress. This running average allows the optimizer to maintain a consistent direction even if individual gradients fluctuate. Additionally, momentum can help the algorithm escape from shallow local minima by carrying the updates through flat regions. This prevents the optimizer from getting stuck in small, unimportant minima and helps it continue moving toward a better local minimum.
Additional Comment: It is important to note that the use of running average is only there to help with intuition. At time t, while the velocity is a linear sum of previous gradients, the weight of the gradient decreases as time get further away. That is, the [math]\displaystyle{ \nabla Q(w_0) }[/math] term will have a coefficient of [math]\displaystyle{ (1-\gamma)^{t} }[/math].
Exercise 3.4
Level: ** (Moderate)
Exercise Types: Novel
Question
Perform one iteration of forward pass and backward propagation for the following network:
- Input layer: 2 neurons (x₁, x₂)
- Hidden layer: 2 neurons (h₁, h₂)
- Output layer: 1 neuron (ŷ)
- Input-to-Hidden Layer:
w₁₁¹ = 0.15, w₂₁¹ = 0.20, w₁₂¹ = 0.25, w₂₂¹ = 0.30 Bias: b¹ = 0.35 Activation function: sigmoid
- Hidden-to-Output Layer:
w₁₁² = 0.40, w₂₁² = 0.45 Bias: b² = 0.60 Activation function: sigmoid
Input:
- x₁ = 0.05, x₂ = 0.10
- Target output: y = 0.01
- Learning rate: η = 0.5
Solution
Step 1: Forward Pass
1. Hidden Layer Outputs
For neuron h₁:
- [math]\displaystyle{ z₁¹ = w₁₁¹ \cdot x₁ + w₂₁¹ \cdot x₂ + b¹ = 0.15(0.05) + 0.20(0.10) + 0.35 = 0.3775 }[/math]
- [math]\displaystyle{ h₁ = \sigma(z₁¹) = \frac{1}{1 + e^{-0.3775}} \approx 0.5933 }[/math]
For neuron h₂:
- [math]\displaystyle{ z₂¹ = w₁₂¹ \cdot x₁ + w₂₂¹ \cdot x₂ + b¹ = 0.25(0.05) + 0.30(0.10) + 0.35 = 0.3925 }[/math]
- [math]\displaystyle{ h₂ = \sigma(z₂¹) = \frac{1}{1 + e^{-0.3925}} \approx 0.5968 }[/math]
2. Output Layer
- [math]\displaystyle{ z² = w₁₁² \cdot h₁ + w₂₁² \cdot h₂ + b² = 0.40(0.5933) + 0.45(0.5968) + 0.60 = 1.1051 }[/math]
- [math]\displaystyle{ \hat{y} = \sigma(z²) = \frac{1}{1 + e^{-1.1051}} \approx 0.7511 }[/math]
Step 2: Compute Error
- [math]\displaystyle{ E = \frac{1}{2} (\hat{y} - y)^2 = \frac{1}{2} (0.7511 - 0.01)^2 \approx 0.2738 }[/math]
Step 3: Backpropagation
3.1: Gradients for Output Layer
1. Gradient w.r.t. output neuron:
- [math]\displaystyle{ \delta² = (\hat{y} - y) \cdot \hat{y} \cdot (1 - \hat{y}) }[/math]
- [math]\displaystyle{ \delta² = (0.7511 - 0.01) \cdot 0.7511 \cdot (1 - 0.7511) = 0.1381 }[/math]
2. Update weights and bias for hidden-to-output layer:
- [math]\displaystyle{ w₁₁² = w₁₁² - \eta \cdot \delta² \cdot h₁ = 0.40 - 0.5 \cdot 0.1381 \cdot 0.5933 = 0.359 }[/math]
- [math]\displaystyle{ w₂₁² = w₂₁² - \eta \cdot \delta² \cdot h₂ = 0.45 - 0.5 \cdot 0.1381 \cdot 0.5968 = 0.409 }[/math]
- [math]\displaystyle{ b² = b² - \eta \cdot \delta² = 0.60 - 0.5 \cdot 0.1381 = 0.53095 }[/math]
3.2: Gradients for Hidden Layer
1. Gradients for hidden layer neurons:
For h₁:
- [math]\displaystyle{ \delta₁ = \delta² \cdot w₁₁² \cdot h₁ \cdot (1 - h₁) }[/math]
- [math]\displaystyle{ \delta₁ = 0.1381 \cdot 0.40 \cdot 0.5933 \cdot (1 - 0.5933) = 0.0138 }[/math]
For h₂:
- [math]\displaystyle{ \delta₂ = \delta² \cdot w₂₁² \cdot h₂ \cdot (1 - h₂) }[/math]
- [math]\displaystyle{ \delta₂ = 0.1381 \cdot 0.45 \cdot 0.5968 \cdot (1 - 0.5968) = 0.0148 }[/math]
2. Update weights and bias for input-to-hidden layer:
For w₁₁¹:
- [math]\displaystyle{ w₁₁¹ = w₁₁¹ - \eta \cdot \delta₁ \cdot x₁ = 0.15 - 0.5 \cdot 0.0138 \cdot 0.05 = 0.14965 }[/math]
For w₂₁¹:
- [math]\displaystyle{ w₂₁¹ = w₂₁¹ - \eta \cdot \delta₁ \cdot x₂ = 0.20 - 0.5 \cdot 0.0138 \cdot 0.10 = 0.19931 }[/math]
For b¹:
- [math]\displaystyle{ b¹ = b¹ - \eta \cdot (\delta₁ + \delta₂) = 0.35 - 0.5 \cdot (0.0138 + 0.0148) = 0.3347 }[/math]
This completes one iteration of forward and backward propagation.
Exercise 3.5
Level: * (Easy)
Exercise Types: Novel
Question
Consider the loss function [math]\displaystyle{ Q(w) = w^2 + 2w + 1 }[/math]. Compute the gradient of [math]\displaystyle{ Q(w) }[/math]. Starting from [math]\displaystyle{ w_0 = 2 }[/math], perform two iterations of stochastic gradient descent using a learning rate [math]\displaystyle{ \rho = 0.1 }[/math].
Solution
Compute the gradient at [math]\displaystyle{ w_0 }[/math]:[math]\displaystyle{ \nabla Q(w_0) = \frac{d}{dw_0}(w_0^2 + 2w_0 + 1) = 2w_0 + 2 = 2(2) + 2 = 4 + 2 = 6 }[/math]
Update the weight using SGD: [math]\displaystyle{ w_1 = w_0 - \rho \cdot \nabla Q(w_0) = 2 - 0.1 \cdot 6 = 2 - 0.6 = 1.4 }[/math]
Compute the gradient at [math]\displaystyle{ w_1 }[/math]: [math]\displaystyle{ \nabla Q(w_1) = \frac{d}{dw_1}(w_1^2 + 2w_1 + 1) = 2w_1 + 2 = 2(1.4) + 2 = 2.8 + 2 = 4.8 }[/math]
Update the weight again: [math]\displaystyle{ w_2 = w_1 - \rho \cdot \nabla Q(w_1) = 1.4 - 0.1 \cdot 4.8 = 1.4 - 0.48 = 0.92 }[/math]
Gradient Path Figure:
Exercise 3.6
Level: * (Easy)
Exercise Types: Novel
Question
What is the prediction of the following MLP for [math]\displaystyle{ x = \begin{bmatrix} 2 \\ 2 \\ 1 \end{bmatrix} }[/math]?
Both layers are using sigmoid activation. The weight matrices connecting the input and hidden layer, and the hidden layer and output are respectively: [math]\displaystyle{ V = \begin{bmatrix} 1 & 0 & 1 \\ 1 & -1 & 0 \end{bmatrix}, \quad W = \begin{bmatrix} 0 & 1 \end{bmatrix}. }[/math]
Choose the correct answer:
a) [math]\displaystyle{ \sigma(0) }[/math]
b) [math]\displaystyle{ \sigma(\sigma(0)) }[/math]
c) [math]\displaystyle{ \sigma(-1) }[/math]
d) [math]\displaystyle{ \sigma(\sigma(0)) }[/math]
Solution
The correct answer is b): [math]\displaystyle{ \sigma(\sigma(0)) }[/math].
Calculation
Step 1: Compute the hidden layer output [math]\displaystyle{ h }[/math].
[math]\displaystyle{ h = \sigma(Vx) = \begin{bmatrix} \sigma(3) \\ \sigma(0) \end{bmatrix} }[/math]
Step 2: Compute the output layer prediction [math]\displaystyle{ y }[/math].
[math]\displaystyle{ y = \sigma(Wh) = \sigma(\sigma(0)) }[/math]
Thus, the prediction of the MLP is [math]\displaystyle{ \sigma(\sigma(0)) }[/math].
Exercise 3.7
Level: * (Easy)
Exercise Types: Novel
Question
Feedforward Neural Networks (FNNs) are one of the commonly used model in deep learning. In the context of training such networks, there are several important design components and techniques to consider.
(a) Give three commonly used activation functions. For each function, provide the formula and comment on its usage.
(b) Write down a widely used loss function for classification and explain why it is popular. Provide an example.
(c) Explain what adaptive learning methods are and how they help optimize and speed up neural network training.
Solution
(a)
- Sigmoid Activation Function
[math]\displaystyle{ f(z) = \frac{1}{1 + e^{-z}} }[/math]
The output of the sigmoid function ranges from 0 to 1, making it suitable for estimating probabilities. For example, it can be used in the final layer of a binary classification task.
- ReLU (Rectified Linear Unit) Activation Function
[math]\displaystyle{ f(z) = \max(0, z) = \begin{cases} z, & \text{if } z \gt 0, \\ 0, & \text{if } z \leq 0. \end{cases} }[/math]
ReLU outputs zero or a positive number, enabling some weights to be set to 0, promoting sparse representation. Since it only requires a comparison, and possibly setting a number to zero, it is more computationally efficient than other activation functions. An activation function similar to ReLU is the Gaussian error linear unit (GELU), which has a very similar shape except it is smooth at [math]\displaystyle{ z=0 }[/math].
- Tanh (Hyperbolic Tangent)
[math]\displaystyle{ f(z) = \frac{e^{z}-e^{-z}}{e^{z}+e^{-z}} }[/math]
Advantages: It can have zero-centered output, which helps during optimization by leading to balanced gradient updates; It's also a smooth gradient that works well in many cases. It also squashes extreme values into the range [-1,1], reducing the bias introduced from outlying datapoints.
Disadvantages: It outputs close to -1 or 1 can have near-zero gradients, leading to slower learning (vanishing gradient issue).
(b)
Cross-Entropy Loss
[math]\displaystyle{ \mathcal{L}_{\text{CE}} = -\sum_{i=1}^{n} \bigl[\, y_i \log\bigl(y_{\text{pred},i}\bigr)\ +\ (1 - y_i)\,\log\bigl(1 - y_{\text{pred},i}\bigr) \bigr] }[/math]
It provides a smooth and continuous gradient, and it penalizes the incorrect predictions more when the predictions are made with confidence. It is well suited for multi-class classification, especially when used with softmax function together.
Consider a binary classification problem where [math]\displaystyle{ y_i=[1,0,1],y_{\text{pred},i}=[0.9,0.2,0.7] }[/math]. Using the [math]\displaystyle{ \mathcal{L}_{\text{CE}} }[/math], we could calculate:
[math]\displaystyle{ \mathcal{L}_{\text{CE}} = -\frac{1}{3}\bigl[1\cdot \log\bigl(0.9\bigr)\ +0\cdot \log\bigl(0.1\bigr)+0\cdot \log\bigl(0.2\bigr)+1\cdot \log\bigl(0.7\bigr) ]\approx 0.154 }[/math]
The gradient of the loss with respect to [math]\displaystyle{ y_{\text{pred},i} }[/math] is:
[math]\displaystyle{ \frac{\partial \mathcal{L}_{\text{CE}}}{\partial y_{\text{pred},i}}=-\frac{1}{n}[\frac{y_i}{y_{\text{pred},i}}-\frac{1-y_i}{1-y_{\text{pred},i}}]
}[/math]
Calculate the gradients, [math]\displaystyle{ \frac{\partial \mathcal{L}_{\text{CE}}}{\partial y_{\text{pred},1}}\approx -0.370,\frac{\partial \mathcal{L}_{\text{CE}}}{\partial y_{\text{pred},2}}\approx 0.417,\frac{\partial \mathcal{L}_{\text{CE}}}{\partial y_{\text{pred},3}}\approx -0.476 }[/math], which is a relatively smooth gradient.
The loss should be between 0 and 1 for a binary classification. A value of 0.154 shows that the model performs well.
When [math]\displaystyle{ y_{\text{pred},i} }[/math] is close to 1 but the true label [math]\displaystyle{ y_i=0 }[/math], the term [math]\displaystyle{ \ (1 - y_i)\,\log\bigl(1 - y_{\text{pred},i}\bigr) }[/math] would approach infinity. This is because [math]\displaystyle{ \log\bigl(1 - y_{\text{pred},i}\bigr)\to -\infty }[/math] when [math]\displaystyle{ y_{\text{pred},i}\to 1 }[/math]. This large gradient would force the model to correct overconfidence.
(c)
Some variants of SGD adjust the learning rate during the training process based on the gradients' magnitudes, helping accelerate convergence and manage gradients more effectively.
Exercise 3.8
Level: * (Easy)
Exercise Types: Novel
Question
Consider a Feedforward Neural Network (FNN) with a single neuron, where the loss function is given by: [math]\displaystyle{ L(w)=(w-3)^2 }[/math]. Compute the gradient of [math]\displaystyle{ L(w) }[/math]. Starting from [math]\displaystyle{ w_{0}=0 }[/math] perform two iterations of Stochastic Gradient Descent (SGD) using a learning rate [math]\displaystyle{ \eta = 0.5 }[/math]
Solution
Gradient of [math]\displaystyle{ L(w) }[/math]:
[math]\displaystyle{
\frac{dL}{dw}=2(w-3) }[/math]
[math]\displaystyle{ w_{0}=0 }[/math]
[math]\displaystyle{ w_{1}=w_{0}-\eta \frac{dL}{dw}=3 }[/math]
[math]\displaystyle{ w_{2}=w_{1}-\eta \frac{dL}{dw}=3 }[/math]
So, after 2 iterations, [math]\displaystyle{ w_{2}=3 }[/math].
Additional Expanded Question
What if the loss function were [math]\displaystyle{ L(w) = (w - 3)^4 }[/math]? Compute its gradient and perform two iterations of SGD starting from [math]\displaystyle{ w_{0} = 0 }[/math] using the same learning rate [math]\displaystyle{ \eta = 0.5 }[/math]. Do we still reach [math]\displaystyle{ w = 3 }[/math] in two steps?
Additional Solution
For [math]\displaystyle{ L(w) = (w - 3)^4 }[/math], the gradient is: [math]\displaystyle{ \frac{dL}{dw} = 4(w - 3)^3 }[/math].
1. **Iteration 1** ([math]\displaystyle{ k=0 }[/math]): [math]\displaystyle{ \frac{dL}{dw}\big\rvert_{w=0} = 4(0 - 3)^3 = 4 \times (-27) = -108 }[/math]. [math]\displaystyle{ w_{1} = 0 \;-\; 0.5 \times (-108) = 54 }[/math].
2. **Iteration 2** ([math]\displaystyle{ k=1 }[/math]): [math]\displaystyle{ \frac{dL}{dw}\big\rvert_{w=54} = 4(54 - 3)^3 = 4 \times 51^3 = 4 \times 132651 = 530604 }[/math]. [math]\displaystyle{ w_{2} = 54 \;-\; 0.5 \times 530604 = 54 - 265302 = -265248 }[/math].
Clearly, [math]\displaystyle{ w }[/math] does not converge to [math]\displaystyle{ 3 }[/math] in just two steps. Because [math]\displaystyle{ w }[/math] is far from [math]\displaystyle{ 3 }[/math] initially, the gradient is extremely large and causes a massive overshoot. In practice, you would reduce [math]\displaystyle{ \eta }[/math] or use more sophisticated optimization methods to handle the higher-order curvature of this loss function.
Exercise 3.9
Level: ** (Moderate)
Exercise Types: Modified
References: Source: Schonlau, M., Applied Statistical Learning. With Case Studies in Stata, Springer. ISBN 978-3-031-33389-7 (Chapter 14, page 318).
Question
Consider the feedforward neural network with initial weights shown in Figure 3.9 (For simplicity, there are no biases). Use sigmoid activation functions ([math]\displaystyle{ \sigma(x) = \frac{1}{1+e^{-x}} }[/math]) for both the hidden and the output layers. Use a learning rate of [math]\displaystyle{ \rho= 0.5 }[/math].
(a): Compute a forward pass through the network for the observation (y,x1,x2,x3) = (1,3,-2,5). That is, compute the predicted probability [math]\displaystyle{ p_1 }[/math].
(b): Using the result from (a), make a backward pass to compute the revised w7 and w1. Use squared error loss: [math]\displaystyle{ E = 0.5 * (y-p_1)^2 }[/math], where [math]\displaystyle{ y\in \{0,1\} }[/math] is the true value of the response and [math]\displaystyle{ p_1 }[/math] is the predicted probability.
Solution
(a)
[math]\displaystyle{ z_A = x_1w_1 + x_2w_2 + x_3w_3 = 3*0.8+(-2)*(-1)+5*0.1=4.9 }[/math]
[math]\displaystyle{ z_B = x_1w_4 + x_2w_5 + x_3w_6 = 3*0.3+(-2)*0.5+5*(-0.2)=-1.1 }[/math]
[math]\displaystyle{ out_A = \frac{1}{1+e^{-z_A}} = \frac{1}{1+e^{-4.9}} = 0.9926 }[/math]
[math]\displaystyle{ out_B = \frac{1}{1+e^{-z_B}} = \frac{1}{1+e^{1.1}} = 0.2497 }[/math]
[math]\displaystyle{ z_1 = out_Aw_7+out_Bw_8=0.9926*1.3+0.2497*0.4=1.3903 }[/math]
[math]\displaystyle{ p_1 = \frac{1}{1+e^{-z_1}} = \frac{1}{1+e^{-1.3903}} =0.8006 }[/math]
(b)
Using [math]\displaystyle{ w_i^{new}=w_i^{old} - \rho \frac{\partial{E}}{\partial{w_i}} }[/math]
[math]\displaystyle{ \frac{\partial{E}}{\partial{w_7}}=\frac{\partial{E}}{\partial{p_1}}\frac{\partial{p_1}}{\partial{z_1}}\frac{\partial{z_1}}{\partial{w_7}}=(p_1-y)*p_1*(1-p_1)*out_A=-0.0317 }[/math]
[math]\displaystyle{ w_7^{new} = 1.3-0.5*(-0.0317)=1.3159 }[/math]
[math]\displaystyle{ \frac{\partial{E}}{\partial{w_1}}=\frac{\partial{E}}{\partial{p_1}}\frac{\partial{p_1}}{\partial{z_1}}\frac{\partial{z_1}}{\partial{out_A}}\frac{\partial{out_A}}{\partial{z_A}}\frac{\partial{z_A}}{\partial{w_1}}=(p_1-y)*p_1*(1-p_1)*w_7*out_A*(1-out_A)*x_1=-0.0091 }[/math]
[math]\displaystyle{ w_1^{new} = 0.8-0.5*(-0.0091)=0.8046 }[/math]
Exercise 3.10
Level: * (Easy)
Exercise Types: Novel
Question
What is vanishing gradient and how is it caused? What is the solution that fixes vanishing gradient?
Answer
The vanishing gradient problem occurs when the gradients used to update weights in a neural network become extremely small as they propagate backward through the layers. This happens because activation functions compress their inputs into a narrow output range. Consequently, their derivatives are very small, particularly for inputs far from zero, causing gradients to shrink exponentially in deeper layers. As a result, in modern computing, the gradient will be the result of multiplying multiple epsilon sized gradients where the resulting update is 0. To fix this, we use the relu activation function where the gradient is either 0 or 1. This ensures that the update value will not vanish due to small gradients. This is because ReLU does not squash its outputs into a narrow range. For positive inputs, the gradient of ReLU is constant and equal to 1, ensuring that gradients remain significant during backpropagation. Unlike other functions that lead to exponentially small gradients, ReLU’s piecewise linear nature avoids the compounding effect of small derivatives across layers. By maintaining larger gradient values, ReLU ensures that weight updates are not hindered, making it an effective solution to the vanishing gradient problem, especially in deep networks.
Exercise 3.11
Level: ** (Moderate)
Exercise Types: Novel
Question
Given a two-layer neural network with the following specifications:
Inputs: Represented as [math]\displaystyle{ \mathbf{x} = [x_1, x_2]^T }[/math], a [math]\displaystyle{ 2\times1 }[/math] column vector.
Weights: First layer weight matrix: [math]\displaystyle{ W_1 }[/math], a [math]\displaystyle{ 2\times2 }[/math] matrix; Second layer weight vector: [math]\displaystyle{ W_2 }[/math], a [math]\displaystyle{ 1\times2 }[/math] row vector.
Activations: The activation function for all layers is the sigmoid function, defined as: [math]\displaystyle{ \sigma(z) = \frac{1}{1 + e^{-z}} }[/math].
Loss Function: The cross-entropy loss, given by [math]\displaystyle{ L = -y \log(\hat{y}) - (1 - y) \log(1 - \hat{y}) }[/math] where [math]\displaystyle{ y }[/math] is the true label (0 or 1) and [math]\displaystyle{ \hat{y} }[/math] is the predicted output.
(a). Write out the forward propagation steps.
(b). Calculate the derivative of the loss with respect to weights in [math]\displaystyle{ W_1 }[/math] and [math]\displaystyle{ W_2 }[/math] using the chain rule.
Solution
(a). Forward Propagation Steps:
For the first layer pre-activation: [math]\displaystyle{ \mathbf{z}_1 = W_1 \cdot \mathbf{x} }[/math] where [math]\displaystyle{ \mathbf{z}_1 }[/math] is a [math]\displaystyle{ 2\times1 }[/math] column vector.
For the first layer activation: [math]\displaystyle{ \mathbf{a}_1 = \sigma(\mathbf{z}_1) }[/math] where the sigmoid function is applied element-wise, resulting in [math]\displaystyle{ \mathbf{a}_1 }[/math], a [math]\displaystyle{ 2\times1 }[/math] column vector.
For the second layer pre-activation: [math]\displaystyle{ z_2 = W_2 \cdot \mathbf{a}_1 }[/math] where [math]\displaystyle{ z_2 }[/math] is a scalar value.
For the second layer activation (output): [math]\displaystyle{ \hat{y} = \sigma(z_2) }[/math] where [math]\displaystyle{ \hat{y} }[/math] represents the predicted probability.
For the loss calculation: [math]\displaystyle{ L = -y \log(\hat{y}) - (1 - y) \log(1 - \hat{y}) }[/math].
(b). Derivative of the Loss
Gradients for the Second Layer [math]\displaystyle{ W_2 }[/math]: using the chain rule: [math]\displaystyle{ \frac{\partial L}{\partial W_2} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_2} \cdot \frac{\partial z_2}{\partial W_2}. }[/math]
For the loss gradient w.r.t. output:[math]\displaystyle{ \frac{\partial L}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1 - y}{1 - \hat{y}}. }[/math]
For the gradient of sigmoid output w.r.t. pre-activation:[math]\displaystyle{ \frac{\partial \hat{y}}{\partial z_2} = \hat{y} (1 - \hat{y}). }[/math]
For the gradient of pre-activation w.r.t. weights:[math]\displaystyle{ \frac{\partial z_2}{\partial W_2} = \mathbf{a}_1. }[/math]
For the combine terms:[math]\displaystyle{ \frac{\partial L}{\partial W_2} = (\hat{y} - y) \cdot \mathbf{a}_1. }[/math]
Gradients for the First Layer [math]\displaystyle{ W_1 }[/math]: using the chain rule:[math]\displaystyle{ \frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial \mathbf{a}_1} \cdot \frac{\partial \mathbf{a}_1}{\partial \mathbf{z}_1} \cdot \frac{\partial \mathbf{z}_1}{\partial W_1}. }[/math]
For the gradient of loss w.r.t. first layer activation:[math]\displaystyle{ \frac{\partial L}{\partial \mathbf{a}_1} = \frac{\partial L}{\partial z_2} \cdot \frac{\partial z_2}{\partial \mathbf{a}_1} }[/math]; from the second layer:[math]\displaystyle{ \frac{\partial L}{\partial z_2} = (\hat{y} - y), \quad \frac{\partial z_2}{\partial \mathbf{a}_1} = W_2^T. }[/math] So:[math]\displaystyle{ \frac{\partial L}{\partial \mathbf{a}_1} = (\hat{y} - y) \cdot W_2^T. }[/math]
For the gradient of activation w.r.t. pre-activation: [math]\displaystyle{ \frac{\partial \mathbf{a}_1}{\partial \mathbf{z}_1} = \mathbf{a}_1 \odot (1 - \mathbf{a}_1), }[/math] where [math]\displaystyle{ \odot }[/math] denotes element-wise multiplication.
For the gradient of pre-activation w.r.t. first layer weights: [math]\displaystyle{ \frac{\partial \mathbf{z}_1}{\partial W_1} = \mathbf{x}^T. }[/math]
For the combine terms: [math]\displaystyle{ \frac{\partial L}{\partial W_1} = \left[ (\hat{y} - y) \cdot W_2^T \right] \odot \left[ \mathbf{a}_1 \odot (1 - \mathbf{a}_1) \right] \cdot \mathbf{x}^T. }[/math]
Exercise 3.12
Level: ** (Moderate)
Exercise Types: Novel
Question
Consider the **Bent Identity loss function**, a smooth approximation of the absolute loss, defined as:
[math]\displaystyle{ l(y, \hat{y}) = \sqrt{1 + (\hat{y} - y)^2} - 1 }[/math]
The following identities may be useful:
[math]\displaystyle{ \frac{d}{dx} \sqrt{1 + x^2} = \frac{x}{\sqrt{1 + x^2}}, \quad \frac{\partial y}{\partial x} = \frac{\partial y}{\partial u} \frac{\partial u}{\partial x}. }[/math]
where [math]\displaystyle{ y }[/math] is the true response, and [math]\displaystyle{ \hat{y} = w^T x + b }[/math] is the predicted response for a feature vector [math]\displaystyle{ x }[/math] given model parameters [math]\displaystyle{ w }[/math] and [math]\displaystyle{ b }[/math].
Part (a): Compute the Gradient
Find the gradient of the loss function with respect to [math]\displaystyle{ w }[/math] and [math]\displaystyle{ b }[/math].
Part (b): Implement Gradient Descent
Using your result from Part (a), write the update rules for **gradient descent** and implement the iterative optimization process.
Solution
Step 1: Computing the Gradient
We differentiate the loss function:
[math]\displaystyle{ l(y, \hat{y}) = \sqrt{1 + (\hat{y} - y)^2} - 1. }[/math]
Differentiating with respect to [math]\displaystyle{ w }[/math] and [math]\displaystyle{ b }[/math], we obtain:
[math]\displaystyle{ \nabla_w l = \frac{(\hat{y} - y) x}{\sqrt{1 + (\hat{y} - y)^2}} }[/math]
[math]\displaystyle{ \nabla_b l = \frac{\hat{y} - y}{\sqrt{1 + (\hat{y} - y)^2}} }[/math]
Step 2: Gradient Descent Update Rules
We update the parameters using gradient descent:
[math]\displaystyle{ w_{t+1} = w_t - \eta \nabla_w l }[/math]
[math]\displaystyle{ b_{t+1} = b_t - \eta \nabla_b l }[/math]
where [math]\displaystyle{ \eta }[/math] is the learning rate.
Step 3: Algorithm Implementation
Initialize w, b randomly Set learning_rate η For t = 1 to max_iterations: Compute predicted value: y_hat = w^T * x + b Compute gradients: grad_w = ((y_hat - y) / sqrt(1 + (y_hat - y)^2)) * x grad_b = (y_hat - y) / sqrt(1 + (y_hat - y)^2) Update parameters: w = w - η * grad_w b = b - η * grad_b Check for convergence
Exercise 3.13
Level: ** (Difficult)
Exercise Types: Novel
Question
Consider training a deep neural network with momentum-based SGD on a quadratic approximation of the loss near a local minimum, [math]\displaystyle{ L(\theta) = \tfrac{1}{2}\theta^\top H ,\theta }[/math], where [math]\displaystyle{ H }[/math] is the Hessian. Explain how the momentum term modifies the effective condition number of [math]\displaystyle{ H }[/math], and why this can speed convergence in directions with small curvature while controlling overshoot in directions with large curvature. Provide a brief analysis of the discrete iteration dynamics that illustrates this effect.
Solution
Let [math]\displaystyle{ \theta_t }[/math] be the parameters at iteration [math]\displaystyle{ t }[/math], and [math]\displaystyle{ v_t }[/math] be the velocity term. The momentum-based SGD updates for a quadratic loss can be written as:
[math]\displaystyle{ v_{t+1} = \beta\,v_t + \eta\,H\,\theta_t, \quad \theta_{t+1} = \theta_t - v_{t+1}, }[/math]
where [math]\displaystyle{ \beta }[/math] is the momentum coefficient and [math]\displaystyle{ \eta }[/math] is the learning rate.
In the eigenbasis of [math]\displaystyle{ H }[/math], let [math]\displaystyle{ \lambda_i }[/math] be an eigenvalue and [math]\displaystyle{ u_i }[/math] the corresponding eigenvector.
Projecting the iteration onto the direction [math]\displaystyle{ u_i }[/math] yields a scalar recurrence of the form:
[math]\displaystyle{ \theta_{t+1}^{(i)} \;=\; \theta_t^{(i)} \;-\; \bigl[\beta\,v_t^{(i)} \;+\;\eta\,\lambda_i\,\theta_t^{(i)}\bigr]. }[/math]
Because [math]\displaystyle{ v_t^{(i)} }[/math] itself depends on past gradients, the combined effect of [math]\displaystyle{ \beta }[/math] and [math]\displaystyle{ \eta,\lambda_i }[/math] modifies the “effective” eigenvalue seen in that direction. Specifically:
Small [math]\displaystyle{ \lambda_i }[/math] (Flat Directions)
When [math]\displaystyle{ \lambda_i }[/math] is small, repeated gradient directions are reinforced by [math]\displaystyle{ \beta }[/math], accelerating convergence compared to vanilla SGD. Effectively, momentum increases the update step in directions that change slowly.
Large [math]\displaystyle{ \lambda_i }[/math] (Steep Directions)
If [math]\displaystyle{ \lambda_i }[/math] is large, the term [math]\displaystyle{ \beta,v_t^{(i)} }[/math] moderates the sudden jumps, helping to avoid overshooting. The velocity “remembers” past updates, dampening abrupt swings caused by steep curvature.
Overall, momentum alters the eigenvalues of [math]\displaystyle{ H }[/math] into a more favorable spectrum, reducing the effective condition number. In practice, this translates into faster convergence along flat directions and controlled progress in steep directions, both of which are crucial in the highly non-convex landscapes typical of deep neural networks.
Exercise 3.14
Level: * (Easy)
Exercise Types: Novel
Question
Perform one step of backpropagation for the following network:
- Input layer: 2 neurons
- Hidden layer: 2 neurons
- Output layer: 1 neuron
- Activation function: sigmoid
- Weight matrix between the input and hidden layer: [math]\displaystyle{ W_1=\begin{bmatrix} 0.4 & -0.2 \\ 0.6 & 0.3 \end{bmatrix} }[/math]
- Weight matrix between the hidden layer and output layer: [math]\displaystyle{ W_2=\begin{bmatrix} 0.5 \\ -0.3 \end{bmatrix} }[/math]
- Input:[math]\displaystyle{ X=\begin{bmatrix} 0.5 \\ 0.1 \end{bmatrix} }[/math]
- Target output: y = 0.8
- Learning rate: η = 0.1
Solution
Step 1: Forward Pass
Hidden Layer Outputs
[math]\displaystyle{ a_1= 0.4*0.5 +(-0.20)*0.10 = 0.18 }[/math]
[math]\displaystyle{ z_1 = \sigma(a_1) = \frac{1}{1 + e^{-0.18}} \approx 0.545 }[/math]
[math]\displaystyle{ a_2 =0.5*0.6 + 0.3*0.1 = 0.33 }[/math]
[math]\displaystyle{ z_2 = \sigma(a_2) = \frac{1}{1 + e^{-0.33}} \approx 0.582 }[/math]
Output Layer
[math]\displaystyle{ a_3 = 0.5*(0.545) +(-0.3)*0.582=0.0979 }[/math]
[math]\displaystyle{ \hat{y} = \sigma(a_3) = \frac{1}{1 + e^{-0.0979}} \approx 0.5245 }[/math]
Step 2: Compute Error
[math]\displaystyle{ L = (\hat{y} - y)^2 = (0.5245 – 0.8)^2 \approx 0.0759 }[/math]
Step 3: Backpropagation
Gradients for Output Layer
[math]\displaystyle{ \delta_3 = 2*(\hat{y} - y) =-0.2755 }[/math]
Update weights for hidden-to-output layer:
[math]\displaystyle{ W’_{2[1]} = W_{2[1]}- \eta \cdot \delta_3 \cdot z_1 = 0.5 - 0.1 \cdot (-0.2755) \cdot 0.545 = 0.515 }[/math] [math]\displaystyle{ W’_{2[2]} = W_{2[2]} - \eta \cdot \delta_3 \cdot z_2 = -0.3 - 0.1 \cdot (-0.2755) \cdot 0.582 = -0.284 }[/math]
Gradients for Hidden Layer
[math]\displaystyle{ \delta_1 = \sigma’(a_1) \cdot \delta_3 \cdot W_{2[1]} = 0.545 \cdot (1 - 0.545) \cdot (-0.2755) \cdot 0.5 = -0.0342 }[/math]
[math]\displaystyle{ \delta_2 = \sigma’(a_2) \cdot \delta_3 \cdot W_{2[2]} = 0.582 \cdot (1 - 0.582) \cdot (-0.2755) \cdot (-0.3) = 0.0201 }[/math]
Update weights and bias for input-to-hidden layer:
[math]\displaystyle{ W’_{1[11]} = W_{1[11]} - \eta \cdot \delta_1 \cdot X_[1] = 0.4 - 0.1 \cdot (-0.0342) \cdot 0.5 = 0.4017 }[/math]
[math]\displaystyle{ W’_{1[12]} = W_{1[12]} - \eta \cdot \delta_1 \cdot X_[2] = -0.2 - 0.1 \cdot (-0.0342) \cdot 0.1 = -0.1997 }[/math]
[math]\displaystyle{ W’_{1[21]} = W_{1[21]} - \eta \cdot \delta_2 \cdot X_[1] = 0.6 - 0.1 \cdot 0.0201 \cdot 0.5 = 0.5990 }[/math]
[math]\displaystyle{ W’_{1[22]} = W_{1[22]} - \eta \cdot \delta_2 \cdot X_[2] = 0.3 - 0.1 \cdot 0.0201 \cdot 0.1 = 0.2998 }[/math]
Exercise 3.15
Level: ** (Moderate)
Exercise Types: Modified (Based on ME 780 Assignment 2, University of Waterloo, Fall 2024 - in the assignment, backpropogation was programmed for a 2D vector field with a deeper network using MATLAB not Python. This was modified to use Python with a simpler network as an easier exercise to understand backpropogation. A sigmoid activation function is used rather than tanh)
References:
A. Ghodsi, STAT 940 Deep Learning: Lecture 3, University of Waterloo, Winter 2025.
W. Melek, ME 780 Computational Intelligence Chapter 6 Neural Network Parameter Learning Algorithms Course Notes, University of Waterloo, Fall 2024
Question
Define a function [math]\displaystyle{ f(x) = x^2 }[/math] on the domain (0,1).
Develop a 3 layer feedforward neural network with 10 neurons in the second layer to predict the output of this function. Use a sigmoid activation function for the hidden layer, and a linear activation function for the output layer.
Manually code backpropogation to learn the weights for this network using Python.
Use stochastic gradient descent, as discussed in Lecture 3 of STAT 940.
Solution
import numpy as np import matplotlib.pyplot as plt x = np.linspace(0,1,50).reshape(50,1) y = x**2 #number of neurons in hidden layer #Define a matrix with the weights and biases for the hidden layer w_1 = np.random.rand(10,1) #10 is the number of neurons in the hidden layer, 1 is the number of neurons in the input layer b_1 = np.random.rand(10,1) #define a matrix with weights for the output layer w_2 = np.random.rand(1,10) #1 is the number of neurons in the output layer, 10 is the number of neurons in the hidden layer b_2 = np.random.rand(1,1) #Sigmoid function in Python credit https://stackoverflow.com/questions/60746851/sigmoid-function-in-numpy def sigmoid(z): return 1/(1 + np.exp(-z)) def nn_output(w_1,b_1,w_2,b_2,x): hidden_layer_output = sigmoid(np.matmul(w_1,np.transpose(x)) + b_1) return (np.matmul(w_2,hidden_layer_output) + b_2), hidden_layer_output #Plot the desired function and initial neural network output before training plt.figure(1,figsize=(7,6)) plt.plot(x,y) plt.plot(x,nn_output(w_1,b_1,w_2,b_2,x)[0].flatten()) plt.title('Neural Network Output Before Training') plt.legend(['Desired Function','Neural Network Output']) plt.show() learning_rate = 0.1 x2 = np.linspace(0,1,50) #Do 1000 epochs for i in range(1000): #shuffle x2 for each epoch np.random.shuffle(x2) #For each epoch, do stochastic gradient descent, looping through each element in x for j in range(x2.shape[0]): #Evaluate the neural network output at x to get the error of the output y_nn = nn_output(w_1,b_1,w_2,b_2,x2[j].reshape(1,1)) #Update the weights and bias in the output layer #Note - these equations for the output layer only work for a linear activation function, otherwise you'd have to use the chain rule #delta is the error in the output delta_output = (x2[j]**2 - y_nn[0]) #Bias update is previous bias + learning rate * delta b_2 = b_2 + learning_rate*delta_output #Weight update is previous weight + learning rate * delta * input (input is the hidden layer neuron output) w_2 = w_2 + np.matmul(learning_rate*delta_output,np.transpose(y_nn[1])) #Update the weights and bias in the hidden layer #Note that the derivative of the sigmoid function is f'(x) = y(1-y) delta_hidden = (y_nn[1])*np.matmul(np.transpose(w_2),delta_output) #Bias update is same like above b_1 = b_1 + learning_rate*delta_hidden #Weight update is same like above w_1 = w_1 + learning_rate*np.matmul(delta_hidden,x2[j].reshape(1,1)) #Plot the desired function and initial neural network output after training plt.figure(1,figsize=(7,6)) plt.plot(x,y) plt.plot(x,nn_output(w_1,b_1,w_2,b_2,x)[0].flatten()) plt.title('Neural Network Output After Training') plt.legend(['Desired Function','Neural Network Output']) plt.show()
Exercise 3.16
Level: * (Moderate)
Exercise Types: Copied
Reference: Calin, Ovidiu. Deep learning architectures: A mathematical approach. Springer, 2020
This question is from exercise 6.6.9 on page 198.
Question
Consider a one-hidden layer neural network with sigmoid neurons in the hidden layer. Given that the input is normally distributed, [math]\displaystyle{ X \sim N(0,1) }[/math], and the output is [math]\displaystyle{ Y=\sum_{i=1}^N\alpha_i\sigma(w_i X+b_i) }[/math]. Show that [math]\displaystyle{ Var(Y) }[/math] is approximately [math]\displaystyle{ \sum_i \sigma '(b_i)^2\alpha_i^2w_i^2 }[/math].
Solution
The variance of Y is given by [math]\displaystyle{ Var(Y) = Var\left(\sum_{i=1}^N\alpha_i\sigma(w_i X+b_i)\right) }[/math]
Since the [math]\displaystyle{ \sigma(w_i X+b_i) }[/math] terms are not independent, it can be hard to decompose. However, we can use a linear approximation as follows for small values [math]\displaystyle{ w_i X }[/math] around [math]\displaystyle{ b_i }[/math]:
[math]\displaystyle{ \sigma(w_i X+b_i) \approx \sigma(b_i) + \sigma '(b_i) w_iX }[/math]
Therefore, we have [math]\displaystyle{ Var(Y) \approx Var\left(\sum_{i=1}^N\alpha_i(\sigma(b_i) + \sigma '(b_i) w_iX)\right) = Var\left(\sum_{i=1}^N\alpha_i\sigma '(b_i) w_iX\right) = \left(\sum_{i=1}^N\alpha_i\sigma '(b_i) w_i\right)^2 }[/math] since [math]\displaystyle{ X \sim N(0,1) }[/math]
This is approximately equal to [math]\displaystyle{ \sum_i \sigma '(b_i)^2\alpha_i^2w_i^2 }[/math] if we ignore the cross terms. Note that the squared terms dominate the cross terms since the squared terms are always positive, and we assume weights are small.
Exercise 3.17
Level: * (Moderate)
Exercise Types: Modified
Question
Consider the loss function [math]\displaystyle{ Q(w) = w^2 + 3w + 5 }[/math].
Compute the gradient of [math]\displaystyle{ Q(w) }[/math]. Starting from [math]\displaystyle{ w_0 = -2 }[/math], perform three iterations of stochastic gradient descent using a learning rate [math]\displaystyle{ \rho = 0.15 }[/math]. Explain how the choice of [math]\displaystyle{ \rho }[/math] influences the stability and speed of convergence for this loss function.
Solution
Compute the gradient of [math]\displaystyle{ Q(w) }[/math]: [math]\displaystyle{ \nabla Q(w) = \frac{d}{dw}(w^2 + 3w + 5) = 2w + 3 }[/math]
At [math]\displaystyle{ w_0 = -2 }[/math]: [math]\displaystyle{ \nabla Q(w_0) = 2(-2) + 3 = -4 + 3 = -1 }[/math] Update [math]\displaystyle{ w }[/math]: [math]\displaystyle{ w_1 = w_0 - \rho \cdot \nabla Q(w_0) = -2 - 0.15 \cdot (-1) = -2 + 0.15 = -1.85 }[/math]
At [math]\displaystyle{ w_1 = -1.85 }[/math]: [math]\displaystyle{ \nabla Q(w_1) = 2(-1.85) + 3 = -3.7 + 3 = -0.7 }[/math] Update [math]\displaystyle{ w }[/math]: [math]\displaystyle{ w_2 = w_1 - \rho \cdot \nabla Q(w_1) = -1.85 - 0.15 \cdot (-0.7) = -1.85 + 0.105 = -1.745 }[/math]
At [math]\displaystyle{ w_2 = -1.745 }[/math]: [math]\displaystyle{ \nabla Q(w_2) = 2(-1.745) + 3 = -3.49 + 3 = -0.49 }[/math] Update [math]\displaystyle{ w }[/math]: [math]\displaystyle{ w_3 = w_2 - \rho \cdot \nabla Q(w_2) = -1.745 - 0.15 \cdot (-0.49) = -1.745 + 0.0735 = -1.6715 }[/math]
Effect of [math]\displaystyle{ \rho }[/math]: With [math]\displaystyle{ \rho = 0.15 }[/math], the convergence is stable, but it might require more iterations for steeper gradients. A smaller [math]\displaystyle{ \rho }[/math] (e.g., [math]\displaystyle{ \rho = 0.05 }[/math]) would slow down the updates, increasing the number of iterations required to reach the minimum. A larger [math]\displaystyle{ \rho }[/math] (e.g., [math]\displaystyle{ \rho = 0.5 }[/math]) could lead to overshooting or divergence, especially near sharp curvatures in the loss function.
Exercise 3.18
Level: * (Easy)
Exercise Types: Novel
Question
In a feedforward neural network, explain why introducing more hidden layers can potentially improve the network's capacity to model complex functions. However, why might adding too many hidden layers degrade the model's performance or make training difficult?
Solution
Adding more hidden layers increases the expressive power of the network, enabling it to approximate more complex functions. This stems from the Universal Approximation Theorem, which states that a sufficiently large feedforward neural network with non-linear activation functions can approximate any continuous function on a compact subset of ℝn. Each additional layer allows the network to learn and represent features at different levels of abstraction, with earlier layers capturing simpler patterns and deeper layers identifying more complex relationships.
However, adding too many hidden layers introduces several challenges:
- Vanishing/Exploding Gradients: During backpropagation, the gradients of the loss function with respect to weights in earlier layers can diminish (vanishing) or grow uncontrollably (exploding). This makes it difficult to update weights effectively and slows down or destabilizes training.
- Overfitting: Excessively deep networks with many parameters are prone to overfitting, especially if the training data is insufficient or noisy. The network may memorize the training data instead of generalizing well to unseen data.
- Computational Cost: Deeper networks require more computation, leading to longer training times and higher resource demands, which might be inefficient for certain applications.
- Optimization Challenges: Deep networks create highly non-convex loss landscapes, increasing the risk of getting stuck in poor local minima or saddle points, making convergence to a good solution more challenging.
- Diminishing Returns: Beyond a certain depth, additional layers may no longer contribute significantly to the network's ability to learn, resulting in wasted computational resources.
Exercise 3.19
Level: * (Easy)
Exercise Types: Modified
References: Calin, Ovidiu. Deep learning architectures: A mathematical approach. Springer, 2020, Exercise 4.17.2, Page 130.
Question
Consider the quadratic function [math]\displaystyle{ Q(x)=\frac{1}{2}\mathbf{x}^T A \mathbf{x}-b \mathbf{x} }[/math], with [math]\displaystyle{ A }[/math] nonsingular square matrix of order [math]\displaystyle{ n }[/math].
(1) Find the gradient.
(2) Write down the update equation using standard gradient descent and momentum
Solution
(1) [math]\displaystyle{ \nabla Q(x)= A \mathbf{x}-b }[/math]
(2) Update equation given by gradent descent
[math]\displaystyle{ \mathbf{x}_{t+1}=\mathbf{x}_{t}-\rho\nabla Q(x_t)=\mathbf{x}_{t}-\rho (A\mathbf{x}_t-b)=(I-\rho A)\mathbf{x}_t+\rho b }[/math]
Update equation given by momentum [math]\displaystyle{ \mathbf{v}_{t+1}=\beta\mathbf{v}_{t}+(1-\beta)\nabla Q(x) = \beta \mathbf{v}_{t}+(1-\beta)(A\mathbf{x}_t-b) }[/math]
[math]\displaystyle{ \mathbf{x}_{t+1}=\mathbf{x}_{t}-\rho\mathbf{v}_{t+1}=(I-\rho(1-\beta)A)\mathbf{x}_t-\rho\beta\mathbf{v}_t+\rho(1-\beta)b }[/math]
Exercise 3.20
Level: * (Easy)
Exercise Types: Novel
Question
Consider a simple feedforward neural network with one hidden layer. The network takes an input vector [math]\displaystyle{ x = [x_1, x_2, ..., x_n] }[/math], passes it through a hidden layer with activation function [math]\displaystyle{ \sigma }[/math], and produces an output [math]\displaystyle{ y_{\text{pred}} }[/math] via a linear output layer.
The architecture of the network is as follows:
- Input Layer: [math]\displaystyle{ x = [x_1, x_2, ..., x_n] }[/math]
- Hidden Layer: [math]\displaystyle{ h = \sigma(W_1 x + b_1) }[/math], where [math]\displaystyle{ W_1 \in \mathbb{R}^{m \times n} }[/math] and [math]\displaystyle{ b_1 \in \mathbb{R}^m }[/math] are the weights and bias of the hidden layer,
- Output Layer: [math]\displaystyle{ y_{\text{pred}} = W_2 h + b_2 }[/math], where [math]\displaystyle{ W_2 \in \mathbb{R}^{1 \times m} }[/math] and [math]\displaystyle{ b_2 \in \mathbb{R} }[/math] are the weights and bias of the output layer.
Given the mean squared error (MSE) loss function: [math]\displaystyle{ L = \frac{1}{2} (y_{\text{true}} - y_{\text{pred}})^2 }[/math] where [math]\displaystyle{ y_{\text{true}} }[/math] is the true target value.
Derive the backpropagation equations for updating the weights [math]\displaystyle{ W_1, W_2 }[/math] and biases [math]\displaystyle{ b_1, b_2 }[/math] using gradient descent.
Solution
Step 1: Gradients with respect to [math]\displaystyle{ W_2 }[/math] and [math]\displaystyle{ b_2 }[/math]
The predicted output [math]\displaystyle{ y_{\text{pred}} }[/math] is given by: [math]\displaystyle{ y_{\text{pred}} = W_2 h + b_2 }[/math]
The derivative of the loss function with respect to [math]\displaystyle{ y_{\text{pred}} }[/math] is: [math]\displaystyle{ \frac{\partial L}{\partial y_{\text{pred}}} = -(y_{\text{true}} - y_{\text{pred}}) }[/math]
Now, compute the gradient of the loss with respect to [math]\displaystyle{ W_2 }[/math] and [math]\displaystyle{ b_2 }[/math]:
For [math]\displaystyle{ W_2 }[/math]: [math]\displaystyle{ \frac{\partial L}{\partial W_2} = \frac{\partial L}{\partial y_{\text{pred}}} \cdot \frac{\partial y_{\text{pred}}}{\partial W_2} = -(y_{\text{true}} - y_{\text{pred}}) \cdot h }[/math]
For [math]\displaystyle{ b_2 }[/math]: [math]\displaystyle{ \frac{\partial L}{\partial b_2} = \frac{\partial L}{\partial y_{\text{pred}}} \cdot \frac{\partial y_{\text{pred}}}{\partial b_2} = -(y_{\text{true}} - y_{\text{pred}}) }[/math]
Step 2: Gradients with respect to [math]\displaystyle{ W_1 }[/math] and [math]\displaystyle{ b_1 }[/math]
Next, we compute the gradients with respect to the hidden layer parameters.
The hidden layer activation is: [math]\displaystyle{ h = \sigma(W_1 x + b_1) }[/math]
Using the chain rule, we first compute [math]\displaystyle{ \frac{\partial L}{\partial h} }[/math]: [math]\displaystyle{ \frac{\partial L}{\partial h} = \frac{\partial L}{\partial y_{\text{pred}}} \cdot \frac{\partial y_{\text{pred}}}{\partial h} = -(y_{\text{true}} - y_{\text{pred}}) \cdot W_2 }[/math]
Then, we compute the gradient with respect to [math]\displaystyle{ W_1 }[/math] and [math]\displaystyle{ b_1 }[/math]:
For [math]\displaystyle{ W_1 }[/math]: [math]\displaystyle{ \frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial h} \cdot \frac{\partial h}{\partial W_1} = -(y_{\text{true}} - y_{\text{pred}}) \cdot W_2 \cdot \sigma'(W_1 x + b_1) \cdot x^T }[/math]
For [math]\displaystyle{ b_1 }[/math]: [math]\displaystyle{ \frac{\partial L}{\partial b_1} = \frac{\partial L}{\partial h} \cdot \frac{\partial h}{\partial b_1} = -(y_{\text{true}} - y_{\text{pred}}) \cdot W_2 \cdot \sigma'(W_1 x + b_1) }[/math]
Step 3: Update Equations
Using the gradients derived above, the updates for the weights and biases using gradient descent are:
For [math]\displaystyle{ W_2 }[/math]: [math]\displaystyle{ W_2^{(t+1)} = W_2^{(t)} - \eta \frac{\partial L}{\partial W_2} }[/math]
For [math]\displaystyle{ b_2 }[/math]: [math]\displaystyle{ b_2^{(t+1)} = b_2^{(t)} - \eta \frac{\partial L}{\partial b_2} }[/math]
For [math]\displaystyle{ W_1 }[/math]: [math]\displaystyle{ W_1^{(t+1)} = W_1^{(t)} - \eta \frac{\partial L}{\partial W_1} }[/math]
For [math]\displaystyle{ b_1 }[/math]: [math]\displaystyle{ b_1^{(t+1)} = b_1^{(t)} - \eta \frac{\partial L}{\partial b_1} }[/math]
Where [math]\displaystyle{ \eta }[/math] is the learning rate.
Exercise 3.21
Level: ** (Moderate)
Exercise Types: Novel
Question
There are many choices of activation functions in Feedforward Neural Networks, such as the sigmoid functions and hyperbolic tangent. This question considers other possibilities.
(a) Suppose a and b are constants. Is [math]\displaystyle{ a \tanh(b x) }[/math] a good candidate? (b) Is [math]\displaystyle{ sin(x) }[/math] a good activation function?
Solution
(a) It could be a good candidate, but the main effect would be similar to [math]\displaystyle{ \tanh(x) }[/math]. The addition of constants can scale the intermediate values within the networks and therefore can affect the convergence rate, akin to the effect of batch normalization.
(b) It may not be a good choice of activation function. Although it introduces nonlinearities in the training, the periodicity also introduces many local minimum, exacerbating the issue of escaping local minima.
Smaller mini-batch size outperforms larger mini-batch size when we need to deal with highly non-convex optimization problems where escaping local minima is prioritized. Nevertheless, it may cause the model not to converge effectively as one of the drawbacks.
Exercise 3.22
Level: * (Easy)
Exercise Types: Novel
Question
Assume you are training a nerual network model with gradient decent. There is a dataset with 1000 samples.
1. If you choose a mini-batch size of 100, how many times will the weights be updated during training?
2. If the mini-batch size is 50, how will the number of updates change? Why?
Solution
1. The dataset has 1000 samples, and each mini-batch has 100 samples. Thus, the number of weight updates will be:
[math]\displaystyle{ \frac{1000}{100} = 10 }[/math] updates.
2. If the mini-batch size is 50, the number of weight updates will be:
[math]\displaystyle{ \frac{1000}{50} = 20 }[/math] updates.
Therefore, the number of updates increases, and asmaller mini-batch results in more frequent updates, but each update involves fewer sample.
Exercise 3.23
Level: ** (Moderate)
Exercise Types: Novel
Question
Given a simple linear regression model y = w x + b and a single training example (x=2,y=4), show how to perform one Stochastic Gradient Descent update step for w and b. Suppose: \begin{aligned} & w=1, \quad b=0, \\ & L = \frac{1}{2}\bigl(y_{\mathrm{pred}} - y\bigr)^{2}, \quad \eta = 0.1. \end{aligned}
Solution
\begin{aligned} y_{\text{pred}} & = w \cdot x + b = 1 \cdot 2 + 0 = 2, \\ \text{error} & = y_{\text{pred}} - y = 2 - 4 = -2, \\ \frac{\partial L}{\partial w} & = (y_{\text{pred}} - y) \cdot x = (-2) \cdot 2 = -4, \\ \frac{\partial L}{\partial b} & = (y_{\text{pred}} - y) = -2, \\ w_{\text{new}} & = w - \eta \cdot \frac{\partial L}{\partial w} = 1 - 0.1 \cdot (-4) = 1.4, \\ b_{\text{new}} & = b - \eta \cdot \frac{\partial L}{\partial b} = 0 - 0.1 \cdot (-2) = 0.2. \end{aligned}
Exercise 3.24
Level: ** (Moderate)
Exercise Types: Modified
References: Deep Learning - Foundations and Concepts, by Christopher M. Bishop and Hugh Bishop, Exercise 7.1, Page 212.
Question
Prove that the Stochastic Gradient Descent (SGD) algorithm reduces the value of the objective function Q(w), where Q(w) is differentiable, and its gradient satisfies [math]\displaystyle{ \nabla Q(w) }[/math]. The goal of the proof is to show that for a sufficiently small learning rate [math]\displaystyle{ \eta }[/math], the weight update rule: [math]\displaystyle{ w_{t+1} = w_{t} - \eta \nabla Q(w_{t}) }[/math] ensures that the objective function Q(w) decreases monotonically.
Solution
1. Change in the Objective Function:
The change in the objective function value can be expressed as:
[math]\displaystyle{ Q(w_{t+1}) - Q(w_{t}) }[/math]
Using the first-order Taylor expansion of Q(w), we can approximate it as:
[math]\displaystyle{ Q(w_{t+1}) \approx Q(w_{t}) + \nabla Q(w_{t})^T(w_{t+1}-w_{t}) }[/math]
2. Substitude the update rule: Substituting the weight update rule [math]\displaystyle{ w_{t+1} = w_{t} - \eta \nabla Q(w_{t}) }[/math], we get:
[math]\displaystyle{ Q(w_{t+1}) \approx Q(w_{t}) - \eta \nabla Q(w_{t})^T \nabla Q(w_{t}) }[/math]
3. Gradient Property:
The quadratic term involving the gradient can be written as:
[math]\displaystyle{ \nabla Q(w_{t})^T \nabla Q(w_{t}) = \| \nabla Q(w_{t}) \|^2 }[/math]
Therefore, [math]\displaystyle{ Q(w_{t+1}) - Q(w_{t}) \approx - \eta \| \nabla Q(w_{t}) \|^2 }[/math]
4. Conclusion:
Since the learning rate [math]\displaystyle{ \eta \gt 0 and \| \nabla Q(w_{t}) \|^2 \geq 0 }[/math],
we have: [math]\displaystyle{ Q(w_{t+1}) - Q(w_{t}) \leq 0 }[/math].
As long as the learning rate [math]\displaystyle{ \eta }[/math] is sufficiently small, the objective function Q(w) will decrease monotonically, and SGD will converge to a local minimum of Q(w).
This proof shows that with an appropriately chosen learning rate, SGD guarantees a reduction in the objective function at each step, making it a reliable optimization method for machine learning tasks.
Exercise 3.25
Level: ** (Moderate)
Exercise Types: Modified (Reference: Probabilistic Machine Learning: An Introduction, 13.2.3 Activation Functions)
Question
Consider a single-layer neural network with the ReLU activation function defined as: [math]\displaystyle{ \text{ReLU}(x) = \max(0, x). }[/math]
Given a weight matrix [math]\displaystyle{ W }[/math], an input vector [math]\displaystyle{ x }[/math], and a bias [math]\displaystyle{ b }[/math], the output of the layer is: [math]\displaystyle{ h = \text{ReLU}(Wx + b). }[/math]
Let: [math]\displaystyle{ W = \begin{bmatrix} 1 & -2 \\ 3 & 0 \end{bmatrix}, }[/math] [math]\displaystyle{ x = \begin{bmatrix} 2 \\ -1 \end{bmatrix}, }[/math] [math]\displaystyle{ b = \begin{bmatrix} 1 \\ -3 \end{bmatrix}. }[/math]
1. Compute the output [math]\displaystyle{ h }[/math] of the layer.
2. Derive the gradient of the ReLU activation for the given [math]\displaystyle{ x }[/math] and explain how it behaves for positive and negative values of the input.
Solution
1. The output is computed as: [math]\displaystyle{ h = \text{ReLU}(Wx + b), }[/math] where: [math]\displaystyle{ Wx = \begin{bmatrix} 1 & -2 \\ 3 & 0 \end{bmatrix} \begin{bmatrix} 2 \\ -1 \end{bmatrix} = \begin{bmatrix} 1(2) + (-2)(-1) \\ 3(2) + 0(-1) \end{bmatrix} = \begin{bmatrix} 2 + 2 \\ 6 + 0 \end{bmatrix} = \begin{bmatrix} 4 \\ 6 \end{bmatrix}. }[/math]
Adding the bias: [math]\displaystyle{ Wx + b = \begin{bmatrix} 4 \\ 6 \end{bmatrix} + \begin{bmatrix} 1 \\ -3 \end{bmatrix} = \begin{bmatrix} 5 \\ 3 \end{bmatrix}. }[/math]
Applying ReLU: [math]\displaystyle{ h = \text{ReLU}\left(\begin{bmatrix} 5 \\ 3 \end{bmatrix}\right) = \begin{bmatrix} \max(0, 5) \\ \max(0, 3) \end{bmatrix} = \begin{bmatrix} 5 \\ 3 \end{bmatrix}. }[/math]
2. Gradient of ReLU:
The derivative of ReLU is defined as: [math]\displaystyle{ \text{ReLU}'(x) = \begin{cases} 1 & \text{if } x \gt 0, \\ 0 & \text{if } x \leq 0. \end{cases} }[/math]
For example take [math]\displaystyle{ x = \begin{bmatrix} 2 \\ -1 \end{bmatrix} }[/math]: Then [math]\displaystyle{ 2 \gt 0 }[/math], the gradient is [math]\displaystyle{ 1 }[/math]. [math]\displaystyle{ -1 \leq 0 }[/math], the gradient is [math]\displaystyle{ 0 }[/math].
Hence, the gradient of ReLU activation behaves as a binary switch, passing gradients only for positive values of the input and blocking gradients for non-positive values.
Exercise 3.26
Level: ** (Moderate)
Exercise Types: Novel
Question
Given a single-layer neural network with a sigmoid activation function used for binary classification, the network is trained using stochastic gradient descent (SGD) with the cross-entropy loss function:
[math]\displaystyle{ L = -[y \cdot \log(\hat{y}) + (1 - y) \cdot \log(1 - \hat{y})], }[/math]
where [math]\displaystyle{ y }[/math] is the binary label (0 or 1) and [math]\displaystyle{ \hat{y} }[/math] is the predicted probability.
Assuming the input vector [math]\displaystyle{ x = [0.5, 1.2, -0.3] }[/math], label [math]\displaystyle{ y = 1 }[/math], and learning rate [math]\displaystyle{ \eta = 0.01 }[/math], derive and apply the SGD update rules for a single training instance. Include calculations for the weights and bias from initial random values.
Solution
1. **Forward Pass Calculation:** Calculate the output before activation, [math]\displaystyle{ z }[/math], by [math]\displaystyle{ z = w \cdot x + b }[/math] where initial weights [math]\displaystyle{ w = [0.2, -0.1, 0.1] }[/math] and bias [math]\displaystyle{ b = 0.01 }[/math]. Then apply the sigmoid function to obtain the predicted probability [math]\displaystyle{ \hat{y} = \frac{1}{1 + e^{-z}} }[/math].
2. **Loss and Gradient Calculation:** Calculate the loss using the cross-entropy formula. Derive the gradients with respect to [math]\displaystyle{ \hat{y} }[/math] as [math]\displaystyle{ \frac{\partial L}{\partial \hat{y}} = -[y \cdot \frac{1}{\hat{y}} - (1 - y) \cdot \frac{1}{1 - \hat{y}}] }[/math] and chain it to get [math]\displaystyle{ \frac{\partial L}{\partial z} = \hat{y} - y }[/math].
3. **Update Weights and Bias:** Use the gradient and learning rate to update each weight [math]\displaystyle{ w_i }[/math] and the bias [math]\displaystyle{ b }[/math]: [math]\displaystyle{ w_i = w_i - \eta \cdot \frac{\partial L}{\partial z} \cdot x_i }[/math], [math]\displaystyle{ b = b - \eta \cdot \frac{\partial L}{\partial z} }[/math].
For example, the weight update for [math]\displaystyle{ w_1 }[/math] would be: [math]\displaystyle{ w_1 = 0.2 - 0.01 \cdot (\hat{y} - 1) \cdot 0.5 }[/math], and similarly for other weights and bias.
Exercise 3.27
Level: * (Easy)
Exercise Types: Novel
Question
Consider the following two optimization update rules:
Standard Gradient Descent (GD): [math]\displaystyle{ w_{t+1} = w_t - \rho\nabla Q(w_t) }[/math]
Momentum-Based Update: [math]\displaystyle{ v_{t+1} = \beta v_t + (1 - \beta)\nabla Q(w_t) }[/math], [math]\displaystyle{ w_{t+1} = w_t - \rho v_{t + 1} }[/math]
Explain the key differences between standard gradient descent and the momentum-based update in terms of:
1) How the gradient information is used.
2) The behaviour of the optimization process, particularly in flat regions and regions with high curvature.
3) The convergence speed to the minimum.
Solution
Gradient Information Usage:
Standard Gradient Descent: Each update is based solely on the current gradient [math]\displaystyle{ \nabla Q(w_t) }[/math]. It does not account for the gradients from previous steps.
Momentum: Uses a running average of past gradients, [math]\displaystyle{ v_t }[/math], to smooth out updates. This accumulates past gradient information, making the optimization less sensitive to short-term noise in the gradient.
Behaviour in Flat and High-Curvature Regions:
Standard Gradient Descent: Progress in flat regions (e.g., plateaus or saddle points) is slow since updates rely only on the small gradients at each step. In high-curvature regions, it may oscillate across the curvature due to abrupt changes in gradient direction.
Momentum: Momentum accelerates progress in flat regions by building up velocity from consistent gradients, preventing you from getting stuck in a flat region. It also reduces oscillations in high-curvature regions by smoothing out the updates, leading to more stable convergence.
Convergence Speed:
Standard Gradient Descent: Typically slower, especially in scenarios where gradients are small or noisy, as it lacks the mechanism to "remember" past gradients.
Momentum: Often converges faster, especially when [math]\displaystyle{ \beta }[/math] is tuned appropriately. The accumulated velocity helps overcome small local minima and speeds up optimization in flat regions.
Exercise 3.28
Level: ** (Moderate)
Exercise Types: Novel
References: Adapted from Hastie, T., Tibshirani, R., & Friedman, J. (2021). The Elements of Statistical Learning, pages 174-180.
Question
(a) Starting with the given smoothing matrix formulation for the Reinsch form: \[ S_\lambda = N(N^T N + \lambda \Omega_N)^{-1}N^T, \] derive the simplified Reinsch form under the assumption that \( N \) is invertible. Show step-by-step how this leads to: \[ S_\lambda = (N^{-T}(N^T N + \lambda \Omega_N)N^{-1})^{-1} = (I + \lambda N^{-T} \Omega_N N^{-1})^{-1}. \]
(b) Discuss how wavelet smoothing can be applied to feedforward neural networks to help manage overfitting, especially in scenarios where the data is highly noisy. How does introducing the smoothing parameter \( \lambda \) and the penalty matrix \( \Omega_N \) affect the generalization ability of the neural network model?
Solution
Part (a) Derivation
Given the matrix \( S_\lambda \) as defined above, start with the inner product and factor in the invertibility of \( N \): \[ S_\lambda = N(N^T N + \lambda \Omega_N)^{-1}N^T. \] Assuming \( N \) is invertible, we have: \[ S_\lambda = N^{-T}(N^T N + \lambda \Omega_N)N^{-1}. \] This can be further simplified to: \[ S_\lambda = (I + \lambda N^{-T} \Omega_N N^{-1})^{-1}, \] where \( I \) denotes the identity matrix.
Part (b) Application to Neural Networks
In feedforward neural networks, overfitting is a significant challenge when dealing with complex models and noisy data. Wavelet smoothing, applied through the Reinsch form, offers a method to control model complexity by smoothing the learned functions.
The matrix \( N \) typically represents the network's weight matrix, and \( \Omega_N \) acts as a regularization term that penalizes the weight configurations based on their complexity. The smoothing parameter \( \lambda \) adjusts the trade-off between the training data's fidelity and the solution's smoothness.
By incorporating the Reinsch form into the network's training process, the effective degrees of freedom are reduced, leading to smoother function estimates. This reduction helps prevent the network from capturing noise as signal, enhancing its ability to generalize from the training data to unseen data.
The mathematical formulation provided in the exercise guides the understanding of how different components of the regularization term and smoothing parameter interact to influence the network’s learning process, potentially improving prediction accuracy on new, unseen data.
Exercise 3.29
Level: * (Easy)
Exercise Types: Modified - Problem 3.16, Prince, Simon JD. Understanding Deep Learning. MIT Press, 2023
Question
Draw a fully-connected neural network with 2 inputs, 3 hidden units in the first hidden layer, 2 hidden units in the second hidden layer, and 2 outputs. Then, write out the general equations for each layer (i.e. [math]\displaystyle{ h^{(1)}, h^{(2)}, }[/math] and [math]\displaystyle{ y }[/math]), where [math]\displaystyle{ \sigma }[/math] is the activation function used for each layer.
Solution
Below is the drawing of the neural network:
Equations:
[math]\displaystyle{ h^{(1)} = \sigma (W^{(1)}x + b^{(1)}) }[/math]
[math]\displaystyle{ h^{(2)} = \sigma (W^{(2)}h^{(1)} + b^{(2)}) }[/math]
[math]\displaystyle{ y = \sigma (W^{(3)}h^{(2)} + b^{(3)}) }[/math]
Exercise 3.30
Level: ** (Moderate)
Exercise Types: Novel
Question
Suppose [math]\displaystyle{ f: \mathbb{R}^n \to \mathbb{R} }[/math] is differentiable, convex, and [math]\displaystyle{ L }[/math]-smooth. Suppose we are given a starting point [math]\displaystyle{ x^0 }[/math]. Assume that there exists an [math]\displaystyle{ r }[/math] such that [math]\displaystyle{ \{x \in \mathbb{R}^n: f(x) \leq f(x^0)\} \subseteq B(0,r) }[/math].
(a) Show that [math]\displaystyle{ f }[/math] has a minimizer [math]\displaystyle{ x^* }[/math].
Proof. Let [math]\displaystyle{ f: \mathbb{R}^n \to \mathbb{R} }[/math] be differentiable, convex, and [math]\displaystyle{ L }[/math]-smooth. Differentiability implies [math]\displaystyle{ f }[/math] is continuous and convexity of [math]\displaystyle{ f }[/math] implies that any local minima is a global minima for [math]\displaystyle{ f }[/math]. Thus it is enough to show that there exist local minima. By hypothesis, there exists [math]\displaystyle{ r }[/math] such that [math]\displaystyle{ S = \{x \in \mathbb{R}^n: x \leq f(x^0)\} \subseteq B(0,r) }[/math].
[math]\displaystyle{ S }[/math] is bounded: By definition of a bounded set, a set [math]\displaystyle{ A \subseteq X }[/math] is bounded if there exist [math]\displaystyle{ a \in X }[/math] and [math]\displaystyle{ r \gt 0 }[/math] such that [math]\displaystyle{ A \subseteq B(a,r) }[/math]. Thus, by hypothesis [math]\displaystyle{ S }[/math] is bounded.
[math]\displaystyle{ S }[/math] is closed:} A function [math]\displaystyle{ g: X \to Y }[/math] is said to be continuous if for every closed set [math]\displaystyle{ V \subseteq Y }[/math], the inverse image [math]\displaystyle{ g^{-1}(V) = \{x \in X : g(x) \in V\} }[/math] is also a closed subset of [math]\displaystyle{ X }[/math]. Since [math]\displaystyle{ f }[/math] is continuous, the pre-image of a closed set is closed. We set [math]\displaystyle{ c = f(x^0) }[/math], then [math]\displaystyle{ S = S_{f(x^0)} }[/math] is closed in [math]\displaystyle{ \mathbb{R}^n }[/math].
Now that we have [math]\displaystyle{ S }[/math] is bounded and closed, it follows from the \textit{Heine-Borel Theorem} that [math]\displaystyle{ S }[/math] is compact, and then by the \textit{Extreme-value Theorem} there exists [math]\displaystyle{ x^* \in S }[/math] such that [math]\displaystyle{ f(x^*) \leq f(x) }[/math] for all [math]\displaystyle{ x \in S }[/math]. By convexity of [math]\displaystyle{ f }[/math], we have that [math]\displaystyle{ f(x^*) \leq f(x) }[/math] for all [math]\displaystyle{ x \in \mathbb{R}^n }[/math], as needed.
(b) Show the following inequality: For any [math]\displaystyle{ x }[/math] such that [math]\displaystyle{ f(x) \leq f(x^0) }[/math], [math]\displaystyle{ f(x) - f(x^*) \leq 2r\|\nabla f(x)\|. }[/math]
Proof. Suppose [math]\displaystyle{ f }[/math] has a minimizer [math]\displaystyle{ x^* }[/math] and let [math]\displaystyle{ f(x) \leq f(x^0) }[/math]. Since [math]\displaystyle{ x^* \in S }[/math] by definition, we use the sub-gradient inequality for convex functions: [math]\displaystyle{ f(x^*) \geq f(x) + \nabla f(x)^T (x^* - x) }[/math] which implies: [math]\displaystyle{ f(x^*) - f(x) \geq \nabla f(x)^T (x^* - x). }[/math] Multiplying both sides by [math]\displaystyle{ -1 }[/math] gives: [math]\displaystyle{ f(x) - f(x^*) \leq \nabla f(x)^T (x - x^*). }[/math] Since [math]\displaystyle{ f(x^*) \leq f(x) }[/math], we obtain: [math]\displaystyle{ |\nabla f(x)^T (x - x^*)| \geq 0. }[/math] Using the Cauchy-Schwarz inequality: [math]\displaystyle{ f(x) - f(x^0) \leq \nabla f(x)^T (x - x^*) \leq \|\nabla f(x)\| \cdot \|x - x^*\|. }[/math] Applying the triangle inequality: [math]\displaystyle{ \|x - x^*\| \leq \|x\| + \|x^*\|. }[/math]
Since [math]\displaystyle{ x, x^* \in S \subseteq B(0,r) }[/math], we have [math]\displaystyle{ \|x\| \leq r }[/math] and [math]\displaystyle{ \|x^*\| \leq r }[/math]. Thus,
[math]\displaystyle{ f(x) - f(x^*) \leq \|\nabla f(x)\| \cdot (r + r) = 2r\|\nabla f(x)\|. }[/math] This completes the proof.
Exercise 4.1
Level: * (Easy)
Exercise Types: Novel
Question
Explain why SURE is rarely used in large-scale deep learning in practice.
Solution
Note that to compute the model complexity part of the SURE estimator, we have to compute the divergence term:
[math]\displaystyle{ 2\sigma^2 \sum_{i=1}^{n} D_i }[/math]
where
[math]\displaystyle{ D_i=\frac{\partial \hat{f}_i(y)}{\partial y_i} }[/math]
For modern deep learning frameworks, both the number of parameters (weights) and the number of data dimensions are huge (in million or billions). This makes the computation of the divergence term extremely expensive. Note that using stochastic gradient descent the estimation of the weights is the result of an iterative process. If we include an inner loop to compute all the divergence, the computation complexity is growing exponentially.
Additionally, the high-dimensional nature of deep learning models means that computing the divergence requires handling large matrices or tensors, making the task even more resource-intensive. Even though parallel computation techniques like GPU acceleration can reduce the time for individual gradient calculations, the divergence computation still adds significant overhead when applied across all data points and iterations. Moreover, the computational complexity becomes even more challenging when dealing with large-scale datasets and real-time training, where the time needed to calculate divergence can slow down the training process substantially. This makes methods like SURE impractical for real-time or large-scale applications compared to simpler and more efficient alternatives like cross-validation, which does not require computing high-dimensional divergence and is easy to implement, making it a more suitable approach in practice.
However, SURE gives very good insight about the behaviour of the true error, and the divergence term is considered a regularization term. For simpler models such as linear regression, the divergence term can be easy to compute, and perturbing a trained data point would not change the estimated function by much. However in more complex models such as deep learning frameworks, perturbing a trained data point can result in large changes in our estimated function, meaning that the divergence term will be larger. Often times, we add a regularization term to our loss function that mimics the behaviour of the divergence term, such that the loss function increases the more complex our model is. Techniques such as adding weight decay or penalties on gradient magnitudes are often used to improve generalization.
Additional Subquestion
Could we use SURE in a simplified or approximate way for deep learning models, perhaps by estimating only a subset of partial derivatives or using a smaller batch of data? What would be the potential trade-offs?
Solution for Additional Question
One possible approach to making SURE more tractable is to approximate the divergence term using a small subset of data points or partial derivatives. For instance: - **Subsampled Divergence**: Compute [math]\displaystyle{ D_i }[/math] for only a small mini-batch of the dataset and use this to extrapolate the full divergence. - **Block-Diagonal or Low-Rank Approximation**: Instead of computing the full Jacobian, approximate it by a block-diagonal or low-rank structure, significantly reducing the computational cost.
However, these approximations introduce additional variance or bias into the divergence estimate. If the mini-batch is not representative, the divergence (and thus the SURE estimate) might be inaccurate. Although such tricks can offer partial relief, they still tend to be more computationally demanding than alternatives like cross-validation. As a result, most large-scale deep learning pipelines avoid SURE in favor of simpler, widely supported methods.
Exercise 4.2
Level: * (Easy)
Exercise Types: Novel
Question
Use SURE to analyze how the bias-variance tradeoff is reflected in the risk of an estimator. Consider two scenarios:
- A high-bias, low-variance estimator (e.g., a constant estimate [math]\displaystyle{ \hat{f}(y) = c }[/math] for all [math]\displaystyle{ y }[/math]).
- A high-variance, low-bias estimator (e.g., [math]\displaystyle{ \hat{f}(y) = y }[/math]).
- For the estimator defined by the equation:
[math]\displaystyle{ \hat{f}(y) = cy + d }[/math],
where c and d are constants:
a. Derive the divergence [math]\displaystyle{ \text{D}(\hat{f}). }[/math],
b. Use the derived divergence to compute the SURE formula for the risk.
Show how the SURE formula quantifies the risk in both cases.
Solution
High-bias, low-variance estimator:
- Since [math]\displaystyle{ \hat{f}(y) = c }[/math], the divergence [math]\displaystyle{ \text{D}(\hat{f}) = 0 }[/math] (no dependence on [math]\displaystyle{ y }[/math]).
- The SURE risk simplifies to [math]\displaystyle{ \text{Risk} = (c - y)^2 + 2\sigma^2 \cdot 0 = (c - y)^2 }[/math]
- The expected risk is influenced entirely by the choice of [math]\displaystyle{ c }[/math] relative to [math]\displaystyle{ f }[/math] (bias).
High-variance, low-bias estimator:
- For [math]\displaystyle{ \hat{f}(y) = y }[/math], the divergence [math]\displaystyle{ \text{D}(\hat{f}) = 1 }[/math] (derivative of [math]\displaystyle{ y }[/math] w.r.t. itself is 1).
- The SURE risk becomes [math]\displaystyle{ \text{Risk} = |y - y|^2 + 2\sigma^2 \cdot 1 = 2\sigma^2 }[/math].
- The risk reflects only the variance, as bias is negligible.
Divergence and SURE formula: a. The divergence is [math]\displaystyle{ \text{D}(\hat{f}) = c }[/math] because the derivative of [math]\displaystyle{ cy + d }[/math] with respect to [math]\displaystyle{ y }[/math] is [math]\displaystyle{ c }[/math].
b. The SURE formula for the risk becomes:
[math]\displaystyle{ \text{Risk} = \mathbb{E}\left[(cy + d - y)^2\right] + 2\sigma^2 \cdot c = \mathbb{E}\left[(c - 1)^2 y^2 + 2(c - 1)d y + d^2\right] + 2\sigma^2 \cdot c }[/math].
Exercise 4.3
Level: ** (Moderate)
Exercise Types: Novel
Question
How does SURE explain why cross-validation and regularization are effective for estimating true error?
Hint: Consider the cases when a data point is not in the training set and when it is included in the training set.
Solution
Let’s first recall the SURE (Stein's Unbiased Risk Estimate) formula:
[math]\displaystyle{ E[(\hat{y_0}-y_0)^2] = E[(\hat{f_0} - f_0)^2] + E[\epsilon_0^2] - 2 E[\epsilon_0 (\hat{f_0} - f_0)] }[/math]
Case 1: Data Point Not in the Training Set
When the data point is not in the training set, the covariance term [math]\displaystyle{ E[\epsilon_0 (f_0 - f_0)] }[/math] becomes zero, since the model does not have access to that particular point during training. This simplifies the formula to:
[math]\displaystyle{ E[(\hat{y_0}-y_0)^2] = E[(\hat{f_0} - f_0)^2] + E[\epsilon_0^2] }[/math]
Now, when summing over all [math]\displaystyle{ m }[/math] points, we obtain:
[math]\displaystyle{ \sum_{i=1}^{m} (\hat{y_i} - y_i)^2 = \sum_{i=1}^{m} (\hat{f_i} - f_i)^2 + m \sigma^2 }[/math]
Where [math]\displaystyle{ \sigma^2 }[/math] is noise. The total error ([math]\displaystyle{ Err }[/math]) can be written as:
[math]\displaystyle{ Err = err - m \sigma^2 }[/math]
Here, [math]\displaystyle{ m \sigma^2 }[/math] is a constant, which means the true error differs from the empirical error by a constant value only. Therefore, the empirical error ([math]\displaystyle{ err }[/math]) provides a good estimate of the true error ([math]\displaystyle{ Err }[/math]) when the point is not in the training set.
This explains why cross-validation is effective. Cross-validation essentially evaluates the model on data points that were not part of the training set, and since the empirical error is only offset by a constant, it provides a reliable estimate of the true error.
Case 2: Data Point in the Training Set
When the data point is part of the training set, the covariance term [math]\displaystyle{ E[\epsilon_0 (f_0 - f_0)] }[/math] is no longer zero. We have:
[math]\displaystyle{ \sum_{i=1}^{m} (\hat{y_i} - y_i)^2 = \sum_{i=1}^{n} (\hat{f_i} - f_i)^2 + n \sigma^2 - 2 \sigma^2 \sum_{i=1}^{n} D_i }[/math]
Where [math]\displaystyle{ D_i }[/math] represents the bias introduced by the model. This equation can be further simplified to:
[math]\displaystyle{ Err = err - n \sigma^2 + 2 \sigma \sum_{i=1}^{n} D_i }[/math]
In this case, the additional term [math]\displaystyle{ \sum_{i=1}^{n} D_i }[/math] reflects the model’s complexity. The complexity term increases with the capacity of the model and is often difficult to calculate directly. To handle this, instead of directly calculating the complexity term, we can use a function that increases with respect to model capacity, and treat it as the regularization term. The regularization term penalizes model complexity to avoid overfitting, thus helping to prevent the model from fitting noise in the training set.
This explains the need for regularization techniques. Regularization helps to control model complexity and ensures that the model generalizes better to unseen data, improving the estimation of the true error.
Thus, both cross-validation and regularization are grounded in the same principle of improving the estimation of true error by adjusting for complexity and noise.
(Note: the formulas are all from Lecture 4 content, STAT 940)
Exercise 4.4
Level: ** (Moderate)
Exercise Types: Novel
Question
Assume there is a point set [math]\displaystyle{ (x_i,y_i) }[/math] satisfying [math]\displaystyle{ y_i=2x_i+3sin(x_i)+n_i }[/math], where [math]\displaystyle{ n_i \sim \mathcal{N}(0, 4) }[/math], for i = 1, 2, ...
Now we fit the relationship between y and x, using polynomial linear models with order from 1 to 10. Show the MSE of models of different complexity.
Solution
library(ggplot2) x <- seq(0, 10, length.out = 100) y <- 2 * x + 3 * sin(x) + rnorm(100, mean = 0, sd = 2) data <- data.frame(x = x, y = y) degrees <- 1:10 mse_values <- numeric(length(degrees)) for (i in seq_along(degrees)) { degree <- degrees[i] model <- lm(y ~ poly(x, degree), data = data) predicted <- predict(model, data) mse_values[i] <- mean((data$y - predicted)^2) } mse_results <- data.frame(Degree = degrees, MSE = mse_values) print(mse_results) ggplot(mse_results, aes(x = Degree, y = MSE)) + geom_line(color = "blue") + geom_point(size = 3, color = "red") + labs(title = "MSE vs. Model Complexity", x = "Polynomial Degree", y = "Mean Squared Error (MSE)") + theme_minimal() ggplot(data, aes(x = x, y = y)) + geom_point(color = "black", alpha = 0.6) + geom_smooth(method = "lm", formula = y ~ poly(x, 1), color = "red", se = FALSE) + geom_smooth(method = "lm", formula = y ~ poly(x, 3), color = "blue", se = FALSE) + geom_smooth(method = "lm", formula = y ~ poly(x, 10), color = "green", se = FALSE) + labs(title = "Polynomial Regression Fits", x = "x", y = "y") + theme_minimal()
Output:
Additional Comment: This graph clearly shows the importance in choosing a good space for your model candidates. When allowing your model to have too much complexity, we see that there is not a lot of reduction in the MSE after 4 polynomial degrees. But there is a big difference from 3 to 4, so based on the graph, it would be best to pick the polynomial of degree 4 as the model as it performed good on the testing dataset while being more robust than models of higher polynomial order.
Exercise 4.5
Level: ** (Easy)
Exercise Types: Novel
Question
Use the SURE equation to explain the difference of using data in the training dataset vs outside of the training dataset in estimating the true error using emperical error (of same size of dataset).
Answer
The Key is in the term [math]\displaystyle{ 2 \sigma^2 \sum_{i=1}^{n} D_i }[/math]. There are overall 2 main terms that contributes to this value ---the model complexity and the variance of the error. If the variance is sufficiently small and the model complexity is not too large (i.e p<3), the difference between using training datapoints and using testing datapoints would not be as significant. Thus, if the complexity of the model is not very high (low [math]\displaystyle{ D_i }[/math]) with low sample size, using the training dataset to estimate true error will not be extremely different from using a separate dataset.
When the model complexity increases (p become larger), the model becomes more sensitive to small changes in the training data. The sensitivity could be written as [math]\displaystyle{ \frac{\partial \hat{y}_i}{\partial y_i} }[/math]. This derivative would be larger as it follows the training data, exaggerating the penalty term [math]\displaystyle{ 2\sigma^2\sum_{i=1}^{n}D_i }[/math] in SURE.
Exercise 4.6
Level: * (Easy)
Exercise Types: Novel
Question
For a momentum factor [math]\displaystyle{ \beta }[/math], explain the impact of increasing or decreasing [math]\displaystyle{ \beta }[/math] (e.g., [math]\displaystyle{ \beta = 0.9 }[/math] vs. [math]\displaystyle{ \beta = 0.5 }[/math]) on the learning process.
Solution
The momentum factor [math]\displaystyle{ \beta }[/math] determines how much of the velocity contributes to the current step during gradient descent.
When [math]\displaystyle{ \beta }[/math] is high (e.g., [math]\displaystyle{ \beta = 0.9 }[/math]), momentum retains most of the previous velocity, resulting in smoother and more consistent updates. In situations where gradients do not change direction significantly, high [math]\displaystyle{ \beta }[/math] accelerates convergence as momentum accumulates in the correct direction. However, on highly non-convex surfaces or steep gradients, high [math]\displaystyle{ \beta }[/math] may cause oscillations or overshooting as the momentum term dominates gradient changes.
When [math]\displaystyle{ \beta }[/math] is low (e.g., [math]\displaystyle{ \beta = 0.5 }[/math]), the updates rely more heavily on the current gradient rather than accumulated momentum, which can introduce more noise into the updates. In flat regions or long valleys, low [math]\displaystyle{ \beta }[/math] leads to slower progress as previous updates fade quickly. However, a lower [math]\displaystyle{ \beta }[/math] adapts better to situations where the loss landscape changes direction frequently, reducing the risk of overshooting.
Exercise 4.7
Level: ** (Moderate)
Exercise Types: Novel
Question
Suppose we have a linear regression model [math]\displaystyle{ y = X \beta + \varepsilon }[/math] with [math]\displaystyle{ \varepsilon \sim \mathcal{N}\bigl(0, \sigma^2 I\bigr) }[/math]. We use a ridge estimator with penalty [math]\displaystyle{ \lambda }[/math]:
[math]\displaystyle{ \hat{\beta}_\lambda = \bigl(X^\top X + \lambda I\bigr)^{-1} X^\top y, \quad \hat{y}_\lambda = X\,\hat{\beta}_\lambda. }[/math] Write down the form of Stein’s Unbiased Risk Estimator (SURE) for this ridge estimator in terms of [math]\displaystyle{ \hat{y}_\lambda }[/math] and the hat matrix. Briefly explain why the term involving the trace of the hat matrix [math]\displaystyle{ H_\lambda }[/math] adjusts for overfitting, and how that adjustment depends on [math]\displaystyle{ \lambda }[/math]. Describe how you would use SURE to select an optimal value of [math]\displaystyle{ \lambda }[/math].
Solution
1. SURE for the Ridge Estimator Define the hat matrix for ridge regression as:
[math]\displaystyle{ H_\lambda = X \bigl(X^\top X + \lambda I\bigr)^{-1} X^\top, \quad \hat{y}_\lambda = H_\lambda\,y. }[/math] Stein’s Unbiased Risk Estimator (SURE) for this setting is:
[math]\displaystyle{ \mathrm{SURE}(\lambda) = \|y - \hat{y}_\lambda\|^2 + 2\,\sigma^2 \,\mathrm{trace}\bigl(H_\lambda\bigr). }[/math] Here, [math]\displaystyle{ |y - \hat{y}\lambda|^2 }[/math] represents the (in-sample) residual sum of squares, and [math]\displaystyle{ \mathrm{trace}(H\lambda) }[/math] measures the effective degrees of freedom used by the ridge estimator.
2. Role of the Hat Matrix Trace
The matrix [math]\displaystyle{ H_\lambda }[/math] maps the observed data [math]\displaystyle{ y }[/math] to the fitted values [math]\displaystyle{ \hat{y}\lambda }[/math]. Its trace, [math]\displaystyle{ \mathrm{trace}(H\lambda) }[/math], indicates how sensitive the fitted values are to the observed data. Larger [math]\displaystyle{ \mathrm{trace}(H_\lambda) }[/math] means the model is using more degrees of freedom and risks overfitting, causing the naive residual sum of squares [math]\displaystyle{ |y - \hat{y}_\lambda|^2 }[/math] to underestimate true prediction error. As [math]\displaystyle{ \lambda }[/math] increases, the estimator shrinks coefficients more aggressively and [math]\displaystyle{ \mathrm{trace}(H_\lambda) }[/math] typically decreases, reflecting a simpler (less flexible) model. 3. Using SURE to Select [math]\displaystyle{ \lambda }[/math] To choose an optimal [math]\displaystyle{ \lambda }[/math] from the perspective of unbiased risk estimation, one can:
Compute [math]\displaystyle{ \mathrm{SURE}(\lambda) }[/math] over a grid of possible [math]\displaystyle{ \lambda }[/math] values. Select the [math]\displaystyle{ \lambda }[/math] that minimizes [math]\displaystyle{ \mathrm{SURE}(\lambda) }[/math]. Because SURE includes the penalty [math]\displaystyle{ 2,\sigma^2,\mathrm{trace}(H_\lambda) }[/math], it corrects for the bias introduced by fitting the same data used in the residual calculation. This makes SURE a more reliable guide to out-of-sample error than just looking at [math]\displaystyle{ |y - \hat{y}_\lambda|^2 }[/math].
Exercise 4.8
Level: * (Easy)
Exercise Type: Novel
Question
You are tasked with understanding the Mean Squared Error (MSE) and its components: bias and variance, using the provided diagram as a reference.
Setup:
1. You have a true underlying function:
[math]\displaystyle{ f(x) = 2x + 3 }[/math]
2. A model is used to estimate [math]\displaystyle{ f(x) }[/math], given as: [math]\displaystyle{ \hat{f}(x) = ax + b }[/math], where [math]\displaystyle{ a }[/math] and [math]\displaystyle{ b }[/math] are estimated from data.
3. Your model has the following properties:
- Expected estimate: [math]\displaystyle{ \mathbb{E}[\hat{f}(x)] = 1.8x + 2.5 }[/math]
- Variance: [math]\displaystyle{ \text{Var}[\hat{f}(x)] = 0.1 }[/math] (constant for all [math]\displaystyle{ x }[/math]).
(a) Calculate the bias at [math]\displaystyle{ x = 2 }[/math].
(b) Given that [math]\displaystyle{ \text{Var}[\hat{f}(x)] = 0.1 }[/math], compute the Mean Squared Error (MSE) at [math]\displaystyle{ x = 2 }[/math].
Solution
1. Bias Calculation:
The bias is defined as:
[math]\displaystyle{ \text{Bias} = \mathbb{E}[\hat{f}(x)] - f(x). }[/math]
Substitute the values for [math]\displaystyle{ f(x) }[/math] and [math]\displaystyle{ \mathbb{E}[\hat{f}(x)] }[/math] at [math]\displaystyle{ x = 2 }[/math]:
- [math]\displaystyle{ f(2) = 2(2) + 3 = 7 }[/math]
- [math]\displaystyle{ \mathbb{E}[\hat{f}(2)] = 1.8(2) + 2.5 = 3.6 + 2.5 = 6.1. }[/math]
Thus, [math]\displaystyle{ \text{Bias} = \mathbb{E}[\hat{f}(2)] - f(2) = 6.1 - 7 = -0.9. }[/math]
2. Variance Contribution:
The variance is directly given as: [math]\displaystyle{ \text{Var}[\hat{f}(x)] = 0.1. }[/math]
3. MSE Decomposition:
The Mean Squared Error (MSE) is defined as: [math]\displaystyle{ \text{MSE} = \text{Bias}^2 + \text{Var}. }[/math]
Substitute the values:
- [math]\displaystyle{ \text{Bias}^2 = (-0.9)^2 = 0.81, }[/math]
- [math]\displaystyle{ \text{MSE} = 0.81 + 0.1 = 0.91. }[/math]
Final Answer:
- Bias: [math]\displaystyle{ -0.9 }[/math].
- Variance: [math]\displaystyle{ 0.1 }[/math].
- MSE: [math]\displaystyle{ 0.91 }[/math].
Exercise 4.9
Level: ** (Easy)
Exercise Types: Novel
Question
Suppose you are developing a predictive model for house prices using a dataset with features like square footage, number of bedrooms, and location. You decide to compare three models with increasing complexity:
1. A simple linear regression model.
2. A polynomial regression model with degree 3.
3. A neural network with multiple layers.
Using Stein's Unbiased Risk Estimator (SURE), explain how you would determine which model is expected to generalize best to unseen data. Any practical challenges you might encounter?
Solution
Approach
1. For each model, compute the empirical error on the training data. For instance, compute the Mean Squared Error: [math]\displaystyle{ err = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat y_i)^2. }[/math]
2. Estimate [math]\displaystyle{ \sigma^2 }[/math], analyze the variance of residuals from a simple baseline model like linear regression, which provides a good approximation of the noise in the data.
3. For each model, calculate the complexity term [math]\displaystyle{ 2 \sigma^2 \sum_{i=1}^n D_i }[/math], where [math]\displaystyle{ D_i = \frac{\partial{\hat{f_i}}}{\partial{y_i}} }[/math]
4. Compute the Stein's Unbiased Risk Estimator for each model by [math]\displaystyle{ Err = err - n \sigma^2 + 2 \sigma^2 \sum_{i=1}^n D_i }[/math]
5. Compare the SURE values for the three models. The model with the lowest SURE is expected to generalize best to unseen data.
Practical Challenges
1. Using a simple model like linear regression might provide a biased estimate if the data is highly nonlinear. This will suggest an inaccurate variance [math]\displaystyle{ \sigma^2 }[/math].
2. While SURE penalizes complexity, it may favor simpler models if the noise variance is high. However, overly simplistic models might underfit.
3. If the best model is neural networks, then it will be less interpretable in this case and calculating [math]\displaystyle{ D_i }[/math] becomes computationally expensive.
4. For neural networks, estimating [math]\displaystyle{ D_i }[/math] is challenging due to the model's intricate structure, lack of interpretability and prone to numerical instability.
Exercise 4.10
Level: ** (Moderate)
Exercise Types: Novel
Question
Derive Stein’s Unbiased Risk Estimator (SURE) for the training set and the testing set.
Solution
Notation
We have observations [math]\displaystyle{ y_i = f(x_i) + \varepsilon_i }[/math] where [math]\displaystyle{ \varepsilon_i \sim \mathcal{N}(0, \sigma^2) }[/math] and [math]\displaystyle{ \hat{f}(x_i) }[/math] is the model’s prediction at [math]\displaystyle{ x_i }[/math]. We write [math]\displaystyle{ \hat{y}_i = \hat{f}(x_i). }[/math]
Test Data
For a test point [math]\displaystyle{ (x_0, y_0) }[/math] not used in fitting [math]\displaystyle{ \hat{f} }[/math]:
[math]\displaystyle{ \hat{y}_0 - y_0 \;=\; \hat{f}(x_0) - \bigl(f(x_0) + \varepsilon_0\bigr). }[/math]
[math]\displaystyle{ (\hat{y}_0 - y_0)^2 \;=\; (\hat{f}(x_0) - f(x_0))^2 + \varepsilon_0^2 - 2\,\varepsilon_0\,(\hat{f}(x_0) - f(x_0)). }[/math]
[math]\displaystyle{ \hat{f}(x_0) }[/math] is independent of [math]\displaystyle{ \varepsilon_0 }[/math], [math]\displaystyle{ E\bigl[\varepsilon_0\,(\hat{f}(x_0) - f(x_0))\bigr] = 0 }[/math]
[math]\displaystyle{ E\bigl[(\hat{y}_0 - y_0)^2\bigr] \;=\; E\bigl[(\hat{f}(x_0) - f(x_0))^2\bigr] + \sigma^2. }[/math]
Summing over [math]\displaystyle{ M }[/math] test points:
[math]\displaystyle{ \sum_{i=1}^M (\hat{y}_i - y_i)^2 \;=\; \sum_{i=1}^M (\hat{f}(x_i) - f(x_i))^2 \;+\; M\,\sigma^2. }[/math]
Training Data
For a point [math]\displaystyle{ (x_0, y_0) }[/math] used in training, [math]\displaystyle{ \hat{f}(x_0) }[/math] depends on [math]\displaystyle{ y_0. }[/math] We have:
[math]\displaystyle{ (\hat{y}_0 - y_0)^2 \;=\; \bigl(\hat{f}(x_0) - (f(x_0) + \varepsilon_0)\bigr)^2. }[/math]
Define [math]\displaystyle{ D_0 := \frac{\partial\,\hat{f}(x_0)}{\partial\,y_0}. }[/math]
By Stein’s lemma,
[math]\displaystyle{ E\bigl[\varepsilon_0\,(\hat{f}(x_0) - f(x_0))\bigr] \;=\; \sigma^2 \, E\bigl[\tfrac{\partial\,\hat{f}(x_0)}{\partial\,\varepsilon_0}\bigr]. }[/math]
But [math]\displaystyle{ \tfrac{\partial\,\hat{f}(x_0)}{\partial\,\varepsilon_0} = \tfrac{\partial\,\hat{f}(x_0)}{\partial\,y_0}, }[/math] so [math]\displaystyle{ E\bigl[\varepsilon_0(\hat{f}(x_0) - f(x_0))\bigr] = \sigma^2 E[D_0]. }[/math]
[math]\displaystyle{ E\bigl[(\hat{y}_0 - y_0)^2\bigr] \;=\; E\bigl[(\hat{f}(x_0) - f(x_0))^2\bigr] + \sigma^2 - 2\,\sigma^2 \, E[D_0]. }[/math]
Summing over [math]\displaystyle{ n }[/math] training points:
[math]\displaystyle{ \sum_{i=1}^n (\hat{y}_i - y_i)^2 \;=\; \sum_{i=1}^n (\hat{f}_i - f_i)^2 + n\,\sigma^2 - 2\,\sigma^2 \sum_{i=1}^n D_i. }[/math]
Exercise 4.11
Level: * (Easy)
Exercise Types: Novel
Question
How does the choice of momentum factor [math]\displaystyle{ \beta }[/math] influence the ability of gradient descent to escape saddle points or local minima during optimization?
Solution
The momentum factor [math]\displaystyle{ \beta }[/math] plays a significant role in determining how gradient descent navigates complex loss landscapes.
For reference, the formula for momentum-based SGD is as follows:
[math]\displaystyle{ v_{t+1} = \beta\,v_t + \eta\,\Delta W_t, }[/math]
where [math]\displaystyle{ \beta }[/math] is the momentum coefficient and [math]\displaystyle{ \eta }[/math] is the learning rate.
When [math]\displaystyle{ \beta }[/math] is high (e.g., [math]\displaystyle{ \beta }[/math] = 0.9), the optimizer builds up velocity over time, which helps push it through flat regions like saddle points more effectively. This accumulated momentum can prevent the algorithm from getting stuck in shallow local minima by carrying the updates forward despite weak gradients. However, in very sharp local minima, high [math]\displaystyle{ \beta }[/math] may cause overshooting, making it harder to settle at the true minima. When [math]\displaystyle{ \beta }[/math] is low (e.g., [math]\displaystyle{ \beta }[/math] = 0.5), the optimizer relies more on the current gradient, which reduces the accumulated velocity. While this makes the updates more adaptive to the local landscape, it also increases the likelihood of getting trapped in saddle points or small local minima, as the velocity is insufficient to escape flat or weak gradient regions.
Exercise 4.12
Level: * (Moderate)
Exercise Types: Novel
Question
Explain the key idea behind Stein's Unbiased Risk Estimator (SURE) and how it can be applied to choose an optimal parameter or model in high-dimensional estimation problems. Why is it preferred over traditional risk estimation methods in some scenarios?
Solution
Stein's Unbiased Risk Estimator (SURE) provides an unbiased estimate of the risk (expected squared error) of an estimator using only the observed data, without requiring knowledge of the true parameter. It is particularly useful in high-dimensional problems, where it balances bias and variance effectively, aiding in tasks like model selection and hyperparameter tuning for methods such as LASSO or ridge regression. SURE is computationally efficient, often providing closed-form expressions, and is preferred over traditional methods like cross-validation in scenarios with Gaussian noise and differentiable estimators. However, its applicability depends on specific assumptions (e.g., noise distribution and estimator smoothness), and it may not always perform well in small sample sizes or under non-standard conditions.
Exercise 4.13
Level: * (Easy)
Exercise Types: Novel
Question
In order for Stein's unbiased risk estimator to be useful, we need to figure out how to compute [math]\displaystyle{ D_i }[/math], which can be challenging. Then what is the advantage of SURE despite the challenge?
Solution
1. SURE is useful because it provides an unbiased estimate of the risk of an estimator without needing the true parameter. Typically, to evaluate the MSE of an estimator, you would need to know the true parameter value but SURE allows you to assess the performance of an estimator without this information.
2. Even if computing the sum of the partial derivatives directly is complex, SURE can be applied to compare different estimators. The goal is often to choose an estimator that minimizes the risk, so SURE helps identify more efficient estimators, especially in high-dimensional settings.
3. In many practical applications, it's not necessary to compute the exact sum of partial derivatives. Instead, we can approximate or use simplified models for the estimator and the data.
Exercise 4.14
Level: * (Easy)
Exercise Types: Novel
Question
Prove that MSE = Bias[math]\displaystyle{ ^2 }[/math]+Var
Solution
Let [math]\displaystyle{ f }[/math] denote the true function, [math]\displaystyle{ \hat{f} }[/math] denote the estimated function , [math]\displaystyle{ x }[/math] denote the data
[math]\displaystyle{ \begin{align*} \text{MSE} &= \mathbb{E}[(f(x)-\hat{f}(x))^2] \\ &= \mathbb{E}\Big[(f(x)-\mathbb{E}[\hat{f}(x)] + \mathbb{E}[\hat{f}(x)]-\hat{f}(x))^2\Big] \\ &= (f(x)-\mathbb{E}[\hat{f}(x)])^2 + (\mathbb{E}[\hat{f}(x)]-\hat{f}(x))^2 \\ &= \text{Bias}^2+\text{Var} \end{align*} }[/math]
where the second last equality comes from [math]\displaystyle{ \mathbb{E}[(f(x)-E[\hat{f}(x)])(\mathbb{E}[\hat{f}(X)]-\hat{f}(x))] = \mathbb{E}[f(x)E[\hat{f}(x)]-f(x)\hat{f}(x)-\mathbb{E}[\hat{f}(X)]^2+\hat{f}(x)\mathbb{E}[\hat{f}(X)])] = f(x)\mathbb{E}[\hat{f}(X)]-f(x)\mathbb{E}[\hat{f}(X)]-\mathbb{E}[\hat{f}(X)]^2+\mathbb{E}[\hat{f}(X)]^2 = 0 }[/math]
Alternatively,
[math]\displaystyle{ \begin{align*} \mathbb{E}\Big[(f(x)-\mathbb{E}[\hat{f}(x)])(\mathbb{E}[\hat{f}(x)]-\hat{f}(x))\Big] &= (f(x)-\mathbb{E}[\hat{f}(x)])\mathbb{E}\Big[\mathbb{E}[\hat{f}(x)]-\hat{f}(x)\Big] \\ &= (f(x)-\mathbb{E}[\hat{f}(x)])\Big(\mathbb{E}[\mathbb{E}[\hat{f}(x)]]-\mathbb{E}[\hat{f}(x)]\Big) \\ &= (f(x)-\mathbb{E}[\hat{f}(x)])\Big(\mathbb{E}[\hat{f}(x)]-\mathbb{E}[\hat{f}(x)]\Big) \\ &= 0, \end{align*} }[/math]
since the bias [math]\displaystyle{ (f(x)-\mathbb{E}[\hat{f}(x)]) }[/math] is a constant.
Exercise 4.15
Level: ** (Moderate)
Exercise Types: Novel
Question
When applying numerical methods, it is essential to constantly remind ourselves of the assumptions behind the model. What are the assumptions of SURE required for SURE to provide an unbiased estimate of the true risk? How does it compare to cross-validation as a model selection technique?
Solution
Assumptions:
1. The noise follows a normal distribution with constant variance.
2. The response must be linearly related to the predictors.
3. The effective degrees of freedom must be correctly computed.
Comparisons with Cross-validation:
1. While the assumptions of SURE could be unrealistic to hold, cross validation may suffer from higher variance due to the random splitting.
2. SURE is computationally cheaper than cross validation.
3. Cross validation does not impose normality on the noise, giving a more general account.
Exercise 4.16
Level: * (Easy)
Exercise Types: Modified (Reference: Lecture 04)
Question
Suppose [math]\displaystyle{ y \sim N(\theta, \sigma^2I) }[/math], where [math]\displaystyle{ \theta \in \mathbb{R}^n }[/math] is the true parameter, [math]\displaystyle{ y }[/math] is the observed data, and [math]\displaystyle{ \sigma^2 }[/math] is the variance of the noise. Consider the following estimator for [math]\displaystyle{ \theta }[/math]: [math]\displaystyle{ \hat{\theta} = a y, }[/math] where [math]\displaystyle{ a \in \mathbb{R} }[/math] is a scalar constant.
Derive the Stein's Unbiased Risk Estimator (SURE) for the estimator [math]\displaystyle{ \hat{\theta} }[/math].
Solution
The true risk is given by: [math]\displaystyle{ R = \mathbb{E}[\|\hat{\theta} - \theta\|^2]. }[/math] Substituting [math]\displaystyle{ \hat{\theta} = a y }[/math] and [math]\displaystyle{ y \sim N(\theta, \sigma^2I) }[/math], we expand the risk: [math]\displaystyle{ R = \mathbb{E}[\|a y - \theta\|^2] = \mathbb{E}[\|a(y - \theta) + (a - 1)\theta\|^2]. }[/math]
Using [math]\displaystyle{ \mathbb{E}[y - \theta] = 0 }[/math] and [math]\displaystyle{ \mathbb{E}[\|y - \theta\|^2] = n\sigma^2 }[/math], we get: [math]\displaystyle{ R = a^2 \mathbb{E}[\|y - \theta\|^2] + (a - 1)^2 \|\theta\|^2 = a^2 n\sigma^2 + (a - 1)^2 \|\theta\|^2. }[/math]
To derive SURE, we estimate the true risk using: [math]\displaystyle{ \text{SURE} = \|\hat{\theta} - y\|^2 + 2\sigma^2 \text{div}(f), }[/math] where [math]\displaystyle{ f(y) = a y }[/math] and [math]\displaystyle{ \text{div}(f) = a n }[/math] because: [math]\displaystyle{ \frac{\partial f_i(y)}{\partial y_i} = a \quad \text{for all } i. }[/math]
First term: [math]\displaystyle{ \|\hat{\theta} - y\|^2 = \|a y - y\|^2 = (a - 1)^2 \|y\|^2. }[/math]
Therefore: [math]\displaystyle{ \text{SURE} = (a - 1)^2 \|y\|^2 + 2\sigma^2 n a. }[/math]
Exercise 4.17
Level: * (Easy)
Exercise Types: Novel
Question
Stein's Unbiased Risk Estimator (SURE) is often discussed in the context of its effectiveness in model selection. Explain how SURE can be used to select the best model from a set of candidates, specifically focusing on its capacity to balance model complexity against prediction error.
Solution
SURE is particularly valuable in model selection because it provides an unbiased estimate of the risk (expected prediction error) for different models, even without knowing the true underlying function. It does this by combining the empirical error (residuals) with a penalty proportional to the model's complexity.
For instance, if we have multiple regression models of varying complexity, SURE helps identify the model that optimally balances low empirical errors with manageable complexity. This is crucial because overly complex models might fit the training data very well (low empirical error) but perform poorly on new data due to overfitting. Conversely, overly simple models might not capture the necessary patterns in the data, leading to high empirical errors.
The SURE formula includes a term that adjusts the observed empirical error by adding a complexity penalty. This penalty is usually proportional to the trace of the hat matrix (a measure of leverage or influence of each observation in linear models), which scales with the flexibility or number of parameters in the model. By evaluating SURE for each model, we can choose the model that minimizes this unbiased estimate of the risk, thereby selecting the model with the best generalization performance expected on new, unseen data.
Exercise 4.18
Level: * (Easy)
Exercise Types: Novel
Question
Why Stein's Unbiased Risk Estimator (SURE) serves as a good regularization method?
Solution
SURE directly estimates the risk (mean squared error) of an estimator [math]\displaystyle{ \hat{\theta} }[/math]. Thus, we can compare different estimators or penalization strategies by explicitly minimizing the estimated risk. By selecting the model or parameters that minimize the SURE criterion, we can effectively regularize the problem to achieve better generalization result. Moreover, it can handle high-dimensional problems because it takes [math]\displaystyle{ D_i }[/math] into consideration.
Exercise 4.19
Level: ** (Moderate)
Exercise Types: Novel
References: Adapted from [Unsupervised Learning with Stein’s Unbiased Risk Estimator](https://arxiv.org/pdf/1805.10531).
Question
Critically evaluate the application of Stein's Unbiased Risk Estimator (SURE) for image denoising in scenarios where ground truth data is unavailable. Discuss the implementation of SURE within convolutional neural networks (CNNs) for image recovery, focusing on the use of the divergence term: \[ \text{div} \, \mathbf{J} = \operatorname{trace}(\nabla \mathbf{f}(\mathbf{x})) \] from the trace of the Jacobian matrix of the estimator with respect to observed data. Explain how this approach, as explored in the referenced paper, assists in parameter optimization and mitigates the challenges posed by the absence of clean reference images.
Solution
The paper implements Stein's Unbiased Risk Estimator (SURE) for unsupervised image denoising, demonstrating its practical use in settings devoid of ground truth data. SURE estimates the risk associated with a denoising estimator by leveraging the formula: \[ \text{SURE} = \sigma^2 \left( n - 2\operatorname{trace}(\nabla \mathbf{f}(\mathbf{x})) \right) + \| \mathbf{y} - \mathbf{f}(\mathbf{x}) \|^2 \] where \( \mathbf{y} \) is the observed noisy image, \( \mathbf{f}(\mathbf{x}) \) is the denoised output, and \( \sigma^2 \) is the noise variance. This approach allows for the optimization of CNN parameters so that the model self-evaluates its performance based on the observed data alone, enhancing its ability to generalize better from noisy observations to cleaner reconstructions.
Exercise 4.20
Level: * (Easy)
Exercise Types: Novel
References:
A. Ghodsi, STAT 940 Deep Learning: Lecture 4, University of Waterloo, Winter 2025.
Question
In STAT 940 Lecture 4, it was shown that the expectation [math]\displaystyle{ \mathbb{E}[(\mathcal{N}(0,\sigma^2))^2]=\sigma^2 }[/math]
Using Python and NumPy, randomly sample a large number of randomly generated gaussian samples with a mean of 0 and a randomly generated variance to numerically demonstrate that this expectation is indeed true
Solution
import numpy as np import matplotlib.pyplot as plt #Generate a random variance std = np.random.rand() #Generate 1000 random number with a mean of 0 and a standard deviation of std sampled_gaussian_numbers = np.random.normal(0, std, 1000) #Plot a histogram of the sampled numbers plt.figure(1, figsize=(8, 6)) plt.hist(sampled_gaussian_numbers, bins=30) plt.title("Histogram of gaussian samples, mean = 0, std = " + str(np.round(std, 3))) plt.show() #Calculate the expectation value of the sampled numbers squared expectation = np.mean(sampled_gaussian_numbers**2) print("The randomly generated variance is: ", std**2) print("The expectation of the randomly generated gaussian samples squared is: ", expectation)
Exercise 5.1
Level: * (Easy)
Exercise Type: Novel
Question
1) Which of the following options can be used to regularize an MLP (choose all that apply)?
a) Add noise to data
b) Use full batch gradient descent
c) Add dropout
d) Use early stopping
2) Using Stochastic Gradient Descent with a small minibatch when training an MLP helps with generalization:
a) True
b) False
3) Explain why adding noise to the input data helps improve generalization in a neural network.
Solution
1) The correct answers are a), c), and d).
- a) Adding noise to data is a regularization technique that can improve generalization by forcing the model to learn robust patterns. Types of noise used in MLPs include: input noise, weight noise, activation noise and dropouts.
- c) Dropout randomly deactivates neurons during training, reducing overfitting.
- d) Early stopping prevents overfitting by halting training once the validation error stops improving.
b) is incorrect because using full batch gradient descent does not help regularize the model.
2) The correct answer is a) True.
Using Stochastic Gradient Descent with a small minibatch size introduces noise into the optimization process, which can act as a form of regularization and improve generalization.
3) Explanation: Adding noise to the input data acts as a form of data augmentation. It forces the model to learn features that are robust to variations in the input, reducing reliance on specific details of the training data. This helps the network generalize better to unseen data, as it becomes more adaptable to small changes and less prone to overfitting.
Additional Note: One form of adding noise is through data smoothing. This ensures that the function is not overconfident on the objects that it was trained on.
Exercise 5.2
Level: * (Easy)
Exercise Type: Novel
Question
Derive the gradient of the loss function with respect to the weights [math]\displaystyle{ \mathbf{w} }[/math] for the following regularized quadratic loss:
[math]\displaystyle{ L(\mathbf{w}) = \frac{1}{2n} \sum_{i=1}^n \left( y_i - \mathbf{w}^T \mathbf{x}_i \right)^2 + \frac{\lambda}{2} \|\mathbf{w}\|^2 }[/math]
where [math]\displaystyle{ \lambda \gt 0 }[/math] is the regularization parameter.
Solution
The gradient of the loss function [math]\displaystyle{ L(\mathbf{w}) }[/math] with respect to [math]\displaystyle{ \mathbf{w} }[/math] can be computed as follows:
[math]\displaystyle{ \nabla_{\mathbf{w}} L(\mathbf{w}) = \nabla_{\mathbf{w}} \left( \frac{1}{2n} \sum_{i=1}^n \left( y_i - \mathbf{w}^T \mathbf{x}_i \right)^2 \right) + \nabla_{\mathbf{w}} \left( \frac{\lambda}{2} \|\mathbf{w}\|^2 \right) }[/math]
For the first term: [math]\displaystyle{ \nabla_{\mathbf{w}} \left( \frac{1}{2n} \sum_{i=1}^n \left( y_i - \mathbf{w}^T \mathbf{x}_i \right)^2 \right) = -\frac{1}{n} \sum_{i=1}^n \left( y_i - \mathbf{w}^T \mathbf{x}_i \right) \mathbf{x}_i }[/math]
For the second term: [math]\displaystyle{ \nabla_{\mathbf{w}} \left( \frac{\lambda}{2} \|\mathbf{w}\|^2 \right) = \lambda \mathbf{w} }[/math]
Combining both terms: [math]\displaystyle{ \nabla_{\mathbf{w}} L(\mathbf{w}) = -\frac{1}{n} \sum_{i=1}^n \left( y_i - \mathbf{w}^T \mathbf{x}_i \right) \mathbf{x}_i + \lambda \mathbf{w} }[/math]
Exercise 5.3
Level: ** (Moderate)
Exercise Types: Novel
Question
What are the mathematical formulas for Ridge Regression and Lasso Regression, including their respective penalty terms? Additionally, write a Python script that visualizes the effect of these regularization techniques on a selected loss function (e.g., Mean Squared Error) for a simple linear regression model. Once you have the graph, interpret the graph by explaining how the shapes show the differences between Lasso and ridge regression.
Solution
Ridge Regression and Lasso Regression are both forms of linear regression with regularization to prevent overfitting.
Ridge regression minimizes the following objective function:
[math]\displaystyle{ \hat{\beta} = \arg\min_{\beta} \sum_{i=1}^{n} (y_i - X_i \beta)^2 + \lambda \sum_{j=1}^{p} \beta_j^2 }[/math]
where:
- [math]\displaystyle{ y_i }[/math] are the target values,
- [math]\displaystyle{ X_i }[/math] represents the input features,
- [math]\displaystyle{ \beta }[/math] are the regression coefficients,
- [math]\displaystyle{ \lambda }[/math] is the regularization parameter controlling the strength of the penalty.
The penalty term [math]\displaystyle{ \lambda \sum_{j=1}^{p} \beta_j^2 }[/math] ensures that coefficients are shrunk towards zero, reducing overfitting but not setting any coefficients exactly to zero. For small [math]\displaystyle{ \lambda }[/math], model behaves closer to an unregularized model (like ordinary least squares); for larger [math]\displaystyle{ \lambda }[/math], strong regularization can push model coefficients towards zero. Lasso may set some to exactly zero, while Ridge reduces the size but does not zero-out coefficients.
Lasso regression introduces an L1 penalty term:
[math]\displaystyle{ \hat{\beta} = \arg\min_{\beta} \sum_{i=1}^{n} (y_i - X_i \beta)^2 + \lambda \sum_{j=1}^{p} |\beta_j| }[/math]
where the L1 penalty term [math]\displaystyle{ \lambda \sum_{j=1}^{p} |\beta_j| }[/math] encourages sparsity in the coefficients. Some coefficients will be set exactly to zero, making Lasso useful for feature selection.
visualization
The following Python script visualizes the impact of Ridge and Lasso regularization by plotting the contours of the Mean Squared Error (MSE) loss function along with the constraint regions imposed by Ridge (L2 ball) and Lasso (L1 diamond):
import numpy as np
import matplotlib.pyplot as plt
# Define the loss function
def mse_loss(beta1, beta2):
return beta1**2 + beta2**2 # Simplified MSE (for visualization)
# Generate grid for visualization
beta1_vals = np.linspace(-2, 2, 100)
beta2_vals = np.linspace(-2, 2, 100)
B1, B2 = np.meshgrid(beta1_vals, beta2_vals)
Loss = mse_loss(B1, B2)
# Plot contours of the loss function
plt.figure(figsize=(10, 6))
contour = plt.contour(B1, B2, Loss, levels=20, cmap='viridis')
plt.colorbar(contour)
# Plot Ridge constraint (L2 ball)
ridge_circle = plt.Circle((0, 0), radius=1.2, color='red', fill=False, linestyle='dashed', label="Ridge Constraint (L2)")
# Plot Lasso constraint (L1 diamond)
lasso_diamond = np.array([[1, 0], [0, 1], [-1, 0], [0, -1], [1, 0]]) * 1.2
plt.plot(lasso_diamond[:, 0], lasso_diamond[:, 1], 'b--', label="Lasso Constraint (L1)")
# Add constraints to the plot
ax = plt.gca()
ax.add_patch(ridge_circle)
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.axhline(0, color='black', linewidth=0.5)
plt.axvline(0, color='black', linewidth=0.5)
# Labels and legend
plt.xlabel(r'$\beta_1$')
plt.ylabel(r'$\beta_2$')
plt.title("Effect of Ridge and Lasso Regularization on Loss Function")
plt.legend()
plt.grid(True)
plt.show()
Output:
1. Ridge Regression (L2) – Red Circle
Ridge regression enforces a circular constraint on the coefficients (β1,β2β1,β2). The solution is found where the contours of the loss function (MSE) first touch this L2 constraint region. Since the constraint is smooth (no sharp corners), the coefficients are shrunk gradually towards zero but rarely exactly zero. This means Ridge keeps all features in the model but reduces their impact by shrinking their values.
2. Lasso Regression (L1) – Blue Diamond
Lasso regression imposes a diamond-shaped constraint on the coefficients. The sharp corners of the L1 constraint (at the axes) make it more likely that the optimal solution lies exactly on one of these axes. This means Lasso sets some coefficients to exactly zero, effectively performing feature selection by eliminating less important variables. The reason for this behavior is that the contours of the loss function often first touch the constraint at the corners, leading to sparsity in the solution.
Conclusion
Ridge Regression (L2) shrinks coefficients continuously but keeps all variables. Lasso Regression (L1) forces some coefficients to exactly zero, performing automatic feature selection.
Exercise 5.4
Level: ** (Moderate)
Exercise Types: Novel
Question
Label smoothing modifies the standard one-hot ground truth labels in multi-class classification by assigning a small amount of probability mass to incorrect classes. Suppose we replace each one-hot label [math]\displaystyle{ y }[/math] with a “smoothed” label [math]\displaystyle{ \tilde{y} }[/math] that assigns [math]\displaystyle{ 1 - \alpha }[/math] to the true class and [math]\displaystyle{ \alpha / (C-1) }[/math] to each of the other [math]\displaystyle{ C - 1 }[/math] classes, where [math]\displaystyle{ C }[/math] is the total number of classes and [math]\displaystyle{ \alpha }[/math] is a small constant.
Write the modified cross-entropy loss using [math]\displaystyle{ \tilde{y} }[/math] and a predicted probability vector [math]\displaystyle{ p }[/math]. Explain how this modification helps reduce model overconfidence and potentially improves calibration. Give an example scenario (e.g., large-scale image classification) where label smoothing has been shown to be particularly beneficial.
Solution
Modified Cross-Entropy Loss Standard cross-entropy for one-hot labels [math]\displaystyle{ y }[/math] and predicted probabilities [math]\displaystyle{ p }[/math] is [math]\displaystyle{ -\sum_{c=1}^C y_c \log p_c. }[/math] With label smoothing, each target label becomes [math]\displaystyle{ \tilde{y}_c }[/math]. Hence, the loss is:
[math]\displaystyle{ \ell_{\mathrm{smooth}} = - \sum_{c=1}^C \tilde{y}_c \,\log p_c, \quad \text{where} \quad \tilde{y}_c = \begin{cases} 1 - \alpha, & \text{for the correct class},\\ \frac{\alpha}{C - 1}, & \text{for other classes}. \end{cases} }[/math] Reduction of Overconfidence and Improved Calibration
In a one-hot scheme, the model is strongly penalized if it does not put near-total probability mass on the correct class. Label smoothing distributes a fraction [math]\displaystyle{ \alpha }[/math] of the probability across incorrect classes, preventing the model from becoming overly confident. By avoiding extreme outputs (probabilities close to 0 or 1), the model tends to produce more calibrated predictions, often generalizing better to unseen data. Example Scenario In large-scale image classification (e.g., ImageNet), label smoothing has demonstrated notable gains:
Models converge more smoothly, especially when data is abundant but still noisy in certain classes. Overconfidence is reduced, leading to improved validation accuracy and calibration metrics.
Exercise 5.5
Level: * (Easy)
Exercise Types: Modified
References: https://www.cs.toronto.edu/~lczhang/321/files/midterm_b.pdf
Question
Which of the following about weight decay is true?
(A) Including weight decay generally reduces the training cost.
(B) Weight decay directly penalizes large activations.
(C) Weight decay can help revive a “dead” or “saturated” neuron.
(D) Weight decay helps get out of saddle points.
Solution
C) Weight decay increases the training cost because there is an extra (positive) term in the training cost.Weight decay penalizes large weights and not activations. The choice of adding weight decay is independent of the choice of the optimizer. Weight decay can revive a “dead” neuron, but it does not help us get out of saddle points. Weight decay primarily regularizes weights and does not directly address saddle points. Techniques like momentum or adaptive gradients are more relevant for escaping saddle points.
Additional note: Dead/saturated neurons have activations that are always in the plateau area of the activation function (e.g. ReLU or sigmoid or tanh), caused by large (positive or negative) weights/biases. Weight decay reduces those parameters, so that the activations will not be consistently large. It might be useful to note that weight decay can be used in combination with various optimizers like SGD (Stochastic Gradient Descent), Adam, and RMSprop. While SGD can benefit from weight decay, optimizers like Adam and RMSprop with adaptive learning rates might help mitigate issues like saddle points and local minima, where weight decay alone might be insufficient.
Exercise 5.6
Level: ** (Moderate)
Exercise Types: Novel
Question
Train a neural network with both L2 regularization and early stopping on UCI dataset. Vary the regularization strength and patience, and observe how they interact to affect model generalization.
Solution
import torch import torch.nn as nn import torch.optim as optim from sklearn.model_selection import train_test_split from sklearn.datasets import load_diabetes from sklearn.preprocessing import StandardScaler from torch.utils.data import DataLoader, TensorDataset import matplotlib.pyplot as plt # Set device device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Load UCI dataset (using the diabetes dataset as an example) data = load_diabetes() X, y = data.data, data.target # Standardize features scaler = StandardScaler() X = scaler.fit_transform(X) # Split into training, validation, and test sets X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42) X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42) # Convert data to PyTorch tensors X_train_tensor = torch.tensor(X_train, dtype=torch.float32).to(device) y_train_tensor = torch.tensor(y_train, dtype=torch.float32).view(-1, 1).to(device) X_val_tensor = torch.tensor(X_val, dtype=torch.float32).to(device) y_val_tensor = torch.tensor(y_val, dtype=torch.float32).view(-1, 1).to(device) X_test_tensor = torch.tensor(X_test, dtype=torch.float32).to(device) y_test_tensor = torch.tensor(y_test, dtype=torch.float32).view(-1, 1).to(device) # Prepare DataLoader train_loader = DataLoader(TensorDataset(X_train_tensor, y_train_tensor), batch_size=32, shuffle=True) val_loader = DataLoader(TensorDataset(X_val_tensor, y_val_tensor), batch_size=32) test_loader = DataLoader(TensorDataset(X_test_tensor, y_test_tensor), batch_size=32) # Define neural network model class SimpleNN(nn.Module): def __init__(self, input_dim): super(SimpleNN, self).__init__() self.fc1 = nn.Linear(input_dim, 64) self.fc2 = nn.Linear(64, 32) self.fc3 = nn.Linear(32, 1) self.relu = nn.ReLU() def forward(self, x): x = self.relu(self.fc1(x)) x = self.relu(self.fc2(x)) x = self.fc3(x) return x # Training function with early stopping def train_model(model, train_loader, val_loader, criterion, optimizer, num_epochs, patience): train_loss_history = [] val_loss_history = [] best_val_loss = float('inf') best_model_state = None patience_counter = 0 for epoch in range(num_epochs): # Training phase model.train() train_loss = 0.0 for X_batch, y_batch in train_loader: optimizer.zero_grad() y_pred = model(X_batch) loss = criterion(y_pred, y_batch) loss.backward() optimizer.step() train_loss += loss.item() train_loss /= len(train_loader) train_loss_history.append(train_loss) # Validation phase model.eval() val_loss = 0.0 with torch.no_grad(): for X_batch, y_batch in val_loader: y_pred = model(X_batch) loss = criterion(y_pred, y_batch) val_loss += loss.item() val_loss /= len(val_loader) val_loss_history.append(val_loss) # Early stopping if val_loss < best_val_loss: best_val_loss = val_loss best_model_state = model.state_dict() patience_counter = 0 else: patience_counter += 1 if patience_counter >= patience: print(f"Early stopping triggered at epoch {epoch+1}.") break print(f"Epoch {epoch+1}/{num_epochs}, Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}") # Load the best model state model.load_state_dict(best_model_state) return train_loss_history, val_loss_history # Define L2 regularization strengths and patience values l2_strengths = [0.0, 0.01, 0.1] patience_values = [3, 5] results = {} # Train models with different regularization and early stopping settings for l2_strength in l2_strengths: for patience in patience_values: print(f"\nTraining with L2 regularization = {l2_strength}, Patience = {patience}") model = SimpleNN(input_dim=X.shape[1]).to(device) optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=l2_strength) criterion = nn.MSELoss() train_loss, val_loss = train_model( model, train_loader, val_loader, criterion, optimizer, num_epochs=100, patience=patience ) # Evaluate the model on the test set model.eval() test_loss = 0.0 with torch.no_grad(): for X_batch, y_batch in test_loader: y_pred = model(X_batch) loss = criterion(y_pred, y_batch) test_loss += loss.item() test_loss /= len(test_loader) results[(l2_strength, patience)] = { "train_loss": train_loss, "val_loss": val_loss, "test_loss": test_loss } # Plot results plt.figure(figsize=(12, 8)) for (l2_strength, patience), result in results.items(): plt.plot(result["val_loss"], label=f"L2={l2_strength}, Patience={patience}") plt.title("Validation Loss with Different L2 Regularization and Patience") plt.xlabel("Epochs") plt.ylabel("Validation Loss") plt.legend() plt.grid(True) plt.savefig('validation_loss.jpg') # Display summary results print("\nSummary of Results:") for (l2_strength, patience), result in results.items(): print(f"L2 Regularization: {l2_strength}, Patience: {patience}") print(f"Final Train Loss: {result['train_loss'][-1]:.4f}") print(f"Final Validation Loss: {result['val_loss'][-1]:.4f}") print(f"Test Loss: {result['test_loss']:.4f}\n")
Output:
1. Effect of L2 Regularization:
With regularization, the validation loss decreases more smoothly. A stronger penalty (L2=0.1) slows down the training initially, as it regularizes the weights more strictly. This can prevent overfitting but may lead to underfitting if the regularization is too strong.
2. Effect of Patience (Early Stopping):
Compared with patience=3, the model with patience=5 may lead to slight improvements if the loss decreases further after plateauing. Longer patience typically benefits models with regularization (L2=0.1) because it gives the model more time to converge despite slower initial progress.
3. Interaction Between L2 and Patience:
L2=0.1 with Patience=3 stops relatively early and might underfit slightly compared to Patience=5; Moderate L2 (L2=0.01) and higher patience (5) provide a better trade-off, as the validation loss decreases smoothly without abrupt stops.
Exercise 5.7
Level: * (Easy)
Exercise Types: Novel
Question
Describe the main idea of the Manifold Tangent Classifier and its application in regularization. Provide an example.
Solution
The Manifold Tangent Classifier (MTC) is a method that improves generalization by leveraging the low-dimensional structure of data. Many datasets lie on a lower-dimensional manifold embedded in high-dimensional space, meaning small changes along certain directions should not affect classification.
MTC ensures that the classifier [math]\displaystyle{ f(x) }[/math] is invariant to these small changes by enforcing that its gradient is orthogonal to the tangent vectors of the manifold at [math]\displaystyle{ x }[/math]. These tangent vectors represent directions where the data naturally varies, and the classifier should be insensitive to changes in these directions.
Mathematically, this is achieved through regularization:
[math]\displaystyle{ \lambda \sum_i \left( \frac{\partial{f(x)}}{\partial{x}} \cdot v_i \right)^2 }[/math] where:
[math]\displaystyle{ \lambda }[/math] controls the strength of regularization. [math]\displaystyle{ v_i }[/math] are the tangent vectors of the manifold. The expression penalizes large directional derivatives of [math]\displaystyle{ f(x) }[/math] along these vectors. By adding this regularization term to the loss function, the model becomes more invariant to small perturbations along the data manifold, leading to:
- Reduced overfitting, as the model does not capture irrelevant noise.
- Better generalization, since the decision boundary aligns with meaningful variations in the data.
- Robustness to adversarial noise, as the classifier focuses on actual structure rather than high-dimensional artifacts.
This approach is particularly useful in semi-supervised learning and unsupervised feature learning, where understanding data geometry helps in learning meaningful representations.
Consider a dataset of handwritten digits, such as the MNIST dataset, where each digit is represented as a high-dimensional image with 28×28 pixels, resulting in a 784-dimensional space. Despite this high-dimensional representation, the variations in handwritten digits do not span the entire space. Instead, they form a much lower-dimensional manifold within it. These variations include differences in slant, stroke thickness, and minor rotations, which do not alter the identity of the digit itself. However, traditional classifiers may be sensitive to these variations, potentially leading to misclassification.
The Manifold Tangent Classifier (MTC) helps mitigate this issue by enforcing invariance to small perturbations along the data manifold. It does this by identifying tangent vectors, which represent natural variations in the data. The classifier is then trained to ensure that its gradient is orthogonal to these tangent directions. This means that small changes in writing style—such as a slightly tilted or stretched digit—will not affect the classification outcome. By incorporating this regularization, the model aligns its decision boundary with meaningful variations rather than being influenced by irrelevant fluctuations in high-dimensional space.
For example, in a typical classification scenario, a standard neural network might misclassify a digit "3" as an "8" due to a slight distortion. However, with MTC, the classifier understands that such a variation is natural and maintains the correct classification. This regularization approach ensures that the model learns representations that are not just accurate but also more robust and interpretable.
Exercise 5.8
Level: *** (Hard)
Exercise Types: Novel
Question
You are provided with a noisy dataset sampled from a simple harmonic oscillator, with mass [math]\displaystyle{ m=1 kg }[/math] and spring constant [math]\displaystyle{ k=100N/m }[/math]. The task in this problem is to write a regression neural network that predicts the oscillator's position [math]\displaystyle{ y }[/math] as a function of time [math]\displaystyle{ t }[/math].
The dataset can be generated using the following script:
import numpy as np import torch np.random.seed(1234) t = np.random.uniform(0, 1, 30) noise = np.random.normal(0, 0.2, t.shape) y = np.sin(10*t) + noise t = torch.tensor(t, dtype=torch.float32).view(-1, 1) y = torch.tensor(y, dtype=torch.float32).view(-1, 1)
(a) Neural network
In PyTorch, write a fully connected neural network with 4 hidden layers, each with 200 neurons. Use a tanh activation function after each hidden layer. Use a learning rate of 0.001. Generate a test dataset of times between [math]\displaystyle{ t=0 }[/math] and [math]\displaystyle{ t=1 }[/math] second(s). Create a plot and comment on the result.
(b) L2 regularization
Repeat the process above, but implement L2 regularization (the weight_decay parameter) in the optimizer. Comment on the result.
(c) Physics-informed neural network
If a dataset follows a known physical law (i.e., a differential equation), this differential equation can be added as a term to the loss function. This penalizes the neural network for solutions that do not obey this law and acts as a form of regularization. The equation of motion for the simple harmonic oscillator is:
[math]\displaystyle{ \frac{d^2y}{dt^2} + \frac{k}{m}y = 0 }[/math]
Using PyTorch's automatic differentiation function(s), compute the first and second derivatives of [math]\displaystyle{ y }[/math] with respect to [math]\displaystyle{ t }[/math], and create a new loss function that is the sum of the mean squared error (MSE) and a new "physics loss" term. Introduce a hyperparameter that controls what fraction of the physics loss gets added to the total loss. Note that getting a good solution requires a careful selection of this parameter as well as the weight decay factor.
Solution
import torch import torch.nn as nn import torch.optim as optim import numpy as np import matplotlib.pyplot as plt torch.manual_seed(1234) class SimpleNN(nn.Module): def __init__(self): super(SimpleNN, self).__init__() self.fc1 = nn.Linear(1, 200) self.fc2 = nn.Linear(200, 200) self.fc3 = nn.Linear(200, 200) self.fc4 = nn.Linear(200, 200) self.fc5 = nn.Linear(200, 1) def forward(self, t): t = torch.tanh(self.fc1(t)) t = torch.tanh(self.fc2(t)) t = torch.tanh(self.fc3(t)) t = torch.tanh(self.fc4(t)) t = self.fc5(t) return t def train(t, y, WEIGHT_DECAY=0, PHYSICS=0): model = SimpleNN() criterion = nn.MSELoss() optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=WEIGHT_DECAY) epochs = 2000 for epoch in range(epochs): perm = torch.randperm(len(t)) t = t[perm] y = y[perm] t.requires_grad = True y_pred = model(t) # forward pass ### derivatives for the physics loss ### dy_dt = torch.autograd.grad(outputs=y_pred, inputs=t, grad_outputs=torch.ones_like(y_pred), create_graph=True)[0] d2y_dt2 = torch.autograd.grad(outputs=dy_dt, inputs=t, grad_outputs=torch.ones_like(dy_dt), create_graph=True)[0] physics_loss = torch.mean((d2y_dt2 + 100 * y_pred) ** 2) loss = criterion(y_pred, y) total_loss = loss + PHYSICS * physics_loss optimizer.zero_grad() total_loss.backward() # backwards pass optimizer.step() # update weights t.requires_grad = False if (epoch + 1) % 100 == 0: print(f"Epoch [{epoch + 1}/{epochs}], Loss: {loss.item():.4f}, " f"Physics Loss: {physics_loss.item():.4f}, Total Loss: {total_loss.item():.4f}") t_test = torch.linspace(0, 1, 100).view(-1, 1) with torch.no_grad(): predictions = model(t_test) return t_test.numpy(), predictions.numpy() t_test, pred = train(t, y) _, pred_L2 = train(t, y, WEIGHT_DECAY=1e-3) _, pred_phys = train(t, y, WEIGHT_DECAY=1e-3, PHYSICS=1e-5) plt.figure(figsize=(8, 4)) plt.plot(t_test, pred, label="No regularization") plt.plot(t_test, pred_L2, label="L2 regularization") plt.plot(t_test, pred_phys, label="L2 + physics") plt.scatter(t.numpy(), y.numpy(), color="k", label="Training data") plt.xlabel("$t$") plt.ylabel("$y$") plt.legend() plt.show()
Exercise 5.9
Level: ** (Moderate)
Exercise Types: Novel
Question
You will train a linear regression model to predict y values. Assume you have a high-dimensional dataset with noise, where the input features are [math]\displaystyle{ \mathbf{x} \in \mathbb{R}^{100 \times 50} }[/math] (100 samples and 50 features), and the target outputs are [math]\displaystyle{ \mathbf{y} \in\mathbb{R}^{100} }[/math].
1. Write a Python function to implement a linear regression model with L2 regularization using gradient descent.
2. Train the model in:
a. Without regularization lambda = 0
b. With regularization lambda = 0.1
Solution
import numpy as np np.random.seed(42) x = np.random.randn(100, 50) true_w = np.random.randn(50) * 0.5 y = x @ true_w + np.random.randn(100) * 0.1 # linear regression with L2 regularization def linear_reg(x, y, lr = 0.01, lambda = 0.0, epochs = 1000): n, d = x.shape w = np.zeros(d) for _ in range(epochs): gradient = -2/n * x.T @ (y - x @ w) + lambda_ * w w = w - lr * gradient no_reg = linear_reg(x, y, lambda = 0.0) with_reg = linear_reg(x, y, lambda = 0.1)
Exercise 5.10
Level: ** (Moderate)
Exercise Types: Novel
Question
Consider a deep learning model that suffers from high variance, indicating it might be overfitting to the training data. Explore regularization techniques that could potentially reduce overfitting. Describe the following methods and explain how they contribute to reducing overfitting:
1. Weight Decay
2. Noise Injection
3. Early Stopping
4. Bagging
Provide a simple example or formula (where applicable) to demonstrate each method's effect.
Solution
1. Weight Decay: This regularization technique involves adding a penalty term to the loss function based on the sum of the squared values of the parameters (L2 regularization). This discourages large weights and helps to prevent the model from fitting the noise in the training data.
Example Formula: [math]\displaystyle{ L_{new} = L_{original} + \lambda \sum_{i} w_i^2 }[/math], where [math]\displaystyle{ \lambda }[/math] is the regularization strength and [math]\displaystyle{ w_i }[/math] are the model weights.
Example: This form of regularization is often used with gradient descent based optimization methods, such as when training a neural network.
A very similar form of regularization is an L1 penalty, which uses [math]\displaystyle{ |w_i| }[/math] instead of [math]\displaystyle{ w_i^2 }[/math]. One notable difference is that L1 can cause some weights to be set exactly to 0, while this is highly unlikely with L2.
2. Noise Injection: By adding noise to the inputs or outputs during training, the model learns to ignore small variations, enhancing its generalization capabilities. Noise injection can be applied in different forms, such as adding noise to the weights (weight noise), outputs (output noise), or inputs (input noise).
Example: Injecting Gaussian noise [math]\displaystyle{ \epsilon \sim \mathcal{N}(0, \sigma^2) }[/math] to inputs during training to make the model robust to slight variations in input data.
3. Early Stopping: This involves stopping the training process before the model has fully converged to the minimum of the loss function on the training set. By monitoring the performance on a validation set and stopping when performance degrades (implying the start of overfitting), this method effectively limits the capacity of the model.
Example: Stop training when validation loss begins to increase, even if training loss continues to decrease.
4. Bagging: Bagging, or bootstrap aggregating, involves training multiple models on different random subsets of the training data (that were sampled with replacement) and then averaging their predictions. This technique reduces variance and avoids overfitting by smoothing out predictions.
Example: Train multiple neural networks independently on different subsets of data and average their outputs to make final predictions.
Example: Use bagging for models that inherently have a high variance, such as decision tree models (e.g., random forest).
Exercise 5.11
Level: * (Easy)
Exercise Types: Novel
Question
Suppose the loss of a machine learning problem follows the function [math]\displaystyle{ f(x) = \cos(x) }[/math]. Find the tangent of [math]\displaystyle{ f(x) }[/math] at [math]\displaystyle{ x = 0 }[/math].
Solution
[math]\displaystyle{ f'(0) = -\sin(0) = 0 }[/math]
[math]\displaystyle{ \cos(0)=1 }[/math]
Therefore, the tangent function is [math]\displaystyle{ g(x) = 1 }[/math]
Exercise 5.12
Level: ** (Moderate)
Exercise Types: Novel
Question
Consider the manifold to be the unit sphere in [math]\displaystyle{ \mathbb{R}^{3} }[/math]
(1) What is the dimension of the sphere?
(2) Find the tangent plane to any point on the sphere
Solution
(1) A sphere in [math]\displaystyle{ \mathbb{R}^{3} }[/math] can be expressed as [math]\displaystyle{ S^{2}:=\{(x_1, x_2, x_3)\in \mathbb{R}^{3}: x_1^2+x_2^2+x_3^2=1\} }[/math], so it's dimension is 2
(2) Note that [math]\displaystyle{ S^{2} }[/math] can also be expressed as the zero set of the function [math]\displaystyle{ F: \mathbb{R}^{3} \rightarrow \mathbb{R}, (x_1, x_2, x_3)\mapsto x_1^2 + x_2^2 + x_3^2 - 1 }[/math], and it's Jacobian [math]\displaystyle{ DF:= (2x_1, 2x_2, 2x_3) }[/math] is maximal rank 1 on [math]\displaystyle{ S^{2} }[/math]. Therefore, for any point [math]\displaystyle{ p=(x_1, x_2, x_3)\in S^{2} }[/math] the tangent plane is given by ker[math]\displaystyle{ DF(p) = \{v\in \mathbb{R}^{3}: (2x_1, 2x_2, 2x_3) \cdot v = 0\} }[/math]
Exercise 5.13
Level: ** (Moderate)
Exercise Types: Modified
Reference: Prince, Simon J.D. Understanding Deep Learning. The MIT Press, 2023, p. 160, udlbook.com. Accessed 25 Jan. 2025.
Question
Show that the weight decay parameter update with decay rate [math]\displaystyle{ 2\alpha\lambda }[/math]: [math]\displaystyle{ w = (1-2\alpha \lambda)w - \alpha \nabla L(w) }[/math] on the original loss function [math]\displaystyle{ \nabla L(w)) }[/math] is equivalent to a standard gradient update using L2 regularization so that the modified loss function is [math]\displaystyle{ \tilde{L}(w) = L(w) + \lambda ||w||^2 }[/math].
Solution
Given [math]\displaystyle{ \tilde{L}(w) = L(w) + \lambda ||w||^2 }[/math], [math]\displaystyle{ \nabla \tilde{L}(w) = \nabla L(w) + 2 \lambda w }[/math].
Given [math]\displaystyle{ w = (1-2\alpha \lambda)w - \alpha \nabla L(w) }[/math], [math]\displaystyle{ w = w - \alpha (2 \lambda w + \nabla L(w)) }[/math], which is the standard gradient update on [math]\displaystyle{ \tilde{L}(w) }[/math].
Exercise 5.14
Level: * (Easy)
Exercise Types: Novel
Question
Except for the examples listed in class, provide another example of noise injection as part of the regularization strategy.
Solution
Here we look at the dropout noise which randomly sets some features to 0 during training, this strategy prevents co-adaption of neurons.
The dropout function is defined as:
[math]\displaystyle{ x' = x \cdot M, \quad M \sim \text{Bernoulli}(p) }[/math]
where [math]\displaystyle{ M }[/math] is a binary mask with probability [math]\displaystyle{ p }[/math] of keeping each unit.
A Python Example (PyTorch):
import torch
dropout = torch.nn.Dropout(p=0.5) # 50% dropout
x = torch.tensor([0.5, 0.3, 0.8])
x_noisy = dropout(x)
print(x_noisy)
Exercise 5.15
Level: * (Easy)
Exercise Types: Modified
Reference: Calin, Ovidiu. Deep learning architectures: A mathematical approach. Springer, 2020, page 463
Question
A one-hidden layer feedforward neural net, with dimensions 784-N-10, is used to classify the MNIST data. Find the range of the number of hidden neurons, N, for which the network overfits the training data. Note the MNIST dataset consists of 28x28 images, each identified with 10 classes. Assume the dataset contains 10,000 images.
Solution
We assume the MNIST dataset has 10,000 training points [math]\displaystyle{ {(x_i, z_i)} }[/math] where [math]\displaystyle{ x_i }[/math] is a 784 dimensional vector (28x28 flattened image) and [math]\displaystyle{ z_i }[/math] is a 10 dimensional one hot vector (indicating which class the image belongs to). Since there are 10,000 images, this yields a space with dimension 10,000*10 =100,000.
The dimension of the output manifold for the one-hidden layer feedforward NN is [math]\displaystyle{ r = 784*N + N*10 + N }[/math] since we have [math]\displaystyle{ 784*N }[/math] weights, from the input to hidden layer, [math]\displaystyle{ N }[/math] biases, and [math]\displaystyle{ N*10 }[/math] weights from the hidden layer to the output layer.
Therefore for the network to overfit, we require [math]\displaystyle{ r \gt 100,000 \Rightarrow 784*N + N*10 + N \gt 100,000 \Rightarrow N \gt 100,000/795 =125.786 }[/math]. That is, we require the dimension of the output manifold of the neural network to be greater than 100,000. Therefore, we require over 126 hidden neurons for which the network overfits the training data, assuming the training set consists of 10,000 images.
Exercise 5.16
Level: ** (Moderate)
Exercise Types: Novel
Question
a). Given the loss function of a ridge regression, isolate the weights, [math]\displaystyle{ \theta }[/math] and compare the ridge regression weights against that of an OLS regression.
[math]\displaystyle{ f(\theta) = \sum_i (y_i - \hat{y}_i )^2 + \lambda ||\theta||^2_2 }[/math]
b). Show how the weights of the ridge regression are updated using the Newton-Raphson method.
c). Discuss one benefits of ridge regression.
Solution
a).
[math]\displaystyle{ \begin{align*} f(\theta) &= \sum_i (y_i - \hat{y}_i)^2 + \lambda \sum_j \theta_j^2 \\ &= (\textbf{y} - \mathbf{x}\theta)^T (\textbf{y} - \textbf{x}\theta) + \lambda \theta ^T \theta \\ \nabla_\theta f(\theta) &= -2 \textbf{x}^T(\textbf{y} - \textbf{x}\theta) + 2 \lambda \theta \\ \end{align*} }[/math]
By letting [math]\displaystyle{ \nabla_\theta f(\theta) = 0, }[/math]
[math]\displaystyle{ \begin{align*} 0 &= -2 \textbf{x}^T(\textbf{y} - \textbf{x}\theta) + 2 \lambda \theta \\ &= - \textbf{x}^T\textbf{y} + \textbf{x}^T\textbf{x}\theta + \lambda \theta \\ \textbf{x}^T\textbf{y} & = \textbf{x}^T\textbf{x}\theta + \lambda \theta \\ \textbf{x}^T\textbf{y} & = (\textbf{x}^T\textbf{x} + \lambda I)\theta \\ \theta & = (\textbf{x}^T\textbf{x} + \lambda I)^{-1}\textbf{x}^T\textbf{y} \end{align*} }[/math]
Comparing both weighting schemes:
[math]\displaystyle{
\begin{align*}
\theta_{ridge} &= (\textbf{x}^T\textbf{x} + \lambda I)^{-1}\textbf{x}^T\textbf{y} \\
\theta_{OLS} &= (\textbf{x}^T\textbf{x})^{-1}\textbf{x}^T\textbf{y}
\end{align*}
}[/math]
b).
Using the Newton-Raphson method:
[math]\displaystyle{
\begin{align*}
\theta^{New} &= \theta^{old} + \rho \frac{f(\theta^{Old})}{\nabla f(\theta^{Old})} \\
&= \theta + \rho\big( - \textbf{x}^T(\textbf{y} - \textbf{x}\theta) + \lambda \theta \big)^{-1} \big((\textbf{y} - \mathbf{x}\theta)^T (\textbf{y} - \textbf{x}\theta^{}) + \lambda \theta ^T \theta \big) \\
\end{align*}
}[/math]
c).
One advantage of Ridge Regression, apart from the general penalization effect on the coefficients, is that it allows highly correlated [math]\displaystyle{ \textbf{x} }[/math] data to be included in the model. When the [math]\displaystyle{ \textbf{x} }[/math] data is highly correlated, the [math]\displaystyle{ \textbf{x}^T \textbf{x} }[/math] term in the denominator of an OLS regression tends to 0, leaving us with an undefined weight. The added [math]\displaystyle{ + \lambda I }[/math] in the denominator of the Ridge Regression reconciles this and therefore allows for highly correlated [math]\displaystyle{ \textbf{x} }[/math] features to be included. This can be particularly useful in certain situations, where keeping highly correlated variables increases model accuracy while maintaining stability.
Exercise 5.17
Level: ** (Moderate)
Exercise Types: Novel
Question
How Bagging's regularization effect differs from Dropout or L2 regularization?
Solution
Bagging reduces variance by training multiple models to integrate. It is best used when computing resources are sufficient because multiple models need to be trained for integration.
L2 regularization reduces the overfitting of a model to a particular sample by constraining the weight of a single model. To train individual models when the computational cost is low.
Dropout randomly drops neurons during training of individual models, thereby reducing overdependence between neurons, thereby enhancing the robustness of the neural network to reduce overfitting.
Exercise 5.18
Level: * (Easy)
Exercise Types: Novel
Question
Data augmentation is commonly used to improve the generalization ability of deep learning models. Consider a scenario where a convolutional neural network (CNN) is trained on a dataset of handwritten digits.
1) Explain how applying random rotations and Gaussian noise injection as data augmentation techniques could impact the model's performance.
2) What potential drawbacks can arise from excessive data augmentation, and how can these be mitigated?
3) Mathematically, how does injecting noise at the input act as a form of regularization? Explain using the concept of the Manifold Tangent Classifier.
Solution
1) Impact of random rotations and Gaussian Noise Injection:
• Random rotations: Rotating the images randomly within a small range helps the model learn rotational invariance, making it more robust to variations in real-world data. However, excessive rotation may distort digits, making classification harder.
• Gaussian Noise Injection: Adding Gaussian noise to the input images forces the network to learn robust features rather than memorizing the dataset. It acts similarly to dropout by introducing randomness, which helps prevent overfitting.
2) Drawbacks of Excessive Data Augmentation:
• If the augmentation introduces unrealistic transformations, such as excessive rotations that make digits unrecognizable, it may degrade model performance.
• Augmenting too much can increase training time significantly without proportional improvement.
• Mitigation: Using hyperparameter tuning, validation monitoring and adversarial training can help balance augmentation intensity.
3) Mathematical Explanation of Noise Injection as Regularization:
• Injecting noise at the input level can be interpreted as adding a regularization term to the loss function.
• If [math]\displaystyle{ x }[/math] is the input and noise [math]\displaystyle{ \epsilon }[/math] is added, the perturbed input becomes [math]\displaystyle{ x' = x + \epsilon }[/math].
• Expanding the loss function using a Taylor series, we see that the noise injection penalizes large weight changes, enforcing smooth decision boundaries.
• The Manifold Tangent Classifier (MTC) leverages this by learning the local manifold structure of the data, ensuring that small perturbations (such as injected noise) do not drastically change the classification outcome.
Exercise 5.19
Level: ** (Moderate)
Exercise Types: Novel
Question
L2 regularization (also known as weight decay) is widely used to prevent overfitting in neural networks. The regularized loss function for a neural network is given by:
Lreg(θ) = Ltrain(θ) + λ ||θ||22
where Ltrain(θ) is the training loss (e.g., cross-entropy loss), ||θ||22 = ∑i=1N θi2 is the squared L2 norm of the weights, and λ is the regularization hyperparameter.
- Derive the gradient of the regularized loss function with respect to the weights θj. Show the update rule for the weight θj during gradient descent.
- Prove that the L2 regularization term leads to a shrinkage effect on the weights. Specifically, show that for each weight θj, the regularization term effectively penalizes large values of θj, making them smaller over time.
- Explain how L2 regularization helps control overfitting in terms of the bias-variance decomposition. Use the bias-variance decomposition of the generalization error to argue how L2 regularization reduces model variance.
Solution
1. Gradient of the Regularized Loss Function
The regularized loss function is:
Lreg(θ) = Ltrain(θ) + λ ||θ||22
To compute the gradient of Lreg(θ) with respect to the weight θj, we first compute the gradient of each term:
- Gradient of the training loss term Ltrain(θ): ∇θj Ltrain(θ)
- Gradient of the regularization term λ ||θ||22: The gradient of the L2 penalty term is: ∇θj (λ ∑i=1N θi2) = 2λθj
The total gradient with respect to θj is:
∇θj Lreg(θ) = ∇θj Ltrain(θ) + 2λθj
Update Rule for the Weight θj
Using gradient descent, the update rule for θj is:
θj ← θj - η (∇θj Ltrain(θ) + 2λθj)
Thus, the weight update rule with L2 regularization is:
θj ← θj - η∇θj Ltrain(θ) - 2ηλθj
2. Shrinkage Effect of L2 Regularization
To understand how L2 regularization shrinks the weights, we look at the weight update rule:
θj ← θj - η (∇θj Ltrain(θ) + 2λθj)
The second term - 2ηλθj represents the shrinkage of the weight θj. This term reduces the magnitude of θj over time, which is the shrinkage effect. Specifically, the term:
- 2ηλθj
forces the weight to decrease in magnitude during each update, thus reducing the complexity of the model and helping prevent overfitting by avoiding large values for the weights. The strength of the shrinkage effect is directly controlled by the regularization parameter λ, the learning rate η, and the magnitude of the weight θj.
3. L2 Regularization and Bias-Variance Decomposition
The generalization error ℰ can be decomposed into three components: bias, variance, and irreducible error:
ℰ = Bias² + Variance + Irreducible Error
L2 regularization helps to control overfitting by influencing the variance component. Here’s how:
- Bias: L2 regularization slightly increases bias by forcing the model to simplify (via shrinking the weights). A more complex model with no regularization can fit the training data perfectly, but at the cost of increased variance.
- Variance: L2 regularization reduces variance by preventing the weights from becoming too large, which would lead to an overfitting model with high variance. By penalizing large weights, the model generalizes better and has reduced sensitivity to fluctuations in the training data.
Thus, L2 regularization reduces the overall generalization error by decreasing variance (the model is less sensitive to noise) while introducing a slight increase in bias (the model is less complex). This tradeoff helps the model generalize better to unseen data.
Exercise 6.1
Level: ** (Medium)
Exercise Types: Modified
Reference: Source: Schonlau, M., Applied Statistical Learning. With Case Studies in Stata, Springer. ISBN 978-3-031-33389-7 (Chapter 14, page 319).
Question
In binary logistic regression, we model the log odds of one class as a linear function of predictor variables. In multinomial logistic regression, where there are [math]\displaystyle{ K }[/math] possible classes, we model the log odds of [math]\displaystyle{ K-1 }[/math] classes relative to a reference class [math]\displaystyle{ K }[/math]:
[math]\displaystyle{ \log \left( \frac{p_k}{p_K} \right) = \alpha_k + x_1 \beta_{k1} + \dots + x_p \beta_{kp}, }[/math]
for [math]\displaystyle{ k = 1, 2, \dots, K-1 }[/math].
Suppose we use a neural network model to implement multinomial logistic regression. Design a neural network architecture for a case with four input variables ([math]\displaystyle{ p = 4 }[/math]) and five possible outcome classes ( [math]\displaystyle{ K = 5 }[/math] ).
(a) Explain how the output layer should be structured and which activation function should be used.
(b) Introduce a **dropout** mechanism in the model. Explain where dropout can be applied and how it affects training and generalization.
Solution
To design a neural network that performs multinomial logistic regression with [math]\displaystyle{ p = 4 }[/math] input variables and [math]\displaystyle{ K = 5 }[/math] classes, we follow these steps:
1. Neural Network Architecture
- Input Layer:** 4 neurons (one for each predictor variable).
- Hidden Layer (Optional):** To introduce non-linearity, we can add a hidden layer with dropout.
- Output Layer:** 5 neurons (one for each class).
Each output neuron represents the log-odds of class [math]\displaystyle{ k }[/math] relative to the reference class [math]\displaystyle{ K }[/math], following the equation:
[math]\displaystyle{ \log \left( \frac{p_k}{p_K} \right) = \alpha_k + x_1 \beta_{k1} + \dots + x_p \beta_{kp}, \quad \text{for } k = 1,2,\dots,4. }[/math]
The probability of class [math]\displaystyle{ K }[/math] is obtained as:
[math]\displaystyle{ p_K = 1 - \sum_{k=1}^{K-1} p_k. }[/math]
2. Activation Function
Since this is a **classification problem**, the **softmax activation function** is used in the output layer:
[math]\displaystyle{ p_k = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}, \quad \text{where } z_k = \alpha_k + x_1 \beta_{k1} + \dots + x_p \beta_{kp}. }[/math]
This ensures that: - Each [math]\displaystyle{ p_k }[/math] is between 0 and 1. - The sum of all [math]\displaystyle{ p_k }[/math] values is 1.
3. Incorporating Dropout Dropout is a regularization technique that randomly deactivates neurons during training to prevent overfitting. It is commonly applied to hidden layers.
- Where to Apply Dropout:
- If a hidden layer is included, apply dropout (e.g., with probability [math]\displaystyle{ p = 0.5 }[/math]) to prevent co-adaptation of neurons.
- Dropout **should not** be applied to the input or output layers.
- Impact on Training and Generalization:
- During training, dropout randomly removes a fraction of neurons in each iteration, forcing the network to learn more robust features. - During inference (testing), dropout is disabled, and all neurons contribute, with weights scaled accordingly. - This reduces overfitting and improves generalization to unseen data.
4. Loss Function To train the network, the **categorical cross-entropy loss** is used:
[math]\displaystyle{ L = - \sum_{i=1}^{N} \sum_{k=1}^{K} y_{ik} \log p_{ik} }[/math]
where: - [math]\displaystyle{ N }[/math] is the number of samples.
- [math]\displaystyle{ y_{ik} }[/math] is 1 if sample [math]\displaystyle{ i }[/math] belongs to class [math]\displaystyle{ k }[/math], otherwise 0.
- [math]\displaystyle{ p_{ik} }[/math] is the predicted probability for class [math]\displaystyle{ k }[/math].
5. Summary of Neural Network Design with Dropout
- Input layer: 4 neurons (one per input feature).
- Hidden layer (optional): Can include 8 neurons with ReLU activation and dropout ([math]\displaystyle{ p = 0.5 }[/math]).
- Output layer: 5 neurons (one per class) with softmax activation.
- Activation function: Softmax in the output layer.
- Loss function: Categorical cross-entropy.
- Regularization: Dropout applied to hidden layer to reduce overfitting.
By including dropout, the model becomes more robust to noise and prevents reliance on specific neurons, leading to better generalization performance.
Exercise 6.2
Level: * (Easy)
Exercise Types: Novel
Question
Consider a neural network with a single hidden layer. The hidden layer has 4 neurons, and the ReLU activation function is applied. During training, dropout is applied with a probability of 0.5. If the input to the hidden layer is [math]\displaystyle{ \mathbf{x} = [1, 2, 3, 4] }[/math] and the weights are [math]\displaystyle{ \mathbf{W} = \begin{bmatrix} 0.5 & 0.2 & -0.3 & 0.8 \\ 0.1 & -0.5 & 0.7 & -0.2 \\ -0.4 & 0.6 & -0.1 & 0.3 \\ 0.3 & -0.4 & 0.2 & 0.1 \end{bmatrix} }[/math], compute the output of the hidden layer after applying dropout.
Assume the following dropout mask is sampled: [math]\displaystyle{ \mathbf{m} = [1, 0, 1, 0] }[/math].
Solution
1. Compute the pre-activation values: [math]\displaystyle{ \mathbf{z} = \mathbf{W} \cdot \mathbf{x} = \begin{bmatrix} 0.5 & 0.2 & -0.3 & 0.8 \\ 0.1 & -0.5 & 0.7 & -0.2 \\ -0.4 & 0.6 & -0.1 & 0.3 \\ 0.3 & -0.4 & 0.2 & 0.1 \end{bmatrix} \cdot \begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \end{bmatrix} = \begin{bmatrix} 3.5 \\ 0.3 \\ 1.4 \\ 0.9 \end{bmatrix} }[/math]
2. Apply the ReLU activation function: [math]\displaystyle{ \text{ReLU}(\mathbf{z}) = \max(0, \mathbf{z}) = \begin{bmatrix} 3.5 \\ 0.3 \\ 1.4 \\ 0.9 \end{bmatrix} }[/math]
3. Apply the dropout mask and scale by [math]\displaystyle{ \frac{1}{1 - 0.5} = 2 }[/math]: [math]\displaystyle{ \mathbf{h} = \mathbf{m} \odot \text{ReLU}(\mathbf{z}) \cdot 2 = \begin{bmatrix} 1 \\ 0 \\ 1 \\ 0 \end{bmatrix} \odot \begin{bmatrix} 3.5 \\ 0.3 \\ 1.4 \\ 0.9 \end{bmatrix} \cdot 2 = \begin{bmatrix} 7.0 \\ 0 \\ 2.8 \\ 0 \end{bmatrix} }[/math]
The final output of the hidden layer after applying dropout is: [math]\displaystyle{ \mathbf{h} = [7.0, 0, 2.8, 0] }[/math].
Exercise 6.3
Level: ** (Moderate)
Exercise Types: Novel
Question
Train a simple deep neural network with and without dropout regularization on the MNIST dataset. Experiment with different dropout rates (0.1, 0.3, 0.5). Analyze the model's performance in terms of accuracy and generalization.
Solution
import torch import torch.nn as nn import torch.optim as optim from torchvision import datasets, transforms from torch.utils.data import DataLoader import matplotlib.pyplot as plt # Set device device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Define data preprocessing and load the MNIST dataset transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,)) # Normalize to [-1, 1] ]) train_dataset = datasets.MNIST(root='./data', train=True, transform=transform, download=True) test_dataset = datasets.MNIST(root='./data', train=False, transform=transform, download=True) train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True) test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False) # Define a simple neural network class SimpleNN(nn.Module): def __init__(self, dropout_rate=0.0): super(SimpleNN, self).__init__() self.fc1 = nn.Linear(28 * 28, 256) self.dropout = nn.Dropout(dropout_rate) self.fc2 = nn.Linear(256, 128) self.fc3 = nn.Linear(128, 10) self.relu = nn.ReLU() def forward(self, x): x = x.view(x.size(0), -1) # Flatten the input x = self.relu(self.fc1(x)) x = self.dropout(x) # Apply dropout x = self.relu(self.fc2(x)) x = self.fc3(x) # Output layer return x # Define training and evaluation functions def train_model(model, train_loader, optimizer, criterion, num_epochs=10): model.train() train_loss = [] for epoch in range(num_epochs): epoch_loss = 0.0 for images, labels in train_loader: images, labels = images.to(device), labels.to(device) optimizer.zero_grad() outputs = model(images) loss = criterion(outputs, labels) loss.backward() optimizer.step() epoch_loss += loss.item() train_loss.append(epoch_loss / len(train_loader)) return train_loss def evaluate_model(model, test_loader, criterion): model.eval() correct = 0 total = 0 test_loss = 0.0 with torch.no_grad(): for images, labels in test_loader: images, labels = images.to(device), labels.to(device) outputs = model(images) loss = criterion(outputs, labels) test_loss += loss.item() _, predicted = torch.max(outputs.data, 1) total += labels.size(0) correct += (predicted == labels).sum().item() accuracy = 100 * correct / total return test_loss / len(test_loader), accuracy # Experiment with different dropout rates dropout_rates = [0.0, 0.1, 0.3, 0.5] results = {} for rate in dropout_rates: model = SimpleNN(dropout_rate=rate).to(device) criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=0.001) # Train the model print(f"Training with dropout rate: {rate}") train_loss = train_model(model, train_loader, optimizer, criterion, num_epochs=10) # Evaluate the model test_loss, test_accuracy = evaluate_model(model, test_loader, criterion) results[rate] = (train_loss, test_loss, test_accuracy) # Plot the training loss for different dropout rates plt.figure(figsize=(10, 6)) for rate, (train_loss, _, _) in results.items(): plt.plot(train_loss, label=f"Dropout {rate}") plt.title("Training Loss for Different Dropout Rates") plt.xlabel("Epochs") plt.ylabel("Loss") plt.legend() plt.grid(True) plt.savefig('training_loss.jpg') # Display test accuracy for different dropout rates print("\nResults Summary:") for rate, (_, test_loss, test_accuracy) in results.items(): print(f"Dropout Rate {rate}: Test Loss = {test_loss:.4f}, Test Accuracy = {test_accuracy:.2f}%")
Output:
Results Summary:
Dropout Rate 0.0: Test Loss = 0.0778, Test Accuracy = 97.80%
Dropout Rate 0.1: Test Loss = 0.0826, Test Accuracy = 97.48%
Dropout Rate 0.3: Test Loss = 0.0873, Test Accuracy = 97.49%
Dropout Rate 0.5: Test Loss = 0.0930, Test Accuracy = 97.11%
From the summary results and the plot above, it is clear that the model tends to overfit without dropout. With a moderate dropout rate, the model achieves a good balance between generalization and regularization, leading to improved test accuracy. However, with a high dropout rate, excessive regularization can hinder the model's ability to learn effectively, resulting in reduced test accuracy. Hence, we always select moderate dropout rate in the practical problem.
Exercise 6.4
Level: * (Easy)
Exercise Type: Novel
Question
True/False Questions on Dropout
1. Dropout helps prevent overfitting by randomly deactivating a subset of neurons during training.
2. Dropout is used both during training and during testing to ensure consistent performance.
3. Using a high dropout rate (e.g., 90%) can lead to underfitting the model.
4. Dropout introduces noise in the network, which makes it more robust and forces the network to learn redundant representations.
5. Dropout is not compatible with batch normalization because they both normalize activations.
6. Dropout improves model performance by reducing training time significantly.
Solution
1. Dropout helps prevent overfitting by randomly deactivating a subset of neurons during training.
Answer: True
Explanation: Dropout prevents overfitting by deactivating a random subset of neurons during each forward pass in training. This forces the network to learn more generalizable features rather than relying on specific neurons or overfitting to the training data. Dropout acts as a regularizer by preventing the model from overfitting. Regularization techniques like dropout introduce some form of 'noise' or uncertainty in training, which prevents the model from memorizing the training data and forces it to focus on the most important patterns in the data
2. Dropout is used both during training and during testing to ensure consistent performance.
Answer: False
Explanation: Dropout is only applied during training. During inference, dropout is turned off, and all neurons are active. The weights are scaled to account for the absence of dropout, ensuring consistent output without randomly deactivating neurons.
3. Using a high dropout rate (e.g., 90%) can lead to underfitting the model.
Answer: True
Explanation: Underfitting occurs when the model is too simple to capture the underlying patterns in the data. A very high dropout rate deactivates most of the neurons, which reduces the network’s capacity to learn complex patterns. This can result in underfitting, where the model performs poorly on both the training and validation sets. Studies typically recommend a dropout rate between 20% and 50% for most problems. Extremely high rates, such as 90%, often make the network underfit and perform worse than simpler models.
4. Dropout introduces noise in the network, which makes it more robust and forces the network to learn redundant representations.
Answer: True
Explanation: By deactivating random neurons during training, dropout adds noise, making the model more robust to variations in the data. This encourages the network to learn redundant and distributed representations, as no single neuron can dominate the decision-making process.
5. Dropout is not compatible with batch normalization because they both normalize activations.
Answer: False
Explanation: Dropout and batch normalization are compatible, although they operate differently. Batch normalization normalizes activations, while dropout randomly deactivates neurons. The goal of Batch Normalization is to stabilize and speed up training. The goal of dropout is to prevent overfitting. Batch normalization is typically applied earlier in the network, often before nonlinearities like ReLU, while dropout is typically used later in the network, after the activation layers. Combining both allows you to stabilize training (via batch normalization) while reducing overfitting (via dropout)
Additional notes: Dropout prevents overfitting by giving other nodes to be trained on a chance. It avoids the pitfall of having 1 node dominating other nodes for training, and allow the option for other weights to be explored. A commonly used starting point is a dropout rate of 0.5, which means 50% of neurons are dropped during training. This can be fine-tuned depending on the model's performance on the validation set. Larger dropout rates might be used for smaller networks, while lower rates are preferred for larger models.
6. Dropout improves model performance by reducing training time significantly.
Answer: False
Explanation: Dropout helps in preventing overfitting by randomly deactivating a subset of neurons during training, forcing the model to learn more robust features. However, dropout does not necessarily reduce training time. In fact, it can increase training time since fewer active neurons participate in each forward and backward pass, which can slow down convergence. Additionally, since dropout is only applied during training (not inference), the computational savings do not extend to the final model when making predictions.
Exercise 6.5
Level: ** (Moderate)
Exercise Types: Novel
References: Adapted from concepts introduced by Srivastava et al., 2014.
Question
Consider the application of dropout to a linear regression model with a response variable [math]\displaystyle{ y }[/math] and a single predictor [math]\displaystyle{ x }[/math]. The linear model can be expressed as:
[math]\displaystyle{ y = \beta_0 + \beta_1 x + \epsilon }[/math], where [math]\displaystyle{ \epsilon }[/math] represents normally distributed error.
(a) If dropout is applied at a rate of 0.2 during the training phase, how does this affect the estimation of [math]\displaystyle{ \beta_0 }[/math] and [math]\displaystyle{ \beta_1 }[/math]? Discuss the potential benefits and drawbacks.
(b) Extend this to a multiple regression scenario where there are three predictors [math]\displaystyle{ x_1, x_2, x_3 }[/math]. Explain how dropout could be implemented during training and its expected impact on model variance and bias.
Solution
1.Dropout Implementation and Effects in Simple Linear Regression:
- Applying dropout to the predictor [math]\displaystyle{ x }[/math] during training involves temporarily removing it from the model with a probability of 0.2. This effectively introduces missingness in the predictor data, which can make the model more robust by preventing it from relying too heavily on [math]\displaystyle{ x }[/math] for predicting [math]\displaystyle{ y }[/math].
- Potential Benefits: - Reduces the risk of overfitting by making the model less sensitive to noise in the predictor [math]\displaystyle{ x }[/math]. - Forces the model to not rely on any single input feature, thus potentially discovering more general patterns in the data.
- Potential Drawbacks: - Can increase model bias if important predictor information is consistently dropped. - Might lead to higher variance in the estimated parameters due to reduced effective sample size during each training iteration.
2. Extension to Multiple Regression
- In a model with multiple predictors, dropout can be applied to each predictor independently. Each predictor [math]\displaystyle{ x_i }[/math] is dropped with a probability of 0.2 during each training epoch.
- Impact on Model Variance and Bias: - Model Variance: Dropout generally increases model variance during training but can lead to a reduction in variance of predictions by averaging over multiple "thinned" models. It can be seen as an ensemble method: As if we have many different tiny networks, and we take the weighted average of all of them. - Model Bias: Similar to the simple linear regression case, the bias might increase due to the systematic exclusion of potentially important predictors during training epochs. - The application of dropout in this setting acts as a form of regularization, helping to mitigate overfitting especially when the number of predictors is large relative to the sample size.
Exercise 6.6
Level: * (Easy)
Exercise Types: Novel
Question
Let [math]\displaystyle{ f(x) = 4\sin(x) }[/math]. Apply gradient clipping with a limit of 3. What is the resulting gradient?
Solution
The gradient is [math]\displaystyle{ f'(x) = 4\cos(x) }[/math].
After clipping, any gradient that exceeds 3 gets limited to 3, and any gradient that goes below -3 gets limited to -3. This is visualized in the plot below.
Exercise 6.7
Level: * (Easy)
Exercise Types: Modified
Sources Calin, Ovidiu. Deep learning architectures: A mathematical approach. Springer, 2020, Exercise 12.13.9
Question
How does the capacity of a network change when:
(a) An extra fully-connected layer is added to the network
(b) Some perceptrons are dropped out of the network
(c) The weights are constrained to be kept small
Solution
(a) Increases. Adding an extra fully-connected layer increases the number of parameters and allows the network to learn more complex functions. The added depth enables the network to capture higher-order interactions in the data, potentially increasing its representational power. However, deeper networks may also be harder to train due to issues like vanishing gradients.
(b) Can decrease effective capacity (but not parameter count). Dropout randomly deactivates neurons during training, which prevents co-adaptation and encourages robustness. While the total number of parameters remains unchanged, the effective capacity of the network is reduced because fewer units participate in each forward pass. However, at test time, all neurons are active with scaled weights, meaning the full network is used.
(c) Decreases. Constraining weights (e.g., via L2 regularization) limits how much each neuron can contribute to the output. This reduces the flexibility of the model, making it less prone to overfitting but also potentially limiting its ability to learn complex patterns.
Exercise 6.8
Level: ** (Moderate)
Exercise Type: Novel
Question
Dropout in Linear Regression
1. How does applying dropout during training differ from its use during testing in linear regression models?
2. Consider a linear regression model: [math]\displaystyle{ y = w_1x_1 + w_2x_2 + w_3x_3 + b }[/math] If dropout is applied to the inputs [math]\displaystyle{ x_1, x_2, x_3 }[/math] with a dropout rate of [math]\displaystyle{ p = 0.2 }[/math], describe the expected effect on the effective input during training.
3. Explain how dropout can affect the model’s capacity and generalization in regression tasks.
4. How would applying dropout to the weights of a linear regression model (instead of the inputs) impact the learned parameters and model performance?
Solution
1. Dropout during training vs. testing: During training, dropout randomly deactivates a fraction of the neurons (or inputs, in the case of linear regression) at each forward pass. This simulates training on different "thinned" versions of the network, helping to prevent co-adaptation of features. During testing, dropout is not applied; instead, all neurons are active, and the weights are scaled down by the dropout probability to approximate the effect of averaging over the thinned networks.
2. Effect on inputs with [math]\displaystyle{ p = 0.2 }[/math]: If a dropout rate of 0.2 is applied, 20% of the inputs [math]\displaystyle{ x_1, x_2, x_3 }[/math] will be set to zero randomly during training. On average, each input contributes only 80% of its value to the output during training. Therefore, the effective input becomes scaled by a factor of [math]\displaystyle{ 1 - p = 0.8 }[/math]. The modified equation during training becomes: [math]\displaystyle{ \hat{y} = 0.8w_1x_1 + 0.8w_2x_2 + 0.8w_3x_3 + b }[/math] This ensures the model does not overly rely on any single input feature.
3. Effect on capacity and generalization: Dropout reduces the effective capacity of the model during training by deactivating some inputs or neurons, forcing the model to distribute learning across all features. This prevents overfitting to the training data and improves generalization to unseen data. However, in some cases, dropout can slightly increase training time as the model requires more epochs to converge due to the reduced capacity during training.
4. Regularization: By preventing any single weight from dominating the learning process, it encourages a more balanced contribution of all features.
Weight Averaging: At test time, the model would use the full weight set, but with scaled values, approximating an ensemble of multiple different sub-models.
Training Stability: Unlike input dropout, weight dropout directly impacts how much each feature contributes to the prediction at every step, potentially leading to a noisier optimization process.
Reduced Overfitting: By forcing the model to rely on different subsets of weights at each training step, it reduces dependency on specific features and helps improve generalization.
Exercise 6.9
Level: * (Easy)
Exercise Type: Novel
Question
This is a quick exercise for demonstrating the effect of batch normalization. Suppose you are given the following mini-batch of inputs:
[math]\displaystyle{ X=[-10, -8, 2, 10] }[/math]
(a) Apply the sigmoid activation function directly to [math]\displaystyle{ X }[/math].
(b) Now, normalize the inputs [math]\displaystyle{ X }[/math] using batch normalization, assuming that the layer has learned parameters [math]\displaystyle{ \gamma = 2 }[/math] and [math]\displaystyle{ \beta = 0.5 }[/math]. Then, apply the sigmoid activation to the normalized inputs.
(c) Why can batch normalization be helpful in models that use activations such as the sigmoid function?
Solution
Note that all of the numerical answers here are rounded to 2 decimal places for conciseness.
(a) The sigmoid function is given by
[math]\displaystyle{ \sigma(x) = \frac{1}{1 + e^{-x}} }[/math]
Applying this to all of the elements of [math]\displaystyle{ X }[/math] results in the following.
[math]\displaystyle{ \sigma(X) = [4.54 \times 10^{-5}, \quad 3.35 \times 10^{-4}, \quad 8.81 \times 10^{-1}, \quad 1.00 \times 10^{-1}] }[/math]
(b) The mean and variance of [math]\displaystyle{ X }[/math] are given below.
[math]\displaystyle{ \mu_B = -2 }[/math]
[math]\displaystyle{ \sigma_B^2 = 54 }[/math]
These are used to calculate the normalized form of [math]\displaystyle{ X }[/math].
[math]\displaystyle{ \hat{X} = \frac{X - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} = [0.25, 0.31, 0.63, 0.80] }[/math]
Then, the learned parameters are used for scaling and shifting.
[math]\displaystyle{ Y = \gamma \hat{X} + \beta = [0.16, 0.24, 0.83, 0.96] }[/math]
(c) The batch given in the problem statement had values that were either very large or very small, and therefore were in the saturation regions of the sigmoid function. Without batch normalization, the output contained values that were either extremely close to zero or extremely close to 1. In those regions, the gradient of the sigmoid function is very small, which makes it harder for the model to learn. Batch normalization can be used to keep the inputs in the centre, close to zero, where the gradient is non-negligible. In other words, the distribution of the inputs to a layer may change during training (i.e., internal covariate shift); batch normalization is meant to mitigate this.
Exercise 6.10
Level: ** (Moderate)
Exercise Type: Novel
Question
1. Answer the following True or False questions, and explain why:
(a). Batch normalization reduces internal covariance shift, improving model performance.
(b). Beta-smoothness describes how much the output difference [math]\displaystyle{ |f(x_1) - f(x_2)| }[/math] changes when the input difference [math]\displaystyle{ |x_1 - x_2| }[/math] is perturbed.
(c). A smaller L in L-Lipschitz continuity generally leads to better optimization.
(d). If the gradient changes significantly between nearby points, using a large learning rate is a good choice.
(e). Weight normalization is a viable alternative to batch normalization.
2. A neural network layer has 5 neurons, each receiving the same input x. The activations before dropout are:
[math]\displaystyle{ a = [2.0, 1.5, 3.0, 0.5, 2.5] }[/math]
Apply dropout with a dropout rate of 0.4 (i.e., 40% of neurons are randomly set to 0 during training). The remaining activations are scaled to maintain expected values. What are the output activations after applying dropout and scaling?
Solution
1.
(a). False. Initially, batch normalization was believed to mitigate internal covariance shift. However, later research (2019) found that adding noise to data performed just as well as using BatchNorm. Since noise increases the variance of each iteration, it actually amplifies internal covariance shift rather than reducing it. This suggests that batch normalization does not work by fixing covariance shift but instead improves training through other mechanisms, such as smoothing the optimization landscape.
(b). False. Beta-smoothness refers to a bound on how much the gradient of a function can change, meaning it limits how fast the slope of the function varies. The concept described in the statement actually corresponds to L-Lipschitz continuity, which bounds the maximum rate at which the function values can change with respect to input perturbations.
(c). True. A smaller L means the function has a lower maximum gradient, making it smoother. Smooth functions are easier to optimize because they reduce the risk of abrupt gradient changes (sharp cliffs) that can destabilize training. This leads to more stable updates and better convergence behavior.
(d). False. Large variations in the gradient indicate a highly non-smooth loss landscape. In such cases, a large learning rate can cause the optimizer to overshoot the minimum or oscillate chaotically, much like a ball bouncing between steep slopes. Instead, a more stable learning rate should be used in these situations to ensure smooth convergence.
(e). True. While batch normalization is widely used, it is not the only normalization technique. Weight normalization, which normalizes weight vectors instead of activations, can serve as an effective alternative, especially in cases where batch statistics are unreliable (e.g., small batch sizes or online learning).
2. With a dropout rate of 0.4, 40% of neurons (2 out of 5) are dropped (set to 0). Suppose the second (1.5) and fourth (0.5) neurons are dropped. The remaining activations are:
[math]\displaystyle{ a' = [2.0, 0, 3.0, 0, 2.5] }[/math]
During training, we scale the remaining neurons by [math]\displaystyle{ (\frac{1}{1 - 0.4} = \frac{1}{0.6} = 1.67) }[/math] to maintain the expected activation. Applying this scaling to the nonzero activations:
[math]\displaystyle{ a'' = [2.0 \times 1.67, 0, 3.0 \times 1.67, 0, 2.5 \times 1.67] }[/math]
After applying dropout and scaling, the output activations are:
[math]\displaystyle{ [3.33, 0, 5.00, 0, 4.17] }[/math]
Exercise 6.11
Level: ** (Moderate)
Exercise Types: Novel
Question
This Exercise has the goal of getting a better understanding of what is a Lipschitz function, an L-Smooth function and its implications.
We say that a function is Lipschitz if [math]\displaystyle{ \forall x \in \mathbb{R}, \exists \ L }[/math] such that:
[math]\displaystyle{ |f(x) - f(y)| \le L|x - y| }[/math]
We say that a function is L-Smooth if [math]\displaystyle{ \forall x \in \mathbb{R}, \exists \ L }[/math] such that:
[math]\displaystyle{ |\nabla f(x) - \nabla f(y)| \le L|x - y| }[/math]
Give an example of a function that is not Lipschitz and not L-Smooth. Explain the ramifications of these conditions.
Solution
Let us begin by finding a function that is not Lipschitz. Suppose we have [math]\displaystyle{ f(x) = \sqrt{x} }[/math]. By Lipschitz property:
[math]\displaystyle{ \begin{align*} | f(x_1) - f(x_2) | & \le L |x_1 - x_2 | \\ \frac{| \sqrt{x_1} - \sqrt{x_2} |}{|x_1 - x_2 |} & \le L \\ \end{align*} }[/math]
Now, suppose that [math]\displaystyle{ |x_1 - x_2| \to 0^+ }[/math]. Now as the gap between both [math]\displaystyle{ x_1 }[/math] and [math]\displaystyle{ x_1 }[/math] narrows, while still remaining greater than 0, the function gets increasingly steeper. Take for instance [math]\displaystyle{ \sqrt{0.0001} - \sqrt{0.00001} \approx 0.0068 }[/math]. Now divide this by [math]\displaystyle{ 9*10^{-5} }[/math], we see that this is roughly equal to 75.9746926.
As the gap narrows, this number increases until we are left with:
[math]\displaystyle{ \infty \le L }[/math]
which is absurd. Therefore, [math]\displaystyle{ f(x) = \sqrt{x} }[/math] is not Lipschitz.
When taking the gradient, we see that [math]\displaystyle{ \nabla f(x) = \frac{1}{2\sqrt{x}} }[/math]. Once again, as [math]\displaystyle{ x \to 0 }[/math], the gradient approaches infinity. Therefore, we can also conclude that the function is not L-Smooth.
When considering a loss function, and the necessity of a local minimum, having a function who's gradient increases dramatically and quickly can lead to poor convergence and inability to attain the local minimum, or in this case, the infimum. Therefore, this would not be an ideal loss function.
Exercise 6.12
Level: ** (Easy)
Exercise Types: Novel
Question
Batch Normalization and Internal Covariate Shift
What is internal covariate shift, and why is it considered a problem in deep learning?
Given a mini-batch of input activations [math]\displaystyle{ x = {x_1, x_2, ..., x_m} }[/math], batch normalization normalizes each activation using the batch mean and variance. Write the mathematical formulation of batch normalization for a single activation [math]\displaystyle{ x_i }[/math].
How does batch normalization help in improving optimization and generalization in deep learning models?
Solution
Internal Covariate Shift: Internal covariate shift refers to the change in the distribution of network activations during training as the parameters of previous layers update. This shift forces each layer to continuously adapt to new distributions, slowing down convergence and making optimization more difficult.
Mathematical Formulation: Given a mini-batch [math]\displaystyle{ x = {x_1, x_2, ..., x_m} }[/math], batch normalization is computed as follows:
Compute the batch mean: [math]\displaystyle{ \mu = \frac{1}{m} \sum_{i=1}^{m} x_i }[/math] Compute the batch variance: [math]\displaystyle{ \sigma^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu)^2 }[/math] Normalize each activation: [math]\displaystyle{ \hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}} }[/math] Scale and shift using learnable parameters [math]\displaystyle{ \gamma }[/math] and [math]\displaystyle{ \beta }[/math]: [math]\displaystyle{ y_i = \gamma \hat{x}_i + \beta }[/math] Effect on Optimization and Generalization:
Optimization: Batch normalization smooths the loss landscape by normalizing activations, making gradients more predictable and allowing for larger learning rates. This speeds up convergence and stabilizes training. Generalization: By reducing the sensitivity of the model to small variations in input distributions, batch normalization acts as a form of regularization, reducing the risk of overfitting.
Exercise 6.13
Level: * (Easy)
Exercise Types: Novel
Question
-
Which of the following is not a commonly used from of regularization in deep neural networks??
- A) Weight Decay
- B) Dropout
- C) Label smoothing
- D) Training the net until it hits 100% accuracy on the training data
-
Label smoothing is mostly directted at preventing which of these??
- A) Large model bias
- B) Overconfidence in the predictions
- C) Too many data augmentations
- D) Slower training speed
Solution
-
Correct Answer: D
Short Reason: Weight Decay, Dropout, and Label Smoothing are all frequent used ways to reduce overfitting. Training a network until it absolutely nails 100% of the training data is usually not considered a good method and can cause severe overfitting.
-
Correct Answer: B
Short Reason: Label smoothing basically lowers the chance of being fully certain about any class. This helps reduce the model's overconfidence in it’s own predictions, so it generalizes better.
Exercise 6.14
Level: * (Easy)
Exercise Types: Novel
Question
-
What is the role of using Dropout when training neural networks?
- A) Speed up computation (reduce the number of neurons)
- B) Improve the generalization ability of the model (introduce noise to reduce the coadaptability of neurons)
- C) Reduce the number of parameters (force neurons to share weights)
- D) Increasing the complexity of the model (randomly adjusting the activation function of neurons)
-
What mechanism does Batch Normalization use to accelerate the training of neural networks?
- A) Reduce the problem of gradient disappearance and gradient explosion
- B) Randomly discard neurons
- C) Add the number of network layers
- D) Reduce the number of weights
Solution
-
Correct Answer: B
Short Reason: Dropout reduces coadaptation by randomly dropping neurons, thereby enhancing the model's generalization ability and reducing overfitting.
-
Correct Answer: A
Short Reason: Batch Normalization alleviates the problem of gradient explosion by reducing internal covariate shifts to increase gradient stability and accelerate convergence.
Exercise 6.15
Level: ** (Moderate)
Exercise Types: Novel
Question
1. What is the working principle of dropout?
2. How does dropout affect the training process of a neural network?
3. What is the role of dropout and why does the model performance decrease if not using dropout?
Solution
1. Dropout is a common regularization method which aims at preventing overfitting in deep learning. The principle is to randomly drop a portion of the neurons in the network during training, such as setting their outputs to zero. This "drop" operation is done proportionally, typically using a hyperparameter called the dropout rate. The dropout rate indicates the probability of a neuron being dropped in each iteration. For example, if the dropout rate is 0.5, each neuron has a 50% chance of being dropped during each forward pass.
2. Dropping neurons randomly can prevent the neural network model relying on the output of individual neurons during learning. This can enhances the model's ability to generalize. Dropout can also prevent overfitting, especially when the training data is small or the model is complex.
3. The reasons are:
- Dropout prevents overfitting by randomly dropping neurons during training. When Dropout is not used during model training, all neurons participate in the computation. This will lack the randomness and lead to biased unbalanced model weight compared to the training phase.
- If the output is not properly scaled, such as multiplying specific rate like 1-p, the model may become sensitive to the inputs, and result in inaccurate predictions.
Exercise 6.16
Level: * (Easy)
Exercise Type: Novel
Question
Dropout and Batch Normalization are two widely used techniques in deep learning for improving generalization and optimization.
1) Dropout: Explain how dropout prevents overfitting in deep neural networks. How does it relate to training multiple "thinned" networks?
2) Batch Normalization: What is the primary motivation behind Batch Normalization, and how does it address internal covariate shift?
Consider a scenario where you train a deep neural network with both Dropout and Batch Normalization.
• How do these techniques interact during training, and why can their combined use sometimes lead to suboptimal performance?
• What adjustments can be made to effectively combine these techniques?
Solution
1) Dropout and Overfitting Prevention:
• Dropout randomly deactivates neurons during training, preventing co-adaptation among neurons.
• It forces the network to learn redundant, independent features, acting as an ensemble of multiple "thinned" networks.
• At test time, dropout is turned off, and the weights are scaled accordingly to approximate the effect of averaging multiple models.
2) Batch Normalization and Internal Covariate Shift:
• Internal covariate shift refers to the change in distribution of activations across layers during training, slowing down convergence.
• Batch Normalization normalizes activations within each mini-batch, ensuring a stable distribution of inputs for each layer.
• It introduces learnable parameters (scale and shift) to preserve model expressiveness.
3) Interaction of Dropout and Batch Normalization:
• Issue: Dropout introduces noise by randomly deactivating neurons, while Batch Normalization relies on stable statistics of mini-batches.
• Conflict: The randomness of Dropout disrupts the mean and variance estimates of Batch Normalization, leading to unstable training.
Adjustments for Compatibility:
• Using Dropout after Batch Normalization (rather than before) to avoid interference in mean/variance computation.
• Reducing Dropout rate when using Batch Normalization, as BatchNorm itself provides regularization.
• Alternatively, using Spatial Dropout in convolutional networks, which drops entire feature maps rather than individual neurons.
Exercise 6.17
Level: ** (Moderate)
Exercise Types: Novel
Question
This Exercise aims to explore the relationship between L-Smoothness and Lipschitz continuity of a function. We say that a function is Lipschitz if [math]\displaystyle{ \forall x \in \mathbb{R}, \exists \ L }[/math] such that:
[math]\displaystyle{ |f(x) - f(y)| \le L|x - y| }[/math]
We say that a function is L-Smooth if [math]\displaystyle{ \forall x \in \mathbb{R}, \exists \ L }[/math] such that:
[math]\displaystyle{ |\nabla f(x) - \nabla f(y)| \le L|x - y| }[/math]
Consider the following question:
- Can an L-Smooth function be non-Lipschitz?** Provide an example and explain the reasoning behind your answer. Discuss the implications of this scenario and the effects it could have when considering optimization or convergence in machine learning.
Solution
Yes, an L-Smooth function can be non-Lipschitz.
Consider the function [math]\displaystyle{ f(x) = x^{3/2} }[/math] for [math]\displaystyle{ x \geq 0 }[/math].
First, let’s check if it is L-Smooth:
The gradient of [math]\displaystyle{ f(x) }[/math] is: [math]\displaystyle{ \nabla f(x) = \frac{3}{2} x^{1/2} }[/math].
For any [math]\displaystyle{ x_1, x_2 \geq 0 }[/math], we compute the difference in the gradients: [math]\displaystyle{ |\nabla f(x_1) - \nabla f(x_2)| = \left| \frac{3}{2} (x_1^{1/2} - x_2^{1/2}) \right| \leq \frac{3}{2} |x_1 - x_2|^{1/2} }[/math].
Thus, the function is L-Smooth with a constant [math]\displaystyle{ L = \frac{3}{2} }[/math].
Now, let’s check if it is Lipschitz:
For Lipschitz continuity, we want to show that there exists a constant [math]\displaystyle{ L }[/math] such that: [math]\displaystyle{ |f(x_1) - f(x_2)| \leq L|x_1 - x_2| }[/math].
However, we compute the difference between [math]\displaystyle{ f(x_1) }[/math] and [math]\displaystyle{ f(x_2) }[/math]: [math]\displaystyle{ |f(x_1) - f(x_2)| = |x_1^{3/2} - x_2^{3/2}| \approx \frac{3}{2} |x_1 - x_2|^{1/2} \text{ for small } |x_1 - x_2| }[/math].
As [math]\displaystyle{ ( |x_1 - x_2| \to 0 }[/math], we see that the ratio [math]\displaystyle{ \frac{|f(x_1) - f(x_2)|}{|x_1 - x_2|} \to \infty }[/math]. Therefore, the function is not Lipschitz.
Exercise 7.1
Level: ** (Moderate)
Exercise Types: Novel
References: Hesaraki, S. (2023). Feature Map. Medium. www.medium.com/@saba99/feature-map-35ba7e6c689e
This article describes more details about convolutional layers and feature maps.
Question
In this problem, we are interested in how convolutional neural networks learn information during training. For this, we will be plotting the "feature maps" of a CNN, which show the output of a convolutional layer after the learned filters have been applied to a sample image.
For this example, you can use the MNIST (handwritten digits) or the FashionMNIST (clothes) dataset. Each dataset has 60 000 greyscale (1 channel) training images that are each 28x28 pixels in size. The following script can be used to download the MNIST dataset, and a very similar one can be used for FashionMNIST:
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))]) train_dataset = torchvision.datasets.MNIST(root="./data", train=True, download=True, transform=transform) train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)
Write a CNN that consists of 3 convolutional layers each with a kernel size of 3 and a padding of 1 pixel in thickness. Each one will learn 6 filters. After each convolutional layer, apply a ReLU activation and a pooling layer with a kernel size of 2 and a stride of 2. After that, use a fully connected hidden layer followed by the output layer (the 2 datasets mentioned above each have 10 classes).
Train the neural network for ~3 epochs. Don't worry too much about the accuracy.
After training, select a random image from the training dataset and run it through the trained model. After each of the 3 convolutional layers, make a plot of the result for each of the 6 filters. Comment on your observations.
Solution
The convolutional neural network:
Note: to save space, the Python imports are not shown here.
class SimpleCNN(nn.Module): def __init__(self): super(SimpleCNN, self).__init__() self.conv1 = nn.Conv2d(1, 6, kernel_size=3, padding=1) self.conv2 = nn.Conv2d(6, 6, kernel_size=3, padding=1) self.conv3 = nn.Conv2d(6, 6, kernel_size=3, padding=1) self.pool = nn.MaxPool2d(kernel_size=2, stride=2) self.fc1 = nn.Linear(6 * 3 * 3, 64) self.fc2 = nn.Linear(64, 10) def forward(self, x): x = self.pool(F.relu(self.conv1(x))) x = self.pool(F.relu(self.conv2(x))) x = self.pool(F.relu(self.conv3(x))) x = x.view(-1, 6 * 3 * 3) x = F.relu(self.fc1(x)) x = self.fc2(x) return x model = SimpleCNN() optimizer = optim.Adam(model.parameters(), lr=0.005) criterion = nn.CrossEntropyLoss() for epoch in range(3): # 3 epochs model.train() for images, labels in train_loader: optimizer.zero_grad() outputs = model(images) loss = criterion(outputs, labels) loss.backward() optimizer.step()
The feature maps:
random_idx = np.random.randint(len(train_dataset)) image, label = train_dataset[random_idx] # getting a sample image image = image.unsqueeze(0) with torch.no_grad(): ### getting the feature maps conv1_output = F.relu(model.conv1(image)) ### feature map 1 conv1_output_pooled = model.pool(conv1_output) conv2_output = F.relu(model.conv2(conv1_output_pooled)) ### feature map 2 conv2_output_pooled = model.pool(conv2_output) conv3_output = F.relu(model.conv3(conv2_output_pooled)) ### feature map 3 conv3_output_pooled = model.pool(conv3_output) fig, axes = plt.subplots(4, 6, figsize=(6, 5)) axes[0, 0].imshow(image.reshape(28, 28), cmap="gray") # plotting the sample image axes[0, 0].set_title("Sample Image", fontsize=10) axes[0, 0].axis("off") for j in range(1, 6): axes[0, j].axis("off") outputs = [conv1_output, conv2_output, conv3_output] layer_names = ["conv1", "conv2", "conv3"] for row, (layer_output, name) in enumerate(zip(outputs, layer_names), start=1): for col in range(6): axes[row, col].imshow(layer_output[0, col].cpu().numpy(), cmap="viridis") axes[row, col].axis("off") axes[row, 0].set_title(name.title(), fontsize=10) plt.tight_layout() plt.show()
According to these images, the first layer detects edges (i.e., after the first layer, the image looks like a high-pass filter was applied to it, outlining the edges), while the second layer detects the rough overall shape. After the third convolutional layer, the image is very abstracted.
Exercise 7.2
Level: * (Easy)
Exercise Type: Novel
Question
Compare max pooling, average pooling, and global pooling on CIFAR-10 dataset.
Solution
import tensorflow as tf from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Conv2D, MaxPooling2D, AveragePooling2D, GlobalMaxPooling2D, GlobalAveragePooling2D, Flatten, Dense from tensorflow.keras.datasets import cifar10 from tensorflow.keras.utils import to_categorical import matplotlib.pyplot as plt # Load CIFAR-10 dataset (x_train, y_train), (x_test, y_test) = cifar10.load_data() # Normalize the dataset x_train = x_train.astype("float32") / 255.0 x_test = x_test.astype("float32") / 255.0 # One-hot encode the labels y_train = to_categorical(y_train, 10) y_test = to_categorical(y_test, 10) # Function to create a model with different pooling strategies def create_model(pooling_layer): model = Sequential([ Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)), pooling_layer, Flatten(), Dense(64, activation='relu'), Dense(10, activation='softmax') ]) model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) return model # Pooling layers to compare pooling_strategies = { 'MaxPooling': MaxPooling2D((2, 2)), 'AveragePooling': AveragePooling2D((2, 2)), 'GlobalMaxPooling': GlobalMaxPooling2D(), 'GlobalAveragePooling': GlobalAveragePooling2D() } # Train and evaluate models results = {} for name, pooling_layer in pooling_strategies.items(): print(f"Training with {name}") model = create_model(pooling_layer) history = model.fit(x_train, y_train, epochs=10, batch_size=64, validation_split=0.2, verbose=0) test_loss, test_acc = model.evaluate(x_test, y_test, verbose=0) results[name] = (history, test_loss, test_acc) # Plot training and validation accuracy plt.figure(figsize=(12, 8)) for name, (history, _, _) in results.items(): plt.plot(history.history['val_accuracy'], label=f"{name} (val_acc)") plt.title('Validation Accuracy for Different Pooling Strategies') plt.xlabel('Epochs') plt.ylabel('Accuracy') plt.legend() plt.grid(True) plt.show() # Print test accuracies for name, (_, _, test_acc) in results.items(): print(f"{name}: Test Accuracy = {test_acc:.4f}")
We can conclude the following information from the plot:
1. MaxPooling achieves the highest validation accuracy overall. It implies that MaxPooling is the most effective pooling strategy for this dataset and architecture, likely because it preserves dominant features while reducing dimensionality.
2. AveragePooling performs slightly worse than MaxPooling but follows a similar trend. It provides a smoother feature extraction method but may lose some fine details.
3. GlobalMaxPooling and GlobalAveragePooling perform significantly worse, showing that they discard too much spatial information, leading to weaker model performance on CIFAR-10.
Exercise 7.3
Level: * (Easy)
Exercise Type: Novel
Question
Consider training a deep neural network where the gradients of the loss with respect to parameters sometimes explode to very large values. To mitigate this, gradient clipping is applied with a threshold [math]\displaystyle{ \tau = 5 }[/math].
Let [math]\displaystyle{ g = (g_1, g_2) }[/math] be a gradient vector with norm [math]\displaystyle{ | g | = \sqrt{g_1^2 + g_2^2} }[/math]. The clipped gradient is defined as:
[math]\displaystyle{ g_{\text{clipped}} = g \cdot \min \left(1, \frac{\tau}{\| g \|} \right). }[/math] Suppose at an iteration, we have [math]\displaystyle{ g_1 = 6 }[/math] and [math]\displaystyle{ g_2 = 8 }[/math].
(a) Compute the clipped gradient [math]\displaystyle{ g_{\text{clipped}} }[/math].
(b) Explain why gradient clipping is necessary in deep networks and its effect on training dynamics.
(c) Assume we are training an LSTM with gradient clipping and a learning rate of 0.01. The norm of the unmodified gradient at time step t is [math]\displaystyle{ | g_t | = 20 }[/math], and the updated gradient after clipping is [math]\displaystyle{ | g_{\text{clipped},t} | = 5 }[/math].
Compute the weight update for a parameter w given that the gradient component in that direction is [math]\displaystyle{ g_t^w = 10 }[/math]. If gradient clipping was not applied, how would the update change, and what potential issue could occur?
Solution
(a) Computing the clipped gradient: The gradient norm is:
[math]\displaystyle{ \| g \| = \sqrt{6^2 + 8^2} = 10. }[/math] Since [math]\displaystyle{ | g | \gt \tau }[/math], we apply clipping:
[math]\displaystyle{ g_{\text{clipped}} = g \cdot \frac{5}{10} = (6,8) \cdot 0.5 = (3,4). }[/math] Thus, the clipped gradient is (3,4).
(b) Effect of Gradient Clipping: Gradient clipping prevents exploding gradients by rescaling large updates, stabilizing training in deep networks and recurrent models (e.g., LSTMs). It helps avoid divergence and allows for larger learning rates while preventing numerical instability.
(c) Computing the Weight Update in an LSTM with Clipping:
The clipped gradient norm is 5, meaning that the updated gradient is scaled by (5 / 20) = 0.25 of its original value. The clipped gradient for the weight w is: [math]\displaystyle{ g_{\text{clipped},t}^w = g_t^w \cdot 0.25 = 10 \times 0.25 = 2.5. }[/math] The weight update using learning rate [math]\displaystyle{ \eta = 0.01 }[/math]: [math]\displaystyle{ \Delta w = -\eta g_{\text{clipped},t}^w = -0.01 \times 2.5 = -0.025. }[/math] If gradient clipping was not applied:
The original update would be: [math]\displaystyle{ \Delta w = -0.01 \times 10 = -0.1. }[/math] This significantly larger update could lead to instability or divergence in training, especially in deep networks like LSTMs, where gradient magnitudes can vary significantly over time.
Exercise 7.4
Level: * (Easy)
Exercise Type: Novel
Question
In convolutional neural networks (CNNs), parameter sharing is a key property that is different from fully-connected neural networks (FCNNs).
1. Explain the concept of parameter sharing in CNNs. How does it differ from fully connected layers?
2. Consider a convolutional layer with an input of size [math]\displaystyle{ 32 \times 32 \times 3 }[/math] (height, width, channels) and a filter of size [math]\displaystyle{ 5 \times 5 \times 3 }[/math]. How many parameters does this filter have, including bias(es)?
3. Discuss one advantage and one limitation of parameter sharing in CNNs.
Solution
1. Concept of Parameter Sharing:
In a fully connected layer, each neuron has its own set of unique weights for every input feature, leading to a large number of parameters. In CNNs, the same set of filter weights is applied across different spatial locations, meaning the parameters are shared across the input. This greatly reduces the total number of parameters and improves generalization.
Furthermore, the concept of parameter sharing (or lack thereof) makes CNNs and FCNNs suited for different tasks. Because the same filter in a CNN is applied everywhere, a feature (like an edge or corner) will be detected regardless of where it appears in the image. For example, if a filter learns to detect cats in the top-left corner, it will also detect cats in the bottom-right. This makes CNNs well-suited for spatially structured data, such as images and videos, and work well in tasks such as object detection, segmentation, motion analysis, etc. In contrast, FCNNs are better suited for tasks where each feature is independent of its position relative to other features, such as tabular datasets. Examples of this include customer data in a marketing problem (e.g., demographics, spending habits, etc.), healthcare records, business analytics, etc.
2. Computing Parameters:
Weight parameter(s): [math]\displaystyle{ 5 \times 5 \times 3 = 75 }[/math]. The size of a filter includes all 3 of its dimensions.
Bias parameter(s): 1. Each filter has one bias, regardless of dimensions.
Total parameters per filter: [math]\displaystyle{ 75 + 1 = 76 }[/math]
Therefore, a single convolutional filter of size [math]\displaystyle{ 5 \times 5 \times 3 }[/math] has 76 parameters in total.
3. Advantages and limitations:
Advantage: Parameter sharing reduces the number of parameters; this is more computationally efficient than having many parameters and also makes CNNs less prone to overfitting.
Additional note: Parameter sharing is particularly useful for lower-layer filters, like edge detectors, which are applied across the entire image.Since CNNs are naturally equivalent to translation, meaning that when the image shifts, the feature map shifts similarly, this allows CNNs to detect features like edges and corners anywhere in the image, which improves generalization and helps to capture spatial hierarchies. By sharing filters/parameters, CNNs use fewer parameters, making them more efficient and less likely to overfit.
Limitation: Parameter sharing assumes that the learned features are equally useful across the entire input, which may not be ideal for tasks where spatial location is important (e.g., detecting objects in fixed positions).
Additional note: Since parameter sharing assumes that the same features (like edges, textures, etc.) are equally important throughout the entire image, this works well for detecting general patterns that can appear anywhere. However, for tasks where the specific location of features matters, a feature might be crucial in one part of the image (like the top-left corner) but irrelevant in another part (like the bottom-right corner), applying the same filter everywhere cannot capture location-specific features effectively. Solutions include including deformable convolutions to allow adaptive receptive fields and self-attention mechanisms to capture long-range dependencies.
Exercise 7.5
Level: * (Easy)
Exercise Type: Novel
Question
Given an image matrix and a filter matrix, compute both the cross-correlation and convolution outputs without padding and using a stride of 1.
Image Matrix: [math]\displaystyle{ I = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix} }[/math]
Filter Matrix: [math]\displaystyle{ K = \begin{bmatrix} 0 & 1 \\ -1 & 0 \end{bmatrix} }[/math]
1. Compute the cross-correlation output.
2. Compute the convolution output.
3. Explain why cross-correlation is used in deep learning instead of convolution.
Solution
1. Cross-correlation:
Cross-correlation applies the filter as is, without flipping it.
[math]\displaystyle{ O(i, j) = \sum_{m} \sum_{n} I(i+m, j+n) K(m, n) }[/math]
For each position:
(1,1): [math]\displaystyle{ (1 \times 0) + (2 \times 1) + (4 \times -1) + (5 \times 0) = -2 }[/math]
(1,2): [math]\displaystyle{ (2 \times 0) + (3 \times 1) + (5 \times -1) + (6 \times 0) = -2 }[/math]
(2,1): [math]\displaystyle{ (4 \times 0) + (5 \times 1) + (7 \times -1) + (8 \times 0) = -2 }[/math]
(2,2): [math]\displaystyle{ (5 \times 0) + (6 \times 1) + (8 \times -1) + (9 \times 0) = -2 }[/math]
[math]\displaystyle{ O = \begin{bmatrix} -2 & -2 \\ -2 & -2 \end{bmatrix} }[/math]
2. Convolution:
Convolution first flips the filter horizontally and vertically before applying it.
[math]\displaystyle{ O(i, j) = \sum_{m} \sum_{n} I(i-m, j-n) K(m, n) = \sum_{m} \sum_{n} I(i+m, j+n) K(-m, -n) }[/math]
Flipped Filter:
[math]\displaystyle{
K_{flipped} =
\begin{bmatrix}
0 & -1 \\
1 & 0
\end{bmatrix}
}[/math]
Applying the flipped filter:
(1,1): [math]\displaystyle{ (1 \times 0) + (2 \times -1) + (4 \times 1) + (5 \times 0) = 2 }[/math]
(1,2): [math]\displaystyle{ (2 \times 0) + (3 \times -1) + (5 \times 1) + (6 \times 0) = 2 }[/math]
(2,1): [math]\displaystyle{ (4 \times 0) + (5 \times -1) + (7 \times 1) + (8 \times 0) = 2 }[/math]
(2,2): [math]\displaystyle{ (5 \times 0) + (6 \times -1) + (8 \times 1) + (9 \times 0) = 2 }[/math]
[math]\displaystyle{
O =
\begin{bmatrix}
2 & 2 \\
2 & 2
\end{bmatrix}
}[/math]
3. In deep learning, we use cross-correlation instead of convolution because flipping the filter (as done in convolution) is not needed. In CNNs, filters are learned automatically, so the model adjusts their weights without requiring a flip. Skipping this step makes computations simpler and faster while still detecting the same features. Since CNNs learn patterns during training, using cross-correlation instead of convolution does not change the results, making it the preferred choice in deep learning.
Exercise 7.6
Level: * (Easy)
Exercise Types: Novel
Question
What architecture make Convolutional Neural Networks (CNNs) different with a standard feedforward neural networks?
Solution
- Local Connectivity: In CNNs, each neuron is only connected to a localized region of the input rather than to all input units as in fully connected layers. This local connectivity make sure the model can be sparse if necessary,making it more efficient and reducing overfitting.
- Weight Sharing: A filter with fixed weights is applied across different locations of the input. This reduces the total number of parameters compared to fully connected networks, significantly lowering computational complexity and enhancing generalization.
- Translation Equivariance: Because the same filter slides over the input, the model can easily recognize features with different locations.
- Pooling Operations: Pooling reduces the dimensions of layers, making the model more robust to small changes. This also helps reduce dimensionality and computation.
- Hierarchical Feature Learning: CNNs learn a hierarchy of features, from low-level in the earlier layers to high-level features in the deeper layers. This hierarchical structure makes CNNs highly effective for complex tasks like image recognition and object detection.
Exercise 7.7
Level: * (Easy)
Exercise Types: Novel
Question
Sobel filter is a popular edge-detection filter. Sobel filter can be seen as combining:
A 1D derivative filter, [math]\displaystyle{ \begin{bmatrix} 1\ 0\ -1 \end{bmatrix} }[/math], in one direction and a 1D “blur” filter, [math]\displaystyle{ \begin{bmatrix} 1\\ 2\\ 1 \end{bmatrix} }[/math], in the other direction
1. Horizontal Sobel Filter
- It is often written as [math]\displaystyle{ \begin{bmatrix} 1 & 0 & -1\\ 2 & 0 & -2\\ 1 & 0 & -1 \end{bmatrix}. }[/math]
See how this comes from multiplying [math]\displaystyle{ \begin{bmatrix}1\\ 2\\ 1 \end{bmatrix} }[/math] by \begin{bmatrix} 1\ 0\ -1 \end{bmatrix}
2. Vertical Sobel Filter
- Construct the [math]\displaystyle{ 3\times 3 }[/math] matrix for the vertical Sobel filter by swapping the roles of “blur” and “derivative,” and write it out explicitly.
3. Application
- Given the [math]\displaystyle{ 3\times 3 }[/math] image patch [math]\displaystyle{ \begin{bmatrix} 2 & 2 & 2\\ 2 & 5 & 7\\ 1 & 6 & 9 \end{bmatrix}, }[/math] apply your vertical Sobel filter to the center pixel (i.e., do a 3×3 convolution) and calculate the filter’s response.
Hint: For the vertical Sobel, think of smoothing left-to-right, then taking the derivative top-to-bottom.
Solution
Step 1. Horizontal Sobel Filter (Review)
We can write a vertical blur filter as [math]\displaystyle{ B_v = \begin{bmatrix}1\\ 2\\ 1 \end{bmatrix} }[/math] and a horizontal derivative filter as [math]\displaystyle{ D_h = \begin{bmatrix}1 & 0 & -1\end{bmatrix} }[/math]. Their outer product gives [math]\displaystyle{ S_{\text{horizontal}} = B_v \times D_h = \begin{bmatrix} 1 & 0 & -1 \\ 2 & 0 & -2 \\ 1 & 0 & -1 \end{bmatrix} }[/math].
Step 2. Vertical Sobel Filter
For vertical edge detection, we blur horizontally and differentiate vertically. Thus, let [math]\displaystyle{ B_h = \begin{bmatrix}1\ 2\ 1 \end{bmatrix} }[/math] (horizontal blur) and [math]\displaystyle{ D_v = \begin{bmatrix} 1\\ 0\\ -1 \end{bmatrix} }[/math] (vertical derivative). The vertical Sobel kernel is their outer product: [math]\displaystyle{ S_{\text{vertical}} = D_v \times B_h = \begin{bmatrix} 1 & 2 & 1 \\ 0 & 0 & 0 \\ -1 & -2 & -1 \end{bmatrix}. }[/math]
Step 3. Applying the Vertical Sobel Filter
Consider the image patch [math]\displaystyle{ I = \begin{bmatrix} 2 & 2 & 2\\ 2 & 5 & 7\\ 1 & 6 & 9 \end{bmatrix}, }[/math]
The filter response at the center pixel is given by the sum of element‐wise products: [math]\displaystyle{ \text{Response} = \sum_{i=1}^3\sum_{j=1}^3 \bigl(S_{\text{vertical}}(i,j)\cdot I(i,j)\bigr). }[/math]
Computing row by row:
- Top row: [math]\displaystyle{ (1\times 2) + (2\times 2) + (1\times 2) = 2 + 4 + 2 = 8 }[/math]
- Middle row: [math]\displaystyle{ (0\times 2) + (0\times 5) + (0\times 7) = 0 }[/math]
- Bottom row: [math]\displaystyle{ ((-1)\times 1) + ((-2)\times 6) + ((-1)\times 9) = -1 -12 -9 = -22 }[/math]
Summing these:
[math]\displaystyle{ 8 + 0 - 22 = -14. }[/math]
Hence, the vertical Sobel filter response at the center pixel is [math]\displaystyle{ -14 }[/math], indicating a strong downward-to-upward intensity change in that region.
Exercise 7.8
Level: * (Easy)
Exercise Types: Novel
Question
Answer whether the following statements about Convolutional Neural Networks (CNNs) are True or False.
1.CNNs are designed to work only with grayscale images.
2.Pooling layers in CNNs help reduce the spatial dimensions of the input while preserving important features.
3.Fully connected layers in a CNN help maintain spatial relationships between pixels.
Solution
1.False – CNNs can process images with multiple channels, such as RGB images (three channels) or even hyperspectral images with many channels.
2.True – Pooling operations, such as max pooling, downsample the feature maps, making the network more efficient while retaining essential information.
3.False – Fully connected layers discard spatial relationships since they treat the input as a single vector, unlike convolutional layers that preserve spatial structure.
Exercise 7.9
Level: * (Easy)
Exercise Types: Novel
Question
In the context of Convolutional Neural Networks (CNNs), explain the following concepts and their significance in image-based deep learning tasks:
(a) How do convolutional layers reduce the number of parameters compared to fully connected layers, and why is this advantageous?
(b) What role do pooling layers (e.g., max-pooling) play in achieving translation invariance?
(c) How does the hierarchical structure of CNNs (e.g., stacking convolutional and pooling layers) enable the learning of complex features from raw pixel data?
Solution
(a) Convolutional layers apply small filters (kernels) that slide across the input image, computing dot products locally. These filters are shared across all spatial positions (e.g., a 3×3 kernel uses the same weights for every patch of the image).
(b)Pooling downsamples feature maps by aggregating local regions (e.g., 2×2 windows). Max-pooling selects the maximum activation in each window. It reduces overfitting by compressing spatial dimensions and prioritizes the presence of features over their exact location.
(c) Early Layers: Detect low-level features (edges, corners, colors).
Example: A 3×3 filter might activate for horizontal edges.
Middle Layers: Combine edges into textures or shapes (e.g., circles, stripes).
Example: A filter might respond to "eye-like" patterns.
Deep Layers: Assemble shapes into high-level semantic features (e.g., faces, objects).
Exercise 8.1
Level: * (Easy)
Exercise Types: Novel
Question
In relation to Gated Recurrent Units (GRUs), write a Python script to:
- Simulate a sequence of 20 time steps where the input [math]\displaystyle{ x_t }[/math] is a sinusoidal wave.
- Use a simple GRU cell with randomly initialized weights.
- Visualize the evolution of the hidden state [math]\displaystyle{ h_t }[/math] over time.
Solution
import numpy as np
import matplotlib.pyplot as plt
# Sigmoid and tanh activation functions
sigmoid = lambda x: 1 / (1 + np.exp(-x))
tanh = lambda x: np.tanh(x)
# Initialize weights randomly
np.random.seed(42)
W_z, U_z, b_z = np.random.randn(3) * 0.5
W_r, U_r, b_r = np.random.randn(3) * 0.5
W_h, U_h, b_h = np.random.randn(3) * 0.5
# Generate sinusoidal input sequence
timesteps = 20
x_seq = np.sin(np.linspace(0, 4 * np.pi, timesteps))
# Initialize hidden state
h_t = 0
hidden_states = []
# GRU forward pass
for x_t in x_seq:
z_t = sigmoid(W_z * x_t + U_z * h_t + b_z)
r_t = sigmoid(W_r * x_t + U_r * h_t + b_r)
h_tilde_t = tanh(W_h * x_t + U_h * (r_t * h_t) + b_h)
h_t = (1 - z_t) * h_t + z_t * h_tilde_t
hidden_states.append(h_t)
# Plot hidden state evolution
plt.plot(hidden_states, label="Hidden State h_t")
plt.xlabel("Time step")
plt.ylabel("Hidden State Value")
plt.title("GRU Hidden State Evolution")
plt.legend()
plt.show()
The plot shows how the hidden state [math]\displaystyle{ h_t }[/math] evolves over time as it processes the sinusoidal input.
Exercise 8.2
Level: * (Easy)
Exercise Types: Novel
Question
In Lecture 8, about BPTT algorithm, write the expression of [math]\displaystyle{ \frac{\partial L}{\partial U} }[/math]. Explain why the vanishing and exploding gradient problem occurs in BPTT and how it affects training.
Solution
[math]\displaystyle{ \frac{\partial L}{\partial U} = \sum_{t} \frac{\partial L_{t}}{\partial U} = \sum_{t} \frac{\partial L_{t}}{\partial S_{t}} \cdot \frac{\partial S_{t}}{\partial U} = \sum_{t} (\text{something computable}) \cdot x_{t} }[/math]
The vanishing and exploding gradient problem occurs in BPTT because gradients are backpropagated through many time steps. If weights are small, gradients shrink exponentially, making learning slow (vanishing gradients). If weights are large, gradients grow exponentially and leads to instability (exploding gradients). Gradient Clipping can be implemented to limit gradient magnitudes and prevent instability.
Exercise 8.3
Level: *** (Difficult)
Exercise Type: Novel
Question
Recurrent Neural Networks (RNNs) suffer from the vanishing and exploding gradient problem due to the repeated multiplication of Jacobians through time steps.
Consider a simple RNN with a hidden state update equation:
[math]\displaystyle{ h_t = \tanh(W h_{t-1} + U x_t + b) }[/math]
where [math]\displaystyle{ W }[/math] is the recurrent weight matrix, [math]\displaystyle{ U }[/math] is the input weight matrix, and [math]\displaystyle{ x_t }[/math] is the input at time step [math]\displaystyle{ t }[/math]. Explain why the gradient of the loss with respect to [math]\displaystyle{ W }[/math] can either explode or vanish over long time steps.
Assume the eigenvalues of [math]\displaystyle{ W }[/math] are given by [math]\displaystyle{ \lambda_1, \lambda_2, ..., \lambda_n }[/math]. What condition on [math]\displaystyle{ \lambda_i }[/math] would lead to the vanishing gradient problem, and what condition would cause the exploding gradient problem?
To mitigate these issues, gradient clipping is often applied during backpropagation through time (BPTT). Suppose we apply gradient clipping with a threshold [math]\displaystyle{ \tau = 1 }[/math] to a gradient [math]\displaystyle{ g = (g_1, g_2, g_3) }[/math] with norm [math]\displaystyle{ | g | = 5 }[/math]. Compute the clipped gradient vector.
Solution
Why Gradients Vanish or Explode:
The gradient of the loss with respect to [math]\displaystyle{ W }[/math] involves the product of multiple Jacobians through time:
[math]\displaystyle{ \frac{\partial L}{\partial W} \propto \prod_{t=1}^{T} W^T \nabla h_t }[/math]
If [math]\displaystyle{ W }[/math] has eigenvalues less than 1, repeated multiplication causes the gradient to shrink exponentially, leading to vanishing gradients. This prevents earlier time steps from having a significant influence on training.
If [math]\displaystyle{ W }[/math] has eigenvalues greater than 1, repeated multiplication causes the gradient to grow exponentially, leading to exploding gradients, causing unstable updates.
Eigenvalue Condition for Vanishing and Exploding Gradients:
Vanishing Gradient: If [math]\displaystyle{ |\lambda_i| \lt 1 }[/math], the gradients diminish exponentially over time. Exploding Gradient: If [math]\displaystyle{ |\lambda_i| \gt 1 }[/math], the gradients grow exponentially, causing instability. Applying Gradient Clipping:
Given [math]\displaystyle{ g = (g_1, g_2, g_3) }[/math] with norm [math]\displaystyle{ | g | = 5 }[/math] and clipping threshold [math]\displaystyle{ \tau = 1 }[/math]: [math]\displaystyle{ g_{\text{clipped}} = g \cdot \frac{1}{5} = (g_1, g_2, g_3) \times 0.2. }[/math] The clipped gradient vector is: [math]\displaystyle{ g_{\text{clipped}} = (0.2 g_1, 0.2 g_2, 0.2 g_3). }[/math]
Additional note: The matrix [math]\displaystyle{ W }[/math] can be decomposed into eigenvectors and eigenvalues. During backpropagation, the gradient is multiplied by [math]\displaystyle{ W }[/math] at each layer (or time step in RNNs). This multiplication scales the gradient along the eigenvectors of [math]\displaystyle{ W }[/math], with the scaling factor determined by the corresponding eigenvalues. If [math]\displaystyle{ W }[/math] has large eigenvalues (greater than 1), the gradient is stretched exponentially along those eigenvectors, causing the gradient to grow quickly across layers, leading to exploding gradients. Conversely, if [math]\displaystyle{ W }[/math] has small eigenvalues (less than 1), the gradient is shrunk exponentially, causing it to vanish as it propagates back through many layers or time steps, leading to vanishing gradients.
Exercise 8.4
Level: * (Easy)
Exercise Types: Novel
Question
Train an LSTM-based model to classify movie reviews as positive/negative using the IMDB dataset and compare LSTM with a simple Dense network.
Solution
import numpy as np import tensorflow as tf import matplotlib.pyplot as plt from tensorflow.keras.datasets import imdb from tensorflow.keras.preprocessing.sequence import pad_sequences from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Embedding, LSTM, Dense, Flatten from tensorflow.keras.optimizers import Adam # Load IMDB dataset with the top 10,000 words vocab_size = 10000 max_length = 200 (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=vocab_size) # Pad sequences to ensure uniform length x_train = pad_sequences(x_train, maxlen=max_length) x_test = pad_sequences(x_test, maxlen=max_length) lstm_model = Sequential([ Embedding(input_dim=vocab_size, output_dim=128, input_length=max_length), LSTM(64, return_sequences=False), Dense(1, activation='sigmoid') ]) lstm_model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy']) # Train the LSTM model history_lstm = lstm_model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=5, batch_size=64) dense_model = Sequential([ Embedding(input_dim=vocab_size, output_dim=128, input_length=max_length), Flatten(), Dense(128, activation='relu'), Dense(1, activation='sigmoid') ]) dense_model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy']) # Train the Dense model history_dense = dense_model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=5, batch_size=64) import matplotlib.pyplot as plt # Plot accuracy comparison plt.figure(figsize=(12, 6)) plt.plot(history_lstm.history['accuracy'], label='LSTM Train Acc') plt.plot(history_lstm.history['val_accuracy'], label='LSTM Val Acc') plt.plot(history_dense.history['accuracy'], label='Dense Train Acc') plt.plot(history_dense.history['val_accuracy'], label='Dense Val Acc') plt.xlabel('Epochs') plt.ylabel('Accuracy') plt.legend() plt.title('LSTM vs. Dense Model Accuracy') plt.show()
The plot shows the accuracy of the lstm model and the dense model.
It's clearly that the training accuracy and validation accuracy of the LSTM model are both better than the dense model.LSTM performs better since it captures sequential dependencies in text by maintaining long-term effects. Dense networks struggle with long-term context but are faster to train.
Exercise 8.5
Level: * (Easy)
Exercise Types: Novel
Question
1. Match the following properties to Leaky Units, RNNs, or Gated RNNs.
(a). Uses explicit gating to control memory flow → ________________
(b). Memory retention is controlled by a simple decay factor → ________________
(c). Can suffer from vanishing gradient problems → ________________
(d). Can learn patterns but has difficulty with long-term dependencies → ________________
(e). Has a separate cell state to regulate information storage → ________________
(f). Computationally simplest among all three models → ________________
(g). Uses reset and update gates to modify memory retention → ________________
2.
Here are the formulas for Leaky Units and Gated RNNs (GRU) from Lecture 8.
Leaky Units:
[math]\displaystyle{ s_{t,i} = \left(1 - \frac{1}{\tau_i}\right) s_{t-1} + \frac{1}{\tau_i} \sigma(W s_{t-1} + U x_t) }[/math]
where: - [math]\displaystyle{ \tau_i }[/math] controls the memory decay for each component [math]\displaystyle{ i }[/math] of the state vector. - [math]\displaystyle{ s_t }[/math] is the state at time [math]\displaystyle{ t }[/math], and [math]\displaystyle{ \sigma }[/math] is a nonlinear activation function (e.g., sigmoid).
Gated Recurrent Networks (GRU):
[math]\displaystyle{ \begin{aligned} r_t &= \sigma(W_r s_{t-1} + U_r x_t) \quad \text{(reset gate)} \\ z_t &= \sigma(W_z s_{t-1} + U_z x_t) \quad \text{(update gate)} \\ \tilde{s}_t &= \tanh(W \cdot (r_t \odot s_{t-1}) + U \cdot x_t) \quad \text{(temporary state)} \\ s_t &= z_t \odot s_{t-1} + (1 - z_t) \odot \tilde{s}_t \quad \text{(final state)} \end{aligned} }[/math]
where: - [math]\displaystyle{ r_t }[/math] is the reset gate. - [math]\displaystyle{ z_t }[/math] is the update gate. - [math]\displaystyle{ \tilde{s}_t }[/math] is the temporary state. - [math]\displaystyle{ s_t }[/math] is the final state at time [math]\displaystyle{ t }[/math].
Now make a comparison between Leaky Units and Gated RNNs, their similarities and differences.
Solution
1.
(a). Gated RNNs
(b). Leaky Units
(c). RNNs
(d). Leaky Units and RNNs
(e). Gated RNNs
(f). Leaky Units
(g). Gated RNNs
2.
Similarities: Both Leaky Units and Gated RNNs are recurrent models that update their states using information from the previous state and current input. They are both designed to handle sequential data and capture long-term dependencies, though they do so in different ways. Additionally, both models use nonlinear activation functions like sigmoid or tanh in their computations.
Differences: Leaky Units use a fixed decay rate [math]\displaystyle{ \tau_i }[/math] to mix past and current states, making them easier and faster to compute, but less effective for handling long-term memory. They can have problems with vanishing gradients when the decay rate is too small and are easier to train, but not as good for complex tasks. On the other hand, Gated RNNs use learnable gates (such as forget, input, and update) to control how much of the previous memory is kept and how much of the new input is used. These gates are learned through training, which makes Gated RNNs more flexible and better at handling long-term dependencies, but also more complex and slower to train.
Exercise 8.6
Level: * (Medium)
Exercise Types: Novel
Question
Dataset: https://data.cityofnewyork.us/Environment/Air-Quality/c3uy-2p5r/about_data Build a Recurrent Neural Network (RNN) to predict the next hour's PM2.5 concentration using the New York City Air Quality Dataset. Use PyTorch to implement the model, and train it on sequences of 24 hours of historical data (including features like temperature, wind speed, and ozone levels).
Solution
import pandas as pd import numpy as np from sklearn.preprocessing import RobustScaler from sklearn.metrics import r2_score, mean_absolute_error import torch import torch.nn as nn from torch.utils.data import Dataset, DataLoader # -------------------------------------------------- # 1. Enhanced Data Preparation with Metrics # -------------------------------------------------- def load_data(filepath): df = pd.read_csv(filepath) # Filter relevant data pm_ozone = df[ (df['Name'].isin(['Fine particles (PM 2.5)', 'Ozone (O3)'])) & (df['Data Value'].notna()) ].copy() # Convert to annual data pm_ozone['year'] = pm_ozone['Time Period'].str.extract(r'(\d{4})').astype(float) pm_ozone = pm_ozone.dropna(subset=['year']) pm_ozone['year'] = pm_ozone['year'].astype(int) # Pivot table with error handling try: pivot_df = pm_ozone.pivot_table( index=['Geo Place Name', 'year'], columns='Name', values='Data Value', aggfunc='mean' ).reset_index() except ValueError: raise RuntimeError("Pivot failed - check for duplicate entries") # Handle missing values pivot_df = pivot_df.groupby('Geo Place Name').apply( lambda x: x.ffill().bfill() ).reset_index(drop=True) return pivot_df.dropna() # -------------------------------------------------- # 2. Data Scaling with Validation # -------------------------------------------------- def scale_data(df): scaler = RobustScaler() features = scaler.fit_transform(df'Fine particles (PM 2.5)', 'Ozone (O3)') # Ensure no NaNs after scaling if np.isnan(features).any(): raise ValueError("NaN values detected after scaling") return features, scaler # -------------------------------------------------- # 3. Sequence Creation with Quality Control # -------------------------------------------------- def create_sequences(features, years, seq_length=3): X, y = [], [] # Group by neighborhood unique_neighborhoods = years['Geo Place Name'].unique() for neighborhood in unique_neighborhoods: mask = years['Geo Place Name'] == neighborhood neighborhood_features = features[mask] neighborhood_years = years[mask]['year'].values # Check for consecutive years year_diffs = np.diff(neighborhood_years) if len(neighborhood_years) < seq_length + 1 or not np.all(year_diffs == 1): continue # Create sequences for i in range(len(neighborhood_features) - seq_length): seq = neighborhood_features[i:i+seq_length] target = neighborhood_features[i+seq_length][0] # PM2.5 is first feature X.append(seq) y.append(target) return np.array(X), np.array(y) # -------------------------------------------------- # 4. Model with Metrics Tracking # -------------------------------------------------- class AirQualityRNN(nn.Module): def __init__(self, input_size=2, hidden_size=16): super().__init__() self.gru = nn.GRU(input_size, hidden_size, batch_first=True) self.fc = nn.Linear(hidden_size, 1) def forward(self, x): out, _ = self.gru(x) return self.fc(out[:, -1, :]).squeeze() def train_and_validate(model, train_loader, test_loader, epochs=30): optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4) criterion = nn.MSELoss() best_r2 = -np.inf metrics = {'train_loss': [], 'test_mae': [], 'test_r2': []} for epoch in range(epochs): # Training model.train() train_loss = 0 for X_batch, y_batch in train_loader: optimizer.zero_grad() outputs = model(X_batch) loss = criterion(outputs, y_batch) loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) optimizer.step() train_loss += loss.item() # Validation model.eval() y_true, y_pred = [], [] with torch.no_grad(): for X_batch, y_batch in test_loader: preds = model(X_batch) y_true.extend(y_batch.numpy()) y_pred.extend(preds.numpy()) # Calculate metrics mae = mean_absolute_error(y_true, y_pred) r2 = r2_score(y_true, y_pred) metrics['train_loss'].append(train_loss/len(train_loader)) metrics['test_mae'].append(mae) metrics['test_r2'].append(r2) print(f"Epoch {epoch+1}/{epochs}") print(f"Train Loss: {metrics['train_loss'][-1]:.4f}") print(f"Test MAE: {mae:.4f} | R²: {r2:.4f}\n") return metrics # -------------------------------------------------- # 5. Main Execution with Metrics Reporting # -------------------------------------------------- if __name__ == "__main__": # Load and prepare data df = load_data("Air_Quality_20250202.csv") features, scaler = scale_data(df) # Create sequences X, y = create_sequences(features, df'Geo Place Name', 'year') # Split data split = int(0.8 * len(X)) X_train, X_test = X[:split], X[split:] y_train, y_test = y[:split], y[split:] # Create datasets class AirDataset(Dataset): def __init__(self, X, y): self.X = torch.tensor(X, dtype=torch.float32) self.y = torch.tensor(y, dtype=torch.float32) def __len__(self): return len(self.X) def __getitem__(self, idx): return self.X[idx], self.y[idx] train_loader = DataLoader(AirDataset(X_train, y_train), batch_size=32, shuffle=True) test_loader = DataLoader(AirDataset(X_test, y_test), batch_size=32, shuffle=False) # Initialize and train model model = AirQualityRNN() metrics = train_and_validate(model, train_loader, test_loader) # Final evaluation print("\nFinal Performance:") print(f"Best R² Score: {max(metrics['test_r2']):.4f}") print(f"Best MAE: {min(metrics['test_mae']):.4f}")
Fundamental Problems
Classification
Consider data [math]\displaystyle{ \{(x_i, y_i)\}_{i=1}^n }[/math] where [math]\displaystyle{ x \in \mathbb{R}^d }[/math] and [math]\displaystyle{ y_i \ }[/math] takes values in some finite set. Find a function [math]\displaystyle{ f }[/math], such that when we observe a new [math]\displaystyle{ x }[/math] we predict [math]\displaystyle{ y }[/math] to be [math]\displaystyle{ f(x) }[/math].Regression
Consider data [math]\displaystyle{ \{(x_i, y_i)\}_{i=1}^n }[/math] where [math]\displaystyle{ \mathbf{x} \in \mathbb{R}^d }[/math] and [math]\displaystyle{ y_i }[/math] takes values in [math]\displaystyle{ \mathbb{R} }[/math]. Find a function [math]\displaystyle{ f }[/math], such that when we observe a new [math]\displaystyle{ \mathbf{x} }[/math] we predict [math]\displaystyle{ y }[/math] to be [math]\displaystyle{ f(\mathbf{x}) }[/math].Clustering
Consider data [math]\displaystyle{ \{x_i\}_{i=1}^n }[/math] where [math]\displaystyle{ \mathbf{x} \in \mathbb{R}^d }[/math]. Find a function [math]\displaystyle{ f }[/math], when we observe a new [math]\displaystyle{ \mathbf{x} }[/math] we predict [math]\displaystyle{ y }[/math] to be [math]\displaystyle{ f(\mathbf{x}) }[/math], such that for similar [math]\displaystyle{ \mathbf{x} }[/math], [math]\displaystyle{ y }[/math] is the same.Perceptron
Define a cost function, [math]\displaystyle{ \phi(\beta, \beta_0) }[/math], as a summation of the distance between all misclassified points and the hyperplane, or the decision boundary. To minimize this cost function, we need to estimate [math]\displaystyle{ \beta }[/math], [math]\displaystyle{ \beta_0 }[/math]: [math]\displaystyle{ \min_{\beta, \beta_0} \phi(\beta, \beta_0) = \{\text{distance of all misclassified points}\} }[/math]. (1) A hyperplane [math]\displaystyle{ L }[/math] can be defined as [math]\displaystyle{ L = \{ \mathbf{x} : f(\mathbf{x}) = \beta^T \mathbf{x} + \beta_0 = 0 \} }[/math], For any two arbitrary points [math]\displaystyle{ \mathbf{x}_1 }[/math] and [math]\displaystyle{ \mathbf{x}_2 }[/math] on [math]\displaystyle{ L }[/math], we have [math]\displaystyle{ \beta^T \mathbf{x}_1 + \beta_0 = 0 }[/math], [math]\displaystyle{ \beta^T \mathbf{x}_2 + \beta_0 = 0 }[/math], such that [math]\displaystyle{ \beta^T (\mathbf{x}_1 - \mathbf{x}_2) = 0 }[/math]. Therefore, [math]\displaystyle{ \beta }[/math] is orthogonal to the hyperplane and it is the normal vector. (2) For any point [math]\displaystyle{ \mathbf{x}_0 }[/math] in [math]\displaystyle{ L }[/math], [math]\displaystyle{ \beta^T \mathbf{x}_0 + \beta_0 = 0 }[/math], which means [math]\displaystyle{ \beta^T \mathbf{x}_0 = -\beta_0 }[/math]. (3) We set [math]\displaystyle{ \beta^* = \frac{\beta}{\|\beta\|} }[/math] as the unit normal vector of the hyperplane [math]\displaystyle{ L }[/math]. For simplicity, we call [math]\displaystyle{ \beta^* }[/math] the norm vector. The distance of point [math]\displaystyle{ \mathbf{x} }[/math] to [math]\displaystyle{ L }[/math] is given by [math]\displaystyle{ \beta^T (\mathbf{x} - \mathbf{x}_0) = \beta^T \mathbf{x} - \beta^T \mathbf{x}_0 = \frac{\beta^T \mathbf{x}}{\|\beta\|} + \frac{\beta_0}{\|\beta\|} = \frac{\beta^T \mathbf{x} + \beta_0}{\|\beta\|} }[/math] Where [math]\displaystyle{ \mathbf{x}_0 }[/math] is any point on [math]\displaystyle{ L }[/math]. Hence, [math]\displaystyle{ \beta^T \mathbf{x} + \beta_0 }[/math] is proportional to the distance of the point [math]\displaystyle{ \mathbf{x} }[/math] to the hyperplane [math]\displaystyle{ L }[/math]. (4) The distance from a misclassified data point [math]\displaystyle{ x_i }[/math] to the hyperplane [math]\displaystyle{ L }[/math] is [math]\displaystyle{ d_i = -y_i (\beta^T x_i + \beta_0) }[/math] where [math]\displaystyle{ y_i }[/math] is a target value, such that [math]\displaystyle{ y_i = 1 }[/math] if [math]\displaystyle{ \beta^T x_i + \beta_0 \lt 0 }[/math], [math]\displaystyle{ y_i = -1 }[/math] if [math]\displaystyle{ \beta^T x_i + \beta_0 \gt 0 }[/math]. Since we need to find the distance from the hyperplane to the misclassified data points, we need to add a negative sign in front. When the data point is misclassified, [math]\displaystyle{ \beta^T x_i + \beta_0 }[/math] will produce an opposite sign of [math]\displaystyle{ y_i }[/math]. Since we need a positive sign for distance, we add a negative sign.Backpropagation
Backpropagation procedure is done using the following steps:Epochs
It is common to cycle through all of the data points multiple times in order to reach convergence. An epoch represents one cycle in which you feed all of your data points through the neural network. It is good practice to randomize the order you feed the points to the neural network within each epoch; this can prevent your weights from changing in cycles. The number of epochs required for convergence depends greatly on the learning rate and convergence requirements used.Stein's Unbiased Risk Estimator
Model Selection
- The general task in machine learning is estimating a function.
- We want to estimate: [math]\displaystyle{ \hat{f}(x) }[/math] (estimated function).
- Where there is a true underlying function: [math]\displaystyle{ f(x) }[/math] (true function).
Definitions and Notations
Assume [math]\displaystyle{ T = \{(x_i, y_i)\}_{i=1}^n }[/math] be the training set. [math]\displaystyle{ f(.) \rightarrow }[/math] True function [math]\displaystyle{ \hat{f}(.) \rightarrow }[/math] Estimated function Also assume: [math]\displaystyle{ y_i = f(x_i) + \epsilon_i }[/math], where [math]\displaystyle{ \epsilon_i \sim \mathcal{N}(0, \sigma^2) }[/math] [math]\displaystyle{ \hat{y}_i = \hat{f}(x_i) }[/math][math]\displaystyle{ f_i \equiv f(x_i) }[/math]
[math]\displaystyle{ \hat{f}_i \equiv \hat{f}(x_i) }[/math] For point [math]\displaystyle{ (x_0, y_0) }[/math], we are interested in: [math]\displaystyle{ E[(\hat{y}_0 - y_0)^2] = E[(\hat{f}_0 - f_0 - \epsilon_0)^2] }[/math] [math]\displaystyle{ = E\left[ \left( \hat{f}_0 - f_0 - \epsilon_0 \right)^2 \right] }[/math] [math]\displaystyle{ = E[(\hat{f}_0 - f_0)^2 + \epsilon_0^2 - 2\epsilon_0 (\hat{f}_0 - f_0)] }[/math] [math]\displaystyle{ = E[(\hat{f}_0 - f_0)^2] + E[\epsilon_0^2] - 2E[\epsilon_0 (\hat{f}_0 - f_0)] }[/math] [math]\displaystyle{ = E[(\hat{f}_0 - f_0)^2] + \sigma^2 - 2E[\epsilon_0 (\hat{f}_0 - f_0)] }[/math] Case 1 Assume: [math]\displaystyle{ (x_0, y_0) \notin T }[/math] In this case, since [math]\displaystyle{ \hat{f} }[/math] is estimated only based on points in the training set, therefore it is completely independent from [math]\displaystyle{ (x_0, y_0) }[/math]. [math]\displaystyle{ \Rightarrow E[(y_0 - f)(\hat{f} - f)] = \text{cov}(y_0, \hat{f}_0) = 0 }[/math] If summing up all [math]\displaystyle{ m }[/math] points that are not in [math]\displaystyle{ T }[/math]: [math]\displaystyle{ \underbrace{\sum_{i=1}^m (\hat{y}_i - y_i)^2}_{\text{err}} = \underbrace{\sum_{i=1}^m (\hat{f}_i - f_i)^2}_{\text{Err}} + m \sigma^2 }[/math] Empirical error ([math]\displaystyle{ \text{err} }[/math]) is a good estimator of true error ([math]\displaystyle{ \text{Err} }[/math]) if the point [math]\displaystyle{ (x_0, y_0) }[/math] is not in the training set. Case 2 Assume: [math]\displaystyle{ (x_0, y_0) \in T }[/math] Then: [math]\displaystyle{ 2E[\epsilon_0 (\hat{f}_0 - f_0)] \neq 0 }[/math]
Stein's Lemma
If: [math]\displaystyle{ x \sim \mathcal{N}(\theta, \sigma^2) }[/math] and [math]\displaystyle{ g(x) }[/math] is differentiable, then: [math]\displaystyle{ E[g(x)(x - \theta)] = \sigma^2 E\left[\frac{\partial g(x)}{\partial x}\right] }[/math] Our problem: [math]\displaystyle{ E[\epsilon_0 (\hat{f}_0 - f_0)] = \sigma^2 E\left[\frac{\partial (\hat{f}_0 - f_0)}{\partial \epsilon_0}\right] }[/math] [math]\displaystyle{ = \sigma^2 E\left[\frac{\partial \hat{f}_0}{\partial \epsilon_0} - \frac{\partial f_0}{\partial \epsilon_0}\right] }[/math] [math]\displaystyle{ = \sigma^2 E\left[\frac{\partial \hat{f}_0}{\partial \epsilon_0}\right] }[/math] [math]\displaystyle{ = \sigma^2 E\left[\frac{\partial \hat{f}_0}{\partial y_0} \cdot \frac{\partial y_0}{\partial \epsilon_0}\right] }[/math] [math]\displaystyle{ = \sigma^2 E\left[\frac{\partial \hat{f}_0}{\partial y_0}\right] }[/math] [math]\displaystyle{ E[(\hat{y}_0 - y_0)^2] = E[(\hat{f}_0 - f_0)^2] + \sigma^2 - 2\sigma^2 E[D_0] }[/math]Sum over all [math]\displaystyle{ n }[/math] data points:
[math]\displaystyle{ \underbrace{\sum_{i=1}^n (\hat{y}_i - y_i)^2}_{\text{err}} = \underbrace{\sum_{i=1}^n (\hat{f}_i - f_i)^2}_{\text{Err}} + n\sigma^2 - 2\sigma^2 \sum_{i=1}^n D_i }[/math] [math]\displaystyle{ \text{Err} = \text{err} - n\sigma^2 + \underbrace{2\sigma^2 \sum_{i=1}^n D_i}_{\text{Complexity of model}} }[/math] [math]\displaystyle{ \text{Err} }[/math] is Stein's Unbiased Risk Estimator (SURE).