stat940W25

Notes on Exercises

Exercises are numbered using a two-part system, where the first number represents the lecture number and the second number represents the exercise number. For example:

1.1 refers to the first exercise in Lecture 1.
2.3 refers to the third exercise in Lecture 2.

Students are encouraged to complete these exercises as they follow the lecture content to deepen their understanding.

Exercise 1.1

Level: ** (Moderate)

Exercise Types: Novel

Each exercise you contribute should fall into one of the following categories:

Novel: Preferred – An original exercise created by you.
Modified: Valued – An exercise adapted or significantly altered from an existing source.
Copied: Permissible – An exercise reproduced exactly as it appears in the source.

References: Source: (e.g., book or other resources, if a webpage has its URL), Chapter,Page Number.

Question

Prove that the Perceptron Learning Algorithm converges in a finite number of steps if the dataset is linearly separable.

Hint:Note: exc Assume that the dataset ${(x_{i}, y_{i})}_{i = 1}^{N}$ is linearly separable, where $x_{i} \in R^{d}$ are the input vectors, and $y_{i} \in {- 1, 1}$ are their corresponding labels. Show that there exists a weight vector $w^{*}$ and a bias $b^{*}$ such that $y_{i} (w^{*} \cdot x_{i} + b^{*}) > 0$ for all $i$ , and use this assumption to bound the number of updates made by the algorithm.

Solution

Step 1: Linear Separability Assumption

If the dataset is linearly separable, there exists a weight vector $w^{*}$ and a bias $b^{*}$ such that: $y_{i} (w^{*} \cdot x_{i} + b^{*}) > 0 \forall i = 1, 2, \dots, N .$ Without loss of generality, let $‖ w^{*} ‖ = 1$ (normalize $w^{*}$ ).

Step 2: Perceptron Update Rule

The Perceptron algorithm updates the weight vector $w$ and bias $b$ as follows:

Initialize $w_{0} = 0$ and $b_{0} = 0$ .
For each misclassified point $(x_{i}, y_{i})$ , update:

$w \leftarrow w + y_{i} x_{i}, b \leftarrow b + y_{i} .$

Define the margin $γ$ of the dataset as: $γ = min_{i} \frac{y_{i} (w^{*} \cdot x_{i} + b^{*})}{‖ x_{i} ‖} .$ Since the dataset is linearly separable, $γ > 0$ .

Step 3: Bounding the Number of Updates

Let $w_{t}$ be the weight vector after $t$ -th update. Define: $M = max_{i} ‖ x_{i} ‖^{2},$ the maximum squared norm of any input vector.

Growth of $‖ w_{t} ‖^{2}$

After $t$ updates, the norm of $w_{t}$ satisfies: $‖ w_{t + 1} ‖^{2} = ‖ w_{t} + y_{i} x_{i} ‖^{2} = ‖ w_{t} ‖^{2} + 2 y_{i} (w_{t} \cdot x_{i}) + ‖ x_{i} ‖^{2} .$ Since the point is misclassified, $y_{i} (w_{t} \cdot x_{i}) < 0$ . Thus: $‖ w_{t + 1} ‖^{2} \leq ‖ w_{t} ‖^{2} + ‖ x_{i} ‖^{2} \leq ‖ w_{t} ‖^{2} + M .$ By induction, after $t$ updates: $‖ w_{t} ‖^{2} \leq t M .$

Lower Bound on $w_{t} \cdot w^{*}$

Let $w_{t}$ be the weight vector after $t$ -th update. Each update increases $w_{t} \cdot w^{*}$ by at least $γ$ : $w_{t + 1} \cdot w^{*} = (w_{t} + y_{i} x_{i}) \cdot w^{*} = w_{t} \cdot w^{*} + y_{i} (x_{i} \cdot w^{*}) .$ Since $y_{i} (x_{i} \cdot w^{*}) \geq γ$ , we have: $w_{t + 1} \cdot w^{*} \geq w_{t} \cdot w^{*} + γ .$ By induction: $w_{t} \cdot w^{*} \geq t γ .$

Combining the Results

The Cauchy-Schwarz inequality gives: $w_{t} \cdot w^{*} \leq ‖ w_{t} ‖ ‖ w^{*} ‖ = ‖ w_{t} ‖ .$ Thus: $t γ \leq ‖ w_{t} ‖ \leq \sqrt{t M} .$ Squaring both sides: $t^{2} γ^{2} \leq t M .$ Dividing through by $t$ (assuming $t > 0$ ): $t \leq \frac{M}{γ^{2}} .$

Step 4: Conclusion

The Perceptron Learning Algorithm converges after at most $\frac{M}{γ^{2}}$ updates, which is finite. This proves that the algorithm terminates when the dataset is linearly separable.

Exercise 1.2

Level: * (Easy)

Exercise Types: Modified

References: Simon J.D. Prince. Understanding Deep learning. 2024

This problem generalized Problem 4.10 in this textbook to $N$ inputs and $M$ outputs.

Question

(a) Consider a deep neural network with a single input, a single output, and $K$ hidden layers, each containing $D$ hidden units. How many parameters does this network have in total?

(b) Now, generalize the problem: if the number of inputs is $N$ and the number of outputs is $M$ , how many parameters does this network have in total?

Solution

(a) Total number of parameters when there is a single input and output:

For the first layer, the input size is $1$ and the output size is $D$ . Therefore, the number of weights is $1 D$ , and the number of biases is $D$ .

Number of parameters: $D + D = 2 D$

For hidden layers $i ⟶ i + 1, i \in 1, . . ., K - 1$ : Each hidden layer connects $D$ units to another $D$ units. Therefore, for each layer, the number of weights is $D^{2}$ , and the number of biases is $D$ .

Number of parameters for all $K - 1$ hidden layers: $(K - 1) (D^{2} + D)$

For the output layer, the number of weights is $D$ , and the number of biases is $1$ .

Number of parameters: $D + 1$

Therefore, the total number of parameters is $2 D + (K - 1) (D^{2} + D) + D + 1$ .

(b) Total number of parameters for $N$ inputs and $M$ outputs:

In this case, the number of parameters for the first layer becomes $N D + D$ , while the number of parameters for the output layer becomes $D M + M$ .

Therefore, in total, the number of parameters is $N D + D + (K - 1) (D^{2} + D) + M D + M$

Exercise 1.3

Level: * (Easy)

Exercise Types: Modified

References: Simon J.D. Prince. Understanding Deep learning. MIT Press, 2023

This problem modified from the background mathematics problem chap01 Question1.

Question

A single linear equation with three inputs associates a value $y$ with each point in a 3D space $(x_{1}, x_{2}, x_{3})$ . Is it possible to visualize this? What value is at position $(0, 0, 0)$ ?

We add an inverse problem: If $y, ω_{1}, ω_{2}, ω_{3}$ and $β$ are known, derive a system of equations to solve for the input values $x_{1}, x_{2}, x_{3}$ that produce a specific output value of $y$ . Under what conditions is this problem solvable?

Solution

A single linear equation with three inputs is of the form:

$y = β + ω_{1} x_{1} + ω_{2} x_{2} + ω_{3} x_{3}$

where $β$ is the offset, and $ω_{1}, ω_{2}, ω_{3}$ are weights for the inputs $x_{1}, x_{2}, x_{3}$ .

We can define the code as follows:

def linear_function_3D(x1, x2, x3, beta, omega1, omega2, omega3):
    y = beta + omega1 * x1 + omega2 * x2 + omega3 * x3
    return y

Given $β = 0.5, ω_{1} = - 1.0, ω_{2} = 0.4$ and $ω_{3} = - 0.3$ ,

$y = β + ω_{1} \cdot 0 + ω_{2} \cdot 0 + ω_{3} \cdot 0$

Thus, $y (0, 0, 0) = 0.5 .$

To visualize, we can fix $x_{3} = 0$ and let $x_{1}, x_{2}$ vary, and generate the $y$ -values using the equation.

Here is the code:

import numpy as np
import matplotlib.pyplot as plt

# Generate grid for x1 and x2, fix x3 = 0
x1 = np.linspace(-10, 10, 100)
x2 = np.linspace(-10, 10, 100)
x1, x2 = np.meshgrid(x1, x2)
x3 = 0  

# Define coefficients
beta = 0.5
omega1 = -1.0
omega2 = 0.4
omega3 = -0.3

# Compute y-values
y = linear_function_3D(x1, x2, x3, beta, omega1, omega2, omega3)

# Visualization
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
ax.plot_surface(x1, x2, y, cmap='viridis')
ax.set_xlabel('x1')
ax.set_ylabel('x2')
ax.set_zlabel('y')
plt.title('3D Linear Function with Fixed x3=0')
plt.show()

The plot is shown below:

For the inverse problem, given $y, β, ω_{1}, ω_{2}, ω_{3},$ we can solve $x_{1}, x_{2}, x_{3}$ as follows:

$[\begin{matrix} ω_{1} & ω_{2} & ω_{3} \end{matrix}] [\begin{matrix} x_{1} \\ x_{2} \\ x_{3} \end{matrix}] = y - β$

The problem is solvable if $ω$ is not a zero vector, ensuring at least one weight contributes to the equation.

y = 10.0  
beta = 1.0
omega = [2, -1, 0.5]
rhs = y - beta

# Solve using least squares
x_vec = np.linalg.lstsq(np.array([omega]), [rhs], rcond=None)[0]
print(f"Solution for x: {x_vec}")

Exercise 1.4

Level: * (Easy)

Exercise Types: Novel

Question

Thinking about feedforward model with sigmoid activation, compute the output of a single-layer neural network with 3 inputs and 1 output.

Assuming:

- Input vector: $x = (0.1, 0.4, 0.6)$

- weights: $w = (0.2, 0.3, 0.5)$

- Bias: $b = 0.1$

- a). Sigmoid activation function: $f (z) = \frac{1}{1 + e^{- z}}$

- b). ReLU activation function: $f (z) = max (0, z)$

- c). Tanh activation function: $f (z) = \tanh (z) = \frac{e^{z} - e^{- z}}{e^{z} + e^{- z}}$

Solution

1. Compute the weighted sum: $z = w \cdot x + b = (0.2) (0.1) + (0.3) (0.4) + (0.5) (0.6) + 0.1$

Breaking this down step-by-step: $z = 0.02 + 0.12 + 0.3 + 0.1 = 0.54$

2. a). Apply the sigmoid activation function: $f (z) = \frac{1}{1 + e^{- z}}$

Substituting $z = 0.54$ : $f (z) = \frac{1}{1 + e^{- 0.54}} \approx \frac{1}{1 + 0.582} \approx 0.632$

Thus, the final output is 0.632.

b). Similarly, apply the ReLU activation function: $f (z) = max (0, z)$

Substituting $z = 0.54$ : $f (z) = max (0, 0.54) = 0.54$

c). Finally, apply the Tanh activation function: $f (z) = \frac{e^{z} - e^{- z}}{e^{z} + e^{- z}}$

Substituting $z = 0.54$ : $f (z) = \frac{e^{0.54} - e^{- 0.54}}{e^{0.54} + e^{- 0.54}} \approx \frac{1.716 - 0.583}{1.716 + 0.583} \approx \frac{1.133}{2.299} \approx 0.493$

Exercise 1.5

Level: * (Easy)

Exercise Types: Novel

Question

1.2012: ________'s ImageNet victory brings mainstream attention.

2.2016: Google's ________ uses deep learning to defeat a Go world champion.

3.2017: ________ architecture revolutionizes Natural Language Processing.

Solution

1. AlexNet

2. AlphaGo

3. Transformer

Key Milestones in Deep Learning

•2006: Deep Belief Networks – The modern era of deep learning begins.

•2012: AlexNet's ImageNet victory brings mainstream attention.

•2014-2015: Introduction of Generative Adversarial Networks (GANs).

•2016: Google's AlphaGo uses deep learning to defeat a Go world champion.

•2017: Transformer architecture revolutionizes Natural Language Processing.

•2018-2019: BERT and GPT-2 set new benchmarks in NLP.

•2020: GPT-3 demonstrates advanced language understanding and generation.

•2021: AlphaFold 2 achieves breakthroughs in protein structure prediction.

•2021-2022: Diffusion Models (e.g., DALL-E 2, Stable Diffusion) achieve state-of-the-art in image and video generation.

•2022: ChatGPT popularizes conversational AI and large language models (LLMs).

Exercise 1.6

Level: * (Easy)

Exercise Type: Novel

Question

a) What are some common examples of first-order search strategies in neural network optimization, and why are first-order methods generally preferred over second-order methods?

b) What is the difference between a deep neural network and a shallow neural network, and how many hidden layers does each typically have?

c) Prove that a perceptron cannot converge for the XOR problem.

Solution

a)

Common examples of first-order search strategies in neural network optimization include Gradient Descent (GD), Stochastic Gradient Descent (SGD), Momentum, and Adam. These methods rely on gradients (first derivatives) of the loss function to update model parameters, making them computationally efficient and scalable. First-order methods are preferred due to their efficiency, scalability to large datasets, and lower memory requirements compared to second-order methods. While second-order methods can converge faster, first-order methods like Adam balance performance and resource usage well, especially in large-scale networks.

b)

A deep neural network typically has more than 2 hidden layers, allowing it to learn complex, abstract features at each layer. A shallow neural network usually has 1 or 2 hidden layers. Therefore, networks with more than 2 hidden layers are considered deep, while those with fewer layers are considered shallow.

c)

Step 1: XOR Dataset

The XOR problem has the following data points and labels:

XOR Data
x₁	x₂	y
0	0	0
0	1	1
1	0	1
1	1	0

Step 2: Perceptron Decision Boundary

The perceptron decision boundary is defined as:

$z = w ₁ \cdot x ₁ + w ₂ \cdot x ₂ + b$

A point is classified as:

y = 1 if z > 0
y = 0 if z < 0

For the XOR dataset, we derive inequalities for each data point.

Step 3: Derive Inequalities

1. For (x₁, x₂) = (0, 0), y = 0:

  b < 0

2. For (x₁, x₂) = (0, 1), y = 1:

  w₂ + b > 0

3. For (x₁, x₂) = (1, 0), y = 1:

  w₁ + b > 0

4. For (x₁, x₂) = (1, 1), y = 0:

  w₁ + w₂ + b < 0

Step 4: Attempt to Solve

From the inequalities:

1. $b < 0$

2. $w ₂ + b > 0 \Rightarrow w ₂ > - b$

3. $w ₁ + b > 0 \Rightarrow w ₁ > - b$

4. $w ₁ + w ₂ + b < 0 \Rightarrow w ₁ + w ₂ < - b$

Now, add inequalities (2) and (3):

$w ₁ + w ₂ > - 2 b$

But compare this with inequality (4):

$w ₁ + w ₂ < - b$

This leads to a contradiction because $- 2 b < - b$ cannot be true if $b < 0$ .

Therefore, the XOR dataset is not linearly separable, and the perceptron cannot converge for the XOR problem.

Exercise 1.7

Level: * (Easy)

Exercise Type: Novel

Question

The sigmoid activation function is defined as: $σ (x) = \frac{1}{1 + e^{- x}} .$

(a) Derive the derivative of $σ (x)$ with respect to $x$ , and show that: $σ^{'} (x) = σ (x) (1 - σ (x)) .$

(b) Use this property to explain why sigmoid activation is suitable for modeling probabilities in binary classification tasks.

Solution

(a) Derivative: Starting with $σ (x) = \frac{1}{1 + e^{- x}}$ , we compute: $σ^{'} (x) = \frac{d}{d x} (\frac{1}{1 + e^{- x}}) = \frac{e^{- x}}{(1 + e^{- x})^{2}} .$

By noting that $σ (x) = \frac{1}{1 + e^{- x}}$ and $1 - σ (x) = \frac{e^{- x}}{1 + e^{- x}},$ we simplify to: $σ^{'} (x) = σ (x) (1 - σ (x)) .$

(b) Why sigmoid for probabilities: The sigmoid function maps any real $x$ into $(0, 1)$ , which aligns with the range of valid probabilities in binary classification. Moreover, its derivative $σ^{'} (x) = σ (x) (1 - σ (x))$ makes gradient-based optimization naturally scale updates based on “confidence.” When $σ (x)$ is near 0 or 1, the gradient becomes small, preventing large adjustments once the model is fairly certain in its prediction.

A closely related function is the softmax, which generalizes the same probabilistic interpretation to multi-class settings. For two classes, softmax is essentially the same as the sigmoid function, so it can also be suitable for binary classification problems.

Exercise 1.8

Level: * (Easy)

Exercise Types: Novel

Question

In classification, it is possible to minimize the number of misclassifications directly by using:

$\sum_{i = 1}^{n} 1 (sign (β^{T} x_{i} + β_{0}) \neq y_{i})$

where $1 (\cdot)$ is the indicator function, $β$ is the weight vector, and $β_{0}$ is the bias term. So, the loss function gives 1 for each incorrect response and 0 for each correct one.

(a) Why is this approach not commonly used in practice?

(b) Name and give formulas for two differentiable loss functions commonly employed in practice for binary classification tasks, explaining why they are more popular.

Solution

(a): The expression $1 (sign (β^{T} x_{i} + β_{0}) \neq y_{i})$ gives only 0 or 1. Small changes in $β$ or $β_{0}$ can suddenly change the loss function for a sample from 0 to 1 (or vice versa). Because the loss function is discrete, the gradient with respect to $β$ or $β_{0}$ does not exist. Standard optimization techniques like gradient descent rely on differentiable, continuous loss function where partial derivatives can use to update the parameters.

(b): Two alternative loss functions:

Hinge Loss: $\sum_{i = 1}^{n} max (0, 1 - y_{i} (β^{T} x_{i} + β_{0}))$

The hinge loss is often used in Support Vector Machines (SVMs) and works well when the data is linearly separable.

Logistic (Cross-Entropy) Loss: $\sum_{i = 1}^{n} \log (1 + \exp (- y_{i} (β^{T} x_{i} + β_{0})))$

The logistic loss (or cross-entropy loss) is commonly used in logistic regression and neural networks. It is differentiable so it works well for gradient-based optimization methods.

Exercise 1.9

Level: ** (Easy)

Exercise Types: Novel

Question

How are neural networks modeled? Using an example to explain it clearly.

Solution

Neural networks are modeled from biological neurons. A neural network consists of layers of interconnected neurons where each connection has associated weights. The input layer receives the data features, and each neuron corresponds to one feature from the dataset. The hidden layer consists of multiple neurons that transform the input data into intermediate representations, using a combination of weights, biases, and activation functions which allows the network to learn complex patterns, like the Sigmoid function $S (x) = \frac{1}{1 + e^{- x}}$ . The output layer generates the final prediction, such as probabilities for classification or continuous values for regression. Neural networks learn to map inputs to outputs by adjusting weights during training to minimize the error between predicted and actual outputs.

For example, in the lecture note, inputs $x_{1} = 0.5, x_{2} = 0.9, x_{3} = - 0.3$ are passed through a hidden layer with specific weights:

$H_{1} w e i g h t = (1.0, - 2.0, 2.0)$ ,

$H_{2} w e i g h t = (2.0, 1.0 - 4.0)$ ,

$H_{3} w e i g h t = (1.0, - 1.0, 0.0)$ .

The computations yield hidden neuron values of

$H_{1} = 0.5 \times 1.0 + 0.9 \times - 2.0 + - 0.3 \times 2.0 = 0.13$ ,

$H_{2} = 0.5 \times 2.0 + 0.9 \times 1.0 + - 0.3 \times - 4.0 = 0.96$ ,

$H_{3} = 0.5 \times 1.0 + 0.9 \times - 1.0 + - 0.3 \times 0.0 = 0.40$ ,

which are then processed by the output layer to produce the final predictions.

This process demonstrates how neural networks learn and transform input data step by step.

Exercise 1.10

Level: ** (Moderate)

Exercise Types: Novel

Question

Biological neurons in the human brain have the following characteristics:

1. A neuron fires an electrical signal only when its membrane potential exceeds a certain threshold. Otherwise, it remains inactive.

2. Neurons are connected to one another through dendrites (input) and axons (outputs), forming a highly interconnected network.

3. The intensity of the signal passed between neurons depends on the strength of the connection, which can change over time due to learning and adaptation.

Considering the above points, answer the following questions:

Explain how these biological properties of neurons might inspire the design and functionality of nodes in artificial neural networks.

Solution

1. Threshold Behavior: The concept of a neuron firing only when its membrane potential exceeds a threshold is mirrored in neural networks through activation functions. These functions decide whether a node "fires" by producing a significant output.

2. Connectivity: The connections between biological neurons via dendrites and axons inspire the weighted connections in artificial neural networks. Each node receives inputs, processes them, and sends weighted outputs to subsequent node, similar how to signals propagate in the brain.

3. Learning and Adaptation: Biological neurons strengthen or weaken their connections based on experience (neuroplasticity). This is similar to how artificial networks adjust weights during training using backpropagation and optimization algorithms. The dynamic modification of weights allows artificial networks to learn from data.

Exercise 1.11

Level: * (Easy)

Exercise Type: Novel

Question

If the pre-activation is 20, what are the outputs of the following activation functions: ReLU, Leaky ReLU, logistic, and hyperbolic?

Choose the correct answer:
a) 20, 20, 1, 1
b) 20, 0, 1, 1
c) 20, -20, 1, 1
d) 20, 20, -1, 1
e) 20, -20, 1, -1

Solution

The correct answer is a): 20, 20, 1, 1.

Calculation

$ReLU (20) = max (0, 20) = 20$

$LeakyReLU (20) = {\begin{cases} 20 & if 20 \geq 0 \\ α \cdot 20 & if 20 < 0 \end{cases} = 20$ where $α$ is a small constant (typically $0.01$ ).

$σ (20) = \frac{1}{1 + e^{- 20}} \approx 1$

$\tanh (20) = \frac{e^{20} - e^{- 20}}{e^{20} + e^{- 20}} = 1$

Exercise 1.12

Level: * (Easy)

Exercise Type: Novel

Question

Imagine a simple feedforward neural network with a single hidden layer. The network structure is as follows: - linear activation function - The input layer has 2 neurons. - The hidden layer has 2 neurons. - The output layer has 1 neuron. - There are no biases in the network.

If the weights from the input layer to the hidden layer are given by: $W^{(1)} = [\begin{matrix} 0.5 & - 0.6 \\ 0.1 & 0.8 \end{matrix}]$ and the weights from the hidden layer to the output layer are given by: $W^{(2)} = [\begin{matrix} 0.3 \\ - 0.2 \end{matrix}]$

Calculate the output of the network for the input vector $x = [\begin{matrix} 1 \\ 0 \end{matrix}]$ using a linear activation function for all neurons.

Hint

- The output of each layer is calculated by multiplying the input of that layer by the layer's weight matrix. - Use matrix multiplication to compute the outputs step-by-step.

Solution

- Step 1: Calculate Hidden Layer Output**

The input to the hidden layer is the initial input $x$ : $h^{(1)} = W^{(1)} \times x = [\begin{matrix} 0.5 & - 0.6 \\ 0.1 & 0.8 \end{matrix}] [\begin{matrix} 1 \\ 0 \end{matrix}] = [\begin{matrix} 0.5 \\ 0.1 \end{matrix}]$

- Step 2: Calculate Output Layer Output**

The input to the output layer is the output from the hidden layer: $y = W^{(2)} \times h^{(1)} = [\begin{matrix} 0.3 \\ - 0.2 \end{matrix}] \times [\begin{matrix} 0.5 \\ 0.1 \end{matrix}] = 0.3 \times 0.5 + (- 0.2) \times 0.1 = 0.15 - 0.02 = 0.13$

Thus, the output of the network for the input vector $x = [\begin{matrix} 1 \\ 0 \end{matrix}]$ is $0.13$ .

Exercise 1.13

Level: * (Easy)

Exercise Types: Novel

Question

Explain whether this is a classification, regression, or clustering task each time. If the task is either classification or regression, also comment on whether the focus is prediction or explanation.

1. **Stock Market Trends:**

  A financial analyst wants to predict the future stock prices of a company based on historical trends, economic indicators, and company performance metrics.

2. **Customer Segmentation:**

  A retail company wants to group its customers based on their purchasing behaviour, including transaction frequency, product categories, and total spending, to design targeted marketing campaigns.

3. **Medical Diagnosis:**

  A hospital wants to develop a model to determine whether a patient has a specific disease based on symptoms, medical history, and lab test results.

4. **Predicting Car Fuel Efficiency:**

  An automotive researcher wants to understand how engine size, weight, and aerodynamics affect a car's fuel efficiency (miles per gallon).

Solution

1. Stock Market Trends

- Task Type:** Regression
- Focus:** Prediction
- Reasoning:** Stock prices are continuous numerical values, making this a regression task. The goal is to predict future prices rather than explain past fluctuations.

2. Customer Segmentation

- Task Type:** Clustering
- Focus:** —
- Reasoning:** Customers are grouped based on their purchasing behaviour without predefined labels, making this a clustering task.

3. Medical Diagnosis

- Task Type:** Classification
- Focus:** Prediction
- Reasoning:** The disease status is a categorical outcome (Has disease: Yes/No), making this a classification problem. The goal is to predict a diagnosis for future patients.

4. Predicting Car Fuel Efficiency

- Task Type:** Regression
- Focus:** Explanation
- Reasoning:** Fuel efficiency (miles per gallon) is a continuous variable. The researcher is interested in understanding how different factors influence efficiency, so the focus is on explanation.

Summary

Task	Type	Focus	Reasoning
Stock Market Trends	Regression	Prediction	Predict future stock prices (continuous variable).
Customer Segmentation	Clustering	—	Group customers based on purchasing behaviour.
Medical Diagnosis	Classification	Prediction	Determine if a patient has a disease (Yes/No).
Predicting Car Fuel Efficiency	Regression	Explanation	Understand how factors affect fuel efficiency.

Exercise 1.14

Level: ** (Easy)

Exercise Types: Novel

Question

You are given a set of real-world scenarios. Your task is to identify the most suitable fundamental machine learning approach for each scenario and justify your choice.

- Scenarios:**

1. **Loan Default Prediction:**

  A bank wants to predict whether a loan applicant will default on their loan based on their credit history, income, and employment status.

2. **House Price Estimation:**

  A real estate company wants to estimate the price of a house based on features such as location, size, and number of bedrooms.

3. **User Grouping for Advertising:**

  A social media platform wants to group users with similar interests and online behavior for targeted advertising.

4. **Dimensionality Reduction in Medical Data:**

  A medical researcher wants to reduce the number of variables in a dataset containing hundreds of patient health indicators while retaining the most important information.

- Tasks:**

- For each scenario, classify the problem into one of the four fundamental categories: Classification, Regression, Clustering, or Dimensionality Reduction. - Explain why you selected that category for each scenario. - Suggest a possible algorithm that could be used to solve each problem.

Solution

- 1. Loan Default Prediction**
- Task Type:** Classification
- Reasoning:** The target variable (loan default) is categorical (Yes/No), making this a classification problem. The goal is to predict whether an applicant will default based on their financial history.
- Possible Algorithm:** Logistic Regression, Random Forest, or Gradient Boosting.

- 2. House Price Estimation**
- Task Type:** Regression
- Reasoning:** House prices are continuous numerical values, making this a regression task. The goal is to estimate a house's price based on features like location and size.
- Possible Algorithm:** Linear Regression, Decision Trees, or XGBoost.

- 3. User Grouping for Advertising**
- Task Type:** Clustering
- Reasoning:** The goal is to group users based on their behavior without predefined labels, making this a clustering task.
- Possible Algorithm:** K-Means, DBSCAN, or Hierarchical Clustering.

- 4. Dimensionality Reduction in Medical Data**
- Task Type:** Dimensionality Reduction
- Reasoning:** The goal is to reduce the number of variables while preserving essential information, making this a dimensionality reduction task.
- Possible Algorithm:** Principal Component Analysis (PCA), t-SNE, or Autoencoders.

Exercise 1.15

Level: ** (Easy)

Exercise Types: Novel

Question

Define what machine learning is and how it is different from classical statistics. Provide the three learning methods used in machine learning, briefly define each and give an example of where each of them can be used. Include some common algorithms for each of the learning methods.

Solution

- Machine learning Definition**

– Machine Learning is the ability to teach a computer without explicitly programming it

– Examples are used to train computers to perform tasks that would be difficult to program

The difference between classical statistics and machine learning is the size of the data that they infer information from. In classical statistics, this is usually done from a small dataset(not enough data) while in machine learning it is done from a large dataset(Too many data).

- Supervised learning**

Supervised learning is a type of machine learning where the model is trained on a labeled dataset, meaning each training example has input features and a corresponding correct output. The algorithm learns the relationship between inputs and outputs to make predictions on new, unseen data.

Examples: Predicting house prices based on location, size, and other features (Regression). Identifying whether an email is spam or not (Classification).

Common Algorithms: Linear Regression, Logistic Regression, Decision Trees, Random Forest, Support Vector Machines (SVM), Neural Networks.

- Unsupervised Learning**

Unsupervised learning involves training a model on data without labeled outputs. The algorithm attempts to discover patterns, structures, or relationships within the data.

Examples: Grouping customers with similar purchasing behaviors for targeted marketing (Clustering). Identifying important features in a high-dimensional dataset (Dimensionality Reduction).

Common Algorithms: K-Means, Hierarchical Clustering, DBSCAN (Clustering). Principal Component Analysis (PCA), t-SNE, Autoencoders (Dimensionality Reduction).

- Reinforcement Learning**

Reinforcement learning (RL) is a type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize cumulative rewards. The agent interacts with the environment, receives feedback in the form of rewards or penalties, and improves its strategy over time.

Examples: Training a robot to walk by rewarding successful movements. Teaching an AI to play chess or video games by rewarding wins and penalizing losses.

Common Algorithms: Q-Learning, Deep Q Networks (DQN), Policy Gradient Methods, Proximal Policy Optimization (PPO).

Summary

Aspect	Supervised Learning	Unsupervised Learning	Reinforcement Learning
Definition	Learning from labeled data where inputs are paired with outputs.	Learning patterns or structures from unlabeled data.	Learning by interacting with an environment to maximize cumulative rewards.
Key Characteristics	Trains on known inputs and outputs to predict outcomes for unseen data.	No predefined labels; discovers hidden structures in the data.	Agent learns through trial and error by receiving rewards or penalties for its actions.
Examples	- Predicting house prices (Regression). - Classifying emails as spam or not (Classification).	- Grouping customers by behavior (Clustering). - Reducing variables in large datasets (Dimensionality Reduction).	- Training robots to walk. - Teaching AI to play chess or video games.
Common Algorithms	- Linear Regression - Logistic Regression - Decision Trees - Random Forest - SVM - Neural Networks	- K-Means - Hierarchical Clustering - PCA - t-SNE - Autoencoders	- Q-Learning - Deep Q Networks (DQN) - Policy Gradient Methods - Proximal Policy Optimization (PPO)

Exercise 1.16

Level: * (Easy)

Exercise Types: Novel

Question

Categorize each of these machine learning scenarios into supervised learning, unsupervised learning, or reinforcement learning. Justify your reasoning for each case.

(a) A neural network is trained to classify handwritten digits using the MNIST dataset, which contains 60 000 images of handwritten digits, along with the correct answer for each image.

(b) A robot is programmed to learn how to play a video game. It does not have access to the game’s rules, but it can observe its current score after each action. Over time, it learns to play better by maximizing its score.

(c) A deep learning model is designed to segment medical images into different sections corresponding to specific organs. The training data consists of medical scans that have been annotated by experts to mark the boundaries of the organs.

(d) A machine learning model is given 100 000 astronomical images of unknown stars and galaxies. Using dimensionality reduction techniques, it groups similar-looking objects based on their features, such as size and shape.

Solution

(a) Supervised learning: The model is trained with labeled data, where each image has a corresponding digit label.

(b) Reinforcement learning: The model learns by interacting with an environment and receiving feedback in the form of rewards or penalties. It explores different actions to maximize cumulative rewards over time.

(c) Supervised learning: The model uses labeled data where professionals annotated each region of the image.

(d) Unsupervised learning: The model works with unlabeled data to find patterns and group similar objects.

Exercise 1.17

Level: * (Easy)

Exercise Types: Novel

Question

How does the introduction of ReLU as an activation function address the vanishing gradient problem observed in early deep learning models using sigmoid or tanh functions?

Solution

The vanishing gradient problem occurs when activation functions like sigmoid or tanh compress their inputs into small ranges, resulting in gradients that become very small during backpropagation. This hinders learning, particularly in deeper networks.

The ReLU (Rectified Linear Unit), defined as $f (x) = max (0, x)$ , addresses this issue effectively:

(a) Non-Saturating Gradients: For positive input values, ReLU's gradient remains constant (equal to 1), preventing gradients from vanishing.

(b) Efficient Computation: The simplicity of the ReLU function makes it computationally faster than the sigmoid or tanh functions, which involve more complex exponential calculations.

(c) Sparse Activations: ReLU outputs zero for negative inputs, leading to sparse activations, which can improve computational efficiency and reduce overfitting.

However, ReLU can experience the "dying ReLU" problem, where neurons output zero for all inputs and effectively become inactive. Variants like Leaky ReLU and Parametric ReLU address this by allowing small, non-zero gradients for negative inputs, ensuring neurons remain active.

Exercise 1.18

Level: * (Easy)

Exercise Types: Novel

Question

What is the general concept of text generation in deep learning, and how does it work?

Solution

Text generation in deep learning refers to the process of automatically creating coherent and contextually relevant text based on input data or a learned language model. The goal is to produce text that mimics human-written content, maintaining grammatical structure, logical flow, and contextual relevance.

There are five steps.

1. Training on a Language Corpus: A deep learning model, such as a Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), or Transformer, is trained on a large dataset of text. During training, the model learns patterns, relationships between words, and context within sentences and across paragraphs.

2. Tokenization and Embeddings: Input text is broken into smaller units, such as words or subwords (tokens). These tokens are converted into numerical vectors (embeddings) that capture semantic and syntactic relationships.

3. The model predicts the probability of the next word or token in a sequence based on the context provided by the preceding words. It uses conditional probability, such as: $P (w_{t} ∣ w_{1}, w_{2}, . . ., w_{t - 1})$ to determine the likelihood of the next token.

4. Once the model generates probabilities for the next token, decoding strategies are used to construct text.

5. Generated text is evaluated for coherence, fluency, and relevance. Techniques such as fine-tuning on specific domains or datasets improve the model's performance for targeted applications.

Exercise 1.19

Level: * (Easy)

Exercise Types: Novel

Question

Supervised learning and unsupervised learning are two of the main types of machine learning, and they differ mainly in how the models are trained and the type of data used. Briefly state their differences.

Solution

Supervised Learning:

Data: Requires labeled data.

Goal: The model learns a mapping from inputs to the correct output.

Example Tasks: Classification and regression.

Training Process: The model is provided with both input data and corresponding labels during training, allowing it to learn from these examples to make predictions on new, unseen data.

Common Algorithms: Linear regression, decision trees, random forests, support vector machines, and neural networks.

Unsupervised Learning:

Data: Does not require labeled data.

Goal: The model tries to find hidden patterns or structure in the data.

Example Tasks: Clustering and dimensionality reduction.

Training Process: The model analyzes the input data without being told the correct answer, and it organizes or structures the data in meaningful ways.

Common Algorithms: K-means clustering, hierarchical clustering, principal component analysis (PCA), and autoencoders.

Exercise 1.20

Level: * (Easy)

Exercise Types: Novel

Question

It was mentioned in lecture that the step function had previously been used as an activation function, but we now commonly use the sigmoid function as an activation function. Highlight the key differences between these functions.

Solution

- The step function takes a single real numbered value and outputs 0 if the number is negative and 1 if the number is 0 or positive

- The sigmoid activation function is an s shaped curve with the output spanning between 0 and 1 (not inclusive)

- The equation for the sigmoid function is $f (x) = \frac{1}{1 + e^{- x}}$

- The step activation function only produces two values as the output, 0 or 1, whereas the sigmoid activation function produces a continuous range of values between 0 and 1

- The smoothness of the sigmoid activation function makes it more suitable for gradient based learning in neural networks, allowing for more efficient back propagation

Exercise 1.21

Level: * (Easy)

Exercise Types: Novel

Question

Consider a linear regression model where we aim to estimate the weight vector $w$ by minimizing the Residual Sum of Squares (RSS), defined as:

$RSS (w) = \frac{1}{2} \sum_{n = 1}^{N} (y_{n} - w^{T} x_{n})^{2} = \frac{1}{2} ‖ X w - y ‖_{2}^{2} = \frac{1}{2} (X w - y)^{T} (X w - y) .$

Compute the gradient: Derive the gradient of $RSS (w)$ with respect to $w$ .
Find the optimal $w$ : Solve for $w$ by setting the gradient to zero.
Interpretation: What is the significance of the solution you obtained in terms of ordinary least squares (OLS)?

Provide your answers with clear derivations and explanations.

Solution

To find the optimal weight vector $w$ , we first compute the gradient of the Residual Sum of Squares (RSS):

$\nabla_{w} RSS (w) = X^{T} X w - X^{T} y .$

Setting the gradient to zero and solving for $w$ gives:

$X^{T} X w = X^{T} y .$

These are known as the normal equations, since at the optimal solution, $y - X w$ is orthogonal to the range of $X$ .

The corresponding solution $\hat{w}$ is the ordinary least squares (OLS) solution, given by:

$\hat{w} = (X^{T} X)^{- 1} X^{T} y .$

The matrix $(X^{T} X)^{- 1} X^{T}$ is known as the (left) pseudo-inverse of $X$ , which generalizes matrix inversion for non-square matrices.

To ensure the solution is unique, we examine the Hessian matrix:

$H (w) = \frac{\partial^{2}}{\partial w^{2}} RSS (w) = X^{T} X .$

If $X$ has full column rank (i.e., its columns are linearly independent), then $H$ is positive definite, as shown by:

$v^{T} (X^{T} X) v = (X v)^{T} (X v) = ‖ X v ‖^{2} > 0, for any nonzero vector v .$

Since the Hessian is positive definite in this case, the least squares objective has a unique global minimum.

Exercise 1.22

Level: * (Easy)

Exercise Types: Novel

Question

Which of the following best highlights the key difference between Machine Learning and Deep Learning?

A. Machine Learning is only suitable for small datasets, while Deep Learning can handle datasets of any size.

B. Machine Learning is restricted to regression and classification, whereas Deep Learning is used for image and text processing.

C. Deep Learning can model without any data, while Machine Learning requires large datasets.

D. Machine Learning relies on manually extracted features, while Deep Learning can automatically learn feature representations.

Solution

Answer: D;

Explanation: Machine Learning algorithms often require manual feature engineering, whereas Deep Learning can automatically extract features from data through multi-layered neural networks. This is a significant distinction between the two.

Machine Learning techniques include decision trees, svms, xgboosting, etc. These techniques typically work better on smaller datasets due to simpler structure. They often struggle to match the flexibility and scalability of deep neural networks as the data becomes more complex. Deep neural networks work well on large datasets consisting of data with a more complex structure, such as images or sentences (LLM). Often times, more data is needed for deep neural networks in order for them to learn effectively and avoid overfitting. Their complex architecture of deep neural networks allow them to learn feature representations from raw data, without the need for manual feature engineering.

Exercise 1.23

Level: * (Easy)

Exercise Types: Novel

Question

Pros and Cons of supervised learning and unsupervised learning?

Solution

Supervised learning is to learn from labelled data. The benefits of supervised learning include its clear objective and direct evaluation through performance metrics such as MSE to compare model predictions with clear labels. The consequences of supervised learning encompass the time-consuming nature to obtain large labelled dataset and the risk of overfitting. Unsupervised learning needs pattern or data structure detection. The benefits of unsupervised learning include opportunities for data preprocessing. Thus, dimension reduction techniques can be used to simplify the data structure. Nevertheless, we can't directly control or interpret the results as good as the supervised learning does.

Exercise 1.24

Level: ** (Easy)

Exercise Types: Novel

Question

Consider the dataset: ${(x_{1}, y_{1}), (x_{2}, y_{2}), (x_{3}, y_{3})} = {([1, 2], 1), ([2, 3], 1), ([4, 5], - 1)},$ and a linear decision boundary defined as: $w_{1} x_{1} + w_{2} x_{2} + b = 0,$ where the classifier predicts $y = 1$ if $f (x) > 0$ , and $y = - 1$ if $f (x) \leq 0$ .

Given the weights and bias: $w_{1} = 1, w_{2} = - 1, b = 0,$ determine whether all points in the dataset are correctly classified.

Solution

The decision function is: $f (x) = w_{1} x_{1} + w_{2} x_{2} + b .$ Substituting $w_{1} = 1$ , $w_{2} = - 1$ , and $b = 0$ , we evaluate $f (x)$ for each point in the dataset.

For $x_{1} = [1, 2]$ : $f (x_{1}) = (1) (1) + (- 1) (2) + 0 = - 1 \Rightarrow y = - 1 (incorrect, since y_{1} = 1) .$

For $x_{2} = [2, 3]$ : $f (x_{2}) = (1) (2) + (- 1) (3) + 0 = - 1 \Rightarrow y = - 1 (incorrect, since y_{2} = 1) .$

For $x_{3} = [4, 5]$ : $f (x_{3}) = (1) (4) + (- 1) (5) + 0 = - 1 \Rightarrow y = - 1 (correct, since y_{3} = - 1) .$

Exercise 2.1

Level: * (Easy)

Exercise Types: Novel

References: Calin, Ovidiu. Deep learning architectures: A mathematical approach. Springer, 2020

This problem is coincidentally similar to Exercise 5.10.1 (page 163) in this textbook, although that exercise was not used as the basis for this question.

Question

This problem is about using perceptrons to implement logic functions. Assume a dataset of the form $x_{1}, x_{2} \in {0, 1}$ , and a perceptron defined as: $y = H (β_{0} + β_{1} x_{1} + β_{2} x_{2}),$ where $H$ is the Heaviside step function, defined as: $H (z) = {\begin{cases} 1, & if z \geq 0, \\ 0, & if z < 0. \end{cases}$

(a)* Find weights $β_{1}, β_{2}$ and bias $β_{0}$ for a single perceptron that implements the AND function.

(b)* Find the weights $β_{1}, β_{2}$ and bias $β_{0}$ for a single perceptron that implements the OR function.

(c)** Given the truth table for the XOR function:

$\begin{array}{ccc} x_{1} & x_{2} & x_{1} \oplus x_{2} \\ 0 & 0 & 0 \\ 0 & 1 & 1 \\ 1 & 0 & 1 \\ 1 & 1 & 0 \end{array}$

Show that it cannot be learned by a single perceptron. Find a small neural network of multiple perceptrons that can implement the XOR function. (Hint: a hidden layer with 2 perceptrons).

Solution

(a) A perceptron that implements the AND function:

$y = H (- 1.5 + x_{1} + x_{2}) .$

Here:

$β_{0} = - 1.5, β_{1} = 1, β_{2} = 1.$ . This works because the AND function returns 1 only when both inputs are 1. For a perceptron, the condition for activation is: $β_{0} + β_{1} x_{1} + β_{2} x_{2} \geq 0.$ This must hold for (1, 1) but fail for all other combinations. Substituting values leads to the choice of $β_{0} = - 1.5$ and $β_{1} = β_{2} = 1$ .

(b) A perceptron that implements the OR function:

$y = H (- 0.5 + x_{1} + x_{2}) .$

Here:

$β_{0} = - 0.5, β_{1} = 1, β_{2} = 1.$ This works because the OR function returns 1 if either or both inputs are 1. Using similar logic to the AND case, the decision boundary conditions lead to these parameters.

Perceptrons that implement the AND and OR functions respectively.

(c) XOR is not linearly separable, so it cannot be implemented by a single perceptron.

The XOR function returns 1 when the following are true:

Either $x_{1}$ or $x_{2}$ are 1. In other words, the expression $x_{1}$ OR $x_{2}$ returns 1.
$x_{1}$ and $x_{2}$ are not both 1. In other words, the expression $x_{1}$ NAND $x_{2}$ returns 1.

To implement this, the outputs of an OR and a NAND perceptron can be taken as inputs to an AND perceptron. (The NAND perceptron was derived by multiplying the weights and bias of the AND perceptron by -1.)

Why can't the perceptron converge in the case of linear non-separability?

In linearly separable data, there exists a weight vector $w$ and bias $b$ such that:

$y_{i} (w \cdot x_{i} + b) > 0 \forall i$

But in the case of linear non-separability, w and b satisfying this condition do not exist, so the perceptron cannot satisfy the convergence condition.

Exercise 2.2

Level: * (Easy)

Exercise Types: Novel

Question

1.How do feedforward neural networks utilize backpropagation to adjust weights and improve the accuracy of predictions during training?

2. How would the training process be affected if the learning rate in optimization algorithm were too high or too low?

Solution

1. After a forward pass where inputs are processes to generate an output, the error between the prediction and actual values is calculated. This error is then propagated backward through the network, and the gradients of the loss function with respect to the weights are computed. Using these gradients, the weights are updated with an optimization algorithm like stochastic gradient descent, gradually minimizing the error and improving the networks' performance.

2. If the learning rate is too high, the weights might overshoot the optimal values, leading to oscillations or divergence. If it's too low, the training process might become very slow and stuck in local minimum.

Calculations

Step 1: Forward Propagation

Each neuron computes:

$z = W \cdot x + b$

$a = f (z)$

where:

W = weights, b = bias
f(z) = activation function (e.g., sigmoid, ReLU)
a = neuron’s output

Step 2: Compute Loss

The error between predicted $\hat{y}$ and actual $y$ is calculated using a loss function, such as **Mean Squared Error (MSE)**:

$L = \frac{1}{n} \sum (y - \hat{y})^{2}$

For classification, **Cross-Entropy Loss** is commonly used.

Step 3: Backward Propagation

Using the **chain rule**, gradients are computed:

$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial W}$

These gradients guide weight updates to minimize loss.

Step 4: Weight Update using Gradient Descent

Weights are updated using:

$W = W - α \frac{\partial L}{\partial W}$

where $α$ is the **learning rate**.

Exercise 2.3

Level: * (Easy)

Exercise Types: Modified

References: Simon J.D. Prince. Understanding Deep learning. 2024

This problem comes from Problem 3.5 in this textbook. In addition to the proof, I explained why this property is important to learning neural networks.

Question

Prove that the following property holds for $α \in R^{+}$ :

$ReLU [α \cdot z] = α \cdot ReLU [z]$

Explain why this property is important in neural networks.

Solution

This is known as the non-negative homogeneity property of the ReLU function.

Recall the definition of the ReLU function:

$ReLU (z) = {\begin{cases} z & if z \geq 0, \\ 0 & if z < 0. \end{cases}$

We prove the property by considering the two possible cases for $z$ .

Case 1: $z \geq 0$

If $z \geq 0$ , then by the definition of the ReLU function:

$ReLU (z) = z$

Therefore:

$ReLU (α \cdot z) = α \cdot z$

and:

$α \cdot ReLU (z) = α \cdot z$

Hence, in this case:

$ReLU (α \cdot z) = α \cdot ReLU (z)$

Case 2: $z < 0$

If $z < 0$ , then $α \cdot z < 0$ .

Therefore:

$ReLU (α \cdot z) = ReLU (z) = 0$

and:

$α \cdot ReLU (z) = α \cdot 0 = 0$

Hence, in this case:

$ReLU (α \cdot z) = α \cdot ReLU (z)$

Since the property holds in both cases, this completes the proof.

Why is this property important in neural networks?

In a neural network, the input to a neuron is often a linear combination of the weights and inputs.

When training neural networks, scaling the inputs or weights can affect the activations of neurons. However, because ReLU satisfies the homogeneity property, the output of the ReLU function scales proportionally with the input. This means that scaling the inputs by a positive constant (like a learning rate or normalization factor) does not change the overall pattern of activations — it only scales them. This stability in scaling is important during optimization because it makes the network's output more predictable and ensures that scaling transformations don't break the network's functionality.

Additionally, because of the non-negative homogeneity property, the gradients also scale proportionally, the scale of the gradient changes proportionally with the input scale, which ensures that the optimization process remains stable. It helps prevent exploding gradients when the inputs are scaled by large positive values.

The homogeneity property of ReLU also helps the network to perform well on different types of data. By keeping the scaling of activations consistent, it helps maintain the connection between inputs and outputs during training, even when the data is adjusted or scaled. This makes ReLU useful when input values vary a lot, and it simplifies the network's response to changes in input distributions, which is especially valuable when transferring a trained model to new data or domains.

Exercise 2.4

Level: * (Easy)

Exercise Types: Novel

Question

Train a perceptron on the given dataset using the following initial settings, and ensure it classifies all data points correctly.

Initial weights: $w_{0} = 0, w_{1} = 0, w_{2} = 0$
Learning rate: $η = 0.1$
Training dataset:

  (x₁ = 1, x₂ = 2, y = 1)
  (x₁ = -1, x₂ = -1, y = -1)
  (x₁ = 2, x₂ = 1, y = 1)

$y = 1$ if the output $z = w_{1} \cdot x_{1} + w_{2} \cdot x_{2} + w_{0} \geq 0$ , otherwise $y = - 1$ .

Solution

Iteration 1

1. First data point (x₁ = 1, x₂ = 2) with label 1:

Weighted sum: $\hat{y} = w_{0} + w_{1} x_{1} + w_{2} x_{2} = 0 + 0 (1) + 0 (2) = 0$
Predicted label: $\hat{y} = 1$
Actual label: 1 → No misclassification

2. Second data point (x₁ = -1, x₂ = -1) with label -1:

Weighted sum: $\hat{y} = w_{0} + w_{1} x_{1} + w_{2} x_{2} = 0 + 0 (- 1) + 0 (- 1) = 0$
Predicted label: $\hat{y} = 1$
Actual label: -1 → Misclassified

3. Third data point (x₁ = 2, x₂ = 1) with label 1:

Weighted sum: $\hat{y} = w_{0} + w_{1} x_{1} + w_{2} x_{2} = 0 + 0 (2) + 0 (1) = 0$
Predicted label: $\hat{y} = 1$
Actual label: 1 → No misclassification

Update Weights (using the Perceptron rule with the cost as the distance of all misclassified points)

For the misclassified point (x₁ = -1, x₂ = -1):

Updated weights:
$w_{0} = w_{0} + η y = 0 + 0.1 (- 1) = - 0.1$
$w_{1} = w_{1} + η y x_{1} = 0 + 0.1 (- 1) (- 1) = 0.1$
$w_{2} = w_{2} + η y x_{2} = 0 + 0.1 (- 1) (- 1) = 0.1$

Updated weights after first iteration: $w_{0} = - 0.1, w_{1} = 0.1, w_{2} = 0.1$

Iteration 2

1. First data point (x₁ = 1, x₂ = 2) with label 1:

Weighted sum: $\hat{y} = - 0.1 + 0.1 (1) + 0.1 (2) = - 0.1 + 0.1 + 0.2 = 0.2$
Predicted label: $\hat{y} = 1$
Actual label: 1 → No misclassification

2. Second data point (x₁ = -1, x₂ = -1) with label -1:

Weighted sum: $\hat{y} = - 0.1 + 0.1 (- 1) + 0.1 (- 1) = - 0.1 - 0.1 - 0.1 = - 0.3$
Predicted label: $\hat{y} = - 1$
Actual label: -1 → No misclassification

3. Third data point (x₁ = 2, x₂ = 1) with label 1:

Weighted sum: $\hat{y} = - 0.1 + 0.1 (2) + 0.1 (1) = - 0.1 + 0.2 + 0.1 = 0.2$
Predicted label: $\hat{y} = 1$
Actual label: 1 → No misclassification

Since there are no misclassifications in the second iteration, the perceptron has converged!

Final Result

Weights after convergence: $w_{0} = - 0.1, w_{1} = 0.1, w_{2} = 0.1$
Total cost after convergence: $C o s t = 0$ , since no misclassified points.

Exercise 2.5

Level: * (Moderate)

Exercise Types: Novel

Question

Consider a Feed-Forward Neural Network (FFN) with one or more hidden layers. Answer the following questions:

(a) Describe how does the Feed-Forward Neural Network (FFN) work in general. Describe the component of the Network.

(b) How does the forward pass work ? Provide the relevant formulas for each step.

(c) How does the backward pass (backpropagation) ? Explain and provide the formulas for each step.

Solution

(a): A Feed-Forward Neural Network (FFN) consists of an input layer, one or more hidden layers, and one output layer. Each layer transforms the input data, with each neuron's output being fed to the next layer as input. Each neuron in a layer is a perceptron, which is a basic computational unit that contains weights, bias and an activation function. The perceptron computes a weighted sum of its inputs, adds a bias term, and passes the result through an activation function. In this structure, each layer transforms the data as it passes through, with each neuron's output being fed to the next layer. The final output is the network’s prediction, then the loss function use the prediction and the true label of the data to calculate the loss. The backward pass computes the gradients of the loss with respect to each weight and bias in the network, and then update the weights and biases to minimize the loss. This process is repeated for each sample (or mini-batch) of data until the loss converges and the weights are optimized.

(b): The forward pass involves computing the output for each layer in the network. For each layer, the algorithm performs the following steps:

1. Compute the weighted sum of inputs to the layer:

$z^{(l)} = W^{(l)} a^{(l - 1)} + b^{(l)}$

2. Use the activation function to calculate the output of the layer:

$\hat{y} = a^{(L)} = σ (z^{(L)})$

3. Repeat these steps for each layer, until getting to the output layer

(c): The backward pass (backpropagation) updates the gradients of the loss function with respect to each weight and bias, and then use the gradient descents to update the weights.

1. Calculate the errors for each layer:

Error at the output layer: The error term at the output layer has this formula:

$δ^{(L)} = \frac{\partial L}{\partial a^{(L)}} \cdot σ^{'} (z^{(L)})$

Error for the hidden layers: The error for each hidden layer is:

$δ^{(l)} = {(W^{(l + 1)})}^{T} δ^{(l + 1)} \cdot σ^{'} (z^{(l)})$

3. Gradient of the loss with respect to weights and biases. Compute the gradients for the weights and biases:

The gradient for weights is:

$\frac{\partial L}{\partial W^{(l)}} = a^{(l - 1)} \cdot (δ^{(l)})^{T}$

The gradient is for biases is:

$\frac{\partial L}{\partial b^{(l)}} = δ^{(l)}$

4. Update the weights and biases using gradient descent:

$W^{(l)} \leftarrow W^{(l)} - ρ \cdot \frac{\partial L}{\partial W^{(l)}}$

Where $ρ$ is the learning rate.

$b^{(l)} \leftarrow b^{(l)} - ρ \cdot \frac{\partial L}{\partial b^{(l)}}$

Repeat these steps for each layers from output layer to input layers to update all the weights and biases

Exercise 2.6

Level: * (Easy)

Exercise Types: Novel

Question

A single neuron takes an input vector $x = [2, - 3]$ , with weights $w = [0.4, - 0.6]$ . The target output is $y_{true} = 1$ .

1. Calculate the weighted sum $z = w \cdot x$ .

2. Compute the squared error loss: $L = 0.5 \cdot (z - y_{true})^{2}$

3. Find the gradient of the loss with respect to the weights $w$ and perform one step of gradient descent with a learning rate $η = 0.01$ .

4.Provide the updated weights and the error after the update.

5. Compare the result of the previous step with the case of a learning rate of $η = 0.1$

Solution

1. $z = w \cdot x = (0.4 \cdot 2) + (- 0.6 \cdot - 3) = 0.8 + 1.8 = 2.6$

2. $L = 0.5 \cdot (z - y_{true})^{2} = 0.5 \cdot (2.6 - 1)^{2} = 0.5 \cdot (1.6)^{2} = 0.5 \cdot 2.56 = 1.28$

3. The gradient of the loss with respect to $w_{i}$ is: $\frac{\partial L}{\partial w_{i}} = (z - y_{true}) \cdot x_{i}$

For $w_{1}$ (associated with $x_{1} = 2$ ): $\frac{\partial L}{\partial w_{1}} = (2.6 - 1) \cdot 2 = 1.6 \cdot 2 = 3.2$

For $w_{2}$ (associated with $x_{2} = - 3$ ): $\frac{\partial L}{\partial w_{2}} = (2.6 - 1) \cdot (- 3) = 1.6 \cdot - 3 = - 4.8$ The updated weights are: $w_{i} = w_{i} - η \cdot \frac{\partial L}{\partial w_{i}}$

For $w_{1}$ : $w_{1} = 0.4 - 0.01 \cdot 3.2 = 0.4 - 0.032 = 0.368$

For $w_{2}$ : $w_{2} = - 0.6 - 0.01 \cdot (- 4.8) = - 0.6 + 0.048 = - 0.552$

4. Recalculate $z$ with updated weights: $z = (0.368 \cdot 2) + (- 0.552 \cdot - 3) = 0.736 + 1.656 = 2.392$

Recalculate the error: $L = 0.5 \cdot (z - y_{true})^{2} = 0.5 \cdot (2.392 - 1)^{2} = 0.5 \cdot (1.392)^{2} = 0.5 \cdot 1.937 = 0.968$

5. Compare the result of the previous step with the case of a learning rate of $η = 0.1$ :

For $w_{1}$ : $w_{1} = 0.4 - 0.1 \cdot 3.2 = 0.4 - 0.032 = 0.08$

For $w_{2}$ : $w_{2} = - 0.6 - 0.1 \cdot (- 4.8) = - 0.6 + 0.048 = - 0.12$

Recalculate $z$ with the updated weights: $z = (0.08 \cdot 2) + (- 0.12 \cdot - 3) = 0.16 + 0.36 = 0.52$

Recalculate the error: $L = 0.5 \cdot (z - y_{true})^{2} = 0.5 \cdot (0.52 - 1)^{2} = 0.5 \cdot (- 0.48)^{2} = 0.5 \cdot 0.2304 = 0.1152$

Comparison:

- With a learning rate of $η = 0.01$ , the error after one update is $L = 0.968$ .

- With $η = 0.1$ , the error after one update is $L = 0.1152$ .

The error is much lower when using a larger learning rate $η = 0.1$ compared to a smaller learning rate $η = 0.01$ . However, large learning rates can sometimes cause overshooting of the optimal solution, so care must be taken when selecting a learning rate.

Exercise 2.7

Level: * (Easy)

Exercise Types: Copied

References: https://www.uni-weimar.de/fileadmin/user/fak/medien/professuren/Webis/teaching/ws13/ml/exercises-en-neural-networks.pdf

This problem comes from Exercise 2 : Perceptron Learning.

Question

Given two single perceptrons $a$ and $b$ , each defined by the inequality $w_{0} + w_{1} x_{1} + w_{2} x_{2} \geq 0$ :

- Perceptron $a$ : $w_{0} = 1$ , $w_{1} = 2$ , $w_{2} = 1$ - Perceptron $b$ : $w_{0} = 0$ , $w_{1} = 2$ , $w_{2} = 1$

Is perceptron $a$ more general than perceptron $b$ ?

Solution

To decide if perceptron $a$ is more general than perceptron $b$ , we compare their respective decision boundaries and positive regions:

- **Perceptron $a$ :** The positive region satisfies $1 + 2 x_{1} + x_{2} \geq 0$ . The decision boundary is $x_{2} = - 2 x_{1} - 1$ .

- **Perceptron $b$ :** The positive region satisfies $2 x_{1} + x_{2} \geq 0$ . The decision boundary is $x_{2} = - 2 x_{1}$ .

A perceptron $a$ is called “more general” than perceptron $b$ if: 1. Every point that $b$ classifies as positive is also classified as positive by $a$ . 2. There exist points that $a$ classifies as positive which $b$ does not.

Observe that for any $x_{1}$ : $- 2 x_{1} - 1 \leq - 2 x_{1}$ . Hence, the set of points ${(x_{1}, x_{2}) : x_{2} \geq - 2 x_{1}}$ (perceptron $b$ ’s positive region) is contained in the set ${(x_{1}, x_{2}) : x_{2} \geq - 2 x_{1} - 1}$ (perceptron $a$ ’s positive region). Therefore: 1. If $b$ classifies a point as positive, $a$ also does. 2. There are additional points (namely those where $- 2 x_{1} - 1 \leq x_{2} < - 2 x_{1}$ ) that $a$ classifies as positive but $b$ does not.

Hence, perceptron $a$ is indeed more general than perceptron $b$ .

Additional Subquestion

For a random sample of points, verify empirically that any point classified as positive by perceptron $b$ is also classified as positive by perceptron $a$ .

Additional Solution (Sample Code)

Below is a short Python snippet that generates random points and checks their classification according to each perceptron. (No visual output is included.)

```python import numpy as np

def perceptron_output(x1, x2, w0, w1, w2):

   return (w0 + w1*x1 + w2*x2) >= 0

Weights for a and b:

w_a = (1, 2, 1) # w0=1, w1=2, w2=1 w_b = (0, 2, 1) # w0=0, w1=2, w2=1

n_points = 1000 x1_vals = np.random.uniform(-10, 10, n_points) x2_vals = np.random.uniform(-10, 10, n_points)

count_b_positive_also_positive_in_a = 0 count_b_positive = 0

for x1, x2 in zip(x1_vals, x2_vals):

   output_b = perceptron_output(x1, x2, *w_b)
   output_a = perceptron_output(x1, x2, *w_a)
   if output_b:
       count_b_positive += 1
       if output_a:
           count_b_positive_also_positive_in_a += 1

print("Out of", count_b_positive, "points that B classified as positive,") print(count_b_positive_also_positive_in_a, "were also positive for A.") print("Hence, all (or nearly all) B-positive points lie in A's positive region.")

Exercise 2.10

Level: * (Easy)

Exercise Type: Novel

Question

In a simple linear regression

a). Derive the vectorized form of the SSE (loss function) in terms of $Y$ , $X$ and $θ$ .

b). Find the optimal value of $θ$ that minimizes the SSE (Recall that this is the weights of the linear regression).

Solution

a).

$\begin{aligned} S S E & = ‖ Y - \hat{Y} ‖^{2} \\ = (Y - \hat{Y})^{T} (Y - \hat{Y}) \\ = Y^{T} Y - Y^{T} X θ - (X θ)^{T} Y + (X θ)^{T} (X θ) \\ = Y^{T} Y - 2 Y^{T} X θ + (X θ)^{T} (X θ) \end{aligned}$

b).

$\begin{aligned} 0 & = \frac{\partial S S E}{\partial θ} \\ 0 & = \frac{\partial}{\partial θ} [Y^{T} Y - 2 Y^{T} X θ + (X θ)^{T} (X θ)] \\ 0 & = 0 - 2 Y X^{T} + 2 X^{T} X θ \\ 2 X^{T} X θ & = 2 Y X^{T} \\ θ & = (Y X^{T}) (X^{T} X)^{- 1} \end{aligned}$

Note that $X^{T} X \geq 0$ , so $\frac{\partial^{2}}{\partial θ^{2}} S S E = 2 X^{T} X \geq 0.$ Hence, the stationary point derived above yields a local minimum.

Exercise 2.11

Level: Moderate

Exercise Types: Novel

Question

Deep learning models often face challenges during training due to the vanishing gradient problem, especially when using sigmoid or tanh activation functions.

(a) Describe the vanishing gradient problem and its impact on the training of deep networks.

(b) Explain how the introduction of ReLU (Rectified Linear Unit) activation function mitigates this problem.

(c) Discuss one potential downside of using ReLU and propose an alternative activation function that addresses this limitation.

Solution

(a) Vanishing Gradient Problem: The vanishing gradient problem occurs when the gradients of the loss function become extremely small as they are propagated back through the layers of a deep network. This leads to:

(i)Slow or stagnant weight updates in early layers;

(ii)Difficulty in effectively training deep models. This issue is particularly pronounced with activation functions like sigmoid and tanh, where gradients approach zero as inputs saturate.

(b) Role of ReLU in Mitigation: ReLU, defined as $f (x) = max (0, x)$ , mitigates the vanishing gradient problem by:

(i)Producing non-zero gradients for positive inputs, maintaining effective weight updates;

(ii)Introducing sparsity, as neurons deactivate (output zero) for negative inputs, which improves model efficiency.

(c) Downside of ReLU and Alternatives: One downside of ReLU is the "dying ReLU" problem, where neurons output zero for all inputs, effectively becoming inactive. This can happen when weights are poorly initialized or during training.

Alternative: Leaky ReLU allows a small gradient for negative inputs, defined as $f (x) = x$ for $x > 0$ and $f (x) = α x$ for $x \leq 0$ , where $α$ is a small positive constant. This prevents neurons from dying, ensuring all neurons contribute to learning.

Exercise 2.12

Level: * (Easy)

Exercise Type: Novel

Question

Consider the following data:

$x = [\begin{matrix} 1 \\ 1 \\ 2 \\ 2 \\ 3 \\ 3 \end{matrix}], y = [\begin{matrix} 1 \\ 2 \\ 2 \\ 3 \\ 2 \\ 3 \end{matrix}]$

This data is fitted by a linear regression model with no bias term. What is the loss for the first four data points? Use the L2 loss defined as:

$L = \frac{1}{2} (y - \hat{y})^{2}$ .

Solution

The correct losses for the first four data points are 0, 0.5, 0, and 0.5.

Calculation

Step 1: Fit the linear regression model.
Since there is no bias term, the model is $\hat{y} = β x$ , where $β = \frac{\sum x_{i} y_{i}}{\sum x_{i}^{2}}$ .

Compute $β$ : $β = \frac{1 \cdot 1 + 1 \cdot 2 + 2 \cdot 2 + 2 \cdot 3 + 3 \cdot 2 + 3 \cdot 3}{1^{2} + 1^{2} + 2^{2} + 2^{2} + 3^{2} + 3^{2}} = \frac{28}{28} = 1.$

Step 2: Calculate $\hat{y}$ for each $x$ .
For the first four data points:
${\hat{y}}_{1} = 1 \cdot 1 = 1$
${\hat{y}}_{2} = 1 \cdot 1 = 1$
${\hat{y}}_{3} = 1 \cdot 2 = 2$
${\hat{y}}_{4} = 1 \cdot 2 = 2$

Step 3: Compute the L2 loss for each point.
- For $(x_{1}, y_{1})$ : $L_{1} = \frac{1}{2} (1 - 1)^{2} = 0$ .
- For $(x_{2}, y_{2})$ : $L_{2} = \frac{1}{2} (2 - 1)^{2} = 0.5$ .
- For $(x_{3}, y_{3})$ : $L_{3} = \frac{1}{2} (2 - 2)^{2} = 0$ .
- For $(x_{4}, y_{4})$ : $L_{4} = \frac{1}{2} (3 - 2)^{2} = 0.5$ .

Thus, the losses are: 0, 0.5, 0, 0.5.

Exercise 2.13

Level: ** (Moderate)

Exercise Types: Novel

References: A. Ghodsi, STAT 940 Deep Learning: Lecture 2, University of Waterloo, Winter 2025.

Question

In Lecture 2, we derived a formula for a simple perceptron to determine a hyperplane which splits a set of linearly independant data.

In class, we defined the error as

$e r r = \sum_{i \in M} - y_{i} (β^{T} x_{i} + β_{0})$

Where $M$ is the set of points which have been misclassified

The function for the simple perceptron was defined as

$y_{i} = β^{T} x + β_{0}$

Where a point belongs to Set 1 if it is above the line, and Set 2 if it is below the line.

In lecture 2, we found the partial derivatives as

$\frac{\partial e r r}{\partial f} = \sum_{i \in M} - y_{i} (β^{T} x_{i} + β_{i})$

and

$\frac{\partial e r r}{\partial β} = \sum_{i \in M} - y_{i}$

And defined the formula for gradient descent as

$[\begin{matrix} β \\ β_{0} \end{matrix}] = [\begin{matrix} β \\ β_{0} \end{matrix}] + η [\begin{matrix} \sum_{i \in M} y_{i} x_{i} \\ \sum_{i \in M} y_{i} \end{matrix}]$

Generate a random set of points and a line in 2D using python, and sort the set of points into 2 sets above and below the line. Then use gradient descent and the formulas above to find a hyperplane which also sorts the set with no prior knowledge of the line used for sorting.

Solution

The code below is a sample of implementing the 2D gradient descent algorithm derived in Lecture 2 to a random dataset

import numpy as np
import matplotlib.pyplot as plt
import functools

#Generate a random array of points
random_array = np.random.random((40,2)) 

#Linearlly seperate the points by a random hyperplane (in this case, a 2D line)
#y = ax + b
a = np.random.random()
b = np.random.random()*0.25

#Create the ground truth, classify points into Set 1 or Set 2
set_1_indicies = random_array[:,1] >= (a*random_array[:,0] + b) #Boolean indicies where y is greater to or equal to the line y = ax + b
set_2_indicies = random_array[:,1] < (a*random_array[:,0] + b)

#Get the (x,y) coordinates of set 1 and set 2 by selecting the indicies of the random array which are either above or below the hyperplane
set_1= random_array[set_1_indicies]
set_2= random_array[set_2_indicies]


plt.figure(1,figsize=(6,6))
plt.scatter(set_1[:,0],set_1[:,1],c='r')
plt.scatter(set_2[:,0],set_2[:,1],c='b')
plt.xlabel('x')
plt.ylabel('y')
x = np.linspace(0,1,2)
y = x*a + b
plt.plot(x,y,'g')
plt.legend(['Set 1','Set 2','Initial Hyperplane Used to Split Sets'])
plt.title("Generated Dataset")
plt.show()

#Now use a simple perceptron and the algorithm developed in class during lecture 2 to have an agent with no prior knowledge of the hyperplane determine what it is, or find a hyperplane that can seperate the two sets.

#Randomly initialize beta and beta_0

beta = np.random.random()

beta_0 = np.random.random()*0.25

def update_weights(beta,beta_0,random_array,set_1_indicies,set_2_indicies,learning_rate):
    #Find perceptron set 1 and set 2
    perceptron_set_1_indicies = random_array[:,1] >= (beta*random_array[:,0] + beta_0)
    perceptron_set_2_indicies = random_array[:,1] < (beta*random_array[:,0] + beta_0)
    
    #Find set M, the misclassified points
    #A point x,y classified by the agent as class 1 is misclassified if it is actually in set 2 or if it is classified as class 2 and is actually in set 1
    M_indicies= np.logical_or(np.logical_and(perceptron_set_1_indicies,set_2_indicies),np.logical_and(perceptron_set_2_indicies,set_1_indicies))
    
    #Get an array of the x,y coordinates of the misclassified points
    M=random_array[M_indicies]
    
    #Define variables for the sums on the right side of the gradient descent linear equation
    sum_yixi = 0
    sum_yi = 0
    
    #Compute the sums needed to do gradient descent
    for i in range(0,len(M)):
        sum_yixi = sum_yixi + M[i,0]*M[i,1]
        sum_yi = sum_yi + M[i,1]
    
    #Update beta and beta_0
    beta = beta + learning_rate*sum_yixi
    beta_0 = beta_0 + learning_rate*sum_yi
    
    return beta,beta_0


#Run for 1000 epochs

for i in range(0,1000):
    beta,beta_0 = update_weights(beta,beta_0,random_array,set_1_indicies,set_2_indicies,0.01)
    

plt.figure(1,figsize=(6,6))
plt.scatter(set_1[:,0],set_1[:,1],c='r')
plt.scatter(set_2[:,0],set_2[:,1],c='b')
plt.xlabel('x')
plt.ylabel('y')
x = np.linspace(0,1,2)
y = beta*x + beta_0
plt.plot(x,y,c='g')
plt.legend(['Set 1','Set 2','Gradient Descent Hyperplane'])
plt.title("Gradient Descent Generated Hyperplane")
plt.show()

The figure below shows a sample of the randomly generated linearly separated sets, and the lines used to separate the sets.

Randomy generated sets of data linearly separated

The figure below shows the predicted hyperplane found using gradient descent

Hyperplane found through gradient descent algorithm derived in STAT 240 lecture 2

Exercise 2.14

Level: Easy

Exercise Types: Novel

Question

Consider a simple neural network model with a single hidden layer to classify input data into two categories based on their features. Address the following points:

Describe the process of input data transformation through a single hidden layer.
Identify the role of activation functions in neural networks.
Explain the importance of the learning rate in the neural network training process.

Solution

Input Processing:
The network receives input features and feeds them through a hidden layer, where each input is subject to a weighted sum, addition of a bias, and application of an activation function. This series of operations transforms the input data into a representation that captures complex relationships within the data.

Mathematically, the output $h$ of the hidden layer for an input vector $x$ is given by: $h = f (W^{(1)} x + b^{(1)})$ , where $W^{(1)}$ represents the weights, $b^{(1)}$ the biases, and $f$ the activation function.
Role of Activation Functions:
Activation functions such as sigmoid, ReLU, or tanh introduce necessary non-linearities into the model, enabling it to learn more complex patterns and relationships in the data. These functions are applied to each neuron's output and help regulate the neural network's overall output, ensuring predictability and differentiation between different types of outputs.
Importance of Learning Rate:
The learning rate $η$ is a critical parameter that determines the extent to which the weights in the network are updated during training. An optimal learning rate ensures efficient convergence to a minimum, whereas an excessively high rate can lead to overshooting and an excessively low rate to slow convergence or getting stuck in local minima.

Exercise 2.15

Level: ** (Moderate)

Exercise Types: Novel

Question

Consider a simple feedforward neural network with one input layer, one hidden layer, and one output layer. The network structure is as follows:

Input layer: 2 neurons (features $x_{1}$ and $x_{2}$ ).
Hidden layer: 3 neurons with ReLU activation ( $(a = max (0, z))$ ).
Output layer: 1 neuron with no activation (linear output).

The weights ( $W$ ) and biases ( $b$ ) for the layers are:

Hidden layer:

$W_{hidden} = [\begin{matrix} 0.5 & - 0.2 \\ 0.8 & 0.3 \\ - 0.5 & 0.7 \end{matrix}], b_{hidden} = [\begin{matrix} 0.1 \\ - 0.4 \\ 0.2 \end{matrix}]$

Output layer:

$W_{hidden} = W_{output} = [\begin{matrix} 0.6 & - 0.1 & 0.3 \end{matrix}], b_{output} = 0.5$

For input values $x_{1} = 1.5$ and $x_{2} = - 0.5$ :

Calculate the output of the hidden layer ( $a_{hidden}$ ).
Calculate the final output of the network ( $y_{output}$ ).

Solution

Step 1: Calculate $z_{hidden} = W_{hidden} \cdot x + b_{hidden}$

The input vector is: $x = [\begin{matrix} x_{1} \\ x_{2} \end{matrix}] = [\begin{matrix} 1.5 \\ - 0.5 \end{matrix}]$

$z_{hidden} = [\begin{matrix} 0.5 & - 0.2 \\ 0.8 & 0.3 \\ - 0.5 & 0.7 \end{matrix}] \cdot [\begin{matrix} 1.5 \\ - 0.5 \end{matrix}] + [\begin{matrix} 0.1 \\ - 0.4 \\ 0.2 \end{matrix}]$

$z_{hidden} = [\begin{matrix} (0.5 \cdot 1.5) + (- 0.2 \cdot - 0.5) \\ (0.8 \cdot 1.5) + (0.3 \cdot - 0.5) \\ (- 0.5 \cdot 1.5) + (0.7 \cdot - 0.5) \end{matrix}] + [\begin{matrix} 0.1 \\ - 0.4 \\ 0.2 \end{matrix}]$

$z_{hidden} = [\begin{matrix} 0.75 + 0.1 \\ 1.2 - 0.15 \\ - 0.75 - 0.35 \end{matrix}] + [\begin{matrix} 0.1 \\ - 0.4 \\ 0.2 \end{matrix}] = [\begin{matrix} 0.85 \\ 0.65 \\ - 0.9 \end{matrix}]$

Step 2: Apply ReLU activation

The ReLU activation is defined as:

$a_{hidden} = max (0, z_{hidden})$

$a_{hidden} = [\begin{matrix} max (0, 0.85) \\ max (0, 0.65) \\ max (0, - 0.9) \end{matrix}] = [\begin{matrix} 0.85 \\ 0.65 \\ 0 \end{matrix}]$

Step 3: Calculate $z_{output} = W_{output} \cdot a_{hidden} + b_{output}$

$z_{output} = [\begin{matrix} 0.6 & - 0.1 & 0.3 \end{matrix}] \cdot [\begin{matrix} 0.85 \\ 0.65 \\ 0 \end{matrix}] + 0.5$

$z_{output} = (0.6 \cdot 0.85) + (- 0.1 \cdot 0.65) + (0.3 \cdot 0) + 0.5$

$z_{output} = 0.51 - 0.065 + 0 + 0.5 = 0.945$

Step 4: Final Output

The final output of the network is: $y_{output} = z_{output} = 0.945$

Exercise 2.16

Level: * (Easy)

Exercise Types: Novel

Question

Consider the following binary classification task, where the goal is to classify points into two categories: +1 or -1.

Training Data: $\begin{array}{ccc} x_{1} & x_{2} & y \\ 2 & 3 & 1 \\ 1 & 2 & 1 \\ 3 & 1 & - 1 \\ 4 & 5 & - 1 \end{array}$ Where $x_{1}$ and $x_{2}$ are the input features, $y$ is the target label (+1 or -1).

Task: Train a perceptron model using gradient descent, and update the weights using the perceptron update rule using the first data point(2,3,1). Initialize the weights as $β_{0} = 0$ , $β_{1} = 0$ , and $β_{2} = 0$ . Choose a learning rate of $η = 0.1$ .

Solution

Step 1: Compute the Prediction

The perceptron model predicts the label using the following equation:

$\hat{y} = sign (β_{0} + β_{1} x_{1} + β_{2} x_{2})$

Substituting the initial weights and the input values:

$\hat{y} = sign (0 + 0 \cdot 2 + 0 \cdot 3) = sign (0) = 0$

Since the predicted label $\hat{y} = 0$ is not equal to the true label $y = 1$ , we need to update the weights.

Step 2: Update the Weights

$β_{0} = β_{0} + η \cdot y = 0 + 0.1 \cdot 1 = 0.1$

$β_{1} = β_{1} + η \cdot y \cdot x_{1} = 0 + 0.1 \cdot 1 \cdot 2 = 0.2$

$β_{2} = β_{2} + η \cdot y \cdot x_{2} = 0 + 0.1 \cdot 1 \cdot 3 = 0.3$

Exercise 2.17

Level: * (Easy)

Exercise Types: Novel

Question

Consider a binary classification problem where a perceptron is used to separate two linearly separable classes in $R^{2}$ . The perceptron updates its weight vector w using the following update rule:

$w^{(t + 1)} = w^{(t)} + η y^{(t)} x^{(t)}$

where:

$w^{(t)}$ is the weight vector at iteration $t$ ,
$x^{(t)} \in R^{2}$ is the feature vector of the misclassified sample at iteration $t$ ,
$y^{(t)} \in {+ 1, - 1}$ is the corresponding label,
$η > 0$ is the learning rate.

Prove that if the data is **linearly separable**, the perceptron algorithm **converges in a finite number of steps

Solution

Define the angle between $w$ and the optimal weight vector $w^{*}$ , which perfectly separates the data. The perceptron algorithm updates the weight vector in the direction of each misclassified sample, gradually aligning it with $w^{*}$ .

By repeatedly applying the update rule and using the Cauchy-Schwarz inequality, we can show that:

$w^{(t)} \cdot w^{*} \geq t γ R$

which grows linearly with $t$ . Meanwhile, the norm of $w^{(t)}$ is bounded as:

$‖ w^{(t)} ‖^{2} \leq t R^{2}$

Combining these inequalities leads to the bound on the number of updates:

$T \leq \frac{R^{2}}{γ^{2}}$

Exercise 2.18

Level: * (Easy)

Exercise Types: Novel

Question

In a binary classification problem,assume the input data is a 2-dimensional vector x = [x1,x2].The Perceptron model is defined with the following parameters:

Weights: w = [2,-1] Bias: b = -0.5

The Perceptron output is given by: y = sign(wx + b), where sign(z) is the sign function that outputs +1 if z > 0, and -1 otherwise.

1. Compute the Perceptron output y for the input sample x = [1,2];

2. Assume the target label is t = +1. Determine if the Perceptron classifies the sample correctly. If it misclassifies, provide the update rule for the weights w and bisas b, and compute their updated values after one step.

Solution

1.Compute the predicted output y and determine if classification is correct:

$z = w*x + b = (2 * 1) + (- 1 * 2) + (- 0.5) = 2 - 2 - 0.5 = - 0.5 < 0 => y = - 1$

The target label t = +1, but the Perceptron output y = -1. Hence, the classification is incorrect.

2.Update rule for weights and bias:

Gradient:

$\nabla w = - t * x = - (+ 1) * [1, 2] = [- 1, - 2]$

$\nabla b = - t = - 1$

Update:

Assume the learning rate here is 0.1,

$w^{n e w} = w^{o l d} - ρ * \nabla w = [2, - 1] - 0.1 * [- 1, - 2] = [2.1, - 0.8]$

$b^{n e w} = b^{o l d} - ρ * \nabla b = - 0.5 - (- 1) = 0.5$

Exercise 2.19

Level: ** (Moderate)

Exercise Types: Novel

Question

Given a simple feedforward neural network with one hidden layer and a softmax output, the network has the following structure:

Input layer: $x \in R^{n}$ ,
Hidden layer: Linear transformation $h = W ₁ x + b ₁$ , followed by a ReLU activation, i.e., $h ₊ = m a x (0, h)$ ,
Output layer: Softmax output $y_{o u t p u t} = S o f t m a x (W ₂ h ₊ + b ₂)$ .

Let the true label be $y \in R^{k}$ , where $k$ is the number of output classes and $y$ is a one-hot vector.

1.Derive the gradient of the loss function (cross-entropy loss) with respect to the output logit $W_{2} h_{+} + b_{2}$ in terms of the softmax output and the true label $y$ . Write the expression for $\frac{\partial L}{\partial y_{o u t p u t}}$ . Here we assume the loss function is the entropy loss, i.e. $L = - \sum_{i} y_{i} \log (y_{o u t p u t, i})$ where $y_{i}$ is the true label and $y_{o u t p u t, i}$

2. Using the chain rule, derive the gradient of the loss function with respect to the hidden layer $h_{+}$ . How does the ReLU activation affect the gradient computation?

Solution

1. Let the output logits be represented as: $z = W ₂ h ₊ + b ₂$ where $h ₊ = m a x (0, W ₁ x + b ₁)$ is the output of the hidden layer after the ReLU activation.

The softmax output $y_{o u t p u t} = S o f t m a x (z)$ is given by: $y_{o u t p u t, i} = \frac{e^{z_{i}}}{\sum_{j} e^{z_{j}}}$

The cross-entropy loss $L = - \sum_{i} y_{i} \log (y_{o u t p u t, i})$ where $y_{i}$ is the true label and $y_{o u t p u t, i}$ is the predicted probability from the softmax.

The gradient of the loss with respect to the output logits $z_{i}$ is: $\frac{\partial L}{\partial z_{i}} = y_{o u t p u t, i} - y_{i}$

The gradient with respect to the output $y_{o u t p u t}$ is: $\frac{\partial L}{\partial y_{o u t p u t}} = y_{o u t p u t} - y$

2. Now, we compute the gradient of the loss with respect to the hidden layer $h_{+}$ .

Using the chain rule, we have: $\frac{\partial L}{\partial h_{+}} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial h_{+}}$

From part 1, we already computed the derivative of the loss with respect to the output logits: $\frac{\partial L}{\partial z} = y_{o u t p u t} - y$

The derivative of the logits with respect to the hidden layer is: $\frac{\partial z}{\partial h_{+}} = W ₂$

Therefore, the gradient with respect to the hidden layer is: $\frac{\partial L}{\partial h_{+}} = W ₂^{T} (y_{o u t p u t} - y)$

The ReLU activation introduces a nonlinearity in the gradient computation. Specifically, ReLU is defined as: $R e L U (x) = m a x (0, x)$

The gradient of the ReLU function is: $R e L U^{'} (x) = 1 if x > 0 else 0$

Therefore, the gradient with respect to the hidden layer becomes: $\frac{\partial L}{\partial h_{+}} = (W ₂^{T} (y_{o u t p u t} - y)) \cdot R e L U^{'} (h_{+})$

This means that if an element of $h_{+}$ is zero (due to the ReLU activation), the corresponding gradient is also zero, which effectively prevents updates to that particular element during backpropagation.

Exercise 2.20

Level: * (Easy)

Exercise Types: Novel

Question

Compute the distance of the following misclassified points to the hyperplane $L$ : $x_{1} = [2, 3]$ , $y_{1} = 1$ , $x_{2} = [4, - 1]$ , $y_{2} = - 1$ .

The hyperplane parameters are: $β = [1, - 2]^{T}, β_{0} = 3.$

Solution

The hyperplane equation is: $β^{T} x + β_{0} = x_{1} - 2 x_{2} + 3.$

For $x_{1} = [2, 3]$ , $y_{1} = 1$ : $β^{T} x_{1} + β_{0} = (1) (2) + (- 2) (3) + 3 = 2 - 6 + 3 = - 1.$ The distance is: $d_{1} = - y_{1} (β^{T} x_{1} + β_{0}) = - (1) (- 1) = 1.$

For $x_{2} = [4, - 1]$ , $y_{2} = - 1$ : $β^{T} x_{2} + β_{0} = (1) (4) + (- 2) (- 1) + 3 = 4 + 2 + 3 = 9.$ The distance is: $d_{2} = - y_{2} (β^{T} x_{2} + β_{0}) = - (- 1) (9) = 9.$

Exercise 2.21

Level: * (Easy)

Exercise Types: Novel

Question

Train a perceptron using gradient descent for the following settings:

Input Data: $x_{i} = [x_{i 1}, x_{i 2}]^{T}$ .
Labels: $y_{i} \in {- 1, + 1}$ .
Initial Weights: $w = [w_{1}, w_{2}]^{T} = [0, 0]^{T}$ , $b = 0$ .
Learning Rate: $η = 0.1$ .

The perceptron update rules are:
$w \leftarrow w + η \cdot y_{i} \cdot x_{i}$
$b \leftarrow b + η \cdot y_{i}$

Dataset:

$x_{1} = [2, 1]^{T}$ , $y_{1} = 1$
$x_{2} = [1, - 1]^{T}$ , $y_{2} = - 1$

Tasks:

Perform one iteration of gradient descent (update weights and bias).
Determine whether both points are correctly classified after the update.

Solution

w = [0, 0]^{T}

b = 0

η = 0.1

Step 1: Update for Point $x_{1} = [2, 1]^{T}, y_{1} = 1$
Compute perceptron output: $Output = w^{T} x_{1} + b = 0$ .
Misclassification occurs ( $Output \cdot y_{1} \leq 0$ ).
Update weights and bias:
$w \leftarrow [0, 0]^{T} + 0.1 \cdot 1 \cdot [2, 1]^{T} = [0.2, 0.1]^{T}$
$b \leftarrow 0 + 0.1 \cdot 1 = 0.1$

Step 2: Update for Point $x_{2} = [1, - 1]^{T}, y_{2} = - 1$
Compute perceptron output: $Output = w^{T} x_{2} + b = 0.2$ .
Misclassification occurs ( $Output \cdot y_{2} > 0$ ).
Update weights and bias:
$w \leftarrow [0.2, 0.1]^{T} + 0.1 \cdot (- 1) \cdot [1, - 1]^{T} = [0.1, 0.2]^{T}$ $b \leftarrow 0.1 + 0.1 \cdot (- 1) = 0$

Results:

Final weights: $w = [0.1, 0.2]^{T}$
Final bias: $b = 0$

Classification Check:

For $x_{1}$ : $w^{T} x_{1} + b = 0.4$ (correctly classified).
For $x_{2}$ : $w^{T} x_{2} + b = - 0.1$ (correctly classified).

Conclusion:

After one iteration of gradient descent, both points are correctly classified.

Exercise 2.22

Level: * (Moderate)

Exercise Types: Novel

Question

For the following neural network, do one forward pass through the network using the point (x1, x2, x3) = (1, 2, -1). The hidden layer uses the ReLu activation function and the output layer uses the sigmoid activation function. Calculate pfinal.

Solution

1) Hidden layer calculations: The equations for the hidden layer are $z 1 = x 1 * w 1 + x 2 * w 2 + x 3 * w 3 + w 4$ and $z 2 = x 1 * w 5 + x 2 * w 6 + x 3 * w 7 + w 8$ . Plugging in the values x and the weights we obtain $z 1 = (1) * (0.5) + (2) * (0.3) + (- 1) * (0.4) - 1 = - 0.3$ and $z 2 = (1) * (- 0.2) + (2) * (1) + (- 1) * (0.7) + 0.9 = 2$ . The equations to activate the hidden layers are $z a = m a x (0, z 1)$ and $z b = m a x (0, z 2)$ . Plugging in z1 and z2 we obtain $z a = m a x (0, - 0.3) = 0$ and $z b = m a x (0, 2) = 2$ .

2) Output layers calculations: The equations for the output layer is $z f i n a l = z a * w 9 + z b * w 10 + w 11$ . Plugging in za, zb and the weights we obtain $z f i n a l = 0 * 0.6 + 2 * 0.5 - 0.4 = 0.6$ . To calculate pfinal we need to use the sigmoid activation function so the equation is $p f i n a l = \frac{1}{1 + e^{- z f i n a l}}$ . Plugging in the value for zfinal we obtain $p f i n a l = \frac{1}{1 + e^{- 0.6}} = 0.646$ .

Exercise 3.1

Level: ** (Moderate)

Exercise Type: Novel

Implement the perceptron learning algorithm with momentum for the AND function. Plot the decision boundary after 10 epochs of training with a learning rate of 0.1 and a momentum of 0.9.

Solution

import numpy as np
import matplotlib.pyplot as plt

# Dataset
data = np.array([
    [0, 0, -1],
    [0, 1, -1],
    [1, 0, -1],
    [1, 1, 1]
])

# Initialize weights, learning rate, and momentum
weights = np.random.rand(3)
learning_rate = 0.1
momentum = 0.9
epochs = 10
previous_update = np.zeros(3)

# Add bias term to data
X = np.hstack((np.ones((data.shape[0], 1)), data[:, :2]))
y = data[:, 2]

# Training loop
for epoch in range(epochs):
    for i in range(len(X)):
        prediction = np.sign(np.dot(weights, X[i]))
        if prediction != y[i]:
            update = learning_rate * y[i] * X[i] + momentum * previous_update
            weights += update
            previous_update = update

# Plot final decision boundary
x_vals = np.linspace(-0.5, 1.5, 100)
y_vals = -(weights[1] * x_vals + weights[0]) / weights[2]
plt.plot(x_vals, y_vals, label='Final Decision Boundary')

# Plot dataset
for point in data:
    color = 'blue' if point[2] == 1 else 'red'
    plt.scatter(point[0], point[1], color=color)

plt.title('Perceptron with Momentum')
plt.legend()
plt.show()

Exercise 3.2

Level: ** (Moderate)

Exercise Types: Novel

Question

Write a python program showing how the back propagation algorithm work with a 2 inputs 2 hidden layer and 1 output layer neural network. Train the network on the XOR problem:

Input[0,0] -> Output 0
Input [0,1] -> Output 1
Input [1,0] -> Output 1
Input [1,1] -> Output 0

Use the sigmoid activation function. Use Mean Square Error as the loss function.

Solution

import numpy as np

# -------------------------
# 1. Define Activation Functions
# -------------------------

def sigmoid(x):
    """
    Sigmoid activation function.
    """
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    """
    Derivative of the sigmoid function.
    Here, 'x' is assumed to be sigmoid(x), 
    meaning x is already the output of the sigmoid.
    """
    return x * (1 - x)


# -------------------------
# 2. Prepare the Training Data (XOR)
# -------------------------
# Input data (4 samples, each with 2 features)
X = np.array([
    [0, 0],
    [0, 1],
    [1, 0],
    [1, 1]
])

# Target labels (4 samples, each is a single output)
y = np.array([
    [0],
    [1],
    [1],
    [0]
])

# -------------------------
# 3. Initialize Network Parameters
# -------------------------
# Weights for input -> hidden (shape: 2x2)
W1 = np.random.randn(2, 2)
# Bias for hidden layer (shape: 1x2)
b1 = np.random.randn(1, 2)

# Weights for hidden -> output (shape: 2x1)
W2 = np.random.randn(2, 1)
# Bias for output layer (shape: 1x1)
b2 = np.random.randn(1, 1)

# Hyperparameters
learning_rate = 0.1
num_epochs = 10000

# -------------------------
# 4. Training Loop
# -------------------------
for epoch in range(num_epochs):
    # 4.1. Forward Pass
    #   - Compute hidden layer output
    hidden_input = np.dot(X, W1) + b1  # shape: (4, 2)
    hidden_output = sigmoid(hidden_input)
    
    #   - Compute final output
    final_input = np.dot(hidden_output, W2) + b2  # shape: (4, 1)
    final_output = sigmoid(final_input)
    
    # 4.2. Compute Loss (Mean Squared Error)
    error = y - final_output  # shape: (4, 1)
    loss = np.mean(error**2)
    
    # 4.3. Backpropagation
    #   - Gradient of loss w.r.t. final_output
    d_final_output = error * sigmoid_derivative(final_output)  # shape: (4, 1)
    
    #   - Propagate error back to hidden layer
    error_hidden_layer = np.dot(d_final_output, W2.T)  # shape: (4, 2)
    d_hidden_output = error_hidden_layer * sigmoid_derivative(hidden_output)  # shape: (4, 2)
    
    # 4.4. Gradient Descent Updates
    #   - Update W2, b2
    W2 += learning_rate * np.dot(hidden_output.T, d_final_output)  # shape: (2, 1)
    b2 += learning_rate * np.sum(d_final_output, axis=0, keepdims=True)  # shape: (1, 1)
    
    #   - Update W1, b1
    W1 += learning_rate * np.dot(X.T, d_hidden_output)  # shape: (2, 2)
    b1 += learning_rate * np.sum(d_hidden_output, axis=0, keepdims=True)  # shape: (1, 2)
    
    # Print loss every 1000 epochs
    if epoch % 1000 == 0:
        print(f"Epoch {epoch}, Loss: {loss:.6f}")

# -------------------------
# 5. Testing / Final Outputs
# -------------------------
print("\nTraining complete.")
print("Final loss:", loss)

# Feedforward one last time to see predictions
hidden_output = sigmoid(np.dot(X, W1) + b1)
final_output = sigmoid(np.dot(hidden_output, W2) + b2)

print("\nOutput after training:")
for i, inp in enumerate(X):
    print(f"Input: {inp} -> Predicted: {final_output[i][0]:.4f} (Target: {y[i][0]})")

Output:

Epoch 0, Loss: 0.257193
Epoch 1000, Loss: 0.247720
Epoch 2000, Loss: 0.226962
Epoch 3000, Loss: 0.191367
Epoch 4000, Loss: 0.162169
Epoch 5000, Loss: 0.034894
Epoch 6000, Loss: 0.012459
Epoch 7000, Loss: 0.007127
Epoch 8000, Loss: 0.004890
Epoch 9000, Loss: 0.003687

Training complete.
Final loss: 0.0029435579049382756

Output after training:
Input: [0 0] -> Predicted: 0.0598 (Target: 0)
Input: [0 1] -> Predicted: 0.9461 (Target: 1)
Input: [1 0] -> Predicted: 0.9506 (Target: 1)
Input: [1 1] -> Predicted: 0.0534 (Target: 0)

Exercise 3.3

Level: * (Easy)

Exercise Types: Novel

Question

Implement 4 iterations of gradient descent with and without momentum for the function $f (x) = x^{2} + 2$ with learning rate $η = 0.1$ , momentum $γ = 0.9$ , starting value of $x_{0} = 2$ , starting velocity of $v_{0} = 0$ . Comment on the differences.

Solution

Note that $f^{'} (x) = 2 x$

Without momentum:

Iteration 1: $x_{1} = x_{0} - η * f^{'} (x_{0}) = 2 - 0.1 * 2 * 2 = 1.6$

Iteration 2: $x_{2} = x_{1} - η * f^{'} (x_{1}) = 1.6 - 0.1 * 2 * 1.6 = 1.28$

Iteration 3: $x_{3} = x_{2} - η * f^{'} (x_{2}) = 1.28 - 0.1 * 2 * 1.28 = 1.024$

Iteration 4: $x_{4} = x_{3} - η * f^{'} (x_{3}) = 1.024 - 0.1 * 2 * 1.024 = 0.8192$

With momentum:

Iteration 1: $v_{1} = γ * v_{0} + η * f^{'} (x_{0}) = 0.9 * 0 + 0.1 * 2 * 2 = 0.4, x_{1} = x_{0} - v_{1} = 2 - 0.4 = 1.6$

Iteration 2: $v_{2} = γ * v_{1} + η * f^{'} (x_{1}) = 0.9 * 0.4 + 0.1 * 2 * 1.6 = 0.68, x_{2} = x_{1} - v_{2} = 1.6 - 0.68 = 0.92$

Iteration 3: $v_{3} = γ * v_{2} + η * f^{'} (x_{2}) = 0.9 * 0.68 + 0.1 * 2 * 0.92 = 0.796, x_{3} = x_{2} - v_{3} = 0.92 - 0.796 = 0.124$

Iteration 4: $v_{4} = γ * v_{3} + η * f^{'} (x_{3}) = 0.9 * 0.796 + 0.1 * 2 * 0.124 = 0.7412, x_{4} = x_{3} - v_{4} = 0.124 - 0.7412 = - 0.6172$

By observation, we know that the minimum of $f (x) = x^{2} + 2$ occurs at $x = 0$ . We can see that with momentum, the algorithm moves towards the minimum much faster than without momentum as past gradients are accumulated, leading to larger steps. However, we also can see that momentum can cause the algorithm to overshoot the minimum since we are taking larger steps.

Benefits for momentum: Momentum is a technique used in optimization to accelerate convergence. Inspired by physical momentum, it helps in navigating the optimization landscape.

By remembering the direction of previous gradients, which are accumulated into a running average (the velocity), momentum helps guide the updates more smoothly, leading to faster progress. This running average allows the optimizer to maintain a consistent direction even if individual gradients fluctuate. Additionally, momentum can help the algorithm escape from shallow local minima by carrying the updates through flat regions. This prevents the optimizer from getting stuck in small, unimportant minima and helps it continue moving toward a better local minimum.

Additional Comment: It is important to note that the use of running average is only there to help with intuition. At time t, while the velocity is a linear sum of previous gradients, the weight of the gradient decreases as time get further away. That is, the $\nabla Q (w_{0})$ term will have a coefficient of $(1 - γ)^{t}$ .

Exercise 3.4

Level: ** (Moderate)

Exercise Types: Novel

Question

Perform one iteration of forward pass and backward propagation for the following network:

Input layer: 2 neurons (x₁, x₂)
Hidden layer: 2 neurons (h₁, h₂)
Output layer: 1 neuron (ŷ)

Input-to-Hidden Layer:

 w₁₁¹ = 0.15, w₂₁¹ = 0.20, w₁₂¹ = 0.25, w₂₂¹ = 0.30
 Bias: b¹ = 0.35
 Activation function: sigmoid

Hidden-to-Output Layer:

 w₁₁² = 0.40, w₂₁² = 0.45
 Bias: b² = 0.60
 Activation function: sigmoid

Input:

x₁ = 0.05, x₂ = 0.10
Target output: y = 0.01
Learning rate: η = 0.5

Solution

Step 1: Forward Pass

1. Hidden Layer Outputs

For neuron h₁:

$z ₁ ¹ = w ₁ ₁ ¹ \cdot x ₁ + w ₂ ₁ ¹ \cdot x ₂ + b ¹ = 0.15 (0.05) + 0.20 (0.10) + 0.35 = 0.3775$
$h ₁ = σ (z ₁ ¹) = \frac{1}{1 + e^{- 0.3775}} \approx 0.5933$

For neuron h₂:

$z ₂ ¹ = w ₁ ₂ ¹ \cdot x ₁ + w ₂ ₂ ¹ \cdot x ₂ + b ¹ = 0.25 (0.05) + 0.30 (0.10) + 0.35 = 0.3925$
$h ₂ = σ (z ₂ ¹) = \frac{1}{1 + e^{- 0.3925}} \approx 0.5968$

2. Output Layer

$z ² = w ₁ ₁ ² \cdot h ₁ + w ₂ ₁ ² \cdot h ₂ + b ² = 0.40 (0.5933) + 0.45 (0.5968) + 0.60 = 1.1051$
$\hat{y} = σ (z ²) = \frac{1}{1 + e^{- 1.1051}} \approx 0.7511$

Step 2: Compute Error

$E = \frac{1}{2} (\hat{y} - y)^{2} = \frac{1}{2} (0.7511 - 0.01)^{2} \approx 0.2738$

Step 3: Backpropagation

3.1: Gradients for Output Layer

1. Gradient w.r.t. output neuron:

$δ ² = (\hat{y} - y) \cdot \hat{y} \cdot (1 - \hat{y})$
$δ ² = (0.7511 - 0.01) \cdot 0.7511 \cdot (1 - 0.7511) = 0.1381$

2. Update weights and bias for hidden-to-output layer:

$w ₁ ₁ ² = w ₁ ₁ ² - η \cdot δ ² \cdot h ₁ = 0.40 - 0.5 \cdot 0.1381 \cdot 0.5933 = 0.359$
$w ₂ ₁ ² = w ₂ ₁ ² - η \cdot δ ² \cdot h ₂ = 0.45 - 0.5 \cdot 0.1381 \cdot 0.5968 = 0.409$
$b ² = b ² - η \cdot δ ² = 0.60 - 0.5 \cdot 0.1381 = 0.53095$

3.2: Gradients for Hidden Layer

1. Gradients for hidden layer neurons:

For h₁:

$δ ₁ = δ ² \cdot w ₁ ₁ ² \cdot h ₁ \cdot (1 - h ₁)$
$δ ₁ = 0.1381 \cdot 0.40 \cdot 0.5933 \cdot (1 - 0.5933) = 0.0138$

For h₂:

$δ ₂ = δ ² \cdot w ₂ ₁ ² \cdot h ₂ \cdot (1 - h ₂)$
$δ ₂ = 0.1381 \cdot 0.45 \cdot 0.5968 \cdot (1 - 0.5968) = 0.0148$

2. Update weights and bias for input-to-hidden layer:

For w₁₁¹:

$w ₁ ₁ ¹ = w ₁ ₁ ¹ - η \cdot δ ₁ \cdot x ₁ = 0.15 - 0.5 \cdot 0.0138 \cdot 0.05 = 0.14965$

For w₂₁¹:

$w ₂ ₁ ¹ = w ₂ ₁ ¹ - η \cdot δ ₁ \cdot x ₂ = 0.20 - 0.5 \cdot 0.0138 \cdot 0.10 = 0.19931$

For b¹:

$b ¹ = b ¹ - η \cdot (δ ₁ + δ ₂) = 0.35 - 0.5 \cdot (0.0138 + 0.0148) = 0.3347$

This completes one iteration of forward and backward propagation.

Exercise 3.5

Level: * (Easy)

Exercise Types: Novel

Question

Consider the loss function $Q (w) = w^{2} + 2 w + 1$ . Compute the gradient of $Q (w)$ . Starting from $w_{0} = 2$ , perform two iterations of stochastic gradient descent using a learning rate $ρ = 0.1$ .

Solution

Compute the gradient at $w_{0}$ : $\nabla Q (w_{0}) = \frac{d}{d w_{0}} (w_{0}^{2} + 2 w_{0} + 1) = 2 w_{0} + 2 = 2 (2) + 2 = 4 + 2 = 6$

Update the weight using SGD: $w_{1} = w_{0} - ρ \cdot \nabla Q (w_{0}) = 2 - 0.1 \cdot 6 = 2 - 0.6 = 1.4$

Compute the gradient at $w_{1}$ : $\nabla Q (w_{1}) = \frac{d}{d w_{1}} (w_{1}^{2} + 2 w_{1} + 1) = 2 w_{1} + 2 = 2 (1.4) + 2 = 2.8 + 2 = 4.8$

Update the weight again: $w_{2} = w_{1} - ρ \cdot \nabla Q (w_{1}) = 1.4 - 0.1 \cdot 4.8 = 1.4 - 0.48 = 0.92$

Gradient Path Figure:

Exercise 3.6

Level: * (Easy)

Exercise Types: Novel

Question

What is the prediction of the following MLP for $x = [\begin{matrix} 2 \\ 2 \\ 1 \end{matrix}]$ ?

Both layers are using sigmoid activation. The weight matrices connecting the input and hidden layer, and the hidden layer and output are respectively: $V = [\begin{matrix} 1 & 0 & 1 \\ 1 & - 1 & 0 \end{matrix}], W = [\begin{matrix} 0 & 1 \end{matrix}] .$

Choose the correct answer:
a) $σ (0)$
b) $σ (σ (0))$
c) $σ (- 1)$
d) $σ (σ (0))$

Solution

The correct answer is b): $σ (σ (0))$ .

Calculation

Step 1: Compute the hidden layer output $h$ .
$h = σ (V x) = [\begin{matrix} σ (3) \\ σ (0) \end{matrix}]$

Step 2: Compute the output layer prediction $y$ .
$y = σ (W h) = σ (σ (0))$

Thus, the prediction of the MLP is $σ (σ (0))$ .

Exercise 3.7

Level: * (Easy)

Exercise Types: Novel

Question

Feedforward Neural Networks (FNNs) are one of the commonly used model in deep learning. In the context of training such networks, there are several important design components and techniques to consider.

(a) Give three commonly used activation functions. For each function, provide the formula and comment on its usage.

(b) Write down a widely used loss function for classification and explain why it is popular.

(c) Explain what adaptive learning methods are and how they help optimize and speed up neural network training.

Solution

(a)

Sigmoid Activation Function
$f (z) = \frac{1}{1 + e^{- z}}$

The output of the sigmoid function ranges from 0 to 1, making it suitable for estimating probabilities. For example, it can be used in the final layer of a binary classification task.
ReLU (Rectified Linear Unit) Activation Function
$f (z) = max (0, z) = {\begin{cases} z, & if z > 0, \\ 0, & if z \leq 0. \end{cases}$

ReLU outputs zero or a positive number, enabling some weights to be set to 0, promoting sparse representation. Since it only requires a comparison, and possibly setting a number to zero, it is more computationally efficient than other activation functions. An activation function similar to ReLU is the Gaussian error linear unit (GELU), which has a very similar shape except it is smooth at $z = 0$ .
Tanh (Hyperbolic Tangent)
$f (z) = \frac{e^{z} - e^{- z}}{e^{z} + e^{- z}}$
Advantages: It can have zero-centered output, which helps during optimization by leading to balanced gradient updates; It's also a smooth gradient that works well in many cases. It also squashes extreme values into the range [-1,1], reducing the bias introduced from outlying datapoints.

Disadvantages: It outputs close to -1 or 1 can have near-zero gradients, leading to slower learning (vanishing gradient issue).

(b)

Cross-Entropy Loss
$L_{CE} = - \sum_{i = 1}^{n} [y_{i} \log (y_{pred, i}) + (1 - y_{i}) \log (1 - y_{pred, i})]$

It provides a smooth and continuous gradient, and it penalizes the incorrect predictions more when the predictions are made with confidence. It is well suited for multi-class classification, especially when used with softmax function together.

(c)

Some variants of SGD adjust the learning rate during the training process based on the gradients' magnitudes, helping accelerate convergence and manage gradients more effectively.

Exercise 3.8

Level: * (Easy)

Exercise Types: Novel

Question

Consider a Feedforward Neural Network (FNN) with a single neuron, where the loss function is given by: $L (w) = (w - 3)^{2}$ . Compute the gradient of $L (w)$ . Starting from $w_{0} = 0$ perform two iterations of Stochastic Gradient Descent (SGD) using a learning rate $η = 0.5$

Solution

Gradient of $L (w)$ :
$\frac{d L}{d w} = 2 (w - 3)$
$w_{0} = 0$
$w_{1} = w_{0} - η \frac{d L}{d w} = 3$
$w_{2} = w_{1} - η \frac{d L}{d w} = 3$
So, after 2 iterations, $w_{2} = 3$ .

Additional Expanded Question

What if the loss function were $L (w) = (w - 3)^{4}$ ? Compute its gradient and perform two iterations of SGD starting from $w_{0} = 0$ using the same learning rate $η = 0.5$ . Do we still reach $w = 3$ in two steps?

Additional Solution

For $L (w) = (w - 3)^{4}$ , the gradient is: $\frac{d L}{d w} = 4 (w - 3)^{3}$ .

1. **Iteration 1** ( $k = 0$ ): $\frac{d L}{d w} |_{w = 0} = 4 (0 - 3)^{3} = 4 \times (- 27) = - 108$ . $w_{1} = 0 - 0.5 \times (- 108) = 54$ .

2. **Iteration 2** ( $k = 1$ ): $\frac{d L}{d w} |_{w = 54} = 4 (54 - 3)^{3} = 4 \times 51^{3} = 4 \times 132651 = 530604$ . $w_{2} = 54 - 0.5 \times 530604 = 54 - 265302 = - 265248$ .

Clearly, $w$ does not converge to $3$ in just two steps. Because $w$ is far from $3$ initially, the gradient is extremely large and causes a massive overshoot. In practice, you would reduce $η$ or use more sophisticated optimization methods to handle the higher-order curvature of this loss function.

Exercise 3.9

Level: ** (Moderate)

Exercise Types: Modified

References: Source: Schonlau, M., Applied Statistical Learning. With Case Studies in Stata, Springer. ISBN 978-3-031-33389-7 (Chapter 14, page 318).

Question

Consider the feedforward neural network with initial weights shown in Figure 3.9 (For simplicity, there are no biases). Use sigmoid activation functions ( $σ (x) = \frac{1}{1 + e^{- x}}$ ) for both the hidden and the output layers. Use a learning rate of $ρ = 0.5$ .

(a): Compute a forward pass through the network for the observation (y,x1,x2,x3) = (1,3,-2,5). That is, compute the predicted probability $p_{1}$ .

(b): Using the result from (a), make a backward pass to compute the revised w7 and w1. Use squared error loss: $E = 0.5 * (y - p_{1})^{2}$ , where $y \in {0, 1}$ is the true value of the response and $p_{1}$ is the predicted probability.

Solution

(a)

$z_{A} = x_{1} w_{1} + x_{2} w_{2} + x_{3} w_{3} = 3 * 0.8 + (- 2) * (- 1) + 5 * 0.1 = 4.9$

$z_{B} = x_{1} w_{4} + x_{2} w_{5} + x_{3} w_{6} = 3 * 0.3 + (- 2) * 0.5 + 5 * (- 0.2) = - 1.1$

$o u t_{A} = \frac{1}{1 + e^{- z_{A}}} = \frac{1}{1 + e^{- 4.9}} = 0.9926$

$o u t_{B} = \frac{1}{1 + e^{- z_{B}}} = \frac{1}{1 + e^{1.1}} = 0.2497$

$z_{1} = o u t_{A} w_{7} + o u t_{B} w_{8} = 0.9926 * 1.3 + 0.2497 * 0.4 = 1.3903$

$p_{1} = \frac{1}{1 + e^{- z_{1}}} = \frac{1}{1 + e^{- 1.3903}} = 0.8006$

(b)

Using $w_{i}^{n e w} = w_{i}^{o l d} - ρ \frac{\partial E}{\partial w_{i}}$

$\frac{\partial E}{\partial w_{7}} = \frac{\partial E}{\partial p_{1}} \frac{\partial p_{1}}{\partial z_{1}} \frac{\partial z_{1}}{\partial w_{7}} = (p_{1} - y) * p_{1} * (1 - p_{1}) * o u t_{A} = - 0.0317$

$w_{7}^{n e w} = 1.3 - 0.5 * (- 0.0317) = 1.3159$

$\frac{\partial E}{\partial w_{1}} = \frac{\partial E}{\partial p_{1}} \frac{\partial p_{1}}{\partial z_{1}} \frac{\partial z_{1}}{\partial o u t_{A}} \frac{\partial o u t_{A}}{\partial z_{A}} \frac{\partial z_{A}}{\partial w_{1}} = (p_{1} - y) * p_{1} * (1 - p_{1}) * w_{7} * o u t_{A} * (1 - o u t_{A}) * x_{1} = - 0.0091$

$w_{1}^{n e w} = 0.8 - 0.5 * (- 0.0091) = 0.8046$

Exercise 3.10

Level: * (Easy)

Exercise Types: Novel

Question

What is vanishing gradient and how is it caused? What is the solution that fixes vanishing gradient?

Answer

The vanishing gradient problem occurs when the gradients used to update weights in a neural network become extremely small as they propagate backward through the layers. This happens because activation functions compress their inputs into a narrow output range. Consequently, their derivatives are very small, particularly for inputs far from zero, causing gradients to shrink exponentially in deeper layers. As a result, in modern computing, the gradient will be the result of multiplying multiple epsilon sized gradients where the resulting update is 0. To fix this, we use the relu activation function where the gradient is either 0 or 1. This ensures that the update value will not vanish due to small gradients. This is because ReLU does not squash its outputs into a narrow range. For positive inputs, the gradient of ReLU is constant and equal to 1, ensuring that gradients remain significant during backpropagation. Unlike other functions that lead to exponentially small gradients, ReLU’s piecewise linear nature avoids the compounding effect of small derivatives across layers. By maintaining larger gradient values, ReLU ensures that weight updates are not hindered, making it an effective solution to the vanishing gradient problem, especially in deep networks.

Exercise 3.11

Level: ** (Moderate)

Exercise Types: Novel

Question

Given a two-layer neural network with the following specifications:

Inputs: Represented as $x = [x_{1}, x_{2}]^{T}$ , a $2 \times 1$ column vector.

Weights: First layer weight matrix: $W_{1}$ , a $2 \times 2$ matrix; Second layer weight vector: $W_{2}$ , a $1 \times 2$ row vector.

Activations: The activation function for all layers is the sigmoid function, defined as: $σ (z) = \frac{1}{1 + e^{- z}}$ .

Loss Function: The cross-entropy loss, given by $L = - y \log (\hat{y}) - (1 - y) \log (1 - \hat{y})$ where $y$ is the true label (0 or 1) and $\hat{y}$ is the predicted output.

(a). Write out the forward propagation steps.

(b). Calculate the derivative of the loss with respect to weights in $W_{1}$ and $W_{2}$ using the chain rule.

Solution

(a). Forward Propagation Steps:

For the first layer pre-activation: $z_{1} = W_{1} \cdot x$ where $z_{1}$ is a $2 \times 1$ column vector.

For the first layer activation: $a_{1} = σ (z_{1})$ where the sigmoid function is applied element-wise, resulting in $a_{1}$ , a $2 \times 1$ column vector.

For the second layer pre-activation: $z_{2} = W_{2} \cdot a_{1}$ where $z_{2}$ is a scalar value.

For the second layer activation (output): $\hat{y} = σ (z_{2})$ where $\hat{y}$ represents the predicted probability.

For the loss calculation: $L = - y \log (\hat{y}) - (1 - y) \log (1 - \hat{y})$ .

(b). Derivative of the Loss

Gradients for the Second Layer $W_{2}$ : using the chain rule: $\frac{\partial L}{\partial W_{2}} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z_{2}} \cdot \frac{\partial z_{2}}{\partial W_{2}} .$

For the loss gradient w.r.t. output: $\frac{\partial L}{\partial \hat{y}} = - \frac{y}{\hat{y}} + \frac{1 - y}{1 - \hat{y}} .$

For the gradient of sigmoid output w.r.t. pre-activation: $\frac{\partial \hat{y}}{\partial z_{2}} = \hat{y} (1 - \hat{y}) .$

For the gradient of pre-activation w.r.t. weights: $\frac{\partial z_{2}}{\partial W_{2}} = a_{1} .$

For the combine terms: $\frac{\partial L}{\partial W_{2}} = (\hat{y} - y) \cdot a_{1} .$

Gradients for the First Layer $W_{1}$ : using the chain rule: $\frac{\partial L}{\partial W_{1}} = \frac{\partial L}{\partial a_{1}} \cdot \frac{\partial a_{1}}{\partial z_{1}} \cdot \frac{\partial z_{1}}{\partial W_{1}} .$

For the gradient of loss w.r.t. first layer activation: $\frac{\partial L}{\partial a_{1}} = \frac{\partial L}{\partial z_{2}} \cdot \frac{\partial z_{2}}{\partial a_{1}}$ ; from the second layer: $\frac{\partial L}{\partial z_{2}} = (\hat{y} - y), \frac{\partial z_{2}}{\partial a_{1}} = W_{2}^{T} .$ So: $\frac{\partial L}{\partial a_{1}} = (\hat{y} - y) \cdot W_{2}^{T} .$

For the gradient of activation w.r.t. pre-activation: $\frac{\partial a_{1}}{\partial z_{1}} = a_{1} ⊙ (1 - a_{1}),$ where $⊙$ denotes element-wise multiplication.

For the gradient of pre-activation w.r.t. first layer weights: $\frac{\partial z_{1}}{\partial W_{1}} = x^{T} .$

For the combine terms: $\frac{\partial L}{\partial W_{1}} = [(\hat{y} - y) \cdot W_{2}^{T}] ⊙ [a_{1} ⊙ (1 - a_{1})] \cdot x^{T} .$

Exercise 3.12

Level: ** (Moderate)

Exercise Types: Novel

Question

Consider the **Bent Identity loss function**, a smooth approximation of the absolute loss, defined as:

$l (y, \hat{y}) = \sqrt{1 + (\hat{y} - y)^{2}} - 1$

The following identities may be useful:

$\frac{d}{d x} \sqrt{1 + x^{2}} = \frac{x}{\sqrt{1 + x^{2}}}, \frac{\partial y}{\partial x} = \frac{\partial y}{\partial u} \frac{\partial u}{\partial x} .$

where $y$ is the true response, and $\hat{y} = w^{T} x + b$ is the predicted response for a feature vector $x$ given model parameters $w$ and $b$ .

Part (a): Compute the Gradient

Find the gradient of the loss function with respect to $w$ and $b$ .

Part (b): Implement Gradient Descent

Using your result from Part (a), write the update rules for **gradient descent** and implement the iterative optimization process.

Solution

Step 1: Computing the Gradient

We differentiate the loss function:

$l (y, \hat{y}) = \sqrt{1 + (\hat{y} - y)^{2}} - 1.$

Differentiating with respect to $w$ and $b$ , we obtain:

$\nabla_{w} l = \frac{(\hat{y} - y) x}{\sqrt{1 + (\hat{y} - y)^{2}}}$

$\nabla_{b} l = \frac{\hat{y} - y}{\sqrt{1 + (\hat{y} - y)^{2}}}$

Step 2: Gradient Descent Update Rules

We update the parameters using gradient descent:

$w_{t + 1} = w_{t} - η \nabla_{w} l$

$b_{t + 1} = b_{t} - η \nabla_{b} l$

where $η$ is the learning rate.

Step 3: Algorithm Implementation

Initialize w, b randomly
Set learning_rate η
For t = 1 to max_iterations:
    Compute predicted value: y_hat = w^T * x + b
    Compute gradients:
        grad_w = ((y_hat - y) / sqrt(1 + (y_hat - y)^2)) * x
        grad_b = (y_hat - y) / sqrt(1 + (y_hat - y)^2)
    Update parameters:
        w = w - η * grad_w
        b = b - η * grad_b
    Check for convergence

Exercise 3.13

Level: ** (Difficult)

Exercise Types: Novel

Question

Consider training a deep neural network with momentum-based SGD on a quadratic approximation of the loss near a local minimum, $L (θ) = \frac{1}{2} θ^{⊤} H, θ$ , where $H$ is the Hessian. Explain how the momentum term modifies the effective condition number of $H$ , and why this can speed convergence in directions with small curvature while controlling overshoot in directions with large curvature. Provide a brief analysis of the discrete iteration dynamics that illustrates this effect.

Solution

Let $θ_{t}$ be the parameters at iteration $t$ , and $v_{t}$ be the velocity term. The momentum-based SGD updates for a quadratic loss can be written as:

$v_{t + 1} = β v_{t} + η H θ_{t}, θ_{t + 1} = θ_{t} - v_{t + 1},$

where $β$ is the momentum coefficient and $η$ is the learning rate.

In the eigenbasis of $H$ , let $λ_{i}$ be an eigenvalue and $u_{i}$ the corresponding eigenvector.

Projecting the iteration onto the direction $u_{i}$ yields a scalar recurrence of the form:

$θ_{t + 1}^{(i)} = θ_{t}^{(i)} - [β v_{t}^{(i)} + η λ_{i} θ_{t}^{(i)}] .$

Because $v_{t}^{(i)}$ itself depends on past gradients, the combined effect of $β$ and $η, λ_{i}$ modifies the “effective” eigenvalue seen in that direction. Specifically:

Small $λ_{i}$ (Flat Directions)

When $λ_{i}$ is small, repeated gradient directions are reinforced by $β$ , accelerating convergence compared to vanilla SGD. Effectively, momentum increases the update step in directions that change slowly.

Large $λ_{i}$ (Steep Directions)

If $λ_{i}$ is large, the term $β, v_{t}^{(i)}$ moderates the sudden jumps, helping to avoid overshooting. The velocity “remembers” past updates, dampening abrupt swings caused by steep curvature.

Overall, momentum alters the eigenvalues of $H$ into a more favorable spectrum, reducing the effective condition number. In practice, this translates into faster convergence along flat directions and controlled progress in steep directions, both of which are crucial in the highly non-convex landscapes typical of deep neural networks.

Exercise 3.14

Level: * (Easy)

Exercise Types: Novel

Question

Perform one step of backpropagation for the following network:

Input layer: 2 neurons
Hidden layer: 2 neurons
Output layer: 1 neuron
Activation function: sigmoid
Weight matrix between the input and hidden layer: $W_{1} = [\begin{matrix} 0.4 & - 0.2 \\ 0.6 & 0.3 \end{matrix}]$
Weight matrix between the hidden layer and output layer: $W_{2} = [\begin{matrix} 0.5 \\ - 0.3 \end{matrix}]$

Input: $X = [\begin{matrix} 0.5 \\ 0.1 \end{matrix}]$
Target output: y = 0.8
Learning rate: η = 0.1

Solution

Step 1: Forward Pass

Hidden Layer Outputs

$a_{1} = 0.4 * 0.5 + (- 0.20) * 0.10 = 0.18$

$z_{1} = σ (a_{1}) = \frac{1}{1 + e^{- 0.18}} \approx 0.545$

$a_{2} = 0.5 * 0.6 + 0.3 * 0.1 = 0.33$

$z_{2} = σ (a_{2}) = \frac{1}{1 + e^{- 0.33}} \approx 0.582$

Output Layer

$a_{3} = 0.5 * (0.545) + (- 0.3) * 0.582 = 0.0979$

$\hat{y} = σ (a_{3}) = \frac{1}{1 + e^{- 0.0979}} \approx 0.5245$

Step 2: Compute Error

$L = (\hat{y} - y)^{2} = (0.5245 - 0.8)^{2} \approx 0.0759$

Step 3: Backpropagation

Gradients for Output Layer

$δ_{3} = 2 * (\hat{y} - y) = - 0.2755$

Update weights for hidden-to-output layer:

$W_{2 [1]}^{'} = W_{2 [1]} - η \cdot δ_{3} \cdot z_{1} = 0.5 - 0.1 \cdot (- 0.2755) \cdot 0.545 = 0.515$ $W_{2 [2]}^{'} = W_{2 [2]} - η \cdot δ_{3} \cdot z_{2} = - 0.3 - 0.1 \cdot (- 0.2755) \cdot 0.582 = - 0.284$

Gradients for Hidden Layer

$δ_{1} = σ^{'} (a_{1}) \cdot δ_{3} \cdot W_{2 [1]} = 0.545 \cdot (1 - 0.545) \cdot (- 0.2755) \cdot 0.5 = - 0.0342$

$δ_{2} = σ^{'} (a_{2}) \cdot δ_{3} \cdot W_{2 [2]} = 0.582 \cdot (1 - 0.582) \cdot (- 0.2755) \cdot (- 0.3) = 0.0201$

Update weights and bias for input-to-hidden layer:

$W_{1 [11]}^{'} = W_{1 [11]} - η \cdot δ_{1} \cdot X_{[} 1] = 0.4 - 0.1 \cdot (- 0.0342) \cdot 0.5 = 0.4017$

$W_{1 [12]}^{'} = W_{1 [12]} - η \cdot δ_{1} \cdot X_{[} 2] = - 0.2 - 0.1 \cdot (- 0.0342) \cdot 0.1 = - 0.1997$

$W_{1 [21]}^{'} = W_{1 [21]} - η \cdot δ_{2} \cdot X_{[} 1] = 0.6 - 0.1 \cdot 0.0201 \cdot 0.5 = 0.5990$

$W_{1 [22]}^{'} = W_{1 [22]} - η \cdot δ_{2} \cdot X_{[} 2] = 0.3 - 0.1 \cdot 0.0201 \cdot 0.1 = 0.2998$

Exercise 3.15

Level: ** (Moderate)

Exercise Types: Modified (Based on ME 780 Assignment 2, University of Waterloo, Fall 2024 - in the assignment, backpropogation was programmed for a 2D vector field with a deeper network using MATLAB not Python. This was modified to use Python with a simpler network as an easier exercise to understand backpropogation. A sigmoid activation function is used rather than tanh)

References:

A. Ghodsi, STAT 940 Deep Learning: Lecture 3, University of Waterloo, Winter 2025.

W. Melek, ME 780 Computational Intelligence Chapter 6 Neural Network Parameter Learning Algorithms Course Notes, University of Waterloo, Fall 2024

Question

Define a function $f (x) = x^{2}$ on the domain (0,1).

Develop a 3 layer feedforward neural network with 10 neurons in the second layer to predict the output of this function. Use a sigmoid activation function for the hidden layer, and a linear activation function for the output layer.

Manually code backpropogation to learn the weights for this network using Python.

Use stochastic gradient descent, as discussed in Lecture 3 of STAT 940.

Solution

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0,1,50).reshape(50,1)
y = x**2

#number of neurons in hidden layer

#Define a matrix with the weights and biases for the hidden layer
w_1 = np.random.rand(10,1) #10 is the number of neurons in the hidden layer, 1 is the number of neurons in the input layer
b_1 = np.random.rand(10,1) 

#define a matrix with weights for the output layer
w_2 = np.random.rand(1,10) #1 is the number of neurons in the output layer, 10 is the number of neurons in the hidden layer
b_2 = np.random.rand(1,1)

#Sigmoid function in Python credit https://stackoverflow.com/questions/60746851/sigmoid-function-in-numpy
def sigmoid(z):
    return 1/(1 + np.exp(-z))

def nn_output(w_1,b_1,w_2,b_2,x):
    
    hidden_layer_output = sigmoid(np.matmul(w_1,np.transpose(x)) + b_1)
    
    return (np.matmul(w_2,hidden_layer_output) + b_2), hidden_layer_output

#Plot the desired function and initial neural network output before training
plt.figure(1,figsize=(7,6))
plt.plot(x,y)
plt.plot(x,nn_output(w_1,b_1,w_2,b_2,x)[0].flatten())
plt.title('Neural Network Output Before Training')
plt.legend(['Desired Function','Neural Network Output'])
plt.show()

learning_rate = 0.1

x2 = np.linspace(0,1,50)

#Do 1000 epochs
for i in range(1000):
    
    #shuffle x2 for each epoch
    np.random.shuffle(x2)    
    
    #For each epoch, do stochastic gradient descent, looping through each element in x
    for j in range(x2.shape[0]):

        #Evaluate the neural network output at x to get the error of the output
        y_nn = nn_output(w_1,b_1,w_2,b_2,x2[j].reshape(1,1))
        
        #Update the weights and bias in the output layer
        
        #Note - these equations for the output layer only work for a linear activation function, otherwise you'd have to use the chain rule
        #delta is the error in the output
        delta_output = (x2[j]**2 - y_nn[0])
        #Bias update is previous bias + learning rate * delta
        b_2 = b_2 + learning_rate*delta_output
        #Weight update is previous weight + learning rate * delta * input (input is the hidden layer neuron output)
        w_2 = w_2 + np.matmul(learning_rate*delta_output,np.transpose(y_nn[1]))
        
        #Update the weights and bias in the hidden layer
        #Note that the derivative of the sigmoid function is f'(x) = y(1-y)
        delta_hidden = (y_nn[1])*np.matmul(np.transpose(w_2),delta_output)
        #Bias update is same like above
        b_1 = b_1 + learning_rate*delta_hidden
        #Weight update is same like above
        w_1 = w_1 + learning_rate*np.matmul(delta_hidden,x2[j].reshape(1,1))
        
#Plot the desired function and initial neural network output after training
plt.figure(1,figsize=(7,6))
plt.plot(x,y)
plt.plot(x,nn_output(w_1,b_1,w_2,b_2,x)[0].flatten())
plt.title('Neural Network Output After Training')
plt.legend(['Desired Function','Neural Network Output'])
plt.show()

Neural Network Output after Using Backpropogation for Training

Exercise 3.16

Level: * (Moderate)

Exercise Types: Copied

Reference: Calin, Ovidiu. Deep learning architectures: A mathematical approach. Springer, 2020

This question is from exercise 6.6.9 on page 198.

Question

Consider a one-hidden layer neural network with sigmoid neurons in the hidden layer. Given that the input is normally distributed, $X \sim N (0, 1)$ , and the output is $Y = \sum_{i = 1}^{N} α_{i} σ (w_{i} X + b_{i})$ . Show that $V a r (Y)$ is approximately $\sum_{i} σ^{'} (b_{i})^{2} α_{i}^{2} w_{i}^{2}$ .

Solution

The variance of Y is given by $V a r (Y) = V a r (\sum_{i = 1}^{N} α_{i} σ (w_{i} X + b_{i}))$

Since the $σ (w_{i} X + b_{i})$ terms are not independent, it can be hard to decompose. However, we can use a linear approximation as follows for small values $w_{i} X$ around $b_{i}$ :

$σ (w_{i} X + b_{i}) \approx σ (b_{i}) + σ^{'} (b_{i}) w_{i} X$

Therefore, we have $V a r (Y) \approx V a r (\sum_{i = 1}^{N} α_{i} (σ (b_{i}) + σ^{'} (b_{i}) w_{i} X)) = V a r (\sum_{i = 1}^{N} α_{i} σ^{'} (b_{i}) w_{i} X) = {(\sum_{i = 1}^{N} α_{i} σ^{'} (b_{i}) w_{i})}^{2}$ since $X \sim N (0, 1)$

This is approximately equal to $\sum_{i} σ^{'} (b_{i})^{2} α_{i}^{2} w_{i}^{2}$ if we ignore the cross terms. Note that the squared terms dominate the cross terms since the squared terms are always positive, and we assume weights are small.

Exercise 3.17

Level: * (Moderate)

Exercise Types: Modified

Question

Consider the loss function $Q (w) = w^{2} + 3 w + 5$ .

Compute the gradient of $Q (w)$ . Starting from $w_{0} = - 2$ , perform three iterations of stochastic gradient descent using a learning rate $ρ = 0.15$ . Explain how the choice of $ρ$ influences the stability and speed of convergence for this loss function.

Solution

Compute the gradient of $Q (w)$ : $\nabla Q (w) = \frac{d}{d w} (w^{2} + 3 w + 5) = 2 w + 3$

At $w_{0} = - 2$ : $\nabla Q (w_{0}) = 2 (- 2) + 3 = - 4 + 3 = - 1$ Update $w$ : $w_{1} = w_{0} - ρ \cdot \nabla Q (w_{0}) = - 2 - 0.15 \cdot (- 1) = - 2 + 0.15 = - 1.85$

At $w_{1} = - 1.85$ : $\nabla Q (w_{1}) = 2 (- 1.85) + 3 = - 3.7 + 3 = - 0.7$ Update $w$ : $w_{2} = w_{1} - ρ \cdot \nabla Q (w_{1}) = - 1.85 - 0.15 \cdot (- 0.7) = - 1.85 + 0.105 = - 1.745$

At $w_{2} = - 1.745$ : $\nabla Q (w_{2}) = 2 (- 1.745) + 3 = - 3.49 + 3 = - 0.49$ Update $w$ : $w_{3} = w_{2} - ρ \cdot \nabla Q (w_{2}) = - 1.745 - 0.15 \cdot (- 0.49) = - 1.745 + 0.0735 = - 1.6715$

Effect of $ρ$ : With $ρ = 0.15$ , the convergence is stable, but it might require more iterations for steeper gradients. A smaller $ρ$ (e.g., $ρ = 0.05$ ) would slow down the updates, increasing the number of iterations required to reach the minimum. A larger $ρ$ (e.g., $ρ = 0.5$ ) could lead to overshooting or divergence, especially near sharp curvatures in the loss function.

Exercise 3.18

Level: * (Easy)

Exercise Types: Novel

Question

In a feedforward neural network, explain why introducing more hidden layers can potentially improve the network's capacity to model complex functions. However, why might adding too many hidden layers degrade the model's performance or make training difficult?

Solution

Adding more hidden layers increases the expressive power of the network, enabling it to approximate more complex functions. This stems from the Universal Approximation Theorem, which states that a sufficiently large feedforward neural network with non-linear activation functions can approximate any continuous function on a compact subset of ℝⁿ. Each additional layer allows the network to learn and represent features at different levels of abstraction, with earlier layers capturing simpler patterns and deeper layers identifying more complex relationships.

However, adding too many hidden layers introduces several challenges:

Vanishing/Exploding Gradients: During backpropagation, the gradients of the loss function with respect to weights in earlier layers can diminish (vanishing) or grow uncontrollably (exploding). This makes it difficult to update weights effectively and slows down or destabilizes training.
Overfitting: Excessively deep networks with many parameters are prone to overfitting, especially if the training data is insufficient or noisy. The network may memorize the training data instead of generalizing well to unseen data.
Computational Cost: Deeper networks require more computation, leading to longer training times and higher resource demands, which might be inefficient for certain applications.
Optimization Challenges: Deep networks create highly non-convex loss landscapes, increasing the risk of getting stuck in poor local minima or saddle points, making convergence to a good solution more challenging.
Diminishing Returns: Beyond a certain depth, additional layers may no longer contribute significantly to the network's ability to learn, resulting in wasted computational resources.

Exercise 3.19

Level: * (Easy)

Exercise Types: Modified

References: Calin, Ovidiu. Deep learning architectures: A mathematical approach. Springer, 2020, Exercise 4.17.2, Page 130.

Question

Consider the quadratic function $Q (x) = \frac{1}{2} x^{T} A x - b x$ , with $A$ nonsingular square matrix of order $n$ .

(1) Find the gradient.

(2) Write down the update equation using standard gradient descent and momentum

Solution

(1) $\nabla Q (x) = A x - b$

(2) Update equation given by gradent descent

$x_{t + 1} = x_{t} - ρ \nabla Q (x_{t}) = x_{t} - ρ (A x_{t} - b) = (I - ρ A) x_{t} + ρ b$

Update equation given by momentum $v_{t + 1} = β v_{t} + (1 - β) \nabla Q (x) = β v_{t} + (1 - β) (A x_{t} - b)$

$x_{t + 1} = x_{t} - ρ v_{t + 1} = (I - ρ (1 - β) A) x_{t} - ρ β v_{t} + ρ (1 - β) b$

Exercise 3.20

Level: * (Easy)

Exercise Types: Novel

Question

Consider a simple feedforward neural network with one hidden layer. The network takes an input vector $x = [x_{1}, x_{2}, . . ., x_{n}]$ , passes it through a hidden layer with activation function $σ$ , and produces an output $y_{pred}$ via a linear output layer.

The architecture of the network is as follows:

Input Layer: $x = [x_{1}, x_{2}, . . ., x_{n}]$
Hidden Layer: $h = σ (W_{1} x + b_{1})$ , where $W_{1} \in R^{m \times n}$ and $b_{1} \in R^{m}$ are the weights and bias of the hidden layer,
Output Layer: $y_{pred} = W_{2} h + b_{2}$ , where $W_{2} \in R^{1 \times m}$ and $b_{2} \in R$ are the weights and bias of the output layer.

Given the mean squared error (MSE) loss function: $L = \frac{1}{2} (y_{true} - y_{pred})^{2}$ where $y_{true}$ is the true target value.

Derive the backpropagation equations for updating the weights $W_{1}, W_{2}$ and biases $b_{1}, b_{2}$ using gradient descent.

Solution

Step 1: Gradients with respect to $W_{2}$ and $b_{2}$

The predicted output $y_{pred}$ is given by: $y_{pred} = W_{2} h + b_{2}$

The derivative of the loss function with respect to $y_{pred}$ is: $\frac{\partial L}{\partial y_{pred}} = - (y_{true} - y_{pred})$

Now, compute the gradient of the loss with respect to $W_{2}$ and $b_{2}$ :

For $W_{2}$ : $\frac{\partial L}{\partial W_{2}} = \frac{\partial L}{\partial y_{pred}} \cdot \frac{\partial y_{pred}}{\partial W_{2}} = - (y_{true} - y_{pred}) \cdot h$

For $b_{2}$ : $\frac{\partial L}{\partial b_{2}} = \frac{\partial L}{\partial y_{pred}} \cdot \frac{\partial y_{pred}}{\partial b_{2}} = - (y_{true} - y_{pred})$

Step 2: Gradients with respect to $W_{1}$ and $b_{1}$

Next, we compute the gradients with respect to the hidden layer parameters.

The hidden layer activation is: $h = σ (W_{1} x + b_{1})$

Using the chain rule, we first compute $\frac{\partial L}{\partial h}$ : $\frac{\partial L}{\partial h} = \frac{\partial L}{\partial y_{pred}} \cdot \frac{\partial y_{pred}}{\partial h} = - (y_{true} - y_{pred}) \cdot W_{2}$

Then, we compute the gradient with respect to $W_{1}$ and $b_{1}$ :

For $W_{1}$ : $\frac{\partial L}{\partial W_{1}} = \frac{\partial L}{\partial h} \cdot \frac{\partial h}{\partial W_{1}} = - (y_{true} - y_{pred}) \cdot W_{2} \cdot σ^{'} (W_{1} x + b_{1}) \cdot x^{T}$

For $b_{1}$ : $\frac{\partial L}{\partial b_{1}} = \frac{\partial L}{\partial h} \cdot \frac{\partial h}{\partial b_{1}} = - (y_{true} - y_{pred}) \cdot W_{2} \cdot σ^{'} (W_{1} x + b_{1})$

Step 3: Update Equations

Using the gradients derived above, the updates for the weights and biases using gradient descent are:

For $W_{2}$ : $W_{2}^{(t + 1)} = W_{2}^{(t)} - η \frac{\partial L}{\partial W_{2}}$

For $b_{2}$ : $b_{2}^{(t + 1)} = b_{2}^{(t)} - η \frac{\partial L}{\partial b_{2}}$

For $W_{1}$ : $W_{1}^{(t + 1)} = W_{1}^{(t)} - η \frac{\partial L}{\partial W_{1}}$

For $b_{1}$ : $b_{1}^{(t + 1)} = b_{1}^{(t)} - η \frac{\partial L}{\partial b_{1}}$

Where $η$ is the learning rate.

Exercise 3.21

Level: ** (Moderate)

Exercise Types: Novel

Question

There are many choices of activation functions in Feedforward Neural Networks, such as the sigmoid functions and hyperbolic tangent. This question considers other possibilities.

(a) Suppose a and b are constants. Is $a \tanh (b x)$ a good candidate? (b) Is $s i n (x)$ a good activation function?

Solution

(a) It could be a good candidate, but the main effect would be similar to $\tanh (x)$ . The addition of constants can scale the intermediate values within the networks and therefore can affect the convergence rate, akin to the effect of batch normalization.

(b) It may not be a good choice of activation function. Although it introduces nonlinearities in the training, the periodicity also introduces many local minimum, exacerbating the issue of escaping local minima.

Exercise 3.22

Level: * (Easy)

Exercise Types: Novel

Question

Assume you are training a nerual network model with gradient decent. There is a dataset with 1000 samples.

1. If you choose a mini-batch size of 100, how many times will the weights be updated during training?

2. If the mini-batch size is 50, how will the number of updates change? Why?

Solution

1. The dataset has 1000 samples, and each mini-batch has 100 samples. Thus, the number of weight updates will be:

$\frac{1000}{100} = 10$ updates.

2. If the mini-batch size is 50, the number of weight updates will be:

$\frac{1000}{50} = 20$ updates.

Therefore, the number of updates increases, and asmaller mini-batch results in more frequent updates, but each update involves fewer sample.

Exercise 3.23

Level: ** (Moderate)

Exercise Types: Novel

Question

Given a simple linear regression model y = w x + b and a single training example (x=2,y=4), show how to perform one Stochastic Gradient Descent update step for w and b. Suppose: $\begin{aligned} w = 1, b = 0, \\ L = \frac{1}{2} (y_{pred} - y)^{2}, η = 0.1 . \end{aligned}$

Solution

$\begin{aligned} y_{pred} & = w \cdot x + b = 1 \cdot 2 + 0 = 2, \\ error & = y_{pred} - y = 2 - 4 = - 2, \\ \frac{\partial L}{\partial w} & = (y_{pred} - y) \cdot x = (- 2) \cdot 2 = - 4, \\ \frac{\partial L}{\partial b} & = (y_{pred} - y) = - 2, \\ w_{new} & = w - η \cdot \frac{\partial L}{\partial w} = 1 - 0.1 \cdot (- 4) = 1.4, \\ b_{new} & = b - η \cdot \frac{\partial L}{\partial b} = 0 - 0.1 \cdot (- 2) = 0.2 . \end{aligned}$

Exercise 3.24

Level: ** (Moderate)

Exercise Types: Modified

References: Deep Learning - Foundations and Concepts, by Christopher M. Bishop and Hugh Bishop, Exercise 7.1, Page 212.

Question

Prove that the Stochastic Gradient Descent (SGD) algorithm reduces the value of the objective function Q(w), where Q(w) is differentiable, and its gradient satisfies $\nabla Q (w)$ . The goal of the proof is to show that for a sufficiently small learning rate $η$ , the weight update rule: $w_{t + 1} = w_{t} - η \nabla Q (w_{t})$ ensures that the objective function Q(w) decreases monotonically.

Solution

1. Change in the Objective Function:

The change in the objective function value can be expressed as:

$Q (w_{t + 1}) - Q (w_{t})$

Using the first-order Taylor expansion of Q(w), we can approximate it as:

$Q (w_{t + 1}) \approx Q (w_{t}) + \nabla Q (w_{t})^{T} (w_{t + 1} - w_{t})$

2. Substitude the update rule: Substituting the weight update rule $w_{t + 1} = w_{t} - η \nabla Q (w_{t})$ , we get:

$Q (w_{t + 1}) \approx Q (w_{t}) - η \nabla Q (w_{t})^{T} \nabla Q (w_{t})$

3. Gradient Property:

The quadratic term involving the gradient can be written as:

$\nabla Q (w_{t})^{T} \nabla Q (w_{t}) = ‖ \nabla Q (w_{t}) ‖^{2}$

Therefore, $Q (w_{t + 1}) - Q (w_{t}) \approx - η ‖ \nabla Q (w_{t}) ‖^{2}$

4. Conclusion:

Since the learning rate $η > 0 a n d ‖ \nabla Q (w_{t}) ‖^{2} \geq 0$ ,

we have: $Q (w_{t + 1}) - Q (w_{t}) \leq 0$ .

As long as the learning rate $η$ is sufficiently small, the objective function Q(w) will decrease monotonically, and SGD will converge to a local minimum of Q(w).

Exercise 3.25

Level: ** (Moderate)

Exercise Types: Modified (Reference: Probabilistic Machine Learning: An Introduction, 13.2.3 Activation Functions)

Question

Consider a single-layer neural network with the ReLU activation function defined as: $ReLU (x) = max (0, x) .$

Given a weight matrix $W$ , an input vector $x$ , and a bias $b$ , the output of the layer is: $h = ReLU (W x + b) .$

Let: $W = [\begin{matrix} 1 & - 2 \\ 3 & 0 \end{matrix}],$ $x = [\begin{matrix} 2 \\ - 1 \end{matrix}],$ $b = [\begin{matrix} 1 \\ - 3 \end{matrix}] .$

1. Compute the output $h$ of the layer.
2. Derive the gradient of the ReLU activation for the given $x$ and explain how it behaves for positive and negative values of the input.

Solution

1. The output is computed as: $h = ReLU (W x + b),$ where: $W x = [\begin{matrix} 1 & - 2 \\ 3 & 0 \end{matrix}] [\begin{matrix} 2 \\ - 1 \end{matrix}] = [\begin{matrix} 1 (2) + (- 2) (- 1) \\ 3 (2) + 0 (- 1) \end{matrix}] = [\begin{matrix} 2 + 2 \\ 6 + 0 \end{matrix}] = [\begin{matrix} 4 \\ 6 \end{matrix}] .$

Adding the bias: $W x + b = [\begin{matrix} 4 \\ 6 \end{matrix}] + [\begin{matrix} 1 \\ - 3 \end{matrix}] = [\begin{matrix} 5 \\ 3 \end{matrix}] .$

Applying ReLU: $h = ReLU ([\begin{matrix} 5 \\ 3 \end{matrix}]) = [\begin{matrix} max (0, 5) \\ max (0, 3) \end{matrix}] = [\begin{matrix} 5 \\ 3 \end{matrix}] .$

2. Gradient of ReLU:

The derivative of ReLU is defined as: ${ReLU}^{'} (x) = {\begin{cases} 1 & if x > 0, \\ 0 & if x \leq 0. \end{cases}$

For example take $x = [\begin{matrix} 2 \\ - 1 \end{matrix}]$ : Then $2 > 0$ , the gradient is $1$ . $- 1 \leq 0$ , the gradient is $0$ .

Hence, the gradient of ReLU activation behaves as a binary switch, passing gradients only for positive values of the input and blocking gradients for non-positive values.

Exercise 3.26

Level: ** (Moderate)

Exercise Types: Novel

Question

Given a single-layer neural network with a sigmoid activation function used for binary classification, the network is trained using stochastic gradient descent (SGD) with the cross-entropy loss function:

$L = - [y \cdot \log (\hat{y}) + (1 - y) \cdot \log (1 - \hat{y})],$

where $y$ is the binary label (0 or 1) and $\hat{y}$ is the predicted probability.

Assuming the input vector $x = [0.5, 1.2, - 0.3]$ , label $y = 1$ , and learning rate $η = 0.01$ , derive and apply the SGD update rules for a single training instance. Include calculations for the weights and bias from initial random values.

Solution

1. **Forward Pass Calculation:** Calculate the output before activation, $z$ , by $z = w \cdot x + b$ where initial weights $w = [0.2, - 0.1, 0.1]$ and bias $b = 0.01$ . Then apply the sigmoid function to obtain the predicted probability $\hat{y} = \frac{1}{1 + e^{- z}}$ .

2. **Loss and Gradient Calculation:** Calculate the loss using the cross-entropy formula. Derive the gradients with respect to $\hat{y}$ as $\frac{\partial L}{\partial \hat{y}} = - [y \cdot \frac{1}{\hat{y}} - (1 - y) \cdot \frac{1}{1 - \hat{y}}]$ and chain it to get $\frac{\partial L}{\partial z} = \hat{y} - y$ .

3. **Update Weights and Bias:** Use the gradient and learning rate to update each weight $w_{i}$ and the bias $b$ : $w_{i} = w_{i} - η \cdot \frac{\partial L}{\partial z} \cdot x_{i}$ , $b = b - η \cdot \frac{\partial L}{\partial z}$ .

For example, the weight update for $w_{1}$ would be: $w_{1} = 0.2 - 0.01 \cdot (\hat{y} - 1) \cdot 0.5$ , and similarly for other weights and bias.

Exercise 3.27

Level: * (Easy)

Exercise Types: Novel

Question

Consider the following two optimization update rules:

Standard Gradient Descent (GD): $w_{t + 1} = w_{t} - ρ \nabla Q (w_{t})$

Momentum-Based Update: $v_{t + 1} = β v_{t} + (1 - β) \nabla Q (w_{t})$ , $w_{t + 1} = w_{t} - ρ v_{t + 1}$

Explain the key differences between standard gradient descent and the momentum-based update in terms of: 1) How the gradient information is used 2) The behaviour of the optimization process, particularly in flat regions and regions with high curvature 3) The convergence speed to the minimum

Exercise 4.1

Level: * (Easy)

Exercise Types: Novel

Question

Explain why SURE is rarely used in large-scale deep learning in practice.

Solution

Note that to compute the model complexity part of the SURE estimator, we have to compute the divergence term:

$2 σ^{2} \sum_{i = 1}^{n} D_{i}$

where

$D_{i} = \frac{\partial {\hat{f}}_{i} (y)}{\partial y_{i}}$

For modern deep learning frameworks, both the number of parameters (weights) and the number of data dimensions are huge (in million or billions). This makes the computation of the divergence term extremely expensive. Note that using stochastic gradient descent the estimation of the weights is the result of an iterative process. If we include an inner loop to compute all the divergence, the computation complexity is growing exponentially.

Additionally, the high-dimensional nature of deep learning models means that computing the divergence requires handling large matrices or tensors, making the task even more resource-intensive. Even though parallel computation techniques like GPU acceleration can reduce the time for individual gradient calculations, the divergence computation still adds significant overhead when applied across all data points and iterations. Moreover, the computational complexity becomes even more challenging when dealing with large-scale datasets and real-time training, where the time needed to calculate divergence can slow down the training process substantially. This makes methods like SURE impractical for real-time or large-scale applications compared to simpler and more efficient alternatives like cross-validation, which does not require computing high-dimensional divergence and is easy to implement, making it a more suitable approach in practice.

However, SURE gives very good insight about the behaviour of the true error, and the divergence term is considered a regularization term. For simpler models such as linear regression, the divergence term can be easy to compute, and perturbing a trained data point would not change the estimated function by much. However in more complex models such as deep learning frameworks, perturbing a trained data point can result in large changes in our estimated function, meaning that the divergence term will be larger. Often times, we add a regularization term to our loss function that mimics the behaviour of the divergence term, such that the loss function increases the more complex our model is. Techniques such as adding weight decay or penalties on gradient magnitudes are often used to improve generalization.

Additional Subquestion

Could we use SURE in a simplified or approximate way for deep learning models, perhaps by estimating only a subset of partial derivatives or using a smaller batch of data? What would be the potential trade-offs?

Solution for Additional Question

One possible approach to making SURE more tractable is to approximate the divergence term using a small subset of data points or partial derivatives. For instance: - **Subsampled Divergence**: Compute $D_{i}$ for only a small mini-batch of the dataset and use this to extrapolate the full divergence. - **Block-Diagonal or Low-Rank Approximation**: Instead of computing the full Jacobian, approximate it by a block-diagonal or low-rank structure, significantly reducing the computational cost.

However, these approximations introduce additional variance or bias into the divergence estimate. If the mini-batch is not representative, the divergence (and thus the SURE estimate) might be inaccurate. Although such tricks can offer partial relief, they still tend to be more computationally demanding than alternatives like cross-validation. As a result, most large-scale deep learning pipelines avoid SURE in favor of simpler, widely supported methods.

Exercise 4.2

Level: * (Easy)

Exercise Types: Novel

Question

Use SURE to analyze how the bias-variance tradeoff is reflected in the risk of an estimator. Consider two scenarios:

A high-bias, low-variance estimator (e.g., a constant estimate $\hat{f} (y) = c$ for all $y$ ).
A high-variance, low-bias estimator (e.g., $\hat{f} (y) = y$ ).
For the estimator defined by the equation:

$\hat{f} (y) = c y + d$ ,

where c and d are constants:

a. Derive the divergence $D (\hat{f}) .$ ,

b. Use the derived divergence to compute the SURE formula for the risk.

Show how the SURE formula quantifies the risk in both cases.

Solution

High-bias, low-variance estimator:

Since $\hat{f} (y) = c$ , the divergence $D (\hat{f}) = 0$ (no dependence on $y$ ).
The SURE risk simplifies to $Risk = (c - y)^{2} + 2 σ^{2} \cdot 0 = (c - y)^{2}$
The expected risk is influenced entirely by the choice of $c$ relative to $f$ (bias).

High-variance, low-bias estimator:

For $\hat{f} (y) = y$ , the divergence $D (\hat{f}) = 1$ (derivative of $y$ w.r.t. itself is 1).
The SURE risk becomes $Risk = | y - y |^{2} + 2 σ^{2} \cdot 1 = 2 σ^{2}$ .
The risk reflects only the variance, as bias is negligible.

Divergence and SURE formula: a. The divergence is $D (\hat{f}) = c$ because the derivative of $c y + d$ with respect to $y$ is $c$ .

b. The SURE formula for the risk becomes:

$Risk = E [(c y + d - y)^{2}] + 2 σ^{2} \cdot c = E [(c - 1)^{2} y^{2} + 2 (c - 1) d y + d^{2}] + 2 σ^{2} \cdot c$ .

Exercise 4.3

Level: ** (Moderate)

Exercise Types: Novel

Question

How does SURE explain why cross-validation and regularization are effective for estimating true error?
Hint: Consider the cases when a data point is not in the training set and when it is included in the training set.

Solution

Let’s first recall the SURE (Stein's Unbiased Risk Estimate) formula:

$E [(\hat{y_{0}} - y_{0})^{2}] = E [(\hat{f_{0}} - f_{0})^{2}] + E [ϵ_{0}^{2}] - 2 E [ϵ_{0} (\hat{f_{0}} - f_{0})]$

Case 1: Data Point Not in the Training Set

When the data point is not in the training set, the covariance term $E [ϵ_{0} (f_{0} - f_{0})]$ becomes zero, since the model does not have access to that particular point during training. This simplifies the formula to:

$E [(\hat{y_{0}} - y_{0})^{2}] = E [(\hat{f_{0}} - f_{0})^{2}] + E [ϵ_{0}^{2}]$

Now, when summing over all $m$ points, we obtain:

$\sum_{i = 1}^{m} (\hat{y_{i}} - y_{i})^{2} = \sum_{i = 1}^{m} (\hat{f_{i}} - f_{i})^{2} + m σ^{2}$

Where $σ^{2}$ is noise. The total error ( $E r r$ ) can be written as:

$E r r = e r r - m σ^{2}$

Here, $m σ^{2}$ is a constant, which means the true error differs from the empirical error by a constant value only. Therefore, the empirical error ( $e r r$ ) provides a good estimate of the true error ( $E r r$ ) when the point is not in the training set.

This explains why cross-validation is effective. Cross-validation essentially evaluates the model on data points that were not part of the training set, and since the empirical error is only offset by a constant, it provides a reliable estimate of the true error.

Case 2: Data Point in the Training Set

When the data point is part of the training set, the covariance term $E [ϵ_{0} (f_{0} - f_{0})]$ is no longer zero. We have:

$\sum_{i = 1}^{m} (\hat{y_{i}} - y_{i})^{2} = \sum_{i = 1}^{n} (\hat{f_{i}} - f_{i})^{2} + n σ^{2} - 2 σ^{2} \sum_{i = 1}^{n} D_{i}$

Where $D_{i}$ represents the bias introduced by the model. This equation can be further simplified to:

$E r r = e r r - n σ^{2} + 2 σ \sum_{i = 1}^{n} D_{i}$

In this case, the additional term $\sum_{i = 1}^{n} D_{i}$ reflects the model’s complexity. The complexity term increases with the capacity of the model and is often difficult to calculate directly. To handle this, instead of directly calculating the complexity term, we can use a function that increases with respect to model capacity, and treat it as the regularization term. The regularization term penalizes model complexity to avoid overfitting, thus helping to prevent the model from fitting noise in the training set.

This explains the need for regularization techniques. Regularization helps to control model complexity and ensures that the model generalizes better to unseen data, improving the estimation of the true error.

Thus, both cross-validation and regularization are grounded in the same principle of improving the estimation of true error by adjusting for complexity and noise.

(Note: the formulas are all from Lecture 4 content, STAT 940)

Exercise 4.4

Level: ** (Moderate)

Exercise Types: Novel

Question

Assume there is a point set $(x_{i}, y_{i})$ satisfying $y_{i} = 2 x_{i} + 3 s i n (x_{i}) + n_{i}$ , where $n_{i} \sim N (0, 4)$ , for i = 1, 2, ...

Now we fit the relationship between y and x, using polynomial linear models with order from 1 to 10. Show the MSE of models of different complexity.

Solution

library(ggplot2)
x <- seq(0, 10, length.out = 100)
y <- 2 * x + 3 * sin(x) + rnorm(100, mean = 0, sd = 2)
data <- data.frame(x = x, y = y)

degrees <- 1:10
mse_values <- numeric(length(degrees))

for (i in seq_along(degrees)) {
  degree <- degrees[i]
  
  model <- lm(y ~ poly(x, degree), data = data)

  predicted <- predict(model, data)
  
  mse_values[i] <- mean((data$y - predicted)^2)
}

mse_results <- data.frame(Degree = degrees, MSE = mse_values)
print(mse_results)

ggplot(mse_results, aes(x = Degree, y = MSE)) +
  geom_line(color = "blue") +
  geom_point(size = 3, color = "red") +
  labs(title = "MSE vs. Model Complexity",
       x = "Polynomial Degree",
       y = "Mean Squared Error (MSE)") +
  theme_minimal()

ggplot(data, aes(x = x, y = y)) +
  geom_point(color = "black", alpha = 0.6) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 1), color = "red", se = FALSE) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 3), color = "blue", se = FALSE) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 10), color = "green", se = FALSE) +
  labs(title = "Polynomial Regression Fits",
       x = "x",
       y = "y") +
  theme_minimal()

Output:

Additional Comment: This graph clearly shows the importance in choosing a good space for your model candidates. When allowing your model to have too much complexity, we see that there is not a lot of reduction in the MSE after 4 polynomial degrees. But there is a big difference from 3 to 4, so based on the graph, it would be best to pick the polynomial of degree 4 as the model as it performed good on the testing dataset while being more robust than models of higher polynomial order.

Exercise 4.5

Level: ** (Easy)

Exercise Types: Novel

Question

Use the SURE equation to explain the difference of using data in the training dataset vs outside of the training dataset in estimating the true error using emperical error (of same size of dataset).

Answer

The Key is in the term $2 σ^{2} \sum_{i = 1}^{n} D_{i}$ . There are overall 2 main terms that contributes to this value ---the number of samples and the variance of the error. If the variance is sufficiently small and the sample is not too large, the difference between using training datapoints and using testing datapoints would not be as significant. So if the complexity of the model is not very high (low $D_{i}$ ) with low sample size, using the training dataset to estimate true error will not be extremely different from using a separate dataset.

Exercise 4.6

Level: * (Easy)

Exercise Types: Novel

Question

For a momentum factor $β$ , explain the impact of increasing or decreasing $β$ (e.g., $β = 0.9$ vs. $β = 0.5$ ) on the learning process.

Solution

The momentum factor $β$ determines how much of the velocity contributes to the current step during gradient descent.

When $β$ is high (e.g., $β = 0.9$ ), momentum retains most of the previous velocity, resulting in smoother and more consistent updates. In situations where gradients do not change direction significantly, high $β$ accelerates convergence as momentum accumulates in the correct direction. However, on highly non-convex surfaces or steep gradients, high $β$ may cause oscillations or overshooting as the momentum term dominates gradient changes.

When $β$ is low (e.g., $β = 0.5$ ), the updates rely more heavily on the current gradient rather than accumulated momentum, which can introduce more noise into the updates. In flat regions or long valleys, low $β$ leads to slower progress as previous updates fade quickly. However, a lower $β$ adapts better to situations where the loss landscape changes direction frequently, reducing the risk of overshooting.

Exercise 4.7

Level: ** (Moderate)

Exercise Types: Novel

Question

Suppose we have a linear regression model $y = X β + ε$ with $ε \sim N (0, σ^{2} I)$ . We use a ridge estimator with penalty $λ$ :

${\hat{β}}_{λ} = (X^{⊤} X + λ I)^{- 1} X^{⊤} y, {\hat{y}}_{λ} = X {\hat{β}}_{λ} .$ Write down the form of Stein’s Unbiased Risk Estimator (SURE) for this ridge estimator in terms of ${\hat{y}}_{λ}$ and the hat matrix. Briefly explain why the term involving the trace of the hat matrix $H_{λ}$ adjusts for overfitting, and how that adjustment depends on $λ$ . Describe how you would use SURE to select an optimal value of $λ$ .

Solution

1. SURE for the Ridge Estimator Define the hat matrix for ridge regression as:

$H_{λ} = X (X^{⊤} X + λ I)^{- 1} X^{⊤}, {\hat{y}}_{λ} = H_{λ} y .$ Stein’s Unbiased Risk Estimator (SURE) for this setting is:

$SURE (λ) = ‖ y - {\hat{y}}_{λ} ‖^{2} + 2 σ^{2} trace (H_{λ}) .$ Here, $| y - \hat{y} λ |^{2}$ represents the (in-sample) residual sum of squares, and $trace (H λ)$ measures the effective degrees of freedom used by the ridge estimator.

2. Role of the Hat Matrix Trace

The matrix $H_{λ}$ maps the observed data $y$ to the fitted values $\hat{y} λ$ . Its trace, $trace (H λ)$ , indicates how sensitive the fitted values are to the observed data. Larger $trace (H_{λ})$ means the model is using more degrees of freedom and risks overfitting, causing the naive residual sum of squares $| y - {\hat{y}}_{λ} |^{2}$ to underestimate true prediction error. As $λ$ increases, the estimator shrinks coefficients more aggressively and $trace (H_{λ})$ typically decreases, reflecting a simpler (less flexible) model. 3. Using SURE to Select $λ$ To choose an optimal $λ$ from the perspective of unbiased risk estimation, one can:

Compute $SURE (λ)$ over a grid of possible $λ$ values. Select the $λ$ that minimizes $SURE (λ)$ . Because SURE includes the penalty $2, σ^{2}, trace (H_{λ})$ , it corrects for the bias introduced by fitting the same data used in the residual calculation. This makes SURE a more reliable guide to out-of-sample error than just looking at $| y - {\hat{y}}_{λ} |^{2}$ .

Exercise 4.8

Level: * (Easy)

Exercise Type: Novel

Question

You are tasked with understanding the Mean Squared Error (MSE) and its components: bias and variance, using the provided diagram as a reference.

Setup:
1. You have a true underlying function: $f (x) = 2 x + 3$

2. A model is used to estimate $f (x)$ , given as: $\hat{f} (x) = a x + b$ , where $a$ and $b$ are estimated from data.

3. Your model has the following properties:
- Expected estimate: $E [\hat{f} (x)] = 1.8 x + 2.5$
- Variance: $Var [\hat{f} (x)] = 0.1$ (constant for all $x$ ).

(a) Calculate the bias at $x = 2$ .

(b) Given that $Var [\hat{f} (x)] = 0.1$ , compute the Mean Squared Error (MSE) at $x = 2$ .

Solution

1. Bias Calculation:
The bias is defined as: $Bias = E [\hat{f} (x)] - f (x) .$

Substitute the values for $f (x)$ and $E [\hat{f} (x)]$ at $x = 2$ :

- $f (2) = 2 (2) + 3 = 7$
- $E [\hat{f} (2)] = 1.8 (2) + 2.5 = 3.6 + 2.5 = 6.1 .$

Thus, $Bias = E [\hat{f} (2)] - f (2) = 6.1 - 7 = - 0.9 .$

2. Variance Contribution:
The variance is directly given as: $Var [\hat{f} (x)] = 0.1 .$

3. MSE Decomposition:
The Mean Squared Error (MSE) is defined as: $MSE = {Bias}^{2} + Var .$

Substitute the values:
- ${Bias}^{2} = (- 0.9)^{2} = 0.81,$
- $MSE = 0.81 + 0.1 = 0.91 .$

Final Answer:
- Bias: $- 0.9$ .
- Variance: $0.1$ .
- MSE: $0.91$ .

Exercise 4.9

Level: ** (Easy)

Exercise Types: Novel

Question

Suppose you are developing a predictive model for house prices using a dataset with features like square footage, number of bedrooms, and location. You decide to compare three models with increasing complexity:

1. A simple linear regression model.

2. A polynomial regression model with degree 3.

3. A neural network with multiple layers.

Using Stein's Unbiased Risk Estimator (SURE), explain how you would determine which model is expected to generalize best to unseen data. Any practical challenges you might encounter?

Solution

Approach

1. For each model, compute the empirical error on the training data. For instance, compute the Mean Squared Error: $e r r = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2} .$

2. Estimate $σ^{2}$ , analyze the variance of residuals from a simple baseline model like linear regression, which provides a good approximation of the noise in the data.

3. For each model, calculate the complexity term $2 σ^{2} \sum_{i = 1}^{n} D_{i}$ , where $D_{i} = \frac{\partial \hat{f_{i}}}{\partial y_{i}}$

4. Compute the Stein's Unbiased Risk Estimator for each model by $E r r = e r r - n σ^{2} + 2 σ^{2} \sum_{i = 1}^{n} D_{i}$

5. Compare the SURE values for the three models. The model with the lowest SURE is expected to generalize best to unseen data.

Practical Challenges

1. Using a simple model like linear regression might provide a biased estimate if the data is highly nonlinear. This will suggest an inaccurate variance $σ^{2}$ .

2. While SURE penalizes complexity, it may favor simpler models if the noise variance is high. However, overly simplistic models might underfit.

3. If the best model is neural networks, then it will be less interpretable in this case and calculating $D_{i}$ becomes computationally expensive.

4. For neural networks, estimating $D_{i}$ is challenging due to the model's intricate structure, lack of interpretability and prone to numerical instability.

Exercise 4.10

Level: ** (Moderate)

Exercise Types: Novel

Question

Derive Stein’s Unbiased Risk Estimator (SURE) for the case the data point is in the training set and in the testing set.

Solution

Exercise 4.11

Level: * (Easy)

Exercise Types: Novel

Question

How does the choice of momentum factor $β$ influence the ability of gradient descent to escape saddle points or local minima during optimization?

Solution

The momentum factor $β$ plays a significant role in determining how gradient descent navigates complex loss landscapes.

When $β$ is high (e.g., $β$ = 0.9), the optimizer builds up velocity over time, which helps push it through flat regions like saddle points more effectively. This accumulated momentum can prevent the algorithm from getting stuck in shallow local minima by carrying the updates forward despite weak gradients. However, in very sharp local minima, high $β$ may cause overshooting, making it harder to settle at the true minima. When $β$ is low (e.g., $β$ = 0.5), the optimizer relies more on the current gradient, which reduces the accumulated velocity. While this makes the updates more adaptive to the local landscape, it also increases the likelihood of getting trapped in saddle points or small local minima, as the velocity is insufficient to escape flat or weak gradient regions.

Exercise 4.12

Level: * (Moderate)

Exercise Types: Novel

Question

Explain the key idea behind Stein's Unbiased Risk Estimator (SURE) and how it can be applied to choose an optimal parameter or model in high-dimensional estimation problems. Why is it preferred over traditional risk estimation methods in some scenarios?

Solution

Stein's Unbiased Risk Estimator (SURE) provides an unbiased estimate of the risk (expected squared error) of an estimator using only the observed data, without requiring knowledge of the true parameter. It is particularly useful in high-dimensional problems, where it balances bias and variance effectively, aiding in tasks like model selection and hyperparameter tuning for methods such as LASSO or ridge regression. SURE is computationally efficient, often providing closed-form expressions, and is preferred over traditional methods like cross-validation in scenarios with Gaussian noise and differentiable estimators. However, its applicability depends on specific assumptions (e.g., noise distribution and estimator smoothness), and it may not always perform well in small sample sizes or under non-standard conditions.

Exercise 4.13

Level: * (Easy)

Exercise Types: Novel

Question

In order for Stein's unbiased risk estimator to be useful, we need to figure out how to compute $D_{i}$ , which can be challenging. Then what is the advantage of SURE despite the challenge?

Solution

1. SURE is useful because it provides an unbiased estimate of the risk of an estimator without needing the true parameter. Typically, to evaluate the MSE of an estimator, you would need to know the true parameter value but SURE allows you to assess the performance of an estimator without this information.

2. Even if computing the sum of the partial derivatives directly is complex, SURE can be applied to compare different estimators. The goal is often to choose an estimator that minimizes the risk, so SURE helps identify more efficient estimators, especially in high-dimensional settings.

3. In many practical applications, it's not necessary to compute the exact sum of partial derivatives. Instead, we can approximate or use simplified models for the estimator and the data.

Exercise 4.14

Level: * (Easy)

Exercise Types: Novel

Question

Prove that MSE = Bias $^{2}$ +Var

Solution

Let $f$ denote the true function, $\hat{f}$ denote the estimated function , $x$ denote the data

$\begin{aligned} MSE & = E [(f (x) - \hat{f} (x))^{2}] \\ = E [(f (x) - E [\hat{f} (x)] + E [\hat{f} (x)] - \hat{f} (x))^{2}] \\ = (f (x) - E [\hat{f} (x)])^{2} + (E [\hat{f} (x)] - \hat{f} (x))^{2} \\ = {Bias}^{2} + Var \end{aligned}$

where the second last equality comes from $E [(f (x) - E [\hat{f} (x)]) (E [\hat{f} (X)] - \hat{f} (x))] = E [f (x) E [\hat{f} (x)] - f (x) \hat{f} (x) - E [\hat{f} (X)]^{2} + \hat{f} (x) E [\hat{f} (X)])] = f (x) E [\hat{f} (X)] - f (x) E [\hat{f} (X)] - E [\hat{f} (X)]^{2} + E [\hat{f} (X)]^{2} = 0$

Alternatively,

$\begin{aligned} E [(f (x) - E [\hat{f} (x)]) (E [\hat{f} (x)] - \hat{f} (x))] & = (f (x) - E [\hat{f} (x)]) E [E [\hat{f} (x)] - \hat{f} (x)] \\ = (f (x) - E [\hat{f} (x)]) (E [E [\hat{f} (x)]] - E [\hat{f} (x)]) \\ = (f (x) - E [\hat{f} (x)]) (E [\hat{f} (x)] - E [\hat{f} (x)]) \\ = 0, \end{aligned}$

since the bias $(f (x) - E [\hat{f} (x)])$ is a constant.

Exercise 4.15

Level: ** (Moderate)

Exercise Types: Novel

Question

When applying numerical methods, it is essential to constantly remind ourselves of the assumptions behind the model. What are the assumptions of SURE required for SURE to provide an unbiased estimate of the true risk? How does it compare to cross-validation as a model selection technique?

Solution

Assumptions:

1. The noise follows a normal distribution with constant variance.

2. The response must be linearly related to the predictors.

3. The effective degrees of freedom must be correctly computed.

Comparisons with Cross-validation:

1. While the assumptions of SURE could be unrealistic to hold, cross validation may suffer from higher variance due to the random splitting.

2. SURE is computationally cheaper than cross validation.

3. Cross validation does not impose normality on the noise, giving a more general account.

Exercise 4.16

Level: * (Easy)

Exercise Types: Modified (Reference: Lecture 04)

Question

Suppose $y \sim N (θ, σ^{2} I)$ , where $θ \in R^{n}$ is the true parameter, $y$ is the observed data, and $σ^{2}$ is the variance of the noise. Consider the following estimator for $θ$ : $\hat{θ} = a y,$ where $a \in R$ is a scalar constant.

Derive the Stein's Unbiased Risk Estimator (SURE) for the estimator $\hat{θ}$ .

Solution

The true risk is given by: $R = E [‖ \hat{θ} - θ ‖^{2}] .$ Substituting $\hat{θ} = a y$ and $y \sim N (θ, σ^{2} I)$ , we expand the risk: $R = E [‖ a y - θ ‖^{2}] = E [‖ a (y - θ) + (a - 1) θ ‖^{2}] .$

Using $E [y - θ] = 0$ and $E [‖ y - θ ‖^{2}] = n σ^{2}$ , we get: $R = a^{2} E [‖ y - θ ‖^{2}] + (a - 1)^{2} ‖ θ ‖^{2} = a^{2} n σ^{2} + (a - 1)^{2} ‖ θ ‖^{2} .$

To derive SURE, we estimate the true risk using: $SURE = ‖ \hat{θ} - y ‖^{2} + 2 σ^{2} div (f),$ where $f (y) = a y$ and $div (f) = a n$ because: $\frac{\partial f_{i} (y)}{\partial y_{i}} = a for all i .$

First term: $‖ \hat{θ} - y ‖^{2} = ‖ a y - y ‖^{2} = (a - 1)^{2} ‖ y ‖^{2} .$

Therefore: $SURE = (a - 1)^{2} ‖ y ‖^{2} + 2 σ^{2} n a .$

Exercise 4.17

Level: * (Easy)

Exercise Types: Novel

Question

Stein's Unbiased Risk Estimator (SURE) is often discussed in the context of its effectiveness in model selection. Explain how SURE can be used to select the best model from a set of candidates, specifically focusing on its capacity to balance model complexity against prediction error.

Solution

SURE is particularly valuable in model selection because it provides an unbiased estimate of the risk (expected prediction error) for different models, even without knowing the true underlying function. It does this by combining the empirical error (residuals) with a penalty proportional to the model's complexity.

For instance, if we have multiple regression models of varying complexity, SURE helps identify the model that optimally balances low empirical errors with manageable complexity. This is crucial because overly complex models might fit the training data very well (low empirical error) but perform poorly on new data due to overfitting. Conversely, overly simple models might not capture the necessary patterns in the data, leading to high empirical errors.

The SURE formula includes a term that adjusts the observed empirical error by adding a complexity penalty. This penalty is usually proportional to the trace of the hat matrix (a measure of leverage or influence of each observation in linear models), which scales with the flexibility or number of parameters in the model. By evaluating SURE for each model, we can choose the model that minimizes this unbiased estimate of the risk, thereby selecting the model with the best generalization performance expected on new, unseen data.

Exercise 5.1

Level: * (Easy)

Exercise Type: Novel

Question

1) Which of the following options can be used to regularize an MLP (choose all that apply)?

a) Add noise to data
b) Use full batch gradient descent
c) Add dropout
d) Use early stopping

2) Using Stochastic Gradient Descent with a small minibatch when training an MLP helps with generalization:

a) True
b) False

3) Explain why adding noise to the input data helps improve generalization in a neural network.

Solution

1) The correct answers are a), c), and d).
- a) Adding noise to data is a regularization technique that can improve generalization by forcing the model to learn robust patterns. Types of noise used in MLPs include: input noise, weight noise, activation noise and dropouts.
- c) Dropout randomly deactivates neurons during training, reducing overfitting.
- d) Early stopping prevents overfitting by halting training once the validation error stops improving.

b) is incorrect because using full batch gradient descent does not help regularize the model.

2) The correct answer is a) True.
Using Stochastic Gradient Descent with a small minibatch size introduces noise into the optimization process, which can act as a form of regularization and improve generalization.

3) Explanation: Adding noise to the input data acts as a form of data augmentation. It forces the model to learn features that are robust to variations in the input, reducing reliance on specific details of the training data. This helps the network generalize better to unseen data, as it becomes more adaptable to small changes and less prone to overfitting.

Exercise 5.2

Level: * (Easy)

Exercise Type: Novel

Question

Derive the gradient of the loss function with respect to the weights $w$ for the following regularized quadratic loss:

$L (w) = \frac{1}{2 n} \sum_{i = 1}^{n} {(y_{i} - w^{T} x_{i})}^{2} + \frac{λ}{2} ‖ w ‖^{2}$

where $λ > 0$ is the regularization parameter.

Solution

The gradient of the loss function $L (w)$ with respect to $w$ can be computed as follows:

$\nabla_{w} L (w) = \nabla_{w} (\frac{1}{2 n} \sum_{i = 1}^{n} {(y_{i} - w^{T} x_{i})}^{2}) + \nabla_{w} (\frac{λ}{2} ‖ w ‖^{2})$

For the first term: $\nabla_{w} (\frac{1}{2 n} \sum_{i = 1}^{n} {(y_{i} - w^{T} x_{i})}^{2}) = - \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - w^{T} x_{i}) x_{i}$

For the second term: $\nabla_{w} (\frac{λ}{2} ‖ w ‖^{2}) = λ w$

Combining both terms: $\nabla_{w} L (w) = - \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - w^{T} x_{i}) x_{i} + λ w$

Exercise 5.3

Level: ** (Moderate)

Exercise Types: Novel

Question

What are the mathematical formulas for Ridge Regression and Lasso Regression, including their respective penalty terms? Additionally, write a Python script that visualizes the effect of these regularization techniques on a selected loss function (e.g., Mean Squared Error) for a simple linear regression model. Once you have the graph, interpret the graph by explaining how the shapes show the differences between Lasso and ridge regression.

Solution

Ridge Regression and Lasso Regression are both forms of linear regression with regularization to prevent overfitting.

Ridge regression minimizes the following objective function:

$\hat{β} = \arg min_{β} \sum_{i = 1}^{n} (y_{i} - X_{i} β)^{2} + λ \sum_{j = 1}^{p} β_{j}^{2}$

where:

$y_{i}$ are the target values,
$X_{i}$ represents the input features,
$β$ are the regression coefficients,
$λ$ is the regularization parameter controlling the strength of the penalty.

The penalty term $λ \sum_{j = 1}^{p} β_{j}^{2}$ ensures that coefficients are shrunk towards zero, reducing overfitting but not setting any coefficients exactly to zero.

Lasso regression introduces an L1 penalty term:

$\hat{β} = \arg min_{β} \sum_{i = 1}^{n} (y_{i} - X_{i} β)^{2} + λ \sum_{j = 1}^{p} | β_{j} |$

where the L1 penalty term $λ \sum_{j = 1}^{p} | β_{j} |$ encourages sparsity in the coefficients. Some coefficients will be set exactly to zero, making Lasso useful for feature selection.

visualization

The following Python script visualizes the impact of Ridge and Lasso regularization by plotting the contours of the Mean Squared Error (MSE) loss function along with the constraint regions imposed by Ridge (L2 ball) and Lasso (L1 diamond):

import numpy as np
import matplotlib.pyplot as plt

# Define the loss function
def mse_loss(beta1, beta2):
    return beta1**2 + beta2**2  # Simplified MSE (for visualization)

# Generate grid for visualization
beta1_vals = np.linspace(-2, 2, 100)
beta2_vals = np.linspace(-2, 2, 100)
B1, B2 = np.meshgrid(beta1_vals, beta2_vals)
Loss = mse_loss(B1, B2)

# Plot contours of the loss function
plt.figure(figsize=(10, 6))
contour = plt.contour(B1, B2, Loss, levels=20, cmap='viridis')
plt.colorbar(contour)

# Plot Ridge constraint (L2 ball)
ridge_circle = plt.Circle((0, 0), radius=1.2, color='red', fill=False, linestyle='dashed', label="Ridge Constraint (L2)")

# Plot Lasso constraint (L1 diamond)
lasso_diamond = np.array([[1, 0], [0, 1], [-1, 0], [0, -1], [1, 0]]) * 1.2
plt.plot(lasso_diamond[:, 0], lasso_diamond[:, 1], 'b--', label="Lasso Constraint (L1)")

# Add constraints to the plot
ax = plt.gca()
ax.add_patch(ridge_circle)
plt.xlim(-2, 2)
plt.ylim(-2, 2)
plt.axhline(0, color='black', linewidth=0.5)
plt.axvline(0, color='black', linewidth=0.5)

# Labels and legend
plt.xlabel(r'$\beta_1$')
plt.ylabel(r'$\beta_2$')
plt.title("Effect of Ridge and Lasso Regularization on Loss Function")
plt.legend()
plt.grid(True)
plt.show()

Output:

1. Ridge Regression (L2) – Red Circle

   Ridge regression enforces a circular constraint on the coefficients (β1,β2β1,β2).
   The solution is found where the contours of the loss function (MSE) first touch this L2 constraint region.
   Since the constraint is smooth (no sharp corners), the coefficients are shrunk gradually towards zero but rarely exactly zero.
   This means Ridge keeps all features in the model but reduces their impact by shrinking their values.

2. Lasso Regression (L1) – Blue Diamond

   Lasso regression imposes a diamond-shaped constraint on the coefficients.
   The sharp corners of the L1 constraint (at the axes) make it more likely that the optimal solution lies exactly on one of these axes.
   This means Lasso sets some coefficients to exactly zero, effectively performing feature selection by eliminating less important variables.
   The reason for this behavior is that the contours of the loss function often first touch the constraint at the corners, leading to sparsity in the solution.

Conclusion

Ridge Regression (L2) shrinks coefficients continuously but keeps all variables. Lasso Regression (L1) forces some coefficients to exactly zero, performing automatic feature selection.

Exercise 5.4

Level: ** (Moderate)

Exercise Types: Novel

Question

Label smoothing modifies the standard one-hot ground truth labels in multi-class classification by assigning a small amount of probability mass to incorrect classes. Suppose we replace each one-hot label $y$ with a “smoothed” label $\tilde{y}$ that assigns $1 - α$ to the true class and $α / (C - 1)$ to each of the other $C - 1$ classes, where $C$ is the total number of classes and $α$ is a small constant.

Write the modified cross-entropy loss using $\tilde{y}$ and a predicted probability vector $p$ . Explain how this modification helps reduce model overconfidence and potentially improves calibration. Give an example scenario (e.g., large-scale image classification) where label smoothing has been shown to be particularly beneficial.

Solution

Modified Cross-Entropy Loss Standard cross-entropy for one-hot labels $y$ and predicted probabilities $p$ is $- \sum_{c = 1}^{C} y_{c} \log p_{c} .$ With label smoothing, each target label becomes ${\tilde{y}}_{c}$ . Hence, the loss is:

$ℓ_{smooth} = - \sum_{c = 1}^{C} {\tilde{y}}_{c} \log p_{c}, where {\tilde{y}}_{c} = {\begin{cases} 1 - α, & for the correct class, \\ \frac{α}{C - 1}, & for other classes . \end{cases}$ Reduction of Overconfidence and Improved Calibration

In a one-hot scheme, the model is strongly penalized if it does not put near-total probability mass on the correct class. Label smoothing distributes a fraction $α$ of the probability across incorrect classes, preventing the model from becoming overly confident. By avoiding extreme outputs (probabilities close to 0 or 1), the model tends to produce more calibrated predictions, often generalizing better to unseen data. Example Scenario In large-scale image classification (e.g., ImageNet), label smoothing has demonstrated notable gains:

Models converge more smoothly, especially when data is abundant but still noisy in certain classes. Overconfidence is reduced, leading to improved validation accuracy and calibration metrics.

Exercise 5.5

Level: * (Easy)

Exercise Types: Modified

References: https://www.cs.toronto.edu/~lczhang/321/files/midterm_b.pdf

Question

Which of the following about weight decay is true?

(A) Including weight decay generally reduces the training cost.

(B) Weight decay directly penalizes large activations.

(C) Weight decay can help revive a “dead” or “saturated” neuron.

(D) Weight decay helps get out of saddle points.

Solution

C) Weight decay increases the training cost because there is an extra (positive) term in the training cost.Weight decay penalizes large weights and not activations. The choice of adding weight decay is independent of the choice of the optimizer. Weight decay can revive a “dead” neuron, but it does not help us get out of saddle points. Weight decay primarily regularizes weights and does not directly address saddle points. Techniques like momentum or adaptive gradients are more relevant for escaping saddle points.

Additional note: Dead/saturated neurons have activations that are always in the plateau area of the activation function (e.g. ReLU or sigmoid or tanh), caused by large (positive or negative) weights/biases. Weight decay reduces those parameters, so that the activations will not be consistently large.

Exercise 5.6

Level: ** (Moderate)

Exercise Types: Novel

Question

Train a neural network with both L2 regularization and early stopping on UCI dataset. Vary the regularization strength and patience, and observe how they interact to affect model generalization.

Solution

import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler
from torch.utils.data import DataLoader, TensorDataset
import matplotlib.pyplot as plt

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load UCI dataset (using the diabetes dataset as an example)
data = load_diabetes()
X, y = data.data, data.target

# Standardize features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split into training, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Convert data to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32).to(device)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).view(-1, 1).to(device)
X_val_tensor = torch.tensor(X_val, dtype=torch.float32).to(device)
y_val_tensor = torch.tensor(y_val, dtype=torch.float32).view(-1, 1).to(device)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32).to(device)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32).view(-1, 1).to(device)

# Prepare DataLoader
train_loader = DataLoader(TensorDataset(X_train_tensor, y_train_tensor), batch_size=32, shuffle=True)
val_loader = DataLoader(TensorDataset(X_val_tensor, y_val_tensor), batch_size=32)
test_loader = DataLoader(TensorDataset(X_test_tensor, y_test_tensor), batch_size=32)

# Define neural network model
class SimpleNN(nn.Module):
    def __init__(self, input_dim):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(input_dim, 64)
        self.fc2 = nn.Linear(64, 32)
        self.fc3 = nn.Linear(32, 1)
        self.relu = nn.ReLU()
    
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Training function with early stopping
def train_model(model, train_loader, val_loader, criterion, optimizer, num_epochs, patience):
    train_loss_history = []
    val_loss_history = []

    best_val_loss = float('inf')
    best_model_state = None
    patience_counter = 0

    for epoch in range(num_epochs):
        # Training phase
        model.train()
        train_loss = 0.0
        for X_batch, y_batch in train_loader:
            optimizer.zero_grad()
            y_pred = model(X_batch)
            loss = criterion(y_pred, y_batch)
            loss.backward()
            optimizer.step()
            train_loss += loss.item()
        train_loss /= len(train_loader)
        train_loss_history.append(train_loss)

        # Validation phase
        model.eval()
        val_loss = 0.0
        with torch.no_grad():
            for X_batch, y_batch in val_loader:
                y_pred = model(X_batch)
                loss = criterion(y_pred, y_batch)
                val_loss += loss.item()
        val_loss /= len(val_loader)
        val_loss_history.append(val_loss)

        # Early stopping
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            best_model_state = model.state_dict()
            patience_counter = 0
        else:
            patience_counter += 1
            if patience_counter >= patience:
                print(f"Early stopping triggered at epoch {epoch+1}.")
                break

        print(f"Epoch {epoch+1}/{num_epochs}, Train Loss: {train_loss:.4f}, Val Loss: {val_loss:.4f}")

    # Load the best model state
    model.load_state_dict(best_model_state)
    return train_loss_history, val_loss_history

# Define L2 regularization strengths and patience values
l2_strengths = [0.0, 0.01, 0.1]
patience_values = [3, 5]
results = {}

# Train models with different regularization and early stopping settings
for l2_strength in l2_strengths:
    for patience in patience_values:
        print(f"\nTraining with L2 regularization = {l2_strength}, Patience = {patience}")
        model = SimpleNN(input_dim=X.shape[1]).to(device)
        optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=l2_strength)
        criterion = nn.MSELoss()

        train_loss, val_loss = train_model(
            model, train_loader, val_loader, criterion, optimizer, num_epochs=100, patience=patience
        )

        # Evaluate the model on the test set
        model.eval()
        test_loss = 0.0
        with torch.no_grad():
            for X_batch, y_batch in test_loader:
                y_pred = model(X_batch)
                loss = criterion(y_pred, y_batch)
                test_loss += loss.item()
        test_loss /= len(test_loader)

        results[(l2_strength, patience)] = {
            "train_loss": train_loss,
            "val_loss": val_loss,
            "test_loss": test_loss
        }

# Plot results
plt.figure(figsize=(12, 8))
for (l2_strength, patience), result in results.items():
    plt.plot(result["val_loss"], label=f"L2={l2_strength}, Patience={patience}")

plt.title("Validation Loss with Different L2 Regularization and Patience")
plt.xlabel("Epochs")
plt.ylabel("Validation Loss")
plt.legend()
plt.grid(True)
plt.savefig('validation_loss.jpg')

# Display summary results
print("\nSummary of Results:")
for (l2_strength, patience), result in results.items():
    print(f"L2 Regularization: {l2_strength}, Patience: {patience}")
    print(f"Final Train Loss: {result['train_loss'][-1]:.4f}")
    print(f"Final Validation Loss: {result['val_loss'][-1]:.4f}")
    print(f"Test Loss: {result['test_loss']:.4f}\n")

Output:

1. Effect of L2 Regularization:

With regularization, the validation loss decreases more smoothly. A stronger penalty (L2=0.1) slows down the training initially, as it regularizes the weights more strictly. This can prevent overfitting but may lead to underfitting if the regularization is too strong.

2. Effect of Patience (Early Stopping):

Compared with patience=3, the model with patience=5 may lead to slight improvements if the loss decreases further after plateauing. Longer patience typically benefits models with regularization (L2=0.1) because it gives the model more time to converge despite slower initial progress.

3. Interaction Between L2 and Patience:

L2=0.1 with Patience=3 stops relatively early and might underfit slightly compared to Patience=5; Moderate L2 (L2=0.01) and higher patience (5) provide a better trade-off, as the validation loss decreases smoothly without abrupt stops.

Exercise 5.7

Level: * (Easy)

Exercise Types: Novel

Question

Describe the main idea of the Manifold Tangent Classifier and its application in regularization.

Solution

The Manifold Tangent Classifier is a method that uses the underlying structure of data to improve generalization. It does this by ensuring that the gradient of the classifier f(x) with respect to the input x is orthogonal to the tangent vectors of the data's manifold at x. These tangent vectors represent directions where small changes in the input do not affect the data significantly. Mathematically, this is achieved by adding a regularization term to the loss function. The regularization penalizes the classifier if the directional derivative of f(x) along any tangent vector is large. The regularizer is expressed as $λ \sum_{i} (\frac{\partial f (x)}{\partial x} \times v_{i})^{2}$ , where $λ$ is a parameter that controls the strength of the penalty, and $v_{i}$ are the tangent vectors. This term encourages the classifier to be invariant to small changes along the manifold, reducing overfitting and improving robustness to noise. The total loss function combines this regularizer with the original loss, making the model more focused on meaningful variations in the data.

Exercise 5.8

Level: *** (Hard)

Exercise Types: Novel

Question

You are provided with a noisy dataset sampled from a simple harmonic oscillator, with mass $m = 1 k g$ and spring constant $k = 100 N / m$ . The task in this problem is to write a regression neural network that predicts the oscillator's position $y$ as a function of time $t$ .

The dataset can be generated using the following script:

import numpy as np
import torch

np.random.seed(1234)

t = np.random.uniform(0, 1, 30)
noise = np.random.normal(0, 0.2, t.shape)
y = np.sin(10*t) + noise

t = torch.tensor(t, dtype=torch.float32).view(-1, 1)
y = torch.tensor(y, dtype=torch.float32).view(-1, 1)

(a) Neural network

In PyTorch, write a fully connected neural network with 4 hidden layers, each with 200 neurons. Use a tanh activation function after each hidden layer. Use a learning rate of 0.001. Generate a test dataset of times between $t = 0$ and $t = 1$ second(s). Create a plot and comment on the result.

(b) L2 regularization

Repeat the process above, but implement L2 regularization (the weight_decay parameter) in the optimizer. Comment on the result.

(c) Physics-informed neural network

If a dataset follows a known physical law (i.e., a differential equation), this differential equation can be added as a term to the loss function. This penalizes the neural network for solutions that do not obey this law and acts as a form of regularization. The equation of motion for the simple harmonic oscillator is:

$\frac{d^{2} y}{d t^{2}} + \frac{k}{m} y = 0$

Implement this into the neural network above.

Solution

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt

torch.manual_seed(1234)

class SimpleNN(nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(1, 200)
        self.fc2 = nn.Linear(200, 200)
        self.fc3 = nn.Linear(200, 200)
        self.fc4 = nn.Linear(200, 200)
        self.fc5 = nn.Linear(200, 1)

    def forward(self, t):
        t = torch.tanh(self.fc1(t))
        t = torch.tanh(self.fc2(t))
        t = torch.tanh(self.fc3(t))
        t = torch.tanh(self.fc4(t))
        t = self.fc5(t)
        return t

def train(t, y, WEIGHT_DECAY=0, PHYSICS=0):
    model = SimpleNN()
    criterion = nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=WEIGHT_DECAY)

    epochs = 2000
    for epoch in range(epochs):
        perm = torch.randperm(len(t))
        t = t[perm]
        y = y[perm]

        t.requires_grad = True
        y_pred = model(t) # forward pass

        ### derivatives for the physics loss ###
        dy_dt = torch.autograd.grad(outputs=y_pred, inputs=t, grad_outputs=torch.ones_like(y_pred), create_graph=True)[0]
        d2y_dt2 = torch.autograd.grad(outputs=dy_dt, inputs=t, grad_outputs=torch.ones_like(dy_dt), create_graph=True)[0]
        physics_loss = torch.mean((d2y_dt2 + 100 * y_pred) ** 2)

        loss = criterion(y_pred, y)
        total_loss = loss + PHYSICS * physics_loss

        optimizer.zero_grad()
        total_loss.backward() # backwards pass
        optimizer.step()  # update weights

        t.requires_grad = False

        if (epoch + 1) % 100 == 0:
            print(f"Epoch [{epoch + 1}/{epochs}], Loss: {loss.item():.4f}, "
                  f"Physics Loss: {physics_loss.item():.4f}, Total Loss: {total_loss.item():.4f}")

    t_test = torch.linspace(0, 1, 100).view(-1, 1)
    with torch.no_grad():
        predictions = model(t_test)

    return t_test.numpy(), predictions.numpy()

t_test, pred = train(t, y)
_, pred_L2 = train(t, y, WEIGHT_DECAY=1e-3)
_, pred_phys = train(t, y, WEIGHT_DECAY=1e-3, PHYSICS=1e-5)

plt.figure(figsize=(8, 4))
plt.plot(t_test, pred, label="No regularization")
plt.plot(t_test, pred_L2, label="L2 regularization")
plt.plot(t_test, pred_phys, label="L2 + physics")
plt.scatter(t.numpy(), y.numpy(), color="k", label="Training data")
plt.xlabel("$t$")
plt.ylabel("$y$")
plt.legend()
plt.show()

Exercise 5.9

Level: ** (Moderate)

Exercise Types: Novel

Question

You will train a linear regression model to predict y values. Assume you have a high-dimensional dataset with noise, where the input features are $x \in R^{100 \times 50}$ (100 samples and 50 features), and the target outputs are $y \in R^{100}$ .

1. Write a Python function to implement a linear regression model with L2 regularization using gradient descent.

2. Train the model in:

   a. Without regularization lambda = 0

   b. With regularization lambda = 0.1

Solution

import numpy as np

np.random.seed(42)
x = np.random.randn(100, 50)
true_w = np.random.randn(50) * 0.5
y = x @ true_w + np.random.randn(100) * 0.1

# linear regression with L2 regularization
def linear_reg(x, y, lr = 0.01, lambda = 0.0, epochs = 1000):
    n, d = x.shape
    w = np.zeros(d)
    for _ in range(epochs):
        gradient = -2/n * x.T @ (y - x @ w) + lambda_ * w
        w = w - lr * gradient

no_reg = linear_reg(x, y, lambda = 0.0)
with_reg = linear_reg(x, y, lambda = 0.1)

Exercise 5.10

Level: ** (Moderate)

Exercise Types: Novel

Question

Consider a deep learning model that suffers from high variance, indicating it might be overfitting to the training data. Explore regularization techniques that could potentially reduce overfitting. Describe the following methods and explain how they contribute to reducing overfitting: 1. Weight Decay 2. Noise Injection 3. Early Stopping 4. Bagging

Provide a simple example or formula where applicable to demonstrate each method's effect.

Solution

1. **Weight Decay:** This regularization technique involves adding a penalty term to the loss function based on the sum of the squared values of the parameters (L2 regularization). This discourages large weights and helps to prevent the model from fitting the noise in the training data. - Example Formula: $L_{n e w} = L_{o r i g i n a l} + λ \sum_{i} w_{i}^{2}$ , where $λ$ is the regularization strength and $w_{i}$ are the model weights.

2. **Noise Injection:** By adding noise to the inputs or outputs during training, the model learns to ignore small variations, enhancing its generalization capabilities. - Example: Injecting Gaussian noise $ϵ \sim N (0, σ^{2})$ to inputs during training to make the model robust to slight variations in input data.

3. **Early Stopping:** This involves stopping the training process before the model has fully converged to the minimum of the loss function on the training set. By monitoring the performance on a validation set and stopping when performance degrades (implying the start of overfitting), this method effectively limits the capacity of the model. - Example: Stop training when validation loss begins to increase, even if training loss continues to decrease.

4. **Bagging:** Bagging, or bootstrap aggregating, involves training multiple models on different subsets of the training data and then averaging their predictions. This technique reduces variance and avoids overfitting by smoothing out predictions. - Example: Train multiple neural networks independently on different subsets of data and average their outputs to make final predictions.

Exercise 6.1

Level: ** (Medium)

Exercise Types: Modified

Refrence: Source: Schonlau, M., Applied Statistical Learning. With Case Studies in Stata, Springer. ISBN 978-3-031-33389-7 (Chapter 14, page 319).

Question

In binary logistic regression, we model the log odds of one class as a linear function of predictor variables. In multinomial logistic regression, where there are $K$ possible classes, we model the log odds of $K - 1$ classes relative to a reference class $K$ :

$\log (\frac{p_{k}}{p_{K}}) = α_{k} + x_{1} β_{k 1} + \dots + x_{p} β_{k p},$

for $k = 1, 2, \dots, K - 1$ .

Suppose we use a neural network model to implement multinomial logistic regression. Design a neural network architecture for a case with four input variables ( $p = 4$ ) and five possible outcome classes ( $K = 5$ ).

(a) Explain how the output layer should be structured and which activation function should be used.

(b) Introduce a **dropout** mechanism in the model. Explain where dropout can be applied and how it affects training and generalization.

Solution

To design a neural network that performs multinomial logistic regression with $p = 4$ input variables and $K = 5$ classes, we follow these steps:

1. Neural Network Architecture

- Input Layer:** 4 neurons (one for each predictor variable).

- Hidden Layer (Optional):** To introduce non-linearity, we can add a hidden layer with dropout.

- Output Layer:** 5 neurons (one for each class).

Each output neuron represents the log-odds of class $k$ relative to the reference class $K$ , following the equation:

$\log (\frac{p_{k}}{p_{K}}) = α_{k} + x_{1} β_{k 1} + \dots + x_{p} β_{k p}, for k = 1, 2, \dots, 4.$

The probability of class $K$ is obtained as:

$p_{K} = 1 - \sum_{k = 1}^{K - 1} p_{k} .$

2. Activation Function

Since this is a **classification problem**, the **softmax activation function** is used in the output layer:

$p_{k} = \frac{e^{z_{k}}}{\sum_{j = 1}^{K} e^{z_{j}}}, where z_{k} = α_{k} + x_{1} β_{k 1} + \dots + x_{p} β_{k p} .$

This ensures that: - Each $p_{k}$ is between 0 and 1. - The sum of all $p_{k}$ values is 1.

3. Incorporating Dropout Dropout is a regularization technique that randomly deactivates neurons during training to prevent overfitting. It is commonly applied to hidden layers.

- Where to Apply Dropout:

 - If a hidden layer is included, apply dropout (e.g., with probability [math]\displaystyle{  p = 0.5  }[/math]) to prevent co-adaptation of neurons.  
 - Dropout **should not** be applied to the input or output layers.

- Impact on Training and Generalization:

 - During training, dropout randomly removes a fraction of neurons in each iteration, forcing the network to learn more robust features.  
 - During inference (testing), dropout is disabled, and all neurons contribute, with weights scaled accordingly.  
 - This reduces overfitting and improves generalization to unseen data.

4. Loss Function To train the network, the **categorical cross-entropy loss** is used:

$L = - \sum_{i = 1}^{N} \sum_{k = 1}^{K} y_{i k} \log p_{i k}$

where: - $N$ is the number of samples.

- $y_{i k}$ is 1 if sample $i$ belongs to class $k$ , otherwise 0.

- $p_{i k}$ is the predicted probability for class $k$ .

5. Summary of Neural Network Design with Dropout

- Input layer: 4 neurons (one per input feature).

- Hidden layer (optional): Can include 8 neurons with ReLU activation and dropout ( $p = 0.5$ ).

- Output layer: 5 neurons (one per class) with softmax activation.

- Activation function: Softmax in the output layer.

- Loss function: Categorical cross-entropy.

- Regularization: Dropout applied to hidden layer to reduce overfitting.

By including dropout, the model becomes more robust to noise and prevents reliance on specific neurons, leading to better generalization performance.

Exercise 6.2

Level: * (Easy)

Exercise Types: Novel

Question

Consider a neural network with a single hidden layer. The hidden layer has 4 neurons, and the ReLU activation function is applied. During training, dropout is applied with a probability of 0.5. If the input to the hidden layer is $x = [1, 2, 3, 4]$ and the weights are $W = [\begin{matrix} 0.5 & 0.2 & - 0.3 & 0.8 \\ 0.1 & - 0.5 & 0.7 & - 0.2 \\ - 0.4 & 0.6 & - 0.1 & 0.3 \\ 0.3 & - 0.4 & 0.2 & 0.1 \end{matrix}]$ , compute the output of the hidden layer after applying dropout.

Assume the following dropout mask is sampled: $m = [1, 0, 1, 0]$ .

Solution

1. Compute the pre-activation values: $z = W \cdot x = [\begin{matrix} 0.5 & 0.2 & - 0.3 & 0.8 \\ 0.1 & - 0.5 & 0.7 & - 0.2 \\ - 0.4 & 0.6 & - 0.1 & 0.3 \\ 0.3 & - 0.4 & 0.2 & 0.1 \end{matrix}] \cdot [\begin{matrix} 1 \\ 2 \\ 3 \\ 4 \end{matrix}] = [\begin{matrix} 3.5 \\ 0.3 \\ 1.4 \\ 0.9 \end{matrix}]$

2. Apply the ReLU activation function: $ReLU (z) = max (0, z) = [\begin{matrix} 3.5 \\ 0.3 \\ 1.4 \\ 0.9 \end{matrix}]$

3. Apply the dropout mask and scale by $\frac{1}{1 - 0.5} = 2$ : $h = m ⊙ ReLU (z) \cdot 2 = [\begin{matrix} 1 \\ 0 \\ 1 \\ 0 \end{matrix}] ⊙ [\begin{matrix} 3.5 \\ 0.3 \\ 1.4 \\ 0.9 \end{matrix}] \cdot 2 = [\begin{matrix} 7.0 \\ 0 \\ 2.8 \\ 0 \end{matrix}]$

The final output of the hidden layer after applying dropout is: $h = [7.0, 0, 2.8, 0]$ .

Exercise 6.3

Level: ** (Moderate)

Exercise Types: Novel

Question

Train a simple deep neural network with and without dropout regularization on the MNIST dataset. Experiment with different dropout rates (0.1, 0.3, 0.5). Analyze the model's performance in terms of accuracy and generalization.

Solution

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Define data preprocessing and load the MNIST dataset
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))  # Normalize to [-1, 1]
])

train_dataset = datasets.MNIST(root='./data', train=True, transform=transform, download=True)
test_dataset = datasets.MNIST(root='./data', train=False, transform=transform, download=True)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

# Define a simple neural network
class SimpleNN(nn.Module):
    def __init__(self, dropout_rate=0.0):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(28 * 28, 256)
        self.dropout = nn.Dropout(dropout_rate)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 10)
        self.relu = nn.ReLU()
    
    def forward(self, x):
        x = x.view(x.size(0), -1)  # Flatten the input
        x = self.relu(self.fc1(x))
        x = self.dropout(x)  # Apply dropout
        x = self.relu(self.fc2(x))
        x = self.fc3(x)  # Output layer
        return x

# Define training and evaluation functions
def train_model(model, train_loader, optimizer, criterion, num_epochs=10):
    model.train()
    train_loss = []
    for epoch in range(num_epochs):
        epoch_loss = 0.0
        for images, labels in train_loader:
            images, labels = images.to(device), labels.to(device)
            
            optimizer.zero_grad()
            outputs = model(images)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            
            epoch_loss += loss.item()
        train_loss.append(epoch_loss / len(train_loader))
    return train_loss

def evaluate_model(model, test_loader, criterion):
    model.eval()
    correct = 0
    total = 0
    test_loss = 0.0
    with torch.no_grad():
        for images, labels in test_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            loss = criterion(outputs, labels)
            test_loss += loss.item()
            
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    accuracy = 100 * correct / total
    return test_loss / len(test_loader), accuracy

# Experiment with different dropout rates
dropout_rates = [0.0, 0.1, 0.3, 0.5]
results = {}

for rate in dropout_rates:
    model = SimpleNN(dropout_rate=rate).to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    
    # Train the model
    print(f"Training with dropout rate: {rate}")
    train_loss = train_model(model, train_loader, optimizer, criterion, num_epochs=10)
    
    # Evaluate the model
    test_loss, test_accuracy = evaluate_model(model, test_loader, criterion)
    results[rate] = (train_loss, test_loss, test_accuracy)

# Plot the training loss for different dropout rates
plt.figure(figsize=(10, 6))
for rate, (train_loss, _, _) in results.items():
    plt.plot(train_loss, label=f"Dropout {rate}")

plt.title("Training Loss for Different Dropout Rates")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.grid(True)
plt.savefig('training_loss.jpg')


# Display test accuracy for different dropout rates
print("\nResults Summary:")
for rate, (_, test_loss, test_accuracy) in results.items():
    print(f"Dropout Rate {rate}: Test Loss = {test_loss:.4f}, Test Accuracy = {test_accuracy:.2f}%")

Output:

Results Summary:

Dropout Rate 0.0: Test Loss = 0.0778, Test Accuracy = 97.80%

Dropout Rate 0.1: Test Loss = 0.0826, Test Accuracy = 97.48%

Dropout Rate 0.3: Test Loss = 0.0873, Test Accuracy = 97.49%

Dropout Rate 0.5: Test Loss = 0.0930, Test Accuracy = 97.11%

From the summary results and the plot above, it is clear that the model tends to overfit without dropout. With a moderate dropout rate, the model achieves a good balance between generalization and regularization, leading to improved test accuracy. However, with a high dropout rate, excessive regularization can hinder the model's ability to learn effectively, resulting in reduced test accuracy.

Exercise 6.4

Level: * (Easy)

Exercise Type: Novel

Question

True/False Questions on Dropout

1. Dropout helps prevent overfitting by randomly deactivating a subset of neurons during training.

2. Dropout is used both during training and during testing to ensure consistent performance.

3. Using a high dropout rate (e.g., 90%) can lead to underfitting the model.

4. Dropout introduces noise in the network, which makes it more robust and forces the network to learn redundant representations.

5. Dropout is not compatible with batch normalization because they both normalize activations.

Solution

1. Dropout helps prevent overfitting by randomly deactivating a subset of neurons during training.
Answer: True
Explanation: Dropout prevents overfitting by deactivating a random subset of neurons during each forward pass in training. This forces the network to learn more generalizable features rather than relying on specific neurons or overfitting to the training data.

2. Dropout is used both during training and during testing to ensure consistent performance.
Answer: False
Explanation: Dropout is only applied during training. During inference, dropout is turned off, and all neurons are active. The weights are scaled to account for the absence of dropout, ensuring consistent output without randomly deactivating neurons.

3. Using a high dropout rate (e.g., 90%) can lead to underfitting the model.
Answer: True
Explanation: A very high dropout rate deactivates most of the neurons, which reduces the network’s capacity to learn complex patterns. This can result in underfitting, where the model performs poorly on both the training and validation sets.

4. Dropout introduces noise in the network, which makes it more robust and forces the network to learn redundant representations.
Answer: True
Explanation: By deactivating random neurons during training, dropout adds noise, making the model more robust to variations in the data. This encourages the network to learn redundant and distributed representations, as no single neuron can dominate the decision-making process.

5. Dropout is not compatible with batch normalization because they both normalize activations.
Answer: False
Explanation: Dropout and batch normalization are compatible, although they operate differently. Batch normalization normalizes activations, while dropout randomly deactivates neurons. The goal of Batch Normalization is to stabilize and speed up training. The goal of dropout is to prevent overfitting.

Exercise 6.5

Level: ** (Moderate)

Exercise Types: Novel

References: Adapted from concepts introduced by Srivastava et al., 2014.

Question

Consider the application of dropout to a linear regression model with a response variable $y$ and a single predictor $x$ . The linear model can be expressed as:

$y = β_{0} + β_{1} x + ϵ$ , where $ϵ$ represents normally distributed error.

(a) If dropout is applied at a rate of 0.2 during the training phase, how does this affect the estimation of $β_{0}$ and $β_{1}$ ? Discuss the potential benefits and drawbacks.

(b) Extend this to a multiple regression scenario where there are three predictors $x_{1}, x_{2}, x_{3}$ . Explain how dropout could be implemented during training and its expected impact on model variance and bias.

Solution

1.Dropout Implementation and Effects in Simple Linear Regression:

- Applying dropout to the predictor $x$ during training involves temporarily removing it from the model with a probability of 0.2. This effectively introduces missingness in the predictor data, which can make the model more robust by preventing it from relying too heavily on $x$ for predicting $y$ .

- Potential Benefits: - Reduces the risk of overfitting by making the model less sensitive to noise in the predictor $x$ . - Forces the model to not rely on any single input feature, thus potentially discovering more general patterns in the data.

- Potential Drawbacks: - Can increase model bias if important predictor information is consistently dropped. - Might lead to higher variance in the estimated parameters due to reduced effective sample size during each training iteration.

2. Extension to Multiple Regression

- In a model with multiple predictors, dropout can be applied to each predictor independently. Each predictor $x_{i}$ is dropped with a probability of 0.2 during each training epoch.

- Impact on Model Variance and Bias: - Model Variance: Dropout generally increases model variance during training but can lead to a reduction in variance of predictions by averaging over multiple "thinned" models. - Model Bias: Similar to the simple linear regression case, the bias might increase due to the systematic exclusion of potentially important predictors during training epochs. - The application of dropout in this setting acts as a form of regularization, helping to mitigate overfitting especially when the number of predictors is large relative to the sample size.

Fundamental Problems

Classification

Consider data ${(x_{i}, y_{i})}_{i = 1}^{n}$ where $x \in R^{d}$ and $y_{i}$ takes values in some finite set.

Find a function $f$ , such that when we observe a new $x$ we predict $y$ to be $f (x)$ .

Regression

Consider data ${(x_{i}, y_{i})}_{i = 1}^{n}$ where $x \in R^{d}$ and $y_{i}$ takes values in $R$ .

Find a function $f$ , such that when we observe a new $x$ we predict $y$ to be $f (x)$ .

Clustering

Consider data ${x_{i}}_{i = 1}^{n}$ where $x \in R^{d}$ .

Find a function $f$ , when we observe a new $x$ we predict $y$ to be $f (x)$ , such that for similar $x$ , $y$ is the same.

Perceptron

Define a cost function, $ϕ (β, β_{0})$ , as a summation of the distance between all misclassified points and the hyperplane, or the decision boundary.

To minimize this cost function, we need to estimate $β$ , $β_{0}$ : $min_{β, β_{0}} ϕ (β, β_{0}) = {distance of all misclassified points}$ .

(1) A hyperplane $L$ can be defined as $L = {x : f (x) = β^{T} x + β_{0} = 0}$ , For any two arbitrary points $x_{1}$ and $x_{2}$ on $L$ , we have

$β^{T} x_{1} + β_{0} = 0$ , $β^{T} x_{2} + β_{0} = 0$ ,

such that $β^{T} (x_{1} - x_{2}) = 0$ .

Therefore, $β$ is orthogonal to the hyperplane and it is the normal vector.

(2) For any point $x_{0}$ in $L$ ,

$β^{T} x_{0} + β_{0} = 0$ , which means $β^{T} x_{0} = - β_{0}$ .

(3) We set $β^{*} = \frac{β}{‖ β ‖}$ as the unit normal vector of the hyperplane $L$ . For simplicity, we call $β^{*}$ the norm vector. The distance of point $x$ to $L$ is given by

$β^{T} (x - x_{0}) = β^{T} x - β^{T} x_{0} = \frac{β^{T} x}{‖ β ‖} + \frac{β_{0}}{‖ β ‖} = \frac{β^{T} x + β_{0}}{‖ β ‖}$

Where $x_{0}$ is any point on $L$ . Hence, $β^{T} x + β_{0}$ is proportional to the distance of the point $x$ to the hyperplane $L$ .

(4) The distance from a misclassified data point $x_{i}$ to the hyperplane $L$ is

$d_{i} = - y_{i} (β^{T} x_{i} + β_{0})$

where $y_{i}$ is a target value, such that $y_{i} = 1$ if $β^{T} x_{i} + β_{0} < 0$ , $y_{i} = - 1$ if $β^{T} x_{i} + β_{0} > 0$ .

Since we need to find the distance from the hyperplane to the misclassified data points, we need to add a negative sign in front. When the data point is misclassified, $β^{T} x_{i} + β_{0}$ will produce an opposite sign of $y_{i}$ . Since we need a positive sign for distance, we add a negative sign.

Backpropagation

Backpropagation procedure is done using the following steps:

First arbitrarily choose some random weights (preferably close to zero) for your network.

Apply

x

to the FFNN's input layer, and calculate the outputs of all input neurons.

Propagate the outputs of each hidden layer forward, one hidden layer at a time, and calculate the outputs of all hidden neurons.

Once

x

reaches the output layer, calculate the output(s) of all output neuron(s) given the outputs of the previous hidden layer.

At the output layer, compute

δ_{k} = - 2 (y_{k} - {\hat{y}}_{k})

for each output neuron(s).

Compute each

δ_{i}

, starting from

i = k - 1

all the way to the first hidden layer, where

δ_{i} = σ^{'} (a_{i}) \sum_{j} δ_{j} \cdot u_{j i}

.

Compute

\frac{\partial ‖ y - \hat{y} ‖^{2}}{\partial u_{i l}} = δ_{i} z_{i}

for all weights

u_{j l}

.

Then update

u_{i l}^{new} \leftarrow u_{i l}^{old} - ρ \cdot \frac{\partial ‖ y - \hat{y} ‖^{2}}{\partial u_{i l}}

for all weights

u_{i l}

.

Continue for next data points and iterate on the training set until weights converge.

Epochs

It is common to cycle through all of the data points multiple times in order to reach convergence. An epoch represents one cycle in which you feed all of your data points through the neural network. It is good practice to randomize the order you feed the points to the neural network within each epoch; this can prevent your weights from changing in cycles. The number of epochs required for convergence depends greatly on the learning rate and convergence requirements used.

Stein's Unbiased Risk Estimator

Model Selection

The general task in machine learning is estimating a function.
- We want to estimate: $\hat{f} (x)$ (estimated function).
- Where there is a true underlying function: $f (x)$ (true function).

Definitions and Notations

Assume $T = {(x_{i}, y_{i})}_{i = 1}^{n}$ be the training set.

$f (.) \to$ True function $\hat{f} (.) \to$ Estimated function

Also assume: $y_{i} = f (x_{i}) + ϵ_{i}$ ,

where $ϵ_{i} \sim N (0, σ^{2})$

${\hat{y}}_{i} = \hat{f} (x_{i})$
$f_{i} \equiv f (x_{i})$
${\hat{f}}_{i} \equiv \hat{f} (x_{i})$

For point $(x_{0}, y_{0})$ , we are interested in:

$E [({\hat{y}}_{0} - y_{0})^{2}] = E [({\hat{f}}_{0} - f_{0} - ϵ_{0})^{2}]$

$= E [{({\hat{f}}_{0} - f_{0} - ϵ_{0})}^{2}]$

$= E [({\hat{f}}_{0} - f_{0})^{2} + ϵ_{0}^{2} - 2 ϵ_{0} ({\hat{f}}_{0} - f_{0})]$

$= E [({\hat{f}}_{0} - f_{0})^{2}] + E [ϵ_{0}^{2}] - 2 E [ϵ_{0} ({\hat{f}}_{0} - f_{0})]$

$= E [({\hat{f}}_{0} - f_{0})^{2}] + σ^{2} - 2 E [ϵ_{0} ({\hat{f}}_{0} - f_{0})]$

Case 1 Assume: $(x_{0}, y_{0}) \notin T$

In this case, since $\hat{f}$ is estimated only based on points in the training set, therefore it is completely independent from $(x_{0}, y_{0})$ .

$\Rightarrow E [(y_{0} - f) (\hat{f} - f)] = cov (y_{0}, {\hat{f}}_{0}) = 0$

If summing up all $m$ points that are not in $T$ :

$\underset{err}{\underset{⏟}{\sum_{i = 1}^{m} ({\hat{y}}_{i} - y_{i})^{2}}} = \underset{Err}{\underset{⏟}{\sum_{i = 1}^{m} ({\hat{f}}_{i} - f_{i})^{2}}} + m σ^{2}$

Empirical error ( $err$ ) is a good estimator of true error ( $Err$ ) if the point $(x_{0}, y_{0})$ is not in the training set.

Case 2 Assume: $(x_{0}, y_{0}) \in T$

Then: $2 E [ϵ_{0} ({\hat{f}}_{0} - f_{0})] \neq 0$

Stein's Lemma

If: $x \sim N (θ, σ^{2})$ and $g (x)$ is differentiable,

then: $E [g (x) (x - θ)] = σ^{2} E [\frac{\partial g (x)}{\partial x}]$

Our problem: $E [ϵ_{0} ({\hat{f}}_{0} - f_{0})] = σ^{2} E [\frac{\partial ({\hat{f}}_{0} - f_{0})}{\partial ϵ_{0}}]$

$= σ^{2} E [\frac{\partial {\hat{f}}_{0}}{\partial ϵ_{0}} - \frac{\partial f_{0}}{\partial ϵ_{0}}]$

$= σ^{2} E [\frac{\partial {\hat{f}}_{0}}{\partial ϵ_{0}}]$

$= σ^{2} E [\frac{\partial {\hat{f}}_{0}}{\partial y_{0}} \cdot \frac{\partial y_{0}}{\partial ϵ_{0}}]$

$= σ^{2} E [\frac{\partial {\hat{f}}_{0}}{\partial y_{0}}]$

$E [({\hat{y}}_{0} - y_{0})^{2}] = E [({\hat{f}}_{0} - f_{0})^{2}] + σ^{2} - 2 σ^{2} E [D_{0}]$

Sum over all $n$ data points:

$\underset{err}{\underset{⏟}{\sum_{i = 1}^{n} ({\hat{y}}_{i} - y_{i})^{2}}} = \underset{Err}{\underset{⏟}{\sum_{i = 1}^{n} ({\hat{f}}_{i} - f_{i})^{2}}} + n σ^{2} - 2 σ^{2} \sum_{i = 1}^{n} D_{i}$

$Err = err - n σ^{2} + \underset{Complexity of model}{\underset{⏟}{2 σ^{2} \sum_{i = 1}^{n} D_{i}}}$

$Err$ is Stein's Unbiased Risk Estimator (SURE).

Regularization in Deep Learning

Introduction

Regularization is a fundamental concept in machine learning, particularly in deep learning, where models with a high number of parameters are prone to overfitting. Overfitting occurs when a model learns the noise in the training data rather than the underlying distribution, leading to poor generalization on unseen data. Regularization techniques aim to constrain the model’s capacity, thus preventing overfitting and improving generalization. This chapter will explore various regularization methods in detail, complete with mathematical formulations, intuitive explanations, and practical implementations.

Classical Regularization: Parameter Norm Penalties

L2 Regularization (Weight Decay)

L2 Parameter Regularization (Weight Decay)

Overview

L2 parameter regularization, commonly known as weight decay, is a technique used to prevent overfitting in machine learning models by penalizing large weights. This penalty helps in constraining the model's complexity.

The regularization term is given by:

$R (w) = \frac{λ}{2} ‖ w ‖_{2}^{2}$

where:

$λ$ is the regularization strength (a hyperparameter),
$w$ represents the model weights,
$‖ w ‖_{2}$ denotes the L2 norm of the weight vector.

Gradient of the Total Objective Function

The gradient of the total objective function, which includes both the loss and the regularization term, is given by:

$\nabla_{w} L_{total} (w; X, y) = λ w + \nabla_{w} L (w; X, y)$

The weight update rule with L2 regularization using gradient descent is:

$w := w - η (λ w + \nabla_{w} L (w; X, y))$

where $η$ is the learning rate.

Quadratic Approximation to the Objective Function

Consider a quadratic approximation to the objective function:

$L (w) \approx L (w^{*}) + \frac{1}{2} (w - w^{*})^{⊤} H (w - w^{*})$

where:

$w^{*}$ is the optimum weight vector,
$H$ is the Hessian matrix of second derivatives.

The modified gradient equation becomes:

$λ w + H (w - w^{*}) = 0$

Solving for $w$ , we get:

$w = (H + λ I)^{- 1} H w^{*}$

where $I$ is the identity matrix.

Eigenvalue Decomposition

Assume $H = Q Λ Q^{⊤}$ where $Q$ is the orthogonal matrix of eigenvectors and $Λ$ is the diagonal matrix of eigenvalues.

Then the weight vector can be expressed as:

$w = Q (Λ + λ I)^{- 1} Λ Q^{⊤} w^{*}$

The effect of weight decay is to rescale the coefficients of the eigenvectors. The $i$ -th component is rescaled by a factor of $\frac{λ_{i}}{λ_{i} + λ}$ , where $λ_{i}$ is the $i$ -th eigenvalue.

If $λ_{i} > λ$ , the effect of regularization is relatively small.
Components with $λ_{i} < λ$ will be shrunk to have nearly zero magnitude.

Effective Number of Parameters

Directions along which the parameters contribute significantly to reducing the objective function are preserved. A small eigenvalue of the Hessian indicates that movement in this direction will not significantly increase the gradient.

The effective number of parameters can be defined as:

$Effective Number of Parameters = \sum_{i} \frac{λ_{i}}{λ_{i} + λ}$

As $λ$ increases, the effective number of parameters decreases, which reduces the model's complexity.

(Placeholder for Image) (Include an image illustrating the effect of weight decay on the eigenvalues and the effective number of parameters)

Dataset Augmentation

Overview

Dataset augmentation is a technique used to improve the generalization ability of machine learning models by artificially increasing the size of the training dataset. This is particularly useful when the amount of available data is limited. The idea is to create new, synthetic data by applying various transformations to the original dataset.

Key Idea: The best way to make a machine learning model generalize better is to train it on more data. When the amount of available data is limited, creating synthetic data (e.g., by applying transformations like rotation, translation, and noise addition) and adding it to the training set can be effective.

Practical Example: Operations like translating training images a few pixels in each direction can greatly improve generalization. Another approach is to train neural networks with random noise applied to their inputs, which also serves as a form of dataset augmentation. This technique can be applied not only to the input layer but also to hidden layers, effectively performing dataset augmentation at multiple levels of abstraction.

---

Noise Injection

Overview

Noise injection is a regularization strategy that can be applied in two main ways:

1. Adding Noise to the Input: This method can be interpreted as a form of dataset augmentation and also has a direct connection to traditional regularization methods. 2. Adding Noise to the Weights: This method is primarily used in the context of recurrent neural networks and can be viewed as a stochastic implementation of Bayesian inference over the weights.

Mathematical Proof for Injecting Noise at the Input

Consider a regression setting where we have an input-output pair $ (x, y) $ and the goal is to minimize the expected loss function:

$J = E_{x, y} [(f (x) - y)^{2}]$

Now, suppose we inject noise into the input $ x $, where the noise $ \epsilon $ is drawn from a distribution with mean zero (e.g., Gaussian noise $ \epsilon \sim \mathcal{N}(0, \sigma^2) $). The modified objective function with noise-injected inputs becomes:

$J_{noise} = E_{x, y, ϵ} [(f (x + ϵ) - y)^{2}]$

To understand the effect of noise injection, we can expand the function $ f(x + \epsilon) $ around $ x $ using a Taylor series:

$f (x + ϵ) = f (x) + ϵ^{⊤} \nabla_{x} f (x) + \frac{1}{2} ϵ^{⊤} \nabla_{x}^{2} f (x) ϵ + O (‖ ϵ ‖^{3})$

Since the expectation of the noise $ \epsilon $ is zero:

$E [ϵ] = 0$

and assuming that the noise is isotropic with covariance matrix $ \sigma^2 I $, the expectation of the second-order term becomes:

$E [ϵ ϵ^{⊤}] = σ^{2} I$

Substituting the Taylor expansion into the objective function:

$J_{noise} = E_{x, y} [(f (x) - y)^{2}] + \frac{σ^{2}}{2} E_{x, y} [‖ \nabla_{x} f (x) ‖^{2}] + O (σ^{4})$

This shows that the objective function with noise injection is equivalent to the original objective function plus a regularization term that penalizes large gradients of the function $ f(x) $. Specifically, the added term $\frac{σ^{2}}{2} E_{x, y} [‖ \nabla_{x} f (x) ‖^{2}]$ reduces the sensitivity of the network's output to small variations in its input.

Key Result:

For small noise variance $ \sigma^2 $, the minimization of the loss function with noise-injected input is equivalent to minimizing the original loss function with an additional regularization term that penalizes large gradients:

$J_{noise} \approx J + \frac{σ^{2}}{2} E_{x, y} [‖ \nabla_{x} f (x) ‖^{2}]$

This regularization term effectively reduces the sensitivity of the output with respect to small changes in the input $ x $, which is beneficial in avoiding overfitting.

Connection to Weight Decay:

In linear models, where $ f(x) = w^\top x $, the gradient $ \nabla_x f(x) $ is simply the weight vector $ w $. Therefore, the regularization term becomes:

$\frac{σ^{2}}{2} ‖ w ‖^{2}$

which is equivalent to L2 regularization or weight decay.

Manifold Tangent Classifier

Overview

The Manifold Tangent Classifier (MTC) is a classification technique that leverages the idea that data often lies on a lower-dimensional manifold within the high-dimensional input space. The key assumption is that examples on the same manifold share the same category, and the classifier should be invariant to local factors of variation that correspond to movements on the manifold.

Key Idea: The classifier should be invariant to variations along the manifold while being sensitive to changes that move the data off the manifold.

Tangent Propagation Algorithm

One approach to achieve invariance to manifold variations is to use the Tangent-Prop algorithm (Simard et al., 1992). The main idea is to add a penalty to the loss function that encourages the neural network's output to be locally invariant to known factors of variation. This is achieved by requiring the gradient of the output with respect to the input to be orthogonal to the known manifold tangent vectors $ v_i $ at each point $ x $.

The regularization term can be expressed as:

$Regularizer = λ \sum_{i} {(\frac{\partial f (x)}{\partial x} \cdot v_{i})}^{2}$

where:

$ \frac{\partial f(x)}{\partial x} $ is the gradient of the neural network output with respect to the input,
$ v_i $ are the known tangent vectors of the manifold,
$ \lambda $ is the regularization strength.

This regularization ensures that the directional derivative of $ f(x) $ in the directions $ v_i $ is small, promoting invariance along the manifold.

Manifold Tangent Classifier (MTC)

A more recent approach, introduced by Rifai et al. (2011), eliminates the need to know the tangent vectors a priori. The Manifold Tangent Classifier automatically learns these tangent vectors during training, making it more flexible and applicable to a wider range of problems.

---

Early Stopping as a Form of Regularization

Overview

Early stopping is one of the most commonly used forms of regularization in deep learning. Instead of running the optimization algorithm until it reaches a local minimum of the training error, early stopping involves monitoring the validation error during training and halting the process when the validation error stops improving.

Key Idea: During training, whenever the error on the validation set improves, a copy of the model parameters is stored. The training is stopped when the validation error has not improved for a predetermined amount of time, and the best model parameters (those that resulted in the lowest validation error) are returned.

Mathematical Formulation

Assume that $ w $ represents the model weights (ignoring bias parameters). We take a quadratic approximation to the objective function $ J(w) $ around the empirically optimal value of the weights $ w^* $:

$J (w) \approx J (w^{*}) + \frac{1}{2} (w - w^{*})^{⊤} H (w - w^{*})$

where:

$ H $ is the Hessian matrix of second derivatives.

The gradient of the objective function is:

$\nabla_{w} J (w) = H (w - w^{*})$

During training, the parameter vector is updated according to:

$w^{(t + 1)} = w^{(t)} - η \nabla_{w} J (w^{(t)})$

Substituting the expression for the gradient:

$w^{(t + 1)} - w^{*} = (I - η H) (w^{(t)} - w^{*})$

where $ \eta $ is the learning rate. If we assume that the initial weights are zero (i.e., $ w^{(0)} = 0 $), we can express the weight update after $ t $ iterations as:

$w^{(t)} - w^{*} = (I - η H)^{t} (w^{(0)} - w^{*})$

If we perform an eigenvalue decomposition of $ H $, we get:

$H = Q Λ Q^{⊤}$

where $ Q $ is the orthogonal matrix of eigenvectors, and $ \Lambda $ is the diagonal matrix of eigenvalues. The weight update can then be rewritten as:

$w^{(t)} - w^{*} = Q (I - η Λ)^{t} Q^{⊤} (w^{(0)} - w^{*})$

Assuming $ w^{(0)} = 0 $ and that $ |1 - \eta \lambda_i| < 1 $ for all eigenvalues $ \lambda_i $, after $ t $ training updates, we have:

$Q^{⊤} w^{(t)} \approx [I - (1 - η Λ)^{t}] Q^{⊤} w^{*}$

Taking the logarithm and using the series expansion for $ \log(1 + x) $, it can be shown that the number of training iterations $ t $ plays a role inversely proportional to the L2 regularization parameter $ \lambda $, and the inverse of $ t $ plays the role of the weight decay coefficient.

Key Insight:

This result shows that early stopping can be interpreted as a form of implicit regularization, where the number of training iterations controls the effective complexity of the model.

Early Stopping as a Form of Regularization

Overview

Early stopping is one of the most commonly used forms of regularization in deep learning. Instead of running the optimization algorithm until it reaches a local minimum of the training error, early stopping involves monitoring the validation error during training and halting the process when the validation error stops improving.

Key Idea: During training, whenever the error on the validation set improves, a copy of the model parameters is stored. The training is stopped when the validation error has not improved for a predetermined amount of time, and the best model parameters (those that resulted in the lowest validation error) are returned.

Mathematical Formulation

Assume that $ w $ represents the model weights (ignoring bias parameters). We take a quadratic approximation to the objective function $ J(w) $ around the empirically optimal value of the weights $ w^* $:

$J (w) \approx J (w^{*}) + \frac{1}{2} (w - w^{*})^{⊤} H (w - w^{*})$

where:

\( H \) is the Hessian matrix of second derivatives.

The gradient of the objective function is:

$\nabla_{w} J (w) = H (w - w^{*})$

During training, the parameter vector is updated according to:

$w^{(t + 1)} = w^{(t)} - η \nabla_{w} J (w^{(t)})$

Substituting the expression for the gradient:

$w^{(t + 1)} - w^{*} = (I - η H) (w^{(t)} - w^{*})$

where $ \eta $ is the learning rate. If we assume that the initial weights are zero (i.e., $ w^{(0)} = 0 $), we can express the weight update after $ t $ iterations as:

$w^{(t)} - w^{*} = (I - η H)^{t} (w^{(0)} - w^{*})$

If we perform an eigenvalue decomposition of $ H $, we get:

$H = Q Λ Q^{⊤}$

where $ Q $ is the orthogonal matrix of eigenvectors, and $ \Lambda $ is the diagonal matrix of eigenvalues. The weight update can then be rewritten as:

$w^{(t)} - w^{*} = Q (I - η Λ)^{t} Q^{⊤} (w^{(0)} - w^{*})$

Assuming $ w^{(0)} = 0 $ and that $ |1 - \eta \lambda_i| < 1 $ for all eigenvalues $ \lambda_i $, after $ t $ training updates, we have:

$Q^{⊤} w^{(t)} \approx [I - (1 - η Λ)^{t}] Q^{⊤} w^{*}$

Taking the logarithm and using the series expansion for $ \log(1 + x) $, it can be shown that the number of training iterations $ t $ plays a role inversely proportional to the L2 regularization parameter $ \lambda $, and the inverse of $ t $ plays the role of the weight decay coefficient.

Key Insight:

This result shows that early stopping can be interpreted as a form of implicit regularization, where the number of training iterations controls the effective complexity of the model.

Label Smoothing

Label smoothing is a technique to prevent a model from becoming over-confident on a specific class by not forcing the model to fit the data exactly. This approach provides more flexibility and generalization abilities.

Suppose the predicted output is $y = [0, 1, 0]$ , then after applying label smoothing, it becomes $y = [0.033, 0.933, 0.033]$ .

The formula for label smoothing is:

$y_{smooth} = (1 - a) \cdot y + \frac{a}{k}$

where:

$a$ is the smoothing factor,
$y$ is the original label,
$k$ is the number of classes.

The reason for using label smoothing:

Prevents overfitting: By preventing the model from becoming too confident in its predictions, label smoothing reduces the likelihood of overfitting.
Improves generalization: Label smoothing can help the model generalize better to unseen data, as it discourages overconfidence in training.

Bagging/Ensemble

Bagging (short for bootstrap aggregating) is a machine learning ensemble technique for reducing generalization error by combining several models (Breiman, 1994)

Explanation: Aggregating is to ombine the predictions of each model, often using voting (for classification) or averaging (for regression).

It works by training several different models separately, and then have all of the models vote on the output for test examples. For example, random forest, which is a popular bagging algorithm that builds multiple decision trees on different bootstrap samples of the data and aggregates their predictions.

The reason why bagging works:

Variance reduction: By training multiple models on different data subsets, bagging reduces the variance of the predictions. This means that it helps models generalize better to new, unseen data.
Handling overfitting: Bagging is particularly useful for high-variance models (e.g., decision trees) as it helps reduce overfitting.

While bagging is highly useful for traditional machine learning models, it is less commonly used in deep learning because modern neural networks are already highly expressive. Instead, other ensemble methods, such as model ensembling or dropout, are used.

Code Sample:

from sklearn.ensemble import RandomForestClassifier

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

- Load a dataset (Iris dataset for example) data = load_iris()

X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3, random_state=42)

- Train a random forest classifier (bagging technique)

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

rf_model.fit(X_train, y_train)

- Make predictions

y_pred = rf_model.predict(X_test)

- Evaluate the model

accuracy = rf_model.score(X_test, y_test)

Dropout

Overview

Dropout is one of the techniques for preventing overfitting in deepneural network which contains a large number of parameters.

The key idea is to randomly drop units from the neural networkduring training.

During training, dropout samples from number of different “thinned” network.
At test time, we approximate the effect of averaging the predictions of all these thinned networks

Model

Consider a neural network with $L$ hidden layers:

Let $z^{(l)}$ denote the vector inputs into layer $l$ .
Let $y^{(l)}$ denote the vector of outputs from layer $l$ .
Let $W^{(l)}$ and $b^{(l)}$ represent the weights and biases at layer $l$ .

With dropout, the feed-forward operation becomes:

$r^{(l)} \sim Bernoulli (p)$

$y^{(l)} = r^{(l)} ⊙ y^{(l)}$

where $⊙$ denotes element-wise multiplication.

The feed-forward equation for layer $l + 1$ becomes:

$z^{(l + 1)} = W^{(l + 1)} y^{(l)} + b^{(l + 1)}$

$y^{(l + 1)} = f (z^{(l + 1)})$

where $f$ is the activation function.

For any layer $l$ , $r^{(l)}$ is a vector of independent Bernoulli random variables, each of which has a probability $p$ of being 1. The vector $y^{(l)}$ is the input after some hidden units are dropped. The rest of the model remains the same as a regular feed-forward neural network.

Training

Dropout neural network can be trained using stochastic gradient descent. The only difference here is that we only back propagate on eachthinned network. The gradient for each parameter are averaged over the training cases in each mini-batch.

Test Time

Use a single neural net without dropout If a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p at test time. $p \cdot w$

Additional Regularization

L1 norm

L1 norm regularization, also known as Lasso, is a technique that adds a penalty equal to the absolute value of the magnitude of coefficients to the loss function. This encourages sparsity in the learned weights, meaning it forces some of the weights to become exactly zero, effectively selecting important features and reducing model complexity.

Mixup

Mixup is a data augmentation technique that creates new training examples by taking convex combinations of pairs of input data and their labels. By blending images and labels together, the model learns smoother decision boundaries and becomes more robust to adversarial examples and noise.

Cutout

Cutout is a form of data augmentation where random square regions are masked out (set to zero) in input images during training. This forces the model to focus on a broader range of features across the image rather than relying on any single part, leading to better generalization and robustness.

Gradient Clipping

Gradient clipping is a technique used to prevent the gradients from becoming too large during training, which can cause the model to diverge. This is done by capping the gradients at a predefined threshold, ensuring that updates remain stable, especially in models like recurrent neural networks (RNNs) where exploding gradients are a common issue.

DropConnect

DropConnect is a variation of Dropout, but instead of dropping neurons, it randomly drops connections (weights) between neurons during training. This prevents the co-adaptation of neurons while allowing individual neurons to contribute to learning.

Data Augmentation (beyond Mixup and Cutout)

Random Flips and Rotations: Randomly flipping (either vertical or horizontal) or rotating images during training to make the model invariant to certain transformations.
Color Jittering: Modifying the brightness, contrast, saturation, and hue of images to make the model more robust to variations in color.
Random Cropping and Scaling: Randomly cropping and scaling images to force the model to learn from different perspectives and contexts within the data.

For PyTorch augmentation, one can refer https://pytorch.org/vision/stable/transforms.html

Generalization Paradox

Models with many parameters tend to overfit
However, deep neural network, despite using many parameters, works well with unseen data (look up the Double Descent Curve), the reason remains unknown

This phenomenon is illustrated by the Double Descent Curve, where after reaching a peak in test error (due to overfitting), the error decreases again with further model complexity. The precise reasons remain uncertain, but hypotheses include implicit regularization from optimization methods like SGD, hierarchical feature learning, and the redundancy offered by overparameterization.

Batch Normalization

Overview

Batch normalization is a technique used to improve the training process of deep neural networks by normalizing the inputs of each layer. Despite the initial intuition for the method being somewhat incorrect, it has proven to be highly effective in practice. Batch normalization speeds up convergence, allows for larger learning rates, and makes the model less sensitive to initialization, resulting in more stable and efficient training.

Batch normalization motivated by internal covariate shift (2015 lofee & Szegedy)

Internal Covariance Shift

Batch normalization was originally proposed as a solution to the internal covariance shift problem, where the distribution of inputs to each layer changes during training. This shift complicates training because the model must constantly adapt to new input distributions.

The transformation of layers can be described as:

$l = F_{2} (F_{1} (u, θ_{1}), θ_{2})$

For a mini-batch of activations $X = {x_{1}, x_{2}, \dots, x_{m}}$ from a specific layer, batch normalization proceeds as follows:

1. Compute the mean: $μ_{B} = \frac{1}{m} \sum_{i = 1}^{m} x_{i}$

2. Compute the variance: $σ_{B}^{2} = \frac{1}{m} \sum_{i = 1}^{m} (x_{i} - μ_{B})^{2}$

3. Normalize the activations: ${\hat{x}}_{i} = \frac{x_{i} - μ_{B}}{\sqrt{σ_{B}^{2} + ϵ}}$

4. Scale and shift the normalized activations using learned parameters $γ$ (scale) and $β$ (shift): $y_{i} = γ {\hat{x}}_{i} + β$

where:

$x_{i}$ represents the activations in the mini-batch,
$μ_{B}$ is the mean of the mini-batch,
$σ_{B}^{2}$ is the variance of the mini-batch,
$ϵ$ is a small constant added for numerical stability,
${\hat{x}}_{i}$ is the normalized activation,
$γ$ and $β$ are learned parameters for scaling and shifting.

The batch normalization improves validation accuracy by removing the dropout and enables higher learning rate

Batch Normalization

As mentioned earlier, the original intuition behind batch normalization was found to be incorrect after further research. A paper by Santurkar, S., Tsipras, D., Ilyas, A., & Madry, A. (NeurIPS 2019) contradicted the original 2015 paper on BatchNorm by highlighting the following points:

Batch normalization does not fix covariate shift.
If we fix covariate shift, it doesn't help.
lf we intentionally increase lCS, it doesn't harm.
Batch Norm is not the only possible normalization. There are alternatives.

Instead, they argue that Batch normalization works better due to other factors, particularly related to its effect on the optimization process:

1. Reparameterization of the loss function:

Improved Lipschitzness: Batch normalization improves the Lipschitz continuity of the loss function, meaning that the loss changes at a smaller rate, and the magnitudes of the gradients are smaller. This makes the gradients of the loss more "Lipschitz."
Better β-smoothness: The loss exhibits significantly better smoothness, which aids in optimization by preventing large, erratic changes in the gradient.

2. Variation of the loss function: BatchNorm reduces the variability of the value of the loss. Consider the variation of the loss function: $L (x + η \nabla L (x))$ where $η \in [0.05, 0.4]$ . A smaller variability of the loss indicates that the steps taken during training are less likely to cause the loss to increase uncontrollably.

3. Gradient predictiveness: BatchNorm enhances the predictiveness of the gradients, meaning the changes in the loss gradient are more stable and predictable. This can be expressed as: $| | \nabla L (x) - \nabla L (x + η \nabla L (x)) | |$ , where $η \in [0.05, 0.4]$ . A good gradient predictiveness implies that the gradient evaluated at a given point remains relevant over longer distances, which allows for larger step sizes during training.

Alternatives to Batch Norm

Weight Normalization: Weight normalization is a technique where the weights, instead of the activations, are normalized. This method reparameterizes the weight vectors to accelerate the training of deep neural networks.

Tim Salimans and Diederik P. Kingma, "Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks," 2016.

ELU (Exponential Linear Unit) and SELU (Scaled Exponential Linear Unit): ELU and SELU are two proposed activation functions that have a decaying slope instead of a sharp saturation. They can be used as alternatives to BatchNorm by providing smoother, non-linear activation without requiring explicit normalization.

Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter, "Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)," In International Conference on Learning Representations (ICLR), 2016. Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter, "Self-Normalizing Neural Networks," ICLR, 2017.

Convolutional Neural Network (CNN)

Introduction

Convolutional networks are simply neural networks that use convolution instead of general matrix multiplication in at least one of their layers. CNN is mainly used for image processing.

Convolution

In ML, convolution means dot product

$h = σ (⟨ x, w ⟩ + b)$

Same x, different w -- multi-layer perception (MLP)
Different x, same w -- CNN (weight sharing)

From class, the following operation is called convolution

$s (t) = \int x (a) w (t - a) d s$

The convolution operation is typically denoted with an asterisk:

$s (t) = (x * w) (t)$

Discrete Convolution

If we now assume that x and w are defined only on integer t, we can define the discrete convolution:

$s [t] = (x * w) (t) = \sum_{a = - \infty}^{\infty} x [a] w [t - a]$

$w [t - a]$ represents the sequence $w [t]$ shifted by $a$ units.

In practice

We often use convolutions over more than one axis at a time.

$s [i, j] = (I * K) [i, j] = \sum_{m} \sum_{n} I [m, n] K [i - m, j - n]$

Input: usually a multidimensional array of data.

Kernel: usually a multidimensional array of parameters that should be learned.

We assume that these functions are zero everywhere but the finite set of points for which we store the values.

We can implement the infinite summation as a summation over a finite number of array elements.

Convolution and Cross-Correlation

Convolution is commutative:

$s [i, j] = (I * K) [i, j] = \sum_{m} \sum_{n} I [i - m, j - n] K [m, n]$

Cross-correlation:

$s [i, j] = (I * K) [i, j] = \sum_{m} \sum_{n} I [i + m, j + n] K [m, n]$

Many machine learning libraries implement cross-correlation but call it convolution. In the context of backpropagation in neural networks, cross-correlation simplifies the computation of gradients with respect to the input and kernel.

Visualization of Cross-Correlation and Convolution with Matlab (https://www.youtube.com/v/Ma0YONjMZLI)

Image to Convolved Feature

Kernel\filter size: weight $\times$ height, e.g. 3 $\times$ 3 in below example
Stride: how many pixels to move the filter each time
Padding: add zeros (or any other value) around the boundary of the input

Example

The following image illustrates a 2D convolution operation between an input image and a filter to produce a convolved feature map.

Image (Input): The grid on the left represents a 5 $\times$ 5 matrix of pixel values. The orange-highlighted 3 $\times$ 3 region is part of the image currently being convolved with the filter.

Convolution Operation: The filter values are applied to the selected region in an element-wise multiplication followed by a summation. The operation is as follows:

$(1 \times 1) + (1 \times 0) + (1 \times 1) + (0 \times 0) + (1 \times 1) + (1 \times 0) + (0 \times 1) + (0 \times 0) + (1 \times 1) = 4$

Convolved Feature Map: The result value $4$ , is placed in the corresponding position (top-left) of the convolved feature map on the right.

This process is repeated as the filter slides across the entire image. The final feature map is shown below.

Sparse Interactions

In feed forward neural network every output unit interacts with every input unit.

When we have $m$ inputs and $n$ outputs, then matrix multiplication requires $(m \times n)$ parameters. and the algorithms used in practice have $O (m \times n)$ runtime (per example)

Convolutional networks, typically have sparse connectivity (sparse weights). This is accomplished by making the kernel smaller than the input.

Limit the number of connections each output may have to $k$ , then requires only $(k \times n)$ parameters and $O (k \times n)$ runtime

Parameter Sharing

In a traditional neural net, each element of the weight matrix is multiplied by one element of the input. i.e. It is used once when computing the output of a layer.
In CNNs, each member of the kernel is used at every position of the input
Instead of learning a separate set of parameters for every location, we learn only one set

Equivariance

A function $f (x)$ is equivaraint to a function $g$ is the following holds:

$f (g (x)) = g (f (x))$

In simple terms: Applying the function $f$ after $g$ is equivalent to applying $g$ after $f$

Equivariance in CNNs

CNNs are naturally equivariant to translation (Covlution = Shift)
If an input image is shifted, the output feature map shifts correspondingly, preserving spatial structure
Importance: This property ensures that CNNs can detect features like edges or corners, no matter where they appear in the image

A convolutional layer has equivariance to translation. For example,

$g (x) [i] = x [i - 1]$

If we apply this transformation to x, then apply convolution, the result will be the same as if we applied convolution to x, then applied the transformation to the output.

For images, convolution creates a 2-D map of where certain features appear in the input. Note that convolution is not equivariant to some other transformations, such as changes in the scale (rescaling) or rotation of an image.

Importance of Data Augmentation

Data augmentation is commonly used to make CNNs robust against variations in scale, rotation, or other transformations. This involves artificially modifying the training data by applying transformations like flipping, scaling, and rotating to expose the network to different variations of the same object.

Convolutional Networks

The first stage (Convolution): The layer performs several convolutions in parallel to produce a set of preactivations. The convolution stage is designed to detect local features in the input image, such as edges and patterns. It does this by applying filters/kernels across the input image.

The second stage (Detector): Each preactivation is run through a nonlinear activation function (e.g. rectified linear). This stage introduces nonlinearity into the model, enabling it to learn complex patterns. Without this nonlinearity, the model would be limited to learning only linear relationships.

The third stage (Pooling) Pooling reduces the spatial dimensions of the feature maps (height and width), helping to make the model more invariant to small translations and distortions in the input image. It also reduces computational load and helps prevent overfitting.

Pooling

Down-sample input size to reduce computation and memory

Popular Pooling functions

The maximum of a rectangular neighborhood (Max pooling operation)
The average of a rectangular neighborhood
The L2 norm of a rectangular neighborhood
A weighted average based on the distance from the central pixel

Pooling with Downsampling

Max-pooling with a pool width of 3 and a stride between pools of 2. This reduces the representation size by a factor of 2, which reduces the the computational and statistical burden on the next layer.

Pooling and Translations

Pooling helps to make the representation become invariant to small translations of the input. Invariance to local translation can be a very useful property if we care more about whether some feature is present than exactly where it is. For example: In a face, we need not know the exact location of the eyes.

Input of Varying Size

Example: we want to classify images of variable size.

The input to the classification layer must have a fixed size. In the final pooling output (for example) four sets of summary statistics, one for each quadrant of an image, regardless of the image size. Feature after pooling is $2 \times 2$ .

It is also possible to dynamically pool features together, for example, by running a clustering algorithm on the locations of interesting features (Boureau et al., 2011).

i.e. a different set of pooling regions for each image.

Learn a single pooling structure that is then applied to all images (Jia et al., 2012)

Convolution and Pooling as an Infinitely Strong Prior

Weak Prior: a prior distribution that has high entropy, which means there is a high level of uncertainty or spread in the distribution. An example of this would be a Gaussian distribution with high variance
Strong Prior: has very low entropy, which implies a high level of certainty or concentration in the distribution. An example of this would be a Gaussian distribution with low variance
Infinitely Strong Prior: places zero probability on some parameters and says a convolutional net is similar to a fully connected net with an infinitely strong prior over its weights

The weights for one hidden unit must be identical to the weights of its neighbor, but shifted in space. The weights must be zero, except for in the small, spatially contiguous receptive field assigned to that hidden unit.

Use of convolution as infinitely strong prior probability distribution over the parameters of a layer. This prior says that the function the layer should learn contains only local interactions and is equivariantto translation.

The use of pooling is infinitely strong prior that each unit should be invariant to small translations. Convolution and pooling can cause underfitting.

Practical Issues

The input is usually not just a grid of real values. It is a grid of vector-valued observations. For example, a color image has a red, green, and blue intensity at each pixel.

When working with images, we usually think of the input and output of the convolution as 3-D tensors. One index into the different channels and two indices into the coordinates of each channel.

Software implementations usually work in batch mode, so they will actually use 4-D tensors, with the fourth axis indexing different examples in the batch.

Training

Suppose we want to train a convolutional network that incorporates convolution of kernel stack $K$ applied to multi-channel image $V$ with stride $s$ : $c (K; V; s)$

Suppose we want to minimize some loss function $J (V; K)$ . During forward propagation, we will need to use $c$ itself to output $Z$ .

$Z$ is propagated through the rest of the network and used to compute $J$ .

During backpropagation, we receive a tensor $G$ such that:

$G_{i, j, k} = \frac{\partial}{\partial Z_{i, j, k}} J (V, K)$

To train the network, we compute the derivatives with respect to the weights in the kernel:

$g (G, V, s)_{i, j, k, l} = \frac{\partial}{\partial Z_{i, j, k}} J (V, K) = \sum_{m, n} G_{i, m, n} V_{j, m s + k, n s + l}$

If this layer is not the bottom layer of the network, we compute the gradient with respect to $V$ to backpropagate the error further:

$h (K, G, s)_{i, j, k} = \frac{\partial}{\partial V_{i, j, k}} J (V, K) = \sum_{l, m | s l + m = j} \sum_{n, p | s n + p = k} \sum_{q} K_{q, i, m, p} G_{i, l, n}$

Random or Unsupervised Features

The most computationally expensive part of training a convolutional network is learning the features.

Supervised training with gradient descent requires full forward and backward propagation through the entire network for every gradient update.
One approach to reduce this cost is to use features that are not learned in a supervised manner, such as random or unsupervised features.

Random Initilizations

Simply initialize the convolution kernels randomly. In high-dimension space, random vectors are almost orthogonal to each other (correlated). Features captured by different kernels are independent.

Unsupervised Learning

Learn the convolution kernels using an unsupervised criterion.

Key insight

Random filters can perform surprisingly well in convolutional networks.

Layers composed of convolution followed by pooling naturally become frequency-selective and translation-invariant, even with random weights.

Inexpensive architecture selection

Evaluate multiple convolutional architectures by training only the last layer.
Choose the best-performing architecture and then fully train it using a more intensive method.

Residual Networks (ResNet)

Overview

Deeper models are harder to train due to vanishing/exploding gradients and can be worse than shallower networks if not properly trained. There are advanced networks to deal with the degradation problem.

ResNet, short for Residual Networks, was introduced by Kaiming He et al. from Microsoft Research in 2015. It brought a significant breakthrough in deep learning by enabling the training of much deeper networks, addressing the vanishing gradient problem.

ResNet introduces the concept of skip connections (or residual connections) that allow the gradient to be directly backpropagated to earlier layers
Skip connections help in overcoming the degradation problem, where the accuracy saturates and then degrades rapidly as the network depth increases

Variants

Several variants of ResNet have been developed, including ResNet-50, ResNet-101, and ResNet-152, differing in the number of layers.

Application

ResNet has been widely adopted for various computer vision tasks, including image classification, object detection, and facial recognition.

DenseNet

Overview

DenseNet, short for Densely Connected Networks, was introduced by Gao Huang et al. in 2017.
It is known for its efficient connectivity between layers, which enhances feature propagation and reduces the number of parameters

Key Feature

The key feature is Dense Connectivity.

In DenseNet, each layer receives feature maps from all preceding layers and passes its own feature maps to all subsequent layers
This dense connectivity improves the flow of information and gradients throughout the network, mitigating the vanishing gradient problem

Echo State Network

In RNNs, the ability to capture long-term dependencies is crucial. Set the recurrent and input weights such that the recurrent hidden units do a good job of capturing the history of past inputs, and only learn the output weights. The goal is to access the information from the past implicitly.

The hidden state at time $t$ can be expressed as:

$s_{t} = σ (W s_{t - 1} + U x_{t})$

where:

$W$ represents the recurrent weight matrix, which connects previous hidden states to the current state,
$U$ is the input weight matrix, responsible for incorporating the current input $x_{t}$ ,
$σ$ is an activation function, like a non-linear function such as a sigmoid or tanh.

It is important to control how small changes in the hidden state propagate through time to ensure the network does not become unstable.

If a change $Δ s$ in the state at time $t$ is aligned with an eigenvector $v$ of the Jacobian $J$ with eigenvalue $λ > 1$ , then the small change $Δ s$ becomes $λ Δ s$ after one-time step, and $λ^{t} Δ s$ after $t$ time steps.

If the largest eigenvalue $λ < 1$ , the map from $t$ to $t + 1$ is contractive.

The network forgets information about the long-term past.

Set the weights to make the Jacobians slightly contractive. This allows the network to gradually forget irrelevant information while still remembering key long-term dependencies.

Long Delays

RNNs often fail to capture these dependencies due to the vanishing gradient problem. Long delays use recurrent connections. It knows something from the past, help vanishing gradient - even if gradient get vanished during the path, it still has the direct information from the past.

Leaky Units

In some cases, we do need to forget the path while in some we do not since we do not want to remember redundant information.

Recall that:

$s_{t} = σ (W s_{t - 1} + U x_{t})$

Then consider the following refined form of the equation (convex combination of the current state and previous through a new parameter):

$s_{t, i} = (1 - \frac{1}{τ_{i}}) s_{t - 1} + \frac{1}{τ_{i}} σ (W s_{t - 1} + U x_{t})$

where

$1 \leq τ_{i} \leq \infty$
$τ_{i} = 1$ corresponds to an ordinary RNN
$τ_{i} > 1$ allows gradients to propagate more easily
$τ_{i} ≫ 1$ means the state changes very slowly, integrating past values associated with the input sequence over a long duration

Infinity means the current state is the previous state while one means completely forgetting the previous steps and only depends on the current observations.

Gated RNNs

Defnition

It might be useful for the neural network to forget the old state in some cases like if we only care about if the current letter is a or b.

Example: $a a b b b b a a a a b a b$

It might be useful to keep the memory of the past.

Example:

Instead of manually deciding when to clear the state, we want the neural network to learn to decide when to do it.

Long-Short-Term-Memory (LSTM)

The Long-Short-Term-Memory (LSTM) algorithm was proposed in 1997 (Hochreiter and Schmidhuber, 1997). It is a type of recurrent neural network designed for approaching the vanishing gradient problem.

Several variants of the LSTM are found in the literature:

Hochreiter and Schmidhuber, 1997
Graves, 2012
Graves et al., 2013
Sutskever et al., 2014

The principle is always to have a linear self-loop through which gradients can flow for a long duration.

Gated Recurrent Units (GRU)

Here is the plain text version of the new image content:

Recent work on gated RNNs, Gated Recurrent Units (GRU), was proposed in 2014.

Cho et al., 2014
Chung et al., 2014, 2015
Jozefowicz et al., 2015
Chrupala et al., 2015

Standard RNN computes the hidden layer at the next time step directly:

$s_{t} = σ (W s_{t - 1} + U x_{t})$

There are two gates: the update gate and the reset gate. Update gate for the case that we want to keep the information around while reset gate is the case when forgetting. A temporary state locks down some of the values of the current state.

GRU first computes an update gate (another layer) based on the current input vector and hidden state:

$z_{t} = σ (U^{(z)} x_{t} + W^{(z)} s_{t - 1})$

It also computes the reset gate similarly but with different weights:

$r_{t} = σ (U^{(r)} x_{t} + W^{(r)} s_{t - 1})$

New memory content is calculated as:

$\tilde{s_{t}} = \tanh (U x_{t} + r_{t} ⊙ W s_{t - 1})$

which has current observations and forgetting something from the past

If the reset gate is close to 0, this causes the network to ignore the previous hidden state, effectively allowing the model to drop irrelevant information.

The final memory at time step $t$ is a combination of the current and previous time steps:

$s_{t} = z_{t} ⊙ s_{t - 1} + (1 - z_{t}) ⊙ \tilde{s_{t}}$

If the reset gate is close to 0, it will ignore the previous hidden state, allowing the model to discard irrelevant information.

The update gate $z_{t}$ controls how much of the past state should matter in the current time step. If $z_{t}$ is close to 1, then we can effectively copy information from the past state across multiple time steps.

Units that need to capture short-term dependencies often have highly active reset gates.

Cliping Gradients

A simple solution for clipping the gradient. (Mikolov, 2012; Pascanu et al., 2013):

Clip the parameter gradient from a mini-batch element-wise (Mikolov, 2012) just before the parameter update.
Clip the norm $g$ of the gradient $g$ (Pascanu et al., 2013a) just before the parameter update.

The formula for clipping the gradient is:

$g^{'} = min (1, \frac{c}{| g |}) g$

where c is a constant.

Attention

The attention mechanism was introduced to improve the performance of the encoder-decoder model for machine translation.

Common Representation

A single 'concept' is universally represented, transcending specific languages or forms.

Encoder: Processes the word 'elephant' from its original source.
Output: A universal representation vector (the abstract 'concept' of an elephant).
Decoders: Translate this concept into various domains or applications.

The 'concept' is an abstract entity that exists independently of any particular language or representation.

For example, if we want to translate from English to Spanish

Encoder (English Input): The system processes the word "elephant."
Output (Universal Representation): The system generates an abstract concept or vector representing an "elephant," independent of any specific language.
Decoder (Spanish Output): The system decodes this concept and outputs the equivalent Spanish word: "elefante."

Sequence-to-Sequence Model

In the sequence-to-sequence model, every word $x_{i}$ produces a hidden vector $h_{i}$ in the encoder part of the autoencoder. The hidden vector of every word, $h_{i}$ , is fed to the next hidden vector, $h_{i + 1}$ , by a projection matrix $W$ .

In this model, for the whole sequence, there is only one context vector $c$ , which is equal to the last hidden vector of the encoder, i.e., $c = h_{n}$ .

Challenges:

1. Long-range dependencies: As the model processes long sequences, it can struggle to remember and utilize information from earlier steps, especially in cases where long-term context is crucial.

2. Sequential processing: Since these models process data step by step in sequence, they can't take full advantage of parallel processing, which limits the speed and efficiency of training.

These are the core challenges that newer architectures, such as transformers, aim to address.

Attention Definition

The basic idea behind the attention mechanism is directing the focus on important factors when processing data. Attention is a fancy name for weighted average.

Sequence-to-Sequence Model with Attention

Sequence-to-sequence models:

Multiple RNN units serve as the encoder. They encode information into the context vectors. Multiple RNN units decode the concept in the context vector to different domain information. The limitations of this approach are long-range dependencies and prevention of parallelization.

$p (y_{i} | y_{1}, \dots, y_{i - 1}) = g (y_{i - 1}, l_{i}, c)$

Sequence-to-sequence with attention:

Pass multiple context vectors to the decoder.

$p (y_{i} | y_{1}, \dots, y_{i - 1}) = g (y_{i - 1}, l_{i}, c_{i})$

Sequence-to-Sequence Model with Attention

There are some calculations:

1. Similarity score: $s_{i j} = s i m i l a r i t y (l_{i - 1}, h_{j})$

2. Attention weight: $a_{i j} = \frac{e^{s_{i j}}}{\sum_{k = 1}^{T} e^{s_{i k}}}$

The attention weight $a_{i j}$ is obtained by applying a softmax function to the similarity scores. This normalizes the scores across all encoder hidden states, turning them into a probability distribution.

3. Context vector: $c_{i} = \sum_{j = 1}^{T} a_{i j} h_{j}$

The effectiveness of the correlation between inputs around position $j$ and the output at position $i$ is crucial.

This score is determined based on:

The RNN hidden state $l_{i - 1}$ just before emitting $y_{i}$ .
The $j^{t h}$ hidden state $h_{j}$ of the input sentence.

Transformer Architecture

The basic concept behind transformers is attention as summarized in the paper by Vaswani et. al 'Attention is all you need'. This paper claims that all you need is attention and with the structure of attentions, basically you can handle the sequential data. It was based on GPT and many other models that we use in LLM and imaging processing. Transformer is an example of an encoder-decoder architecture. Unlike RNNs, Transformers can process sequences in parallel, making them faster to train on large datasets. Transformers have applications beyond NLP, such as Vision Transformers (ViT) in computer vision and protein structure prediction in biology (AlphaFold). Transformers' ability to capture long-range dependencies efficiently has made them the standard for many modern AI models.

Encoder

The encoder consists of two main components: Self Attention and Feed Forward Neural Network. The architecture of the Encoder is given below:

Self Attention

Understanding the individual words in a sentence is not enough to understand the whole sentence and one needs to understand how the words relate to each other. The attention mechanism forms composite representations. We aim to have embeddings of words and embeddings of compositions at the same time in different levels. Unlike word2vec introduced in 2013 by Mikolov et al., which captures the vector representations of words in an embedding space such that words that are similar are represented closer to each other, self-attention aims to capture the similarity between words based on context and in relation to each other. Multiple layers help form complex concept representations. For example, the context of the word "bank" differs based on if the surrounding words involve "money" or "river".

Illustration of word2vec where "King" - "Man" + "Woman" produces a vector close to the word "Queen"

Self-attention is analogous to the fundamental retrieval strategy in databases where given a query, and key-value pairs, we use the query to identify the key and retrieve the corresponding value. The generalized definition for calculating the attention of a target word with respect to the input word: use the Query of the target and the Key of the input and then calculate a matching score. These matching scores act as the weights of the Value vectors.

Then we look at the definition if matrix form. Given an input vector $x \in R^{d}$ , the weights for the query, key, and value transformations are defined as:

Query weight matrix: $W_{q} \in R^{d \times p}$

Key weight matrix: $W_{k} \in R^{d \times p}$

Value weight matrix: $W_{v} \in R^{d \times r}$

The transformations for the query ( $q$ ), key ( $k$ ), and value ( $v$ ) vectors are given by:

Query vector: $q = W_{q}^{T} x$ where $q \in R^{p \times 1}$ , since $W_{q}^{T} \in R^{p \times d}$ , and $x \in R^{d \times 1}$

Key vector: $k = W_{k}^{T} x$ where $k \in R^{p \times 1}$ , since $W_{k}^{T} \in R^{p \times d}$ , and $x \in R^{d \times 1}$

Value vector: $v = W_{v}^{T} x$ where $v \in R^{r \times 1}$ , since $W_{v}^{T} \in R^{r \times d}$ , and $x \in R^{d \times 1}$

The transformations allow the attention mechanism to compute similarity scores between the query and key vectors and to use the value vectors to produce the final weighted output as $z_{1} = α_{1} v_{1} + α_{2} v_{2} + \dots + α_{n} v_{n}$ , where $v_{i}$ are similar to the input in CNNs, and $α_{i}$ are similar to the kernels in CNNs.

However, unlike the kernels in CNNs, the $α_{i}$ 's are data-dependent and given by: $α_{i} = softmax (\frac{q^{T} k_{i}}{\sqrt{p}})$ .

Extending this to the entire dataset, the equations are:

$X = [x_{1}, x_{2}, \dots, x_{n}] \in R^{d \times n}$

$Q = [q_{1}, q_{2}, \dots, q_{n}] \in R^{p \times n}$

$K = [k_{1}, k_{2}, \dots, k_{n}] \in R^{p \times n}$

$V = [v_{1}, v_{2}, \dots, v_{n}] \in R^{r \times n}$

The transformations are defined as:

$Q_{p \times n} = W_{q}^{T_{p \times d}} X_{d \times n}$

$K_{p \times n} = W_{k}^{T_{p \times d}} X_{d \times n}$

$V_{r \times n} = W_{v}^{T_{r \times d}} X_{d \times n}$

Therefore the output, $Z_{r \times n} = V_{r \times n} \cdot softmax {(\frac{Q^{T} K}{\sqrt{p}})}_{n \times n}$

Additionally, we have:

$Q^{T} K = X^{T} W_{q} W_{k}^{T} X (an asymmetric kernel extracting the global similarity between words)$

Let us take a closer look at how one word is processed in the encoder.

The input vector x is passed through a linear transformation layer which transforms the input into query q, key k, and value v vectors, given by the equations above. This is passed through a stacked multi-head self-attention layer which extracts global information across pairwise sequence of words to produce the output vector Z also given by the equation above. The equation for Z can be compared to how similarity is computed between the key and value pairs in databases. This output is added to the residual input x to preserve the meaning of individual words in addition to the pairwise representations. This is normalized to produce the output $(Z + x) \in R^{h \times r}$ . This serves as the input to the feed forward neural network.

Feed Forward Neural Network

The structure of the feed forward network (FFN) is Linear Layer 1, followed by ReLU activation and then another linear layer 2. While the attention mechanism captures global information between words and hence aggregates across columns, the feed forward neural network aggregates across rows and takes a more broader look at each word independently. Depsite the individual processing, all positions share the same set of weights and biases in the FFN. In a classroom environment, the attention mechanism is similar to the teacher observing the interactions among students in a group, whereas the feed forward neural network resembles the teacher evaluating each student independently for their understanding of the assignment. The output of the feed forward layer r also has a residual connection from the previous layer which is normalized and passed as the input of the decoder as $(r + (z + x))$ . The encoder also has a positional encoding component which captures information about the position of words in a sequence.

While the above figure zooms in at how a word vector x is processed by the encoder, the major advantage of the transformer architecture is its ability to handle multiple words in a sequence in parallel and so in practice, the above zoomed in version of the encoder is usually stacked to handle a group of words in parallel.

==== Global v.s. Local Information

For Attention Mechanism:

Global Understanding: Captures relationships among different positions in the sequence.
Context Aggregation: Spreads relevant information across the sequence, enabling each position to see a broader context.

For Feed-Forward Networks (FFN):

Local Processing: While attention looks across the entire sequence, FFN zooms back in to process each position independently.
Individual Refinement: Enhances the representation of each position based on its own value, refining the local information gathered so far.

Decoder

The decoder consist of three main components: Masked Self Attention, Cross Attention and Linear Layer. The architecture of the Decoder is given below:

Masked Self Attention

In masked self attention, we add a mask matrix $M \in R^{n \times n}$ to the normalized argument within the softmax of $Z$ given by $Z_{r \times n} = V_{r \times n} \cdot softmax {(\frac{Q^{T} K + M}{\sqrt{p}})}_{n \times n}$ The reason for adding a mask matrix M is to ensure that the output is informed only by the past words and not by the words further along in the sequence. The mask matrix therefore is given by:

$M (i, j) = {\begin{cases} 0 & if j \leq i \\ - \infty & if j > i \end{cases}$

This is because $Z$ is an upper triangular matrix with values for previous words since $s o f t m a x (x + 0)$ is the same as $s o f t m a x (x)$ , but $s o f t m a x (x - \infty)$ will be equal to 0.

Cross Attention

The intuition behind cross attention is similar to sequence-to-sequence models where the context vector is passed from the encoder to the decoder capturing relevant information from the input sequence. The residual connection plus the output of the masked self attention is passed as the query to the cross attention block, whereas the key and value pairs are the same as the output of the encoder. The relationship and relevance between words in different sentences are captured.

Linear Layer

The linear layer is applied to the output of the feed forward neural network of the decoder and its primary role is to adjust the dimensionality of the network to match the size of the vocabulary. It involves learning a set of parameters - a matrix of dimension equal to $p \times l e n (v o c a b)$ . This results in a $p \times 1$ vector which is then passed through a probabilistic softmax layer to predict the next word in the sequence.

Softmax Activation

The function transforms the linear layer's output into probabilities as mentioned above. The output represents the likelihood of a respective word being the next word in the sequence.

Positional Encoding

So far, the encoder and decoder has no sense of the order of the words in the sequence. The sentences "I am a teacher", "Teacher I am", "Am I a teacher?" are processed the same way, even though the meaning may not be the same. To circumvent this issue, and due to the lack of the convolution or recurrence operations, a positional encoding scheme is embedded within the encoder and decoder. In this encoding, the position of the words in a sequence is encoded by a vector. The even positions are represented by a sinusoidal wave with 1 representing the peaks and 0 representing the troughs. Similarly, the odd positions are represented by a cosine wave. Thus, each position is encoded by a unique binary vector and each unique binary vector represents a specific position.

Bidirectional Encoder Representations from Transformers (BERT)

BERT introduced by Google is built by repeating the encoder of the transformer multiple times. The Bidirectional in BERT refers to its ability to attend to the future and past words in the sequence unlike Transformers which only looks at the past to make predictions about the future. The fundamental principles behind BERT are Masked Language Modeling, and Next Sentence Prediction. The architecture of BERT is given below.

Masked Language Modeling

BERT masks words in a corpus as shown in the figure below and makes the model learn to predict these words. It thereby pays attention to both words before and after the masked word and fills in the blank given the entire context. It is the same as a Transformer Encoder where 12 encoders (in contrast to the 6 in the original paper of Transformers) are stacked on top of each other. The output of the encoder is passed to s linear dense layer and softmax to predict the most probable word from the vocabulary of the corpus.

Next Sentence Prediction

It was also used to identify if given two sentences A and B, B logically follows the sentence A. This is possible due to the ability to fine-tune the weights of the pretrained BERT model for downstream prediction tasks. The pre-trained model can also be used to represent the input of a sentiment analysis task (for example) to a neural network in a better manner. The [CLS] token is prepended to the input sequence and is used to capture the context of the whole sentence such that during fine-tuning, the prediction may be conditioned on the representation of the [CLS] token.

There are various flavors of BERT based on slight modifications to the original architecture and training processes. The advantage of fine tuning pretrained models is the opportunity to finetune them on a domain specific corpus leading to additional variants of BERT as BioBERT retrained on a biomedical corpus, BERTweet trained on around 850 million tweets and so on.

Transformers and GPT Models

Overview of Transformers

Transformers consist of two main components:
- Encoder
- Decoder
BERT utilizes a stack of encoders, while GPT uses a stack of decoders.
GPT models are neural network-based language prediction models built on Transformer architecture.

Decoder Structure

The decoder has three parts:
1. Masked Multi-Head Attention: Attends only to the left.
2. Cross Attention: Attends to encoded inputs (removed in GPT).
3. Feed Forward Neural Network.

Differences Between BERT and GPT

Both BERT and GPT use masked multi-head attention, but unlike BERT that masks a word in the middle of the sentence, and tries to predict the mask, GPT masks all the future words and tries to predict them.
Training methods differ:
- BERT masks tokens in sentences.
- GPT predicts the next token in a sequence.

Training Process

In GPT, the model predicts the next token based on the previous tokens, treating the input as a sequence.
The first GPT model (2018) had 117 million parameters and was trained on 7,000 books.
It was seen that larger models tend to generalize better and have better "emergent abilities", although the reason is still unclear.

Evolution of GPT Models

GPT-2 (2019) had 1.5 billion parameters.
GPT-3 (2020) had 175 billion parameters.
Larger models tend to generalize better. This trend suggests that increasing model size allows for deeper contextual understanding.

Chain of Thought Reasoning

GPT-4 introduced a method called “chain of thought,” allowing the model to predict intermediate steps in reasoning tasks. It involves multi-step thinking: The model generates a series of logical steps to arrive at the final answer, rather than answering in a single sentence.
Example: Problem: What is the result of 5 + 5 + 5? It is clear to see the answer is 15. With the chain of thought reasoning - Step 1: The first 5 plus the second 5 gives 10 (i.e. $5 + 5 = 10$ ). Step 2: add the last 5 to the result from Step 1 (i.e. $10 + 5 = 15$ ).

Alignment and Instruction Following

ChatGPT is designed to follow instructions and align with user preferences, addressing issues like harmful or politically incorrect outputs.
InstructGPT is a variant of GPT that focuses on following instructions using human feedback via Reinforcement Learning from Human Feedback (RLHF).

Text-Text Transfer Transformer (T5)

T5 combines encoder and decoder structures and treats various NLP tasks as text-to-text problems, allowing for diverse applications like translation and summarization.
NLP tasks such as sentiment analysis which were primarily treated as classification problems are cast as text-to-text problems.
Next span prediction is used where a set of sequential tokens are removed and replaced with sentinel tokens.
T5 architecture operates on an encoder-decoder model and encoder input padding can be performed on both left and right.

Training Techniques in T5

T5 masks spans of tokens rather than individual tokens, focusing on global context rather than local dependencies. 15% of tokens are randomly removed and replaced with sentinel tokens.

Benchmarking and Performance

Models are often evaluated against benchmarks like GLUE and SuperGLUE, which consist of various NLP tasks.

Training and Fine-Tuning LLMs

LLMs are trained on massive datasets of unlabeled text using unsupervised learning.
The training process involves predicting the next token in a sequence.
Fine-tuning is done on labeled data to align the model with specific tasks or instructions. This process is called Supervised Fine-Tuning (SFT).

Aligning LLMs with Human Feedback

Reinforcement Learning with Human Feedback (RLHF) is used to ensure LLMs produce safe and ethical outputs.
RLHF involves generating multiple responses for a given prompt and having humans rank them.
A reward model is trained on this ranked data to predict the quality of a response.
The original LLM is then fine-tuned using this reward model.
RLHF was introduced in 2017 and has been widely used in LLMs like ChatGPT.

Direct Preference Optimization (DPO)

DPO is a newer approach that aims to replace RLHF.
It directly collects user preferences for different responses, eliminating the need for a reward model.
DPO is easier to implement and more stable than RLHF.

Project Ideas for Deep Learning

Enhancing Speculative Decoding for Faster Large Language Modeling
- This project aims to improve the speed of LLMs by using techniques like rejection sampling and importance sampling.
- The goal is to find alternative sampling methods to rejection sampling, which is computationally expensive.
Reducing the Computational Complexity of Transformers
- Transformers are computationally expensive, especially for long sequences.
- This project explores methods like ORCID, which uses data-dependent convolution to reduce complexity.
- The goal is to develop more efficient transformers that can handle longer sequences.
Diffusion Decoding for Peptide De Novo Sequencing
- Peptide sequencing is a crucial problem in bioinformatics, involving determining the amino acid sequence of a peptide.
- This project proposes using diffusion models for peptide sequencing, replacing the traditional GPT-based approach.
- Diffusion models can potentially improve accuracy by predicting all tokens simultaneously, rather than sequentially.
Using ORCID for DNA Analysis
- This project explores using ORCID, a model that can handle larger sequences, for DNA analysis.
- The goal is to leverage ORCID’s ability to handle long sequences to improve DNA analysis tasks.
Symbolic Regression with Diffusion Models
- Symbolic regression involves finding a mathematical formula that best fits a given dataset.
- This project proposes using diffusion models for symbolic regression, potentially improving the accuracy and efficiency of the process.

Transformers and Variational Autoencoders

Large Language Models (LLMs) and Transformers

Key Components of Transformers

Attention Mechanism Formula: The core of the attention mechanism is computed using the following formula:

V softmax (\frac{Q^{T} K}{\sqrt{p}}) (n x n)

Dimensions:
- $X$ as a $d \times n$ matrix, $n$ as sequence length, and $d$ as data dimensionality.
- $Q$ , $K$ and $V$ are the query, key, and value matrices, respectively, and they are derived from $W_{Q}$ , $W_{K}$ and $W_{V}$ .
- Given a sequence $X = {[\begin{matrix} x_{1} & \dots & x_{n} \end{matrix}]}_{d \times n}$ we define:
  - $Q = W_{Q}^{T} X (p x n), W_{Q}^{T} \in R^{p \times d}, X \in R^{d \times n}$
  - $K = W_{K}^{T} X (p x n), W_{K}^{T} \in R^{p \times d}$
  - $V = W_{V}^{T} X (m x n), W_{V}^{T} \in R^{m \times d}$

Recap of Transformers

Building Blocks of LLMs: Transformers.
Computational Complexity:
- Issue: Transformers have quadratic complexity concerning sequence length.
- Reason: Complexity arises due to calculations in the attention mechanism.

Approaches to Reduce Complexity

Issue: Long sequences result in an $n \times n$ matrix, making computation demanding.
Solution: Techniques like the Performer model approximate attention, reducing complexity to linear time.

Performer Model

Overview

Objective: Address quadratic complexity in transformers by approximating attention using kernel methods.
Kernel Approximation: Utilizes random features to approximate kernels, e.g., Gaussian kernels. The kernel function $K (x, y)$ is defined as:

K_{Gauss} (x, y) = φ (x)^{T} φ (y)

For the Gaussian kernel approximation, the transformation $φ_{Gauss} (x)$ is given by:

φ_{Gauss} (x) = \frac{1}{\sqrt{r}} (\begin{matrix} \sin (ω_{1}^{T} x), & \sin (ω_{2}^{T} x), & \dots, & \sin (ω_{r}^{T} x), \cos (ω_{1}^{T} x), & \cos (ω_{2}^{T} x), & \dots, & \cos (ω_{r}^{T} x) \end{matrix})

In general for kernel $K$ :

ϕ (x) = \frac{h (x)}{\sqrt{r}} (f_{1} (w_{1}^{T} x) \dots f_{1} (w_{r}^{T} x) \dots f_{l} (w_{1}^{T} x) \dots f_{l} (w_{r}^{T} x)) (l x r)

Advantage

Scalability: Performers are highly scalable for very long sequences.

Memory Efficiency: Performers drastically reduce memory usage by linearizing attention.

Robust Performance: Despite the approximation, Performers retain high accuracy and performance, comparable to standard transformers.

Application

Performers are beneficial in applications requiring efficient processing of long sequences, such as natural language processing, DNA sequence analysis, and other domains where traditional transformers are computationally prohibitive.

Softmax and Attention Mechanism in Transformers

Softmax Formula

In the context of transformers, the softmax function is used to normalize the attention scores:

SM [s]_{i} = \frac{e^{s_{i}}}{\sum_{j} e^{s_{j}}}

To calculate the softmax numerically, we can consider:

$e^{Q^{T} K} = A$

$d = {[\begin{matrix} \end{matrix}]}_{n \times n} {[\begin{matrix} 1 \\ 1 \\ ⋮ \\ 1 \end{matrix}]}_{n \times 1}$

$D = diag (d) = {[\begin{matrix} d_{1} & 0 & \dots & 0 \\ 0 & d_{2} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & d_{n} \end{matrix}]}_{n \times n}$

$SM [s]_{i} = \frac{e^{s_{i}}}{\sum_{j} e^{s_{j}}} = A D^{- 1}$

We want to be able to do the matrix multiplication in the following way, which has linear time complexity for long sequences in transformers:

[\begin{matrix} V & Φ^{T} & Φ \end{matrix}] with dimensions: [\begin{matrix} m \times n & n \times r^{'} & r^{'} \times n \end{matrix}] = m \times n

Steps in Kernel Approximation

Kernel Formulation: Approximate $ϕ (x)^{T} ϕ (y)$ as a function of random vectors.
Random Feature Transformation: Performers use sinusoidal functions (sine and cosine) as random features for approximating similarity between sequences.
The softmax can also be approximated using random features, with a specific $φ (x)$ tailored for softmax.

e^{- \frac{1}{2} ‖ x - y ‖^{2}} = e^{- \frac{(x - y)^{T} (x - y)}{2}} = e^{- \frac{x^{T} x + y^{T} y - 2 x^{T} y}{2}} = e^{- \frac{x^{T} x}{2}} \cdot e^{- \frac{y^{T} y}{2}} \cdot e^{x^{T} y}

e^{x^{T} y} = e^{- ‖ x - y ‖^{2}} \cdot e^{\frac{x^{T} x}{2}} \cdot e^{\frac{y^{T} y}{2}} = φ_{Gauss} (x)^{T} φ_{Gauss} (y) \cdot e^{\frac{x^{T} x}{2}} \cdot e^{\frac{y^{T} y}{2}} = φ_{Gauss} (x)^{T} e^{\frac{x^{T} x}{2}} \cdot φ_{Gauss} (y) e^{\frac{y^{T} y}{2}}

So, the kernel approximation for the softmax is:

φ_{SM} (x) = \frac{e^{\frac{x^{T} x}{2}}}{\sqrt{r}} (\begin{matrix} \sin (ω_{1}^{T} x), & \sin (ω_{2}^{T} x), & \dots, & \sin (ω_{r}^{T} x), \cos (ω_{1}^{T} x), & \cos (ω_{2}^{T} x), & \dots, & \cos (ω_{r}^{T} x) \end{matrix})

Variational Auto-encoders (VAEs)

Overview: VAEs are a type of generative model that extends traditional auto-encoders by introducing stochastic elements. While traditional auto-encoders are deterministic and focus on reconstructing inputs, VAEs aim to learn a distribution over the latent space. This approach extends traditional autoencoders by introducing a probabilistic framework that enables them to capture complex data distributions.

Background: Auto-encoders

Basic Structure: Consists of an encoder, a bottleneck layer, and a decoder.
Objective: Maps input $x$ into a lower-dimensional representation ( $z = u^{T} x$ ) by compressing it and reconstructs it. It minimizes the differences in $x$ and the reconstructed version of it ( $\hat{x} = u z$ ): $min | x - \hat{x} |$
PCA Connection: A simple, linear auto-encoder resembles PCA in dimensionality reduction.
Limitation: While VAEs are powerful, they may struggle to capture highly complex data distributions compared to more recent generative models like GANs.

Moving to Variational Auto-encoders

Generative Model Aspect: Variational Auto-encoders enforce a specific distribution on $z$ , allowing them to generate new samples.
Gaussian Distribution Constraint: By enforcing a Gaussian distribution on $z$ , the model can generate new samples in the learned data distribution. So Variational Auto-encoder introduces a prior distribution (typically Gaussian) on $z$ and maximizes the Evidence Lower Bound (ELBO) instead of reconstruction alone.

Approximation of the Posterior Distribution

p (z | x) = \frac{p (x | z), p (z)}{p (x)}

p (x) = \int_{z} p (x | z), p (z), d z

$q_{θ} (z)$ is the approximate posterior distribution of the latent variable $z$ , parameterized by $θ$ .
Purpose of $q_{θ} (z)$ : In a VAE, we aim to learn a latent variable $z$ that represents the underlying structure of the data $x$ . Ideally, we want the true posterior $p (z | x)$ , which is often intractable to compute directly. To address this, we approximate $p (z | x)$ with a simpler distribution $q_{θ} (z | x)$ , where $θ$ represents the parameters (typically learned through a neural network).
Key Points about $q_{θ} (z)$ :
- Approximate Posterior: $q_{θ} (z | x)$ is an approximation of the true posterior $p (z | x)$ .
- Parameterized by $θ$ : The parameters $θ$ define the structure of this distribution, often through a neural network in a VAE.
- Optimization: During training, we optimize $θ$ to make $q_{θ} (z | x)$ as close as possible to $p (z | x)$ by minimizing the KL divergence between $q_{θ} (z | x)$ and $p (z | x)$ : $min_{θ} KL (q_{θ} (z) | | p (z | x))$

Information Theory

In VAEs, the model learns a latent space where each data point is mapped not to a single point, but to a probability distribution. This design choice allows the VAE to: Capture the information content of the input data in a compressed, structured form. Represent the uncertainty in the data by encoding it as a distribution rather than as a single deterministic vector. The VAE learns this by balancing reconstruction accuracy with the information-theoretic regularization of the latent space, which enforces a smooth and organized representation.

Information content:
- $I$ , represents the information content or self-information of an event with probability $p (x)$ . It quantifies how much “surprise” or “information” is associated with a specific outcome $x$ occurring.
- The formula for information content $I$ of an event $x$ is: $I = - \log p (x)$
- $I$ is higher for less probable events (low $p (x)$ ), indicating more “surprise” or “information” content when that event occurs.
- In information theory, this concept helps quantify the amount of information gained from observing an event, with rare events providing more information than common ones.
- For example, given a sentence "Tomorrow, it either rains, or not." $p (x) = 1$ and so $I = - l o g (1) = 0$ since there is no insight to glean in that sentence.
Entropy:
- $H$ represents entropy. Entropy measures the average amount of information (or “uncertainty”) contained in a random variable. It is often used to quantify the uncertainty or unpredictability of a probability distribution.
- The formula for entropy $H$ of a discrete random variable $x$ with a probability distribution $p (x)$ is: $H = - \sum p (x) \log p (x)$
- $H$ is maximized when the distribution is most uncertain (e.g., in a uniform distribution).
- In the context of VAEs, entropy is used to understand the distribution of latent variables and how much “information” or “uncertainty” they contain.
KL divergence:
- KL divergence, stands for the Kullback-Leibler divergence between two probability distributions $q$ and $p$ . It is a measure of how one probability distribution $q$ diverges from a second, reference probability distribution $p$ .
- The KL divergence from $q (x)$ to $p (x)$ is defined as: $KL (q | | p) = \sum q (x) \log q (x) - \sum q (x) \log p (x) = \sum_{x} q (x) \log \frac{q (x)}{p (x)}$
- KL divergence is not symmetric; $KL (q ‖ p) \neq KL (p ‖ q)$ . This means that it matters which distribution we consider as the reference.
- KL divergence is always non-negative, $KL (q ‖ p) \geq 0$ , and equals zero if and only if $q = p$ (almost everywhere). This is known as Gibbs’ inequality.
- In variational inference and variational auto-encoders (VAEs), KL divergence is used to measure the difference between the approximate posterior $q_{θ} (z | x)$ and the true posterior $p (z | x)$ . Minimizing $KL (q_{θ} (z | x) | p (z | x))$ helps make the approximation $q_{θ} (z | x)$ closer to the true posterior.
- KL divergence quantifies the “information loss” if we approximate $p (x)$ with $q (x)$ . In essence, it tells us how much extra information (or “surprise”) is needed to describe samples from $q$ as if they were from $p$ .
- For the continuous space:

KL (q (z) ‖ p (z | x)) = - \int q (z) \log \frac{p (z | x)}{q (z)} d z

p (z | x) = \frac{p (x | z) p (z)}{p (x)}

p (x | z) p (z) = p (x, z)

KL (q (z) ‖ p (z | x)) = - \int q (z) \log \frac{p (x, z)}{q (z) \cdot p (x)} d z

= - \int q (z) [\log \frac{p (x, z)}{q (z)} + \log \frac{1}{p (x)}] d z

= - \int q (z) \log \frac{p (x, z)}{q (z)} d z + \int q (z) \log p (x) d z = - \int q (z) \log \frac{p (x, z)}{q (z)} d z + \log p (x) \int q (z) d z

\int q (z) d z = 1

\log p (x) = KL (q (z) ‖ p (z | x)) + \int q (z) \log \frac{p (x, z)}{q (z)} d z = KL (q (z) ‖ p (z | x)) + ELBO

ELBO = \int q (z) \log \frac{p (x, z)}{q (z)} d z

= \int q (z) \log \frac{p (x | z) p (z)}{q (z)} d z

= \int q (z) [\log p (x | z) + \log \frac{p (z)}{q (z)}] d z

= \int q (z) \log p (x | z) d z + \int q (z) \log \frac{p (z)}{q (z)} d z

= E_{q (z)} [\log p (x | z)] - KL (q (z) ‖ p (z))

ELBO = E_{q_{θ} (z | x)} [\log p (x | z)] - KL (q_{θ} (z | x) | p (z))

KL Divergence and Evidence Lower Bound (ELBO)

KL Divergence: Measures the similarity between two distributions; minimized to bring $q$ close to $p$ .
Evidence Lower Bound (ELBO):
- Variational auto-encoders maximize ELBO which is similar to minimizing KL divergence since $l o g (p (x))$ is a constant.
- Maximizing ELBO ensures $q$ closely approximates $p$ , meaning maximizing ELBO ensures that the learned latent distribution is close to the target distribution. This approach allows VAEs to generate new samples by sampling from the latent space.

ELBO Decomposition

ELBO Equation: $E L B O = q (z) l o g p (x | z) - K L [q (z) | | p (z)]$
First Term: Reconstruction likelihood (maximize this).
Second Term: Ensures latent variable $z$ approximates the desired distribution.

Practical Implementation: Reconstruction Loss and Gaussian Constraint

Objective in VAE Training:
- Minimize reconstruction loss (similar to traditional auto-encoders).
- Ensure $z$ follows a Gaussian distribution.
Regularization Term in ELBO: In VAEs, the Evidence Lower Bound (ELBO) includes a KL divergence term to ensure that the learned latent distribution remains close to a prior distribution (e.g., a Gaussian $p (z) = N (0, I)$ ).

Re-parameterization Trick

Challenge: The stochastic nature of $z$ complicates backpropagation.
Solution: To make the stochastic layer differentiable, VAEs use the re-parameterization trick, where $z$ is expressed as a deterministic function with a noise component. So we use re-parameterization to enable gradient-based optimization and we minimize:

min (| x - \hat{x} |^{2} + KL (q (z) ‖ N (0, 1)))

Reverse Processes

We need to know the reverse diffusion process $p_{θ} (x_{t - 1} | x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t))$
Mean of q(x_{t-1}|x_t,x_0):

$\begin{aligned} q (x_{t - 1} | x_{t}, x_{0}) & = q (x_{t} | x_{t - 1}, x_{0}) \frac{q (x_{t - 1} | x_{0})}{q (x_{t} | x_{0})} \\ \propto \exp (- \frac{1}{2} (\frac{(x_{t} - \sqrt{α_{t}} x_{t - 1})^{2}}{β_{t}} + \frac{(x_{t - 1} - \sqrt{{\bar{α}}_{t - 1}} x_{0})^{2}}{1 - {\bar{α}}_{t - 1}} - \frac{(x_{t} - \sqrt{{\bar{α}}_{t}} x_{0})^{2}}{1 - {\bar{α}}_{t}})) \\ = \exp (- \frac{1}{2} ((\frac{α_{t}}{β_{t}} + \frac{1}{1 - \bar{α_{t - 1}}}) x_{t - 1}^{2} - (\frac{2 \sqrt{α_{t}}}{β_{t}} x_{t} + \frac{2 \sqrt{\bar{α_{t - 1}}}}{1 - \bar{α_{t - 1}}} x_{0}) x_{t - 1} + C (x_{t}, x_{0}))) \end{aligned}$

, where $C (x_{t}, x_{0})$ is not a function of $x_{t - 1}$

Define $α_{t} = 1 - β_{t}, \bar{α_{t}} = \prod_{i = 1}^{T} α_{i}$ ,

$\tilde{μ_{t}} (x_{t}, x_{0}) = \frac{\sqrt{α_{t}} (1 - \bar{α_{t - 1}})}{1 - \bar{α_{t}}} x_{t} + \frac{\sqrt{\bar{α_{t - 1}}} β_{t}}{1 - \bar{α_{t}}} x_{0}$

Since $x_{0} = \frac{1}{\sqrt{\bar{α_{t}}}} (x_{t} - \sqrt{1 - \bar{α_{t}}} ϵ_{t})$ , we have

$\tilde{μ_{t}} (x_{t}, x_{0}) = \frac{1}{\sqrt{\bar{α_{t}}}} (x_{t} - \frac{1 - α_{t}}{\sqrt{1 - \bar{α_{t}}}} ϵ_{t})$

KL for two Gaussian

Between forward process and backward process:

$D_{K L} (q (x_{t - 1} | x_{t}, x_{0}) | p_{θ} (x_{t - 1} | x_{t}))$

For the distributions:

$q (x) = N (x; μ_{q}, Σ_{q})$ and $p (x) = N (x; μ_{p}, Σ_{p})$ ,

The KL divergence formula is:

$D_{K L} (q | p) = \frac{1}{2} (tr (Σ_{p}^{- 1} Σ_{q}) + (μ_{p} - μ_{q})^{T} Σ_{p}^{- 1} (μ_{p} - μ_{q}) - k + \ln (\frac{det Σ_{p}}{det Σ_{q}}))$

For $q$ :

${\tilde{μ}}_{t} (x_{t}, x_{0}) = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{β_{t}}{\sqrt{1 - {\bar{α}}_{t}}} ϵ)$

For $p$ :

$μ_{θ} (x_{t}, t) = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{β_{t}}{\sqrt{1 - \bar{α} t}} ϵ θ (x_{t}, t))$

Loss function

The loss function is defined as:

$L_{t} = E x_{0}, ϵ [\frac{1}{2 | Σ θ (x_{t}, t) |_{2}^{2}} | \tilde{μ} t (x_{t}, x_{0}) - μ θ (x_{t}, t) |^{2}]$

Substituting $\tilde{μ} t$ and $μ θ$ :

$L_{t} = E x_{0}, ϵ [\frac{1}{2 | Σ θ |_{2}^{2}} {| \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{1 - α_{t}}{\sqrt{1 - {\bar{α}}_{t}}} ϵ_{t}) - \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{1 - α_{t}}{\sqrt{1 - \bar{α} t}} ϵ θ (x_{t}, t)) |}^{2}]$

Simplifying:

$L_{t} = E_{x_{0}, ϵ} [\frac{(1 - α_{t})^{2}}{2 α_{t} (1 - \bar{α} t) | Σ θ | 2^{2}} | ϵ_{t} - ϵ θ (x_{t}, t) |^{2}]$

Expressing $x_{t}$ in terms of $x_{0}$ and $ϵ$ :

$L_{t} = E_{x_{0}, ϵ} [\frac{(1 - α_{t})^{2}}{2 α_{t} (1 - \bar{α} t) | Σ θ | 2^{2}} | ϵ_{t} - ϵ θ (\sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - {\bar{α}}_{t}} ϵ, t) |^{2}]$

Graph Neural Network (GNN)

Graph Neural Networks are a class of deep learning methods designed to perform inference on data described by graphs.

Application

Graphs are Structured Data: Many real-world datasets can be represented as graphs. Examples: social networks, protein-interaction networks, the World Wide Web, and molecules, where entities and their relationships are mapped as nodes and edges. In social networks, each user can be represented as a node and the relationships or interactions between them — such as friendships and follows — are represented as edges connecting the nodes. GNN can be used to recommend friends and content by learning from the network structure. Also, GNN can help to track the spread of information or behaviors across the network.

Images as Graphs: Images can be considered as graphs where each pixel acts as a node. In this representation, each pixel node connects to its adjacent pixels through edges and form a grid-like graph structure that captures the spatial relationships in the image.

Text as Graphs: Text data can also be represented as graphs. Here, each token or word is a node, and each node is connected by edges to the preceding and following tokens, which can capture the sequential dependencies in the text.

CNN as GNN: Convolutional Neural Networks (CNNs) can be viewed as a specialized form of Graph Neural Networks (GNNs) applied to grid-like structures. Each pixel in an image represents a node connected to its neighbors, with convolutions aggregating features across the whole grid. The main goal is to check if convolution can be defined on random graphs.

Tasks

There are 3 different levels of tasks.

Graph-Level Tasks: These involve predicting a property of the entire graph. For example, in molecular graphs, we may predict whether a particular molecule will bind to a receptor based on the graph structure. Examples: Images, Differentiate between molecules.
Node-Level Tasks: These focus on predicting the identity or role of each individual node within the graph. This could involve classifying nodes based on their attributes or their position within the network. Example: Node belonging to a network.
Edge-Level Tasks: These involve analyzing or predicting relationships between pairs of nodes, such as determining whether two nodes in a social network graph are friends or not. Example: Recommender Systems.

Graph

Definition

A graph, denoted as $G$ , can be defined by a set of vertices (or nodes) $V$ and edges (connections between the vertices) $V$ , along with an adjacency matrix $A$ . This matrix represents the relationships between vertices in a binary format.

In graph notation, we write this as $G = (V, E, A)$ , where:

$V$ : Vertices or nodes in the graph.
$E$ : Edges, which indicate connections between pairs of vertices.
$A$ : Adjacency matrix, a binary matrix that indicates the presence (1) or absence (0) of an edge between vertex pairs.

The adjacency matrix is a binary matrix indicating whether pairs of vertices are adjacent and is structured so that:

The entry $A_{i j} = 1$ if there is an edge between vertex $i$ and vertex $j$ . The entry $A_{i j} = 0$ if there is no edge between vertex $i$ and vertex $j$ .

In this way, the adjacency matrix provides a clear view of the connections within the graph.

Laplacian of a Graph

Laplacian Matrix Definition

$L = D - A$

$D$ : Degree matrix, which is a diagonal matrix where each diagonal entry represents the degree of each vertex.

$A$ : Adjacency matrix, as previously defined.

Degree Matrix Notation

The degree matrix $D$ is a square matrix in which each diagonal element $d_{i}$ represents the degree of the corresponding vertex $i$ . $D = {[\begin{matrix} d_{1} & 0 & \dots & 0 \\ 0 & d_{2} & \dots & 0 \\ ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & \dots & d_{n} \end{matrix}]}_{n \times n}$

Degree of a Vertex is defined as $d_{i} = \sum_{j} A_{i j}$ . The degree $d_{i}$ is the sum of the elements in the $i$ -th row of $A$ , representing the number of edges connected to vertex $i$ .

Example

For an adjacency matrix $A$ as follows:

$A = [\begin{matrix} 0 & 1 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 1 & 0 & 1 \\ 0 & 0 & 1 & 0 \end{matrix}]$

The sum of each row (representing each vertex's degree) results in:

$Sum = [\begin{matrix} \sum A_{1 j} \\ \sum A_{2 j} \\ \sum A_{3 j} \\ \sum A_{4 j} \end{matrix}] = [\begin{matrix} 2 \\ 2 \\ 3 \\ 1 \end{matrix}]$

This yields the degree matrix $D$ as:

$D = [\begin{matrix} 2 & 0 & 0 & 0 \\ 0 & 2 & 0 & 0 \\ 0 & 0 & 3 & 0 \\ 0 & 0 & 0 & 1 \end{matrix}]$

The Laplacian matrix $L$ is then calculated as:

$L = D - A$

Graph Cut Optimization Problem

The graph cut optimization problem aims to partition a graph into segments with minimal interconnections between them.

Objective: The goal is to find a cut that divides the graph into distinct segments while minimizing the connections between them.
Minimization Target: The target function to minimize is $\sum A_{i j} (y_{i} - y_{j})^{2}$ , which captures the cost associated with the "cut" between segments.
Equation: The minimization can be expressed as $y^{T} L y$ , where $L$ is the Laplacian matrix, and $y$ is a vector representing the segment assignment of each node.
Constraint Applied: The constraint $y^{T} y = 1$ is applied to ensure non-trivial solutions.
Vector $y$ : This vector represents the assignment of nodes to segments, with dimensions $n \times 1$ , where $n$ is the number of nodes.
Solution Method: Optimal partitioning is achieved by performing eigen decomposition on the Laplacian matrix $L$ .

Eigen Decomposition of Laplacian

The Laplacian matrix $L$ can be decomposed as $L = U Λ U^{T}$ , where:

$U$ represents the orthonormal eigenvectors. These eigenvectors are both orthogonal to each other and normalized. $Λ$ is a diagonal matrix containing the eigenvalues. Each diagonal entry in $Λ$ corresponds to an eigenvalue that pairs with an eigenvector in $U$ .

Starting with the standard Laplacian, defined as $L = D - A$ :

$D$ is the degree matrix. $A$ is the adjacency matrix. The normalized Laplacian is given by: $\tilde{L} = D^{- \frac{1}{2}} (D - A) D^{- \frac{1}{2}}$

This expression can be simplified to: $\tilde{L} = I - D^{- \frac{1}{2}} A D^{- \frac{1}{2}}$ , where $I$ is the identity matrix.

Laplacian Eigenvectors and Fourier Analysis Analogy

Mathematical Definition: Decomposition of a signal into sinusoidal components.
Frequency Domain: A signal is transformed to represent it as a sum of its frequency components.
Basis Functions: Sine and cosine functions serve as the basis for this transformation.
Orthogonality: These basis functions are orthogonal, ensuring unique frequency representation.
Graph Signals: Functions defined over the nodes of a graph.
Spectral Domain: Eigenvectors of $L$ transform graph signals into the spectral domain.
Eigenvector Basis: Analogous to sines and cosines, eigenvectors form a basis for graph signals.
Orthogonality: Eigenvectors are orthogonal, providing a unique spectral representation for graph signals.
Low Eigenvalues: Correspond to "low-frequency" eigenvectors. These eigenvectors change slowly over the graph. Represent large-scale, smooth structures in the graph.
High Eigenvalues: Correspond to "high-frequency" eigenvectors. These eigenvectors change rapidly between connected nodes. Capture fine details or irregularities in the graph.
Ordering: Eigenvalues (and eigenvectors) are ordered from lowest to highest.

Fourier Transform

The eigen-decomposition of the graph Laplacian $L$ can be written as: $L = U^{T} Λ U$ , where:

$U = [u_{1}, . . ., u_{n}]$ is the matrix of orthonormal eigenvectors.
$Λ$ is a diagonal matrix where each entry $Λ_{i i} = λ_{i}$ represents an eigenvalue (also known as the spectrum of the Laplacian).
The equation $U^{T} U = I$ indicates that the eigenvectors are orthonormal.

The eigenvectors from $u_{1}$ to $u_{n}$ are also called Fourier functions, as they serve as a basis for transforming signals on the graph.

The Fourier transform projects a signal $x$ onto the Fourier functions, resulting in coefficients that form the Fourier series.

In the context of graphs, the graph Fourier transform projects an input graph signal onto an orthonormal basis formed by the eigenvectors of the normalized graph Laplacian.

Suppose $x \in R^{n}$ is a feature vector containing the values of all nodes in a graph, where $x_{i}$ represents the value at the $i$ -th node.

The graph Fourier transform of a signal $x$ is given by: $f (x) = U^{T} x = \hat{x}$
The inverse graph Fourier transform is defined as: $f^{- 1} (\hat{x}) = U \hat{x} = U U^{T} x = x$

ConvGNNs

The graph convolution of an input signal $x$ with a filter $g \in R^{n}$ is defined as follows:

Convolution Theorem: $f (x * g) = (f (x) \cdot f (g))$

This can be expanded as:

$x * g = f^{- 1} (f (x) \cdot f (g))$
$= U (U^{T} x \cdot U^{T} g)$
$= U g_{θ} U^{T} x$

Where $g_{θ} = diag (U^{T} g)$ .

Neural Network Layers

Feedforward Neural Networks (FFNN):

Computation proceeds in layers, with each layer represented as $h^{'} = σ (W h)$ . The output of each layer $h^{'}$ serves as the input for the next layer. The initial input $h_{0}$ is the feature vector $x$ .

Convolutional Neural Networks (CNN):

Layer computation involves convolution, with each layer defined as $h^{'} = σ (w * h)$ . $w$ represents the learned filters or kernels that slide over $h$ .

Vanilla Spectral GNN

The graph convolutional layer of the Spectral CNN is defined as:

$x * g = f^{- 1} (f (x) \cdot f (g))$ $= U (U^{T} x \cdot U^{T} g)$ $= U g_{θ} U^{T} x$

With:

$H^{'} = σ (U Θ U^{T} H)$
$X = H^{(0)}$
$g_{θ} = Θ$

Here, $Θ$ is a diagonal matrix with learnable parameters, where $U^{T} g = [\begin{matrix} u_{1}^{T} g ⋮ u_{n}^{T} g \end{matrix}]$ and $Θ = [\begin{matrix} u_{1}^{T} g & \dots & u_{n}^{T} g \end{matrix}]$ .

Limitation: Eigen-decomposition requires $O (n^{2})$ computational complexity.

ChebNet

Approximates the filter $g_{θ}$ by Chebyshev [*] polynomials of the diagonal matrix of eigenvalues $Λ$ .

Chebyshev Polynomials of the first kind

$T_{0} (x) = 1$
$T_{1} (x) = x$
For $i \geq 2$ , the polynomials are defined as: $T_{i} (x) = 2 x T_{i - 1} (x) - T_{i - 2} (x)$

ChebNet Graph Convolution

$g_{θ} = \sum_{i} θ_{i} T_{i} (\tilde{A})$ .
$g_{θ}$ is a Graph convolutional filter with parameter $θ$ .
$\tilde{A}$ is are scaled eigenvalues of the Laplacian matrix: $\tilde{A} = \frac{2 A}{λ_{m a x}} - I_{n}$ , which makes the normalized eigenvalues fall into [-1,1].
$x * g = U g_{θ} U^{T} x$
$x * g = \sum_{i} θ_{i} U T_{i} (\tilde{A}) U^{T} x$
$x * g_{θ} = U (\sum_{i} θ_{i} T_{i} (\tilde{A})) U^{T} x$
This is equal to: $x * g_{θ} = \sum_{i = 1}^{k} θ_{i} T_{i} (\tilde{L}) x$ , where $\tilde{L} = \frac{2 L}{λ_{m a x}} - I_{n}$

stat940W25

Notes on Exercises

Exercise 1.1

Question

Solution

Step 1: Linear Separability Assumption

Step 2: Perceptron Update Rule

Step 3: Bounding the Number of Updates

Growth of ‖wt‖2

Lower Bound on wt⋅w∗

Combining the Results

Step 4: Conclusion

Exercise 1.2

Question

Solution

Exercise 1.3

Question

Solution

Exercise 1.4

Question

Solution

Exercise 1.5

Question

Solution

Exercise 1.6

Question

Solution

a)

b)

c)

Step 1: XOR Dataset

Step 2: Perceptron Decision Boundary

Step 3: Derive Inequalities

Step 4: Attempt to Solve

Exercise 1.7

Question

Solution

Exercise 1.8

Question

Solution

Exercise 1.9

Question

Solution

Exercise 1.10

Question

Solution

Exercise 1.11

Question

Solution

Calculation

Exercise 1.12

Question

Hint

Solution

Exercise 1.13

Question

Solution

**1. Stock Market Trends**

**2. Customer Segmentation**

**3. Medical Diagnosis**

**4. Predicting Car Fuel Efficiency**

Summary

Exercise 1.14

Question

Solution

Exercise 1.15

Question

Solution

Exercise 1.16

Question

Solution

Exercise 1.17

Question

Solution

Exercise 1.18

Question

Solution

Exercise 1.19

Question

Solution

Growth of $‖ w_{t} ‖^{2}$

Lower Bound on $w_{t} \cdot w^{*}$

1. Stock Market Trends

2. Customer Segmentation

3. Medical Diagnosis

4. Predicting Car Fuel Efficiency