stat940W25
Lecture 1: Perceptron
Exercise:
== Question ==
Prove that the Perceptron Learning Algorithm converges in a finite number of steps if the dataset is linearly separable.
Hint: Assume that the dataset \( \{(\mathbf{x}_i, y_i)\}_{i=1}^N \) is linearly separable, where \( \mathbf{x}_i \in \mathbb{R}^d \) are the input vectors, and \( y_i \in \{-1, 1\} \) are their corresponding labels. Show that there exists a weight vector \( \mathbf{w}^* \) and a bias \( b^* \) such that \( y_i (\mathbf{w}^* \cdot \mathbf{x}_i + b^*) > 0 \) for all \( i \), and use this assumption to bound the number of updates made by the algorithm.
Proof
Step 1: Linear Separability Assumption
If the dataset is linearly separable, there exists a weight vector \( \mathbf{w}^* \) and a bias \( b^* \) such that: [math]\displaystyle{ y_i (\mathbf{w}^* \cdot \mathbf{x}_i + b^*) \gt 0 \quad \forall i = 1, 2, \dots, N. }[/math] Without loss of generality, let [math]\displaystyle{ \| \mathbf{w}^* \| = 1 }[/math] (normalize \( \mathbf{w}^* \)).
Step 2: Perceptron Update Rule
The Perceptron algorithm updates the weight vector \( \mathbf{w} \) and bias \( b \) as follows:
- Initialize [math]\displaystyle{ \mathbf{w}_0 = 0 }[/math] and [math]\displaystyle{ b_0 = 0 }[/math].
- For each misclassified point \( (\mathbf{x}_i, y_i) \), update:
[math]\displaystyle{ \mathbf{w} \leftarrow \mathbf{w} + y_i \mathbf{x}_i, \quad b \leftarrow b + y_i. }[/math]
Define the margin [math]\displaystyle{ \gamma }[/math] of the dataset as: [math]\displaystyle{ \gamma = \min_{i} \frac{y_i (\mathbf{w}^* \cdot \mathbf{x}_i + b^*)}{\| \mathbf{x}_i \|}. }[/math] Since the dataset is linearly separable, [math]\displaystyle{ \gamma \gt 0 }[/math].
Step 3: Bounding the Number of Updates
Let [math]\displaystyle{ \mathbf{w}_t }[/math] be the weight vector after [math]\displaystyle{ t }[/math]-th update. Define: [math]\displaystyle{ M = \max_i \| \mathbf{x}_i \|^2, }[/math] the maximum squared norm of any input vector.
3.1: Growth of [math]\displaystyle{ \| \mathbf{w}_t \|^2 }[/math]
After [math]\displaystyle{ t }[/math] updates, the norm of [math]\displaystyle{ \mathbf{w}_t }[/math] satisfies: [math]\displaystyle{ \| \mathbf{w}_{t+1} \|^2 = \| \mathbf{w}_t + y_i \mathbf{x}_i \|^2 = \| \mathbf{w}_t \|^2 + 2 y_i (\mathbf{w}_t \cdot \mathbf{x}_i) + \| \mathbf{x}_i \|^2. }[/math] Since the point is misclassified, [math]\displaystyle{ y_i (\mathbf{w}_t \cdot \mathbf{x}_i) \lt 0 }[/math]. Thus: [math]\displaystyle{ \| \mathbf{w}_{t+1} \|^2 \leq \| \mathbf{w}_t \|^2 + \| \mathbf{x}_i \|^2 \leq \| \mathbf{w}_t \|^2 + M. }[/math] By induction, after [math]\displaystyle{ t }[/math] updates: [math]\displaystyle{ \| \mathbf{w}_t \|^2 \leq tM. }[/math]
3.2: Lower Bound on [math]\displaystyle{ \mathbf{w}_t \cdot \mathbf{w}^* }[/math]
Let [math]\displaystyle{ \mathbf{w}_t }[/math] be the weight vector after [math]\displaystyle{ t }[/math]-th update. Each update increases [math]\displaystyle{ \mathbf{w}_t \cdot \mathbf{w}^* }[/math] by at least [math]\displaystyle{ \gamma }[/math]: [math]\displaystyle{ \mathbf{w}_{t+1} \cdot \mathbf{w}^* = (\mathbf{w}_t + y_i \mathbf{x}_i) \cdot \mathbf{w}^* = \mathbf{w}_t \cdot \mathbf{w}^* + y_i (\mathbf{x}_i \cdot \mathbf{w}^*). }[/math] Since [math]\displaystyle{ y_i (\mathbf{x}_i \cdot \mathbf{w}^*) \geq \gamma }[/math], we have: [math]\displaystyle{ \mathbf{w}_{t+1} \cdot \mathbf{w}^* \geq \mathbf{w}_t \cdot \mathbf{w}^* + \gamma. }[/math] By induction: [math]\displaystyle{ \mathbf{w}_t \cdot \mathbf{w}^* \geq t \gamma. }[/math]
3.3: Combining the Results
The Cauchy-Schwarz inequality gives: [math]\displaystyle{ \mathbf{w}_t \cdot \mathbf{w}^* \leq \| \mathbf{w}_t \| \| \mathbf{w}^* \| = \| \mathbf{w}_t \|. }[/math] Thus: [math]\displaystyle{ t \gamma \leq \| \mathbf{w}_t \| \leq \sqrt{tM}. }[/math] Squaring both sides: [math]\displaystyle{ t^2 \gamma^2 \leq tM. }[/math] Dividing through by [math]\displaystyle{ t }[/math] (assuming [math]\displaystyle{ t \gt 0 }[/math]): [math]\displaystyle{ t \leq \frac{M}{\gamma^2}. }[/math]
Step 4: Conclusion
The Perceptron Learning Algorithm converges after at most [math]\displaystyle{ \frac{M}{\gamma^2} }[/math] updates, which is finite. This proves that the algorithm terminates when the dataset is linearly separable.