Lecture 1 Examples

Part 1: K-Nearest Neighbors (KNN) Classification

Let’s classify a new data point based on its similarity to existing training data using Euclidean distance.

The Setup:

Training Data:
- $P_{1}$ : (2, 4) $\to$ Class A
- $P_{2}$ : (4, 2) $\to$ Class A
- $P_{3}$ : (6, 8) $\to$ Class B
- $P_{4}$ : (8, 6) $\to$ Class B
Test Point ( $P_{t}$ ): (4, 5)
Hyperparameter: $K = 3$

Step 1: Calculate Euclidean Distances

The formula is $d = \sqrt{(x_{2} - x_{1})^{2} + (y_{2} - y_{1})^{2}}$ .

Distance to $P_{1}$ : $\sqrt{(4 - 2)^{2} + (5 - 4)^{2}} = \sqrt{4 + 1} = 2.24$
Distance to $P_{2}$ : $\sqrt{(4 - 4)^{2} + (5 - 2)^{2}} = \sqrt{0 + 9} = 3.00$
Distance to $P_{3}$ : $\sqrt{(4 - 6)^{2} + (5 - 8)^{2}} = \sqrt{4 + 9} = 3.61$
Distance to $P_{4}$ : $\sqrt{(4 - 8)^{2} + (5 - 6)^{2}} = \sqrt{16 + 1} = 4.12$

Step 2: Sort and Select the $K$ Nearest Neighbors

Sorting the distances in ascending order:

$P_{1}$ (Distance: 2.24) $\to$ Class A
$P_{2}$ (Distance: 3.00) $\to$ Class A
$P_{3}$ (Distance: 3.61) $\to$ Class B

Step 3: Majority Voting

Among the 3 nearest neighbors, Class A appears twice and Class B appears once.

Result: The test point (4, 5) is classified as Class A.

For regression problems, you apply the exact same steps, except instead of taking majority vote, we simply take the average of K-nearest.

Part 2: Linear Regression

For the following regression examples, we will use a tiny dataset to predict a target $Y$ from a single feature $X$ .

Dataset:

$X$ : [1, 2, 3]
$Y$ : [2, 3, 5]

We are trying to fit the line equation: $\hat{y} = m x + b$ (where $m$ is the weight/slope, and $b$ is the bias/intercept).

Method A: The Normal Equation (Analytical)

The Normal Equation finds the exact optimal weights in one mathematical operation. The formula is:

W = (X^{T} X)^{- 1} X^{T} Y

Step 1: Create the Augmented Matrix $X$ and Vector $Y$

We add a column of 1s to $X$ to account for the bias term ( $b$ ).

X = [\begin{matrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{matrix}], Y = [\begin{matrix} 2 \\ 3 \\ 5 \end{matrix}]

Step 2: Calculate $X^{T} X$

X^{T} = [\begin{matrix} 1 & 1 & 1 \\ 1 & 2 & 3 \end{matrix}]

X^{T} X = [\begin{matrix} 1 & 1 & 1 \\ 1 & 2 & 3 \end{matrix}] [\begin{matrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{matrix}] = [\begin{matrix} 3 & 6 \\ 6 & 14 \end{matrix}]

Step 3: Calculate the Inverse $(X^{T} X)^{- 1}$

The determinant is $(3 \times 14) - (6 \times 6) = 42 - 36 = 6$ .

(X^{T} X)^{- 1} = \frac{1}{6} [\begin{matrix} 14 & - 6 \\ - 6 & 3 \end{matrix}] = [\begin{matrix} 7 / 3 & - 1 \\ - 1 & 1 / 2 \end{matrix}]

Step 4: Calculate $X^{T} Y$

X^{T} Y = [\begin{matrix} 1 & 1 & 1 \\ 1 & 2 & 3 \end{matrix}] [\begin{matrix} 2 \\ 3 \\ 5 \end{matrix}] = [\begin{matrix} 10 \\ 23 \end{matrix}]

Step 5: Multiply to get $W$

W = [\begin{matrix} 7 / 3 & - 1 \\ - 1 & 1 / 2 \end{matrix}] [\begin{matrix} 10 \\ 23 \end{matrix}] = [\begin{matrix} (70 / 3) - 23 \\ - 10 + (23 / 2) \end{matrix}] = [\begin{matrix} 1 / 3 \\ 3 / 2 \end{matrix}]

Result: The optimal bias $b \approx 0.33$ and optimal weight $m = 1.5$ .

Method B: Gradient Descent (Iterative)

Gradient Descent optimization, AI generated

Instead of matrix inversion, Gradient Descent updates weights iteratively.

Initialization: $m = 0, b = 0$
Learning Rate ( $α$ ): 0.05
Gradient Formula for $m$ : $\frac{\partial J}{\partial m} = - \frac{2}{n} \sum x_{i} (y_{i} - {\hat{y}}_{i})$
Gradient Formula for $b$ : $\frac{\partial J}{\partial b} = - \frac{2}{n} \sum (y_{i} - {\hat{y}}_{i})$

Below is exactly one iteration (epoch) for the three different types of gradient descent using our dataset. Since initial weights are 0, $\hat{y}$ is initially 0 for all points.

1. Batch Gradient Descent (BGD)

BGD calculates the error across the entire dataset ( $n = 3$ ) before making a single update.

Errors ( $y - \hat{y}$ ): (2 - 0) = 2, (3 - 0) = 3, (5 - 0) = 5.
$\frac{\partial J}{\partial m} = - \frac{2}{3} [1 (2) + 2 (3) + 3 (5)] = - \frac{2}{3} [2 + 6 + 15] = - \frac{46}{3} \approx - 15.33$
$\frac{\partial J}{\partial b} = - \frac{2}{3} [2 + 3 + 5] = - \frac{20}{3} \approx - 6.67$

Update:

$m_{n e w} = 0 - 0.05 (- 15.33) = 0.766$
$b_{n e w} = 0 - 0.05 (- 6.67) = 0.333$

2. Stochastic Gradient Descent (SGD)

SGD updates the weights after evaluating a single, randomly chosen data point. Let's assume the first point chosen is $(x = 1, y = 2)$ . Here, $n = 1$ .

Error: (2 - 0) = 2.
$\frac{\partial J}{\partial m} = - 2 [1 (2)] = - 4$
$\frac{\partial J}{\partial b} = - 2 [2] = - 4$

Update:

$m_{n e w} = 0 - 0.05 (- 4) = 0.20$
$b_{n e w} = 0 - 0.05 (- 4) = 0.20$

(The algorithm would then immediately perform another update using the next single data point).

3. Mini-Batch Gradient Descent (MBGD)

MBGD uses a small subset of the data. Let's use a batch size of 2, picking the first two points: $(1, 2)$ and $(2, 3)$ . Here, $n = 2$ .

Errors: 2, 3.
$\frac{\partial J}{\partial m} = - \frac{2}{2} [1 (2) + 2 (3)] = - [2 + 6] = - 8$
$\frac{\partial J}{\partial b} = - \frac{2}{2} [2 + 3] = - 5$

Update:

$m_{n e w} = 0 - 0.05 (- 8) = 0.40$
$b_{n e w} = 0 - 0.05 (- 5) = 0.25$