1. Performance Evaluation

Before you can improve a model, you need to know how to measure its current performance. The lecture highlights that as model complexity increases, training error naturally decreases, but test error will eventually start to increase—creating a convex curve that represents overfitting.

Evaluating Regression Models

For regression (predicting continuous values), the slides define two primary metrics:

Evaluating Classification Models

For classification, performance is broken down using a Confusion Matrix, which tracks:

From this matrix, we derive several critical formulas:

Pasted image 20260331014652.png


2. Hyperparameter Tuning

As discussed in previous lectures, hyperparameters are external configurations set manually by the user (like the learning rate η, K in KNN, or the number of trees in a Random Forest).

The standard search process is:

  1. Divide data into train, validation, and test sets.

  2. Optimize internal parameters on the training set.

  3. Search for the best hyperparameters using the validation set.

  4. Alternate until finalized, then do a final assessment on the test set.

Search Methods:


3. Optimizer Types

Optimizers are the specific algorithms used to update the weights of your model during training. They aim to accelerate convergence, prevent getting stuck in local minimums, and simplify learning rate adjustments.

1. Standard Gradient Descent Updates weights by taking a step in the negative direction of the gradient.

wk+1=wkηfwk(xi)

2. Momentum Optimizer Standard gradient descent can get stuck or oscillate. Momentum fixes this by adding a fraction of the previous update to the current update. It acts like a ball rolling down a hill, gaining inertia to push through flat spots or small bumps.

زي اي حاجة بتتدحرج علي تل, كل ما تفضل ماشي في اتجاه ثابت, تكتسب سرعة اكبر في نفس الاتجاه دهز

Δwji(n)=ηδixj+αΔwji(n1)

3. AdaGrad (Adaptive Gradient) Instead of using one global learning rate (η) for all parameters, AdaGrad adapts the learning rate for each specific parameter. It does this by dividing the learning rate by the square root of rt (the accumulated sum of all past squared gradients).

rt=rt1+gt2Δwt=ηϵ+rtgt

Problem: Because rt strictly increases, the learning rate eventually shrinks to zero, stopping training entirely.

4. RMSProp (Root Mean Square Propagation) RMSProp fixes AdaGrad's "shrinking to zero" problem by introducing an attenuation coefficient (decay factor β). Instead of accumulating all past gradients, it keeps a moving average, allowing the model to continue learning. It is highly effective for Recurrent Neural Networks (RNNs).

rt=βrt1+(1β)gt2

5. Adam (Adaptive Moment Estimation) Adam is essentially the combination of Momentum and RMSProp, and it is the default choice for most modern deep learning.


تلخيصة القوانين كلها تحت اهي سيبك من القوانين اللي فوق

Here is a complete comparison of the five optimization algorithms, followed by a step-by-step numerical walkthrough for two epochs.

1. Optimizer Comparison & Equations

Each optimizer attempts to solve the core problem of standard gradient descent: finding the minimum error efficiently without getting stuck in flat regions or overshooting the target. Let gt represent the gradient at step t, η represent the learning rate, and wt represent the weight at step t.

1. Stochastic Gradient Descent (SGD)

2. Momentum

3. AdaGrad (Adaptive Gradient)

4. RMSProp (Root Mean Square Propagation)

5. Adam (Adaptive Moment Estimation)


2. Numerical Example (2 Epochs)

Let's trace these optimizers mathematically using a static gradient to see how they behave differently. For simplicity in the hand calculations, we will ignore the tiny ϵ (usually 1e-8) as its only purpose is preventing division by zero.

Initial Conditions:

Epoch 1 (t=1)

Epoch 2 (t=2)

more on optimizers here