Q1

There appears to be a typo in the diagram. The bottom arrow originating from X2 and pointing to H3 is labeled as WX2H1=0.2. Based on the network's structure, I will logically assume this represents WX2H3=0.2.

Pasted image 20260407200608.png

To find the recommended class, we need to perform a forward pass through the network using the given inputs (X1=2, X2=4), a bias of 0, and the Sigmoid activation function: σ(z)=11+ez.

Step A: Calculate Hidden Layer Activations (H1,H2,H3)

First, we find the weighted sum (Z) for each hidden node, then apply the Sigmoid function.

Step B: Calculate Output Layer Activations (O1,O2)

Now, we use the hidden layer activations to compute the final outputs.

Conclusion:

The recommended class is Class 0.

Why: The calculated activation for the O1 node (0.794), which corresponds to Class 0, is greater than the activation for the O2 node (0.664), which corresponds to Class 1.


2. Perform the first iteration of weight updates using backpropagation. Specify all updated weights.

To perform a mathematical weight update via backpropagation, two critical pieces of information are missing from the problem prompt:

  1. The True Target Labels (Y): We need to know what the network should have predicted for the input X=[2,4] to calculate the loss/error margin.

  2. The Learning Rate (η): We need the hyperparameter that determines the step size of the weight updates.

How to proceed:

If you can provide the target labels and the learning rate expected by your professor or assignment, I can complete the backpropagation calculations for you.

As a reference, once you have those values, the update for a weight connecting the hidden layer to the output layer (assuming Mean Squared Error loss) would follow this formula:

Wnew=Woldη(OutputTarget)Output(1Output)Hidden_Activation

Deep_Learning_Eq.jpeg

Setup & Values from the Forward Pass

(Note: We will use the standard Mean Squared Error loss derivative for the Sigmoid activation function: δ=(OutputTarget)Output(1Output))


Step 1: Calculate the Output Layer Errors (δO)


Step 2: Update the Output Layer Weights

The weight update formula is: Wnew=WoldηδOH


Step 3: Calculate the Hidden Layer Errors (δH)

To calculate the error for the hidden nodes, we backpropagate the output errors using the original weights before they were updated.

Formula: δH=(δOWnew)H(1H)


Step 4: Update the Hidden Layer Weights

The weight update formula is: Wnew=WoldηδHX


Q2: Derive the weight equation of Minimum sum squared errors.

Pasted image 20260407205838.png


Q3

Pasted image 20260407210031.png

a- Steps for classification depending on decision tree classifier

To build a decision tree (specifically using the ID3 algorithm which relies on Information Gain), you follow these standard steps:

  1. Calculate the Entropy of the target dataset: Determine the uncertainty of the entire dataset based on the class labels.

  2. Calculate the Entropy for each feature: For every feature, calculate the entropy of its individual branches (values) and then compute the weighted average entropy for that feature.

  3. Calculate Information Gain: Subtract the weighted entropy of each feature from the target dataset's total entropy.

  4. Select the Root Node: Choose the feature with the highest Information Gain to split the dataset.

  5. Repeat for Child Nodes: Recursively apply steps 1-4 to the subsets of data created by the split. Stop when a subset is pure (contains only one class) or when no more features are left to split.


Initial Calculation: Parent Entropy

Before calculating the gain for specific features, we must find the total entropy of the dataset, E(S).

E(S)=P(C1)log2P(C1)P(C2)log2P(C2)E(S)=(36)log2(36)(36)log2(36)=1.0

b- Calculate the gain entropy of Feature 1

First, we analyze the splits if we use Feature 1:

Now, calculate the weighted entropy for Feature 1, E(S|F1):

E(S|F1)=(36)E(F1=0)+(36)E(F1=1)E(S|F1)=0.50.918+0.50.918=0.918

Finally, calculate Information Gain:

Gain(S,F1)=E(S)E(S|F1)Gain(S,F1)=1.00.918=0.082

c- Calculate the gain entropy of Feature 2

Next, we analyze the splits if we use Feature 2:

Now, calculate the weighted entropy for Feature 2, E(S|F2):

E(S|F2)=(26)E(F2=0)+(46)E(F2=1)E(S|F2)=(13)1.0+(23)1.0=1.0

Finally, calculate Information Gain:

Gain(S,F2)=E(S)E(S|F2)Gain(S,F2)=1.01.0=0.0

d- Build the decision tree classification graph

Since Feature 1 has the higher Information Gain (0.082>0.0), it becomes the root node.

Because neither subset is perfectly pure after the first split, we apply Feature 2 to the resulting branches to complete the classification:

Here is the resulting decision tree structure:

Plaintext

          [Feature 1]
           /       \
        (0)         (1)
        /             \
  [Feature 2]     [Feature 2]
    /     \         /     \
  (0)     (1)     (0)     (1)
  /         \     /         \
[C1]       [C2] [C2]       [C1]

Q4

Pasted image 20260407211011.png

1. What is the output provided by a convolution layer?

We have a 5×5 input image and a 2×2 kernel with a stride of 1.

The output size will be: 521+1=4×4. (Assuming zero padding)

The kernel is essentially an identity matrix:

[1001]

This means that for every 2×2 window the kernel slides over, we multiply the elements element-wise and sum them. Because of the 0s in the kernel, this operation simplifies to adding the top-left and bottom-right elements of the current 2×2 window.

Following this sliding window process across the entire matrix gives us the following output feature map:

[676061112918582813105]

2. The output of the convolution layer goes through The ReLU layer, what will the output of this layer be?

The Rectified Linear Unit (ReLU) activation function sets all negative values to zero while keeping positive values unchanged: f(x)=max(0,x).

Applying this to our 4×4 feature map from step 1:

[60600110918002813100]

3. The output of the ReLU layer goes through a maxpooling layer with size 3×3, what will the output of this layer be?

Max pooling extracts the maximum value from the window. The problem does not explicitly state the stride for the pooling layer. When dealing with a 4×4 matrix and a 3×3 pool size, a standard default assumption in academic contexts (when stride isn't given) is a stride of 1, allowing the windows to overlap to capture edge features.

(Note: If the stride were 3, only one 3×3 window would fit without padding, resulting in a single value of 18. Assuming a stride of 1 yields a 2×2 output.)

Using a 3×3 window with a stride of 1 on the ReLU output:

Final Output Matrix:

[18111813]

Q5

1. Difference between Bagging and Boosting

Both are ensemble learning techniques used to improve model performance, but they operate differently:

Feature Bagging (Bootstrap Aggregating) Boosting
Execution Parallel: Models are trained independently at the same time. Sequential: Models are trained one after another.
Focus Aims to reduce variance (prevents overfitting). Aims to reduce bias (improves underfitting) and variance.
Data Sampling Random subsets of data are drawn with replacement for each model. All data is used, but weights of previously misclassified data points are increased for the next model.
Base Learners Usually complex, high-variance models (e.g., deep Decision Trees). Usually simple, high-bias weak learners (e.g., shallow Decision Trees/stumps).
Aggregation Equal majority vote (classification) or simple average (regression). Weighted vote/average based on each model's accuracy.
Examples Random Forest AdaBoost, Gradient Boosting, XGBoost

2. Assumptions of Linear Regression

For a linear regression model to be valid and reliable, it relies on four main assumptions (often remembered by the acronym LINE), plus one regarding features:

3. Importance of Data Normalization

Data normalization (scaling numerical features to a standard range, like 0 to 1, or standardizing them to have a mean of 0 and a variance of 1) is critical for several reasons:

4. Convolutional Layer Calculations

Here are the calculations based on your provided parameters:

a. Calculate the width and height of the output feature maps:

We use the standard dimension formula: Wout=WinK+2PS+1

Wout=2247+2(2)3+1Wout=2247+43+1Wout=2213+1Wout=73.66...+1Wout=73+1=74

Because the input is square (224×224), the height will be the same.

b. Determine the total number of units (neurons) in the output:

The total number of units is the volume of the output feature map (Width × Height × Number of Filters).

c. Compute the total number of learnable parameters:

Each filter has weights corresponding to its volume, plus exactly one bias term. We then multiply by the total number of filters.