Lecture 2 Examples#1. Categorical Naïve Bayes (Play Tennis Dataset)
#2. Gaussian Naïve Bayes (Numerical Data)
#3. Principal Component Analysis (PCA)
#More On Naive Bayes

1. Categorical Naïve Bayes (Play Tennis Dataset)

The Scenario:

We have historical data on whether a tennis game was played (OK_Play = 9 days, NO_Play = 5 days, Total = 14 days).

Pasted image 20260330221457.png

We need to classify a new, unseen day (Z) with the following conditions:

Step 1: Calculate the Probabilities for OK_Play

We multiply the likelihood of each feature occurring when the game was played by the prior probability of playing at all.

Pasted image 20260330221512.png

Equation:

P(OK_Play|Z)(29×39×39×39)×914=0.0053

Step 2: Calculate the Probabilities for NO_Play

We do the same for when the game was not played.

Equation:

P(NO_Play|Z)(35×15×45×35)×514=0.0206

Result: Since 0.0206 is greater than 0.0053, the model classifies this new day as NO_Play.


2. Gaussian Naïve Bayes (Numerical Data)

The Scenario:

Pasted image 20260330222240.png

Recall: the marginalization rule "total probability", is when you calculate the probability of some variable, by summing over the joint distribution of all possible values of another variable

P(ai)=j=1Nb-classesP(ai,bj)

Instead of counts, we are dealing with continuous numbers (PSA level and Age) to classify a patient as Healthy or Cancer. Because we assume the features follow a normal distribution, we plug our values into the Gaussian Probability Density Function:

f(x)=12πσ2e(xμ)22σ2

Step 1: Calculate Likelihoods for Healthy

Using the mean (μ) and standard deviation (σ) from the Healthy training data:

Posterior Numerator (Healthy):

P(Healthy)×P(PSA|Healthy)×P(Age|Healthy)0.5×0.13×0.059=0.004

Step 2: Calculate Likelihoods for Cancer

Using the mean and standard deviation from the Cancer training data:

Posterior Numerator (Cancer):

P(Cancer)×P(PSA|Cancer)×P(Age|Cancer)0.5×0.47×0.055=0.013

Result: Since 0.013 is greater than 0.004, the model classifies the patient as having Cancer.


3. Principal Component Analysis (PCA)

The Scenario:

We want to reduce the dimensionality of a 2D dataset while preserving the most important variance.

Data Points: (2, 1), (3, 5), (4, 3), (5, 6), (6, 7), (7, 8)

Step 1: Calculate the Mean Vector (μ)

Find the average of all X values and all Y values.

μ=[(2+3+4+5+6+7)/6(1+5+3+6+7+8)/6]=[4.55]

Step 2: Subtract the Mean

Center the data around the origin by subtracting the mean from every point.

Step 3: Calculate the Covariance Matrix

Compute (xiμ)(xiμ)T for each point, sum them up, and divide by the number of points (n=6) to find how the variables vary together.

CovarianceMatrix=[2.923.673.675.67]

Step 4: Find Eigenvalues and Eigenvectors

Solve the characteristic equation |MλI|=0 (where M is the covariance matrix) to find the eigenvalues (λ).

λ28.59λ+3.09=0

Next, we solve MX=λX using the dominant eigenvalue (λ1=8.22) to get the principal Eigenvector.

PrincipalComponent=[2.553.67]

Step 5: Project the Data

Finally, we recast the original data points onto this new 1D principal component line by taking the dot product.


More On Naive Bayes

Why Naive Bayes

Its main strength is requiring less training data, while its primary weakness is the "naive" assumption that all features are independent, which rarely holds true in real-world scenarios.

Pros of Naive Bayes:

Cons of Naive Bayes:

Common Use Cases:


Ay=CiBx=xiP(y=Ci|x=xj)=P(x=xj|y=Ci)P(y=Ci)P(x=xj)y^j=CargmaxiP(y=Ci|x=xj)P(y=Ci|x1j,x2j,x3j,,xdj)=P(x1j,x2j,x3j,,xdj|y=Ci)P(y=Ci)P(x1j,x2j,x3j,,xdj) P(ai)=j=1Nb-classesP(ai,bj)P(A|B)=P(A)

Therefore:

P(y=Ci|x1j,x2j,x3j,,xdj)=P(x1j|y=Ci)P(x2j|y=Ci)P(x3j|y=Ci)P(xdj|y=Ci)P(y=Ci)P(x1j)P(x2j)P(x3j)P(xdj)

The "Naivety"

Calculating the exact joint probability of many interacting variables " for calculating P(B) , and P(B|A) " requires exponential computational complexity (O(2n)). The Naïve Bayes classifier drastically simplifies this by making a strong assumption: all features are conditionally independent of each other, given the class label.

When making a decision between two classes (w1 and w2) given an observation x, the rule is to choose w1 if P(w1|x)>P(w2|x).

Normalization

The Marginalization Rule (often called the Law of Total Probability) is the mathematical engine behind the denominator in Bayes' Theorem. In the context of machine learning, it is what allows us to convert raw, abstract scores into clean, readable percentages that sum up to 100% (or 1.0).

Bayes theorem with marginal probability, AI generated

Here is Bayes' formula:

P(Class|Features)=P(Features|Class)×P(Class)P(Features)

The denominator, P(Features), is the marginal probability. It represents the total probability of seeing this specific exact combination of features across all possible classes in your dataset.

How to Calculate the Marginal Probability with Multiple Features

When you have multiple features (like Temperature, Humidity, Wind, or Age, PSA), calculating the exact probability of that specific combination occurring in the wild is extremely difficult.

Instead, the Marginalization Rule lets us calculate it by summing up the "numerators" of all our possible classes.

The formula for the marginal probability becomes:

P(Features)=k=1nP(Features|Classk)×P(Classk)

The Normalization Trick (Step-by-Step)

Because the marginal probability (the denominator) is the exact same for every class you are evaluating, you do not actually need to compute it right away. You can just calculate the numerators, sum them up, and use that sum to normalize your results.

Let's use the numerical Healthy vs. Cancer example from the lecture slides to see this in action!

1. Calculate the Numerator for Class A (Healthy)

We multiply the Prior by the Likelihoods of the features (PSA = 2.6):

2. Calculate the Numerator for Class B (Cancer)

3. Apply the Marginalization Rule (Find the Denominator)

To find the total marginal probability of someone having a PSA of 2.6, we simply add the numerators together:

4. Normalize to get the Final Probabilities

Now, divide each class's numerator by the marginal probability to get the true, normalized posterior probabilities:

By dividing by the marginalized sum, you force the probabilities to scale proportionally so that 0.21+0.79=1.0. This tells your model not just which class is more likely, but exactly how confident it should be in that prediction!


To use that exact marginalization rule for multiple features, we have to rely on the "Naïve" part of the Naïve Bayes algorithm: Conditional Independence.

If you have a dataset with multiple features (like PSA and Age, or Outlook, Temperature, Humidity, and Wind), trying to find the true, real-world probability of that exact combination happening all at once—P(Features)—is nearly impossible. You would need a massive dataset to find enough people who are exactly 70 years old and have a PSA of exactly 2.6 just to count them.

To get around this, Naïve Bayes assumes that every feature is completely independent of the others, as long as you already know the class.

Here is how that theoretical breakdown works mathematically.

1. Breaking Down the Likelihood

Because we assume the features are independent, the joint probability of all features given a specific class—P(Features|Classk)—simply becomes the product of their individual probabilities:

P(Features|Classk)=P(Feature1|Classk)×P(Feature2|Classk)××P(Featurem|Classk)

2. Expanding the Marginalization Rule

Now, we take that expanded, multiplied likelihood and plug it back into the formula you provided. To find the total marginal probability of seeing this specific combination of features across your entire dataset, you calculate this product for every single class and add them all together:

P(Features)=k=1n([P(Feature1|Classk)××P(Featurem|Classk)]×P(Classk))

3. A Theoretical Example

Let's apply this directly to the two-feature numerical example from your lecture (PSA = 2.6 and Age = 70) to see the full equation in action.

We have two classes (n=2): Healthy and Cancer.

Part A: Calculate the inner term for Class 1 (Healthy)

Term1=P(PSA=2.6|Healthy)×P(Age=70|Healthy)×P(Healthy)

Part B: Calculate the inner term for Class 2 (Cancer)

Term2=P(PSA=2.6|Cancer)×P(Age=70|Cancer)×P(Cancer)

Part C: Sum them up for the Marginal Probability

P(PSA=2.6 and Age=70)=Term1+Term2

Once you have that final summed number, you have successfully calculated your denominator! You then divide Term1 by this denominator to get the final normalized percentage for Healthy, and divide Term2 by this denominator to get the final normalized percentage for Cancer.

بالبلدي لو عايز زتونة الحل, لو طلب منك تعمل normalization, اجمع البسط بتاع كل الاحتمالات اللي حسبتها "ال likelihoods في ال priors بتاعت كل كلاس, سواء بقي كانت one or more than one feature"