Outline

Part 1: The Problem of Induction

Inductive learning is the process of moving from specific observations to general rules.

The goal is to predict outputs for unseen data based on a mathematical training set defined as $D = {(x_{1}, y_{1}), . . ., (x_{n}, y_{n})}$ .
The core assumption is that the future will behave exactly like the past.
However, there is a massive logical gap: unlike pure deduction, induction is never logically guaranteed. Just because you observe 1,000 white swans, it does not prove the 1,001st swan won't be black.

The philosopher David Hume argued that induction cannot be rationally justified.

When we try to justify it, we say "it worked in the past," which relies on induction to justify induction—a circular argument.
In machine learning, this means data alone is never enough; we must introduce a "Prior" (initial assumptions) to narrow down the infinite possible rules that could fit the training data.

Bayesian inference provides the mathematical framework to update our beliefs as we see new data.

The core formula is Bayes' Theorem: $$P(h|d)=\frac{P(d|h)P(h)}{P(d)}$$
Prior $P (h)$ : Your initial belief in a hypothesis before seeing data.
Likelihood $P (d | h)$ : The probability of seeing this specific data if your hypothesis were true.
Evidence $P (d)$ : The total probability of seeing the data under all possible hypotheses.
Posterior $P (h | d)$ : Your updated belief after observing the data

P (h | d) \propto P (d | h) P (h)

We have a bag with two coins:
- $h_{f a i r}$ : 50% chance of Heads.
- $h_{b i a s e d}$ : 90% chance of Heads.
Prior: We pick one at random. $P (h_{f a i r}) = 0.5$ , $P (h_{b i a s e d}) = 0.5$ .
Data (d): We flip the coin once and get Heads. Likelihoods: $P (H e a d s | h_{f a i r}) = 0.5$ $P (H e a d s | h_{b i a s e d}) = 0.9$

Calculate Total Probability (Evidence) P(d): $$P(d) = P(d|hf)P(hf) + P(d|hb)P(hb)$$$$P(d) = (0.5×0.5)+(0.9×0.5) = 0.25+0.45 = 0.70$$
Calculate Posterior for $h_{b i a s e d}$ : $$P(hb|d) = \frac{Likelihood × Prior Total} {Probability (Evidence)} = \frac{0.9 × 0.5}{ 0.70} = \frac{0.45} {0.70} ≈ 0.643$$
After seeing one Head, our belief that the coin is biased increased from 50% to 64.3%
Repeat the same 3 steps after getting a second head using 0.643 as the new prior, the prior will increase further more to 76.4%

When repeating the same steps, taking our 76.4% belief as the New Prior and observe 3 more flips,

New Data (d): Two Heads and one Tail (H,H,T)
Likelihoods:
- Biased: 0.9 × 0.9 × 0.1 = 0.081
- Fair: 0.5 × 0.5 × 0.5 = 0.125
New Posterior: $$P(hbias|d) = \frac{0.081 × 0.764} {(0.081 × 0.764) + (0.125 × 0.236)} ≈ 0.677$$
The single ”Tail” significantly weakened our hypothesis that the coin is biased.

The "Hypothesis Space" (H) represents the pool of possible models the algorithm can choose from.

Underfitting: Happens when H is too weak or simple. For example, trying to fit a straight line to complex curved data.
Overfitting: Happens when H is too complex. For example, using a 20th-degree polynomial for simple linear data. The model memorizes the noise instead of learning the actual signal.

A small dataset with a highly complex hypothesis leads to overfitting.
A huge dataset with a simple hypothesis leads to underfitting.
Successful learning requires matching the complexity of the Hypothesis Space to the volume and nature of the dataset.

Because of Hume's Problem, algorithms must use an "Inductive Bias" to prefer certain hypotheses over others.

A common bias is Occam's Razor, which states we should prefer the simplest hypothesis that successfully explains the data. In Bayesian terms, simpler hypotheses are assigned a higher Prior probability $P (h)$ .

Concept	Role in Learning
Induction	Generalizing from samples to populations.
Hume’s Problem	Pure induction is logically impossible.
Bayes’ Theorem	Provides a mathematical way to update beliefs.
Hypothesis Space	The ”search area” for the learning algorithm.
Priors	The ”initial guess” that solves Hume’s problem.