Meaning of "pushing summations inward"

The concept of "pushing summations inward" is the mathematical heart of why Variable Elimination is so much more efficient than the brute-force approach. At its core, it is an algorithmic optimization built entirely on the distributive property of multiplication over addition—the same fundamental rule that states xy+xz=x(y+z).

Instead of distributing multiplication over a massive set of additions (which creates a combinatorial explosion), Variable Elimination factors out the terms that don't depend on the variable you are currently summing over.

Here is exactly what that means, mapping the algebra directly to the numerical example from slides 3 through 9.

1. The Brute Force Approach: "Un-pushed" Summations

In the lecture's example of the network ABC, the goal is to calculate the marginal probability P(C).

To do this using the full joint distribution (Method 2, Slides 6 & 7), you define the marginal probability as the sum over all possible states of A and B:

P(C)=ABP(A,B,C)

If we expand P(A,B,C) into its initial factors, the equation looks like this:

P(C)=AB[ϕ1(A)ϕ2(A,B)ϕ3(B,C)]

The Algorithmic Flaw: Because the summations are entirely on the outside, the algorithm is forced to compute the inner product [ϕ1(A)ϕ2(A,B)ϕ3(B,C)] for every single combination of A, B, and C before doing any addition. This generates an 8-row table (23=8) with 16 total multiplications.

2. Pushing the Summations Inward

To optimize this, we look at the variables involved in each factor.

Notice the innermost sum over A (A). We ask: Do all the factors in the product depend on A?

The factor ϕ3(B,C) does not contain the variable A. Because ϕ3(B,C) acts as a constant relative to A, we can factor it out and move it to the left of the A operator.

We have successfully "pushed" the sum over A inward:

P(C)=B(ϕ3(B,C)[Aϕ1(A)ϕ2(A,B)])

3. Mapping to the Numerical Example (Method 1)

That newly restructured equation dictates the exact, step-by-step Variable Elimination algorithm shown in Slides 4 and 5. By isolating subproblems, we prevent the table size from exploding.

Step 1: The Inner Bracket (Eliminating A - Slide 4)

The innermost operation is now isolated: [Aϕ1(A)ϕ2(A,B)].

  1. Product: We multiply the factors that actually care about A: ψ(A,B)=ϕ1(A)ϕ2(A,B). This only requires calculating combinations of A and B (max size 22=4).

  2. Sum-out: We marginalize A out of that product to create a new, intermediate factor: τ1(B)=Aψ(A,B).

Step 2: The Outer Bracket (Eliminating B - Slide 5)

We substitute our new factor τ1(B) back into the main equation. The equation is now simpler:

P(C)=B(ϕ3(B,C)τ1(B))
  1. Product: We multiply the remaining factors that involve B: ψ(B,C)=τ1(B)ϕ3(B,C). Again, this table only tracks B and C (max size 22=4).

  2. Sum-out: We marginalize B out to get our final target: τ2(C)=Bψ(B,C). This yields P(C)={0.65,0.35}.

The Efficiency Payoff

By pushing the summations inward, we isolate the mathematical operations into decoupled blocks.

As shown in Slide 8, instead of generating a monolithic 23=8 row table, we generated two smaller operations that each only required a maximum table size of 22=4. If we had 50 variables in a chain, the brute-force equation would require calculating a 250 row table before adding anything together. By pushing summations inward, Variable Elimination guarantees we only ever compute localized products between directly connected variables.


Why does the order in which we eliminate variables matter so much

The order in which you eliminate variables is the single most important factor in Variable Elimination because it directly dictates the computational cost (both time and memory) of the algorithm.

To put it simply: a good elimination order makes a problem trivially easy to solve, while a bad order can make the exact same problem mathematically impossible for a computer to process.

Here is a breakdown of why this happens:

1. It Dictates the Size of Intermediate Tables

Every time you eliminate a variable, you have to multiply all the factors that involve that variable together.

2. The "Fill-in Edge" Effect and Induced Width

When you eliminate a variable from the graph, you conceptually "marry" all of its neighbors, creating what are called fill-in edges. All of those neighbors are merged into a single clique.

The complexity of the Variable Elimination algorithm is completely determined by the maximum factor size generated during the entire process. In graph theory, the size of the largest clique created during elimination is related to the graph's induced width. The time and memory required scale exponentially with this width.

3. Avoiding the Exponential Wall

Imagine a network shaped like a star, where one central node A connects to 20 outer nodes.

This is why, in the lecture's student performance example, eliminating the unobserved leaf node S first was highly efficient—it didn't force any other variables to merge. Finding the absolute optimal elimination order is actually an NP-hard problem, but using heuristic strategies (like eliminating nodes with the fewest neighbors first) prevents the algorithm from crashing your computer's memory.


Purpose of Moralization

The primary purpose of moralization is to convert a directed Bayesian Network into an undirected graph so that the Variable Elimination (VE) algorithm can process it correctly.

Here is why this step is mathematically necessary:

How Moralization Solves This:

If two parent nodes share a child, they are part of the same CPT, but they might not be directly connected to each other in the original directed graph.

Moralization fixes this by "marrying the parents". You draw a new undirected edge between any parents that share a child, and then you drop the directional arrows from all other edges.

For example, in the lecture's student performance network, Intelligence (I) and Difficulty (D) both point to Grade (G), so they appear together in the factor P(G|D,I). Moralization requires adding an edge between I and D to ensure that their dependency is captured correctly when they are merged into the Grade factor during calculation.


"Student Performance" example

1. Network Setup and Moralization (Slides 12–14)

The example uses a 5-node Bayesian Network representing a student's academic path. The variables are Difficulty (D), Intelligence (I), SAT/Score (S), Grade (G), and Letter (L). All variables are binary, meaning they can be either 0 (false/poor) or 1 (true/good).

Pasted image 20260501213817.png

To prepare for Variable Elimination, the directed graph is "moralized" into an undirected graph. Because D and I share the child node G, they are "married" by adding an undirected edge between them. This ensures that the conditional probability table P(G|D,I) forms a complete clique containing all three related variables.

2. The Query and Strategy (Slide 15)

The goal is to calculate the marginal probability of the student getting a good recommendation letter, or P(L=1). To do this, we must sum out (marginalize) all other variables: S, I, D, and G.

The chosen elimination order is S,I,D,G. Applying the principle of pushing summations inward, the mathematical equation looks like this:

P(L)=GDISP(D)P(I)P(S|I)P(G|D,I)P(L|G)

3. Step-by-Step Execution (Slides 16–19)

Step 1: Eliminating S (The Barren Node)

Step 2: Eliminating I

We only explicitly calculated the values for G=1 during the elimination of I as a mathematical shortcut, not because G=0 doesn't exist.

Here is the full breakdown of why that shortcut works, and how it aligns with the lecture summary you provided.

1. The Shortcut: Why We Skipped G=0

When we eliminate I, the resulting factor τ2(G,D) technically contains four values, covering every combination of G and D:

However, the professor knew that the very next step was to eliminate D to create τ3(G). Because we have not introduced any strict evidence (like knowing for a fact the student failed), the factor τ3(G) is mathematically equivalent to the actual marginal probability P(G).

One of the foundational rules of probability is that all possible states of a single variable must sum to exactly 1. Therefore:

P(G=0)+P(G=1)=1

By calculating only the G=1 half of the τ2(G,D) table, the professor had exactly enough information to find τ3(G=1) (which was 0.507). Instead of grinding through the math for G=0, they simply subtracted 0.507 from 1 to get 0.493. It was a trick to cut the manual calculation time in half.

In short, your instinct was completely correct: a full algorithmic execution would calculate G=0 during the elimination of I. The professor just used human intuition to skip it.

Step 3: Eliminating D

To understand why the intermediate factor τ3(G) is exactly equal to the real-world marginal probability P(G), we have to look at what Variable Elimination is actually doing under the hood.

Variable Elimination is not an approximation; it is algebraically identical to calculating the full joint distribution and marginalizing variables out, just done in a smarter order.

Here is the step-by-step mathematical proof of why τ3(G)=P(G), based on the network structure in the lecture.

The Mathematical Proof

If we want to find the true marginal probability P(G) for the Grade node, we must sum over all of its ancestors (Intelligence and Difficulty). Because D and I are independent parent nodes, the true probability is defined as:

P(G)=DIP(D,I,G)P(G)=DIP(D)P(I)P(G|D,I)

Now, let's look at how the Variable Elimination algorithm built τ3(G) in the lecture example:

1. The creation of τ2(G,D) When eliminating I, the algorithm multiplied the prior of I with the Conditional Probability Table (CPT) for G, and summed out I:

τ2(G,D)=IP(I)P(G|D,I)

2. The creation of τ3(G) Next, when eliminating D, the algorithm took the prior of D, multiplied it by our new τ2(G,D) factor, and summed out D:

τ3(G)=DP(D)τ2(G,D)

3. Combining the equations

If we substitute the mathematical definition of τ2(G,D) directly into the equation for τ3(G), we get this:

τ3(G)=DP(D)[IP(I)P(G|D,I)]

Because of the distributive property, we can pull the summations to the outside and group the probabilities:

τ3(G)=DIP(D)P(I)P(G|D,I)

This final equation perfectly matches the definition of P(G). By systematically pushing the summations inward, the algorithm naturally computed the exact marginal probability of the Grade node.

The "No Evidence" Condition

This equivalency only happened because no evidence was observed in any of the ancestor nodes (D, I, or S) prior to calculating G.

If we had observed evidence—for example, if we knew the student was highly intelligent (I=1)—the math would change:

Because there was no evidence acting as a constraint in the lecture's scenario, τ3(G) maintained the properties of a pure probability distribution, allowing the professor to confidently use the shortcut τ3(G0)=10.507.

Step 4: Final Elimination of G

The Conclusion

The calculation reveals that there is approximately a 50.6% chance of the student receiving a good recommendation letter.

As slide 20 summarizes, this specific elimination order was highly efficient. By eliminating the cheap leaf node S first, and isolating operations, the algorithm never had to build the full 32-row joint distribution table (25=32). The largest intermediate table it ever constructed had only 8 rows (23=8), which occurred during the elimination of I when factors G, D, and I were briefly evaluated together.