Sequence Topologies and Parameter number

When structuring recurrent architectures for various processing tasks, understanding the underlying sequence topologies dictates how you shape your data tensors. Furthermore, grasping the exact parameter calculations is fundamental for optimizing the theoretical footprint of your models before training.

1. Sequence Topologies

Standard feed-forward networks (One-to-One) are restricted by fixed-size inputs and outputs, lacking the recurrency needed for sequential data. By introducing the loop mechanism, Recurrent Neural Networks (RNNs) can be unrolled across discrete time steps to handle dynamic sequences in several topological configurations:


2. Parameter Calculation

Unlike standard deep networks where weights only connect independent layers, RNNs introduce weight matrices that govern the temporal states. At any time step t, the output of an activation unit (at) depends on both the current input (Xt) and the previous activation output (at1).

The hidden state update formula is defined as:

at=tanh(XtWax+at1Waa+b1)

From this operation, we can mathematically derive the total number of trainable parameters for any hidden layer. Let N be the number of units in the current hidden layer, and I be the number of inputs coming into that layer.

Hidden Layer Calculation:

  1. Input Weights (Wax): Every input feature connects to every unit. Size = I×N.

  2. Recurrent Weights (Waa): Every unit connects to every unit from the previous time step. Size = N×N.

  3. Biases (b1): One bias per unit. Size = N.

Combining these, the parameters for a single RNN hidden layer is defined as:

Phidden=N×(I+N+1)

Output Layer Calculation:

To generate the final prediction Y^, the final hidden state connects to the output nodes. If O is the output size, the calculation requires Wya (weights) and b2 (biases):

Poutput=O×(N+1)

Example Derivation (Two Hidden Layers)

Following the mathematical walkthrough from the lecture involving a network with 1 Input Feature, a 3-unit Hidden Layer 1, a 4-unit Hidden Layer 2, and a 1-unit Output:

Notice how the N×N recurrent connection causes the parameter count to scale quadratically with the size of the hidden dimension.


"Every unit connects to every unit from the previous time step." Why so? Shouldn't it only be the same unit? Why all units in the layer

It is very intuitive to assume that a single memory unit should only be responsible for its own continuous timeline.

However, if an RNN were built that way, it would severely limit the model's ability to understand complex data. To see why, we can look at the mathematics of the recurrent weight matrix (Waa) and how networks build "distributed representations," particularly in sequence tasks like natural language processing.

The Mathematical Difference

When calculating the hidden state at time t, the equation is:

a<t>=tanh(WaxX<t>+Waaa<t1>+b)

Why is a Dense Matrix Necessary? (Feature Interaction)

In a neural network, each hidden unit learns to detect and track a specific, abstract "feature" within the data. The true power of deep learning comes from allowing these isolated features to interact with one another.

Imagine you are training an RNN language model to process the sentence: "The cats, which the owner fed,..."

The network needs to predict the next word (e.g., "are"). At any given time step, different units in the hidden layer are tracking different contextual clues:

If the units only connected to themselves, they would act as completely isolated memory tracks. Unit 2 would know the dependent clause just ended, but it wouldn't know what the main subject was.

By fully connecting the layer, the network pools all of these individual features into a single, unified Context Vector. At the next time step, Unit 1 can update its own state based not just on its own past, but by looking at what Unit 2 and Unit 3 detected in the previous step. This cross-communication allows the network to logically deduce: "The dependent clause just ended (Unit 2's past state), and the main subject was a plural animal (Unit 1 and Unit 3's past states), therefore the next expected feature should be a plural verb."

In short, connecting every unit to every unit allows the RNN to maintain a holistic, interacting "state of the world" rather than a collection of blind, parallel tracks.


RNN Vs GRU Vs LSTM

To understand how Recurrent Neural Networks (RNNs) evolve into more advanced models, it is helpful to look at how they manage the flow of information over time. Standard RNNs struggle with maintaining long-term context, which led to the introduction of specialized "gating" mechanisms to explicitly control what information is kept, updated, or discarded.

1. The Gated Recurrent Unit (GRU)

The GRU introduces a modification to the standard RNN by adding an Update Gate (Gu). This gate acts as a filter that determines how much of the past knowledge needs to be passed along to the future.

2. Long Short-Term Memory (LSTM)

The LSTM architecture builds upon the gating concept of the GRU but introduces more granular control over the memory state through two major structural changes.

A. Splitting the Gates Instead of using a single Update Gate to govern both what is added and what is removed, the LSTM splits this responsibility into two independent gates:

B. Differentiating Memory from Output In a GRU, the internal memory cell value is often exposed directly to the subsequent classification layers (like a Softmax layer). However, the lecture notes a critical warning regarding LSTMs: simply combining the independent Update and Forget gates (GuC~<t>+GfC<t1>) results in a value that is not bounded by [1,1].

To solve this, the LSTM explicitly differentiates between the internal Cell Memory (C<t>) and the visible Cell Output (a<t>).