This note covers the content of Week 7 slides, to see more on topics mentioned in class like YOLO and ResNet, here

At its core, transfer learning is a machine learning technique where a model developed for one task is reused as the starting point for a model on a second, related task.

Think of it like learning to play the piano. Once you understand reading sheet music, rhythm, and basic finger coordination, learning to play the organ or the harpsichord becomes significantly easier. You don't have to relearn what a musical note is; you just adapt your existing knowledge to the new instrument.

In deep learning, this means taking a model that someone else has already spent immense amounts of time, data, and computing power training (often on massive datasets like ImageNet, which contains millions of images), and tweaking it to solve your specific problem.

Here is a breakdown of how it works and why it is so powerful, particularly with Convolutional Neural Networks (CNNs).

How Transfer Learning Works in CNNs

To understand why transfer learning is so effective with CNNs, you have to look at how CNNs learn to "see." They process images hierarchically:

  1. Early Layers (The Foundation): The first few layers of a CNN learn very generic, universal features. They detect simple edges, color blobs, curves, and textures. These features are useful for almost any image task, whether you are looking at a dog, a car, or an X-ray.

  2. Middle Layers (The Shapes): These layers combine the edges and textures to find shapes and patterns, like circles, squares, or specific structural arrangements.

  3. Late Layers (The Specifics): The final convolutional layers learn highly complex, task-specific features. If the model was trained to recognize dogs, these layers are looking for snouts, floppy ears, or tails.

  4. The "Head" (The Classifier): At the very end of the network, the features are flattened and passed into dense, fully connected layers that spit out the final prediction (e.g., "This is a Golden Retriever").

When we use transfer learning, we typically strip away that final "Head" (the classifier) because it is entirely specialized to the original task. We keep the convolutional layers (often called the convolutional base) because they hold all that valuable, foundational knowledge about how to extract features from an image.

The Two Main Strategies

Once you have your pre-trained convolutional base, you add a brand new, untrained "Head" designed for your specific categories. From there, you generally choose between two training strategies:

1. Feature Extraction

In this approach, you freeze the entire convolutional base. "Freezing" means you tell the network not to update the weights of these layers during training. You pass your new images through the frozen base to extract the features, and you only train your newly added classifier Head to make sense of those features.

2. Fine-Tuning

Fine-tuning goes a step further. You still train your new classifier Head, but you also unfreeze some of the top layers of the convolutional base. You then train both the Head and those top convolutional layers together at a very low learning rate. This allows the model to slightly adjust its complex feature detectors to better suit your specific data.

Why is Transfer Learning a Game Changer?


1. Image Segmentation

Image segmentation using neural networks has revolutionized computer vision because it automatically learns features from data, unlike traditional methods that require manual feature extraction. The lecture outlines three primary types of segmentation:

2. Classification and Localization

While basic classification uses a CNN and a Softmax layer to determine what an object is (e.g., Person, Fruit, Car, or Background), localization involves drawing a bounding box around the detected object.

For localization and classification, the network generates an output vector, Y, which includes the following parameters:

If the network determines there is no object (Pc=0), the rest of the values in the vector are treated as "Don't Care" parameters.

Loss Calculations

To train the network, loss must be calculated. The slides highlight a squared error approach:

S=(Y1Y^1)2+(Y2Y^2)2+(Y3Y^3)2+...+(Y8Y^8)2

Other loss functions that can be utilized include:

3. Segmentation Architectures

The lecture introduces encoder-decoder architectures used heavily in segmentation tasks:

4. Landmark Detection

Landmark detection is the process of identifying highly specific coordinate points on a target.

5. Sliding Window Technique

Finally, the lecture touches on the sliding window technique, which is an older method for object detection. It involves moving a designated "window" across an image in steps to check for the presence of an object. The precision and scale of this technique are managed by using different window sizes and selecting a specific stride (the distance the window moves at each step).


Autoencoders

An autoencoder is a type of artificial neural network designed to learn highly efficient, compressed representations of data. Instead of retaining all the complex, raw details of an input (like the millions of individual pixels in an image), the network learns to describe that data using as few key mathematical features as possible.

How Autoencoders Work

The architecture of an autoencoder consists of three main components that form an "hourglass" shape:

Machine Learning Autoencoder Diagram Data Compression to Embedding Vector

Machine Learning Autoencoder Diagram Data Compression to Embedding Vector

  1. The Encoder: This part of the network takes the high-dimensional input data (like an image, or complex tabular data) and compresses it step-by-step into a smaller and smaller set of numbers.

  2. The Bottleneck (Latent Space): This is the smallest layer in the middle of the network. It holds the "latent representation"—a low-dimensional, highly condensed summary of the original data's essential features. The number of neurons in this bottleneck determines the "latent dimension" (e.g., if there are 2 neurons, the data is compressed into a 2D space coordinate).

  3. The Decoder: This network works in reverse. It takes the compressed, low-dimensional coordinates from the bottleneck and attempts to rebuild the original high-dimensional input data from scratch.

The Training Process:

The network learns by minimizing the reconstruction error—the difference between the original input and the output the decoder generates. This is typically measured using Mean Squared Error (MSE), which compares the images pixel by pixel. As training progresses, the encoder gets better at identifying which features are critical to save, and the decoder gets better at translating those limited features back into a full image.

Purpose and Applications

The primary purpose of an autoencoder is data dimensionality reduction and finding meaningful underlying patterns in raw data.

Advantages

Weaknesses and Limitations

While foundational, basic autoencoders have significant limitations, mostly related to how unstructured and messy their latent space can become:

To fix these issues, modern machine learning relies on regularized versions, most notably the Variational Autoencoder (VAE), which forces the latent space to be continuous, smooth, and tightly organized.


While standard autoencoders are incredibly effective at compressing images into a tiny, generalized latent space (like identifying that a picture contains a "4"), that heavy compression creates a major problem for image segmentation.

In segmentation tasks (like semantic, instance, or panoptic segmentation), the goal is not just to identify what is in the image, but exactly where it is, down to the precise pixel boundary. Standard autoencoders lose all that spatial information when they compress the data.

To solve this, the autoencoder architecture is modified into what is known as a Fully Convolutional Network (FCN), with the most famous architecture being the U-Net (which features an encoder-decoder structure).

Here is exactly how the architecture is repurposed to draw precise bounding masks around objects:

1. The Encoder (Learning the "What")

Instead of flattening the image into a 1D array of numbers like a standard autoencoder, a segmentation network keeps the data in 2D feature maps.

The encoder acts as a contracting path. It uses standard convolutional layers and max-pooling to process the image.

2. The Decoder (Learning the "Where")

To get back to the original image size to draw the pixel-by-pixel mask, the decoder acts as an expanding path.

Instead of pooling, it uses up-convolutions (or transposed convolutions) to artificially enlarge the spatial dimensions of the feature maps step-by-step.

However, if you only use an encoder and a decoder, the final output will look like a blurry, ill-defined blob. The decoder knows it needs to draw a car, but because the encoder threw away all the sharp boundary coordinates during compression, the decoder has to guess the exact edges.

3. The Secret Weapon: Skip Connections

To fix the blurry blob problem, architectures like U-Net introduce a brilliant mathematical shortcut: Skip Connections (often implemented as a "copy and crop" function).

During the forward pass, before the encoder compresses a feature map, it saves a copy of it. Then, when the decoder reaches that exact same spatial resolution on the way back up, the network takes that saved high-resolution map from the encoder and concatenates it directly onto the decoder's features.

This combines the best of both worlds:

By merging these, the network can output a mathematically precise, pixel-perfect segmentation map.

Sources: Gemini and Autoencoders