Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
---
title: Image Classification
sidebar_label: Image Classification
description: "How to train neural networks to categorize images into predefined classes using CNNs."
tags: [deep-learning, cnn, image-classification, computer-vision, transfer-learning]
---

**Image Classification** is the task of assigning a label or a category to an entire input image. It is the most fundamental task in Computer Vision and serves as the building block for more complex tasks like Object Detection and Image Segmentation.

## 1. The Workflow: From Pixels to Labels

An image classification model follows a linear pipeline where spatial information is gradually transformed into a semantic category.

1. **Input Layer:** Raw pixel data (e.g., $224 \times 224 \times 3$ for an RGB image).
2. **Feature Extraction:** Multiple [Convolution](../cnn/convolution) and [Pooling](../cnn/pooling) layers identify edges, shapes, and complex patterns.
3. **Flattening:** The 2D feature maps are converted into a 1D vector.
4. **Classification:** [Fully Connected Layers](https://www.youtube.com/watch?v=rxSmwM7z0_4) act as a traditional MLP to interpret the features.
5. **Output Layer:** Uses a **Softmax** function to provide probabilities for each class.

## 2. Binary vs. Multi-Class Classification

| Type | Output Neurons | Activation | Loss Function |
| :--- | :--- | :--- | :--- |
| **Binary** (Cat or Not) | 1 | Sigmoid | Binary Cross-Entropy |
| **Multi-Class** (Cat, Dog, Bird) | $N$ (Number of classes) | Softmax | Categorical Cross-Entropy |

## 3. Transfer Learning: Standing on the Shoulders of Giants

Training a CNN from scratch requires thousands of images and massive computing power. Instead, most developers use **Transfer Learning**.

This involves taking a model pre-trained on a massive dataset (like **ImageNet**, which has 1.4 million images across 1,000 classes) and repurposing it for a specific task.

* **Freezing:** We keep the "Feature Extractor" weights fixed because they already know how to "see" shapes.
* **Fine-Tuning:** We only replace and train the final classification head for our specific labels.

## 4. Implementation with Keras (Transfer Learning)

This example shows how to use the **MobileNetV2** architecture to classify custom images.

```python
import tensorflow as tf
from tensorflow.keras import layers, models

# 1. Load a pre-trained model without the top (classification) layer
base_model = tf.keras.applications.MobileNetV2(
input_shape=(160, 160, 3), include_top=False, weights='imagenet'
)

# 2. Freeze the base model
base_model.trainable = False

# 3. Add custom classification head
model = models.Sequential([
base_model,
layers.GlobalAveragePooling2D(),
layers.Dense(1, activation='sigmoid') # Binary: e.g., 'Mask' or 'No Mask'
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

```

## 5. Challenges in Classification

1. **Intra-class Variation:** A "Chair" can look very different depending on its design.
2. **Scale Variation:** An object may occupy the entire frame or just a tiny corner.
3. **Viewpoint Variation:** A model must recognize a car from the front, side, and top.
4. **Occlusion:** Only part of the object might be visible (e.g., a dog behind a fence).

## 6. Popular Architectures for Classification

* **ResNet (Residual Networks):** Introduced "Skip Connections" to allow training of very deep networks (100+ layers).
* **VGG-16:** A very deep but simple architecture using only convolutions.
* **Inception (GoogLeNet):** Uses different kernel sizes in the same layer to capture features at different scales.
* **EfficientNet:** Optimized for the best balance between accuracy and computational cost.

## References

* **ImageNet:** [The Benchmark Dataset](https://www.image-net.org/)
* **TensorFlow Tutorials:** [Image Classification for Beginners](https://www.tensorflow.org/tutorials/images/classification)
* **PyTorch Tutorials:** [Transfer Learning for Computer Vision](https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html)

---

**Classifying an entire image is great, but what if you need to know *where* the object is or if there are multiple objects?**
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
---
title: Image Segmentation
sidebar_label: Image Segmentation
description: "Going beyond bounding boxes: How to classify every single pixel in an image."
tags: [deep-learning, cnn, computer-vision, segmentation, u-net, mask-rcnn]
---

While [Image Classification](./image-classification) tells us **what** is in an image, and **Object Detection** tells us **where** it is, **Image Segmentation** provides a pixel-perfect understanding of the scene.

It is the process of partitioning a digital image into multiple segments (sets of pixels) to simplify or change the representation of an image into something that is more meaningful and easier to analyze.

## 1. Types of Segmentation

Not all segmentation tasks are the same. We generally categorize them into three levels of complexity:

### A. Semantic Segmentation
Every pixel is assigned a class label (e.g., "Road," "Sky," "Car"). However, it does **not** differentiate between multiple instances of the same class. Two cars parked next to each other will appear as a single connected "blob."

### B. Instance Segmentation
This goes a step further by detecting and delineating each distinct object of interest. If there are five people in a photo, instance segmentation will give each person a unique color/ID.

### C. Panoptic Segmentation
The "holy grail" of segmentation. It combines semantic and instance segmentation to provide a total understanding of the scene—identifying individual objects (cars, people) and background textures (sky, grass).

## 2. The Architecture: Encoder-Decoder (U-Net)

Traditional CNNs lose spatial resolution through pooling. To get back to an image output of the same size as the input, we use an **Encoder-Decoder** architecture.

1. **Encoder (The "What"):** A standard CNN that downsamples the image to extract high-level features.
2. **Bottleneck:** The compressed representation of the image.
3. **Decoder (The "Where"):** Uses **Transposed Convolutions** (Upsampling) to recover the spatial dimensions.
4. **Skip Connections:** These are the "secret sauce" of the **U-Net** architecture. They pass high-resolution information from the encoder directly to the decoder to help refine the boundaries of the mask.

## 3. Loss Functions for Segmentation

Because we are classifying every pixel, standard accuracy can be misleading (especially if 90% of the image is just background). We use specialized metrics:

* **Intersection over Union (IoU) / Jaccard Index:** Measures the overlap between the predicted mask and the ground truth.
* **Dice Coefficient:** Similar to IoU, it measures the similarity between two sets of data and is more robust to class imbalance.

$$
IoU = \frac{\text{Area of Overlap}}{\text{Area of Union}}
$$

## 4. Real-World Applications

* **Medical Imaging:** Identifying tumors or mapping organs in MRI and CT scans.
* **Self-Driving Cars:** Identifying the exact boundaries of lanes, sidewalks, and drivable space.
* **Satellite Imagery:** Mapping land use, deforestation, or urban development.
* **Portrait Mode:** Separating the person (subject) from the background to apply a "bokeh" blur effect.

## 5. Popular Models

| Model | Type | Best For |
| :--- | :--- | :--- |
| **U-Net** | Semantic | Medical imaging and biomedical research. |
| **Mask R-CNN** | Instance | Detecting objects and generating masks (e.g., counting individual cells). |
| **DeepLabV3+** | Semantic | State-of-the-art results using Atrous (Dilated) Convolutions. |
| **SegNet** | Semantic | Efficient scene understanding for autonomous driving. |

## 6. Implementation Sketch (PyTorch)

Using a pre-trained segmentation model from `torchvision`:

```python
import torch
from torchvision import models

# Load a pre-trained DeepLabV3 model
model = models.segmentation.deeplabv3_resnet101(pretrained=True).eval()

# Input: (Batch, Channels, Height, Width)
dummy_input = torch.randn(1, 3, 224, 224)

# Output: Returns a dictionary containing 'out' - the pixel-wise class predictions
with torch.no_grad():
output = model(dummy_input)['out']

print(f"Output shape: {output.shape}")
# Shape will be [1, 21, 224, 224] (for 21 Pascal VOC classes)

```

## References

* **ArXiv:** [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597)
* **Facebook Research:** [Mask R-CNN Paper](https://arxiv.org/abs/1703.06870)

---

**Segmentation provides a high level of detail, but it's computationally expensive. How do we make these models faster for real-time applications?**
88 changes: 88 additions & 0 deletions docs/machine-learning/deep-learning/cnn/convolution.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
---
title: The Convolution Operation
sidebar_label: Convolution
description: "Understanding kernels, filters, and how feature maps are created in Convolutional Neural Networks."
tags: [deep-learning, cnn, computer-vision, convolution, kernels]
---

The **Convolution** is the heart of Computer Vision. Unlike standard neural networks that treat every pixel as an independent feature, Convolution allows the network to preserve the **spatial relationship** between pixels, enabling it to recognize shapes, edges, and textures.

## 1. What is a Convolution?

At its simplest, a convolution is a mathematical operation where a small matrix (called a **Kernel** or **Filter**) slides across an input image and performs element-wise multiplication with the part of the input it is currently hovering over.

The results are summed up to create a single value in a new matrix called a **Feature Map** (or Activation Map).

## 2. The Anatomy of a Kernel

A kernel is a grid of weights. Different weights allow the kernel to detect different types of features:

* **Vertical Edge Detector:** A kernel with high values on the left and low values on the right.
* **Horizontal Edge Detector:** A kernel with high values on the top and low values on the bottom.
* **Sharpening Kernel:** A kernel that emphasizes the central pixel relative to its neighbors.

## 3. Key Hyperparameters

When performing a convolution, there are three main settings that determine the size and behavior of the output:

### A. Stride
Stride is the number of pixels the kernel moves at a time.
* **Stride 1:** Moves one pixel at a time (larger output).
* **Stride 2:** Jumps two pixels at a time (smaller, downsampled output).

### B. Padding
Since the kernel cannot "hang off" the edge of an image, the pixels on the borders are processed less than the pixels in the center. To fix this, we add a border of zeros around the image.
* **Valid Padding:** No padding (output is smaller than input).
* **Same Padding:** Zeros are added so the output is the same size as the input.

### C. Depth (Channels)
If you are processing a color image, your input has 3 channels (Red, Green, Blue). Your kernel will also have a depth of 3 to match.

## 4. The Math of Output Size

To calculate the dimensions of the resulting Feature Map, we use the following formula:

$$
O = \frac{W - K + 2P}{S} + 1
$$

* **$W$**: Input width/height
* **$K$**: Kernel size
* **$P$**: Padding
* **$S$**: Stride

## 5. Why Convolution?

1. **Sparse Connectivity:** Instead of every input pixel connecting to every output neuron, neurons only look at a small "receptive field." This massively reduces the number of parameters.
2. **Parameter Sharing:** The same kernel (weights) is used across the entire image. If a filter learns to detect a "circle," it can find that circle in the top-left corner or the bottom-right corner using the same weights.

## 6. Implementation with PyTorch

```python
import torch
import torch.nn as nn

# Create a sample input: (Batch, Channels, Height, Width)
input_image = torch.randn(1, 3, 32, 32)

# Define a Convolutional Layer
# 3 input channels (RGB), 16 output filters, 3x3 kernel size
conv_layer = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1)

# Apply convolution
output = conv_layer(input_image)

print(f"Input shape: {input_image.shape}")
print(f"Output shape: {output.shape}")
# Output: [1, 16, 32, 32] because of 'Same' padding

```

## References

* **Stanford CS231n:** [Convolutional Neural Networks for Visual Recognition](https://cs231n.github.io/convolutional-networks/)
* **Setosa.io:** [Image Kernels Visualizer](http://setosa.io/ev/image-kernels/)

---

**Convolution extracts the features, but the resulting maps are often too large and computationally heavy. How do we shrink them down without losing the important information?**
91 changes: 91 additions & 0 deletions docs/machine-learning/deep-learning/cnn/padding.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
---
title: Padding in CNNs
sidebar_label: Padding
description: "How padding prevents data loss at the edges and controls the output size of convolutional layers."
tags: [deep-learning, cnn, computer-vision, padding, zero-padding]
---

When we slide a kernel over an image in a [Convolutional Layer](./convolution), two problems occur:
1. **Shrinking Output:** The image gets smaller with every layer.
2. **Loss of Border Info:** Pixels at the corners are only "touched" by the kernel once, whereas central pixels are processed many times.

**Padding** solves both by adding a border of extra pixels (usually zeros) around the input image.

## 1. The Border Problem

Imagine a $3 \times 3$ kernel sliding over a $5 \times 5$ image. The center pixel is involved in 9 different multiplications, but the corner pixel is only involved in 1. This means the network effectively "ignores" information at the edges of your images.

## 2. Types of Padding

There are two primary ways to handle padding in deep learning frameworks:

### A. Valid Padding (No Padding)
In "Valid" padding, we add zero extra pixels. The kernel stays strictly within the boundaries of the original image.
* **Result:** The output is always smaller than the input.
* **Formula:** $O = (W - K + 1)$

### B. Same Padding (Zero Padding)
In "Same" padding, we add enough pixels (usually zeros) around the edges so that the output size is **exactly the same** as the input size (assuming a stride of 1).
* **Result:** Spatial dimensions are preserved.
* **Common use:** Deep architectures where we want to stack dozens of layers without the image disappearing.

## 3. Mathematical Formula with Padding

When we include padding ($P$), the formula for the output dimension becomes:

$$
O = \frac{W - K + 2P}{S} + 1
$$

* **$W$**: Input dimension
* **$K$**: Kernel size
* **$P$**: Padding amount (number of pixels added to one side)
* **$S$**: Stride

:::note
For "Same" padding with a stride of 1, the required padding is usually $P = \frac{K-1}{2}$. This is why kernel sizes are almost always odd numbers ($3 \times 3, 5 \times 5$).
:::

## 4. Other Padding Techniques

While **Zero Padding** is the standard, other methods exist for specific cases:
* **Reflection Padding:** Mirrors the pixels from inside the image. This is often used in style transfer or image generation to prevent "border artifacts."
* **Constant Padding:** Fills the border with a specific constant value (e.g., gray or white).

## 5. Implementation

### TensorFlow / Keras
Keras simplifies this by using strings:

```python
from tensorflow.keras.layers import Conv2D

# Output size will be smaller than input
valid_conv = Conv2D(32, (3, 3), padding='valid')

# Output size will be identical to input
same_conv = Conv2D(32, (3, 3), padding='same')

```

### PyTorch

In PyTorch, you specify the exact number of pixels:

```python
import torch.nn as nn

# For a 3x3 kernel, padding=1 gives 'same' output
# (3-1)/2 = 1
conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1)

```

## References

* **CS231n:** [Spatial Arrangement of Layers](https://cs231n.github.io/convolutional-networks/#spatial)
* **PyTorch Docs:** [Conv2d Layer Specifications](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html)

---

**Padding keeps the image size consistent, but what if we want to move across the image faster or purposely reduce the size?**
Loading