diff --git a/docs/machine-learning/deep-learning/cnn-applications/image-classification.mdx b/docs/machine-learning/deep-learning/cnn-applications/image-classification.mdx index e69de29..101d9d0 100644 --- a/docs/machine-learning/deep-learning/cnn-applications/image-classification.mdx +++ b/docs/machine-learning/deep-learning/cnn-applications/image-classification.mdx @@ -0,0 +1,85 @@ +--- +title: Image Classification +sidebar_label: Image Classification +description: "How to train neural networks to categorize images into predefined classes using CNNs." +tags: [deep-learning, cnn, image-classification, computer-vision, transfer-learning] +--- + +**Image Classification** is the task of assigning a label or a category to an entire input image. It is the most fundamental task in Computer Vision and serves as the building block for more complex tasks like Object Detection and Image Segmentation. + +## 1. The Workflow: From Pixels to Labels + +An image classification model follows a linear pipeline where spatial information is gradually transformed into a semantic category. + +1. **Input Layer:** Raw pixel data (e.g., $224 \times 224 \times 3$ for an RGB image). +2. **Feature Extraction:** Multiple [Convolution](../cnn/convolution) and [Pooling](../cnn/pooling) layers identify edges, shapes, and complex patterns. +3. **Flattening:** The 2D feature maps are converted into a 1D vector. +4. **Classification:** [Fully Connected Layers](https://www.youtube.com/watch?v=rxSmwM7z0_4) act as a traditional MLP to interpret the features. +5. **Output Layer:** Uses a **Softmax** function to provide probabilities for each class. + +## 2. Binary vs. Multi-Class Classification + +| Type | Output Neurons | Activation | Loss Function | +| :--- | :--- | :--- | :--- | +| **Binary** (Cat or Not) | 1 | Sigmoid | Binary Cross-Entropy | +| **Multi-Class** (Cat, Dog, Bird) | $N$ (Number of classes) | Softmax | Categorical Cross-Entropy | + +## 3. Transfer Learning: Standing on the Shoulders of Giants + +Training a CNN from scratch requires thousands of images and massive computing power. Instead, most developers use **Transfer Learning**. + +This involves taking a model pre-trained on a massive dataset (like **ImageNet**, which has 1.4 million images across 1,000 classes) and repurposing it for a specific task. + +* **Freezing:** We keep the "Feature Extractor" weights fixed because they already know how to "see" shapes. +* **Fine-Tuning:** We only replace and train the final classification head for our specific labels. + +## 4. Implementation with Keras (Transfer Learning) + +This example shows how to use the **MobileNetV2** architecture to classify custom images. + +```python +import tensorflow as tf +from tensorflow.keras import layers, models + +# 1. Load a pre-trained model without the top (classification) layer +base_model = tf.keras.applications.MobileNetV2( + input_shape=(160, 160, 3), include_top=False, weights='imagenet' +) + +# 2. Freeze the base model +base_model.trainable = False + +# 3. Add custom classification head +model = models.Sequential([ + base_model, + layers.GlobalAveragePooling2D(), + layers.Dense(1, activation='sigmoid') # Binary: e.g., 'Mask' or 'No Mask' +]) + +model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) + +``` + +## 5. Challenges in Classification + +1. **Intra-class Variation:** A "Chair" can look very different depending on its design. +2. **Scale Variation:** An object may occupy the entire frame or just a tiny corner. +3. **Viewpoint Variation:** A model must recognize a car from the front, side, and top. +4. **Occlusion:** Only part of the object might be visible (e.g., a dog behind a fence). + +## 6. Popular Architectures for Classification + +* **ResNet (Residual Networks):** Introduced "Skip Connections" to allow training of very deep networks (100+ layers). +* **VGG-16:** A very deep but simple architecture using only convolutions. +* **Inception (GoogLeNet):** Uses different kernel sizes in the same layer to capture features at different scales. +* **EfficientNet:** Optimized for the best balance between accuracy and computational cost. + +## References + +* **ImageNet:** [The Benchmark Dataset](https://www.image-net.org/) +* **TensorFlow Tutorials:** [Image Classification for Beginners](https://www.tensorflow.org/tutorials/images/classification) +* **PyTorch Tutorials:** [Transfer Learning for Computer Vision](https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html) + +--- + +**Classifying an entire image is great, but what if you need to know *where* the object is or if there are multiple objects?** \ No newline at end of file diff --git a/docs/machine-learning/deep-learning/cnn-applications/image-segmentation.mdx b/docs/machine-learning/deep-learning/cnn-applications/image-segmentation.mdx index e69de29..94a48c4 100644 --- a/docs/machine-learning/deep-learning/cnn-applications/image-segmentation.mdx +++ b/docs/machine-learning/deep-learning/cnn-applications/image-segmentation.mdx @@ -0,0 +1,91 @@ +--- +title: Image Segmentation +sidebar_label: Image Segmentation +description: "Going beyond bounding boxes: How to classify every single pixel in an image." +tags: [deep-learning, cnn, computer-vision, segmentation, u-net, mask-rcnn] +--- + +While [Image Classification](./image-classification) tells us **what** is in an image, and **Object Detection** tells us **where** it is, **Image Segmentation** provides a pixel-perfect understanding of the scene. + +It is the process of partitioning a digital image into multiple segments (sets of pixels) to simplify or change the representation of an image into something that is more meaningful and easier to analyze. + +## 1. Types of Segmentation + +Not all segmentation tasks are the same. We generally categorize them into three levels of complexity: + +### A. Semantic Segmentation +Every pixel is assigned a class label (e.g., "Road," "Sky," "Car"). However, it does **not** differentiate between multiple instances of the same class. Two cars parked next to each other will appear as a single connected "blob." + +### B. Instance Segmentation +This goes a step further by detecting and delineating each distinct object of interest. If there are five people in a photo, instance segmentation will give each person a unique color/ID. + +### C. Panoptic Segmentation +The "holy grail" of segmentation. It combines semantic and instance segmentation to provide a total understanding of the scene—identifying individual objects (cars, people) and background textures (sky, grass). + +## 2. The Architecture: Encoder-Decoder (U-Net) + +Traditional CNNs lose spatial resolution through pooling. To get back to an image output of the same size as the input, we use an **Encoder-Decoder** architecture. + +1. **Encoder (The "What"):** A standard CNN that downsamples the image to extract high-level features. +2. **Bottleneck:** The compressed representation of the image. +3. **Decoder (The "Where"):** Uses **Transposed Convolutions** (Upsampling) to recover the spatial dimensions. +4. **Skip Connections:** These are the "secret sauce" of the **U-Net** architecture. They pass high-resolution information from the encoder directly to the decoder to help refine the boundaries of the mask. + +## 3. Loss Functions for Segmentation + +Because we are classifying every pixel, standard accuracy can be misleading (especially if 90% of the image is just background). We use specialized metrics: + +* **Intersection over Union (IoU) / Jaccard Index:** Measures the overlap between the predicted mask and the ground truth. +* **Dice Coefficient:** Similar to IoU, it measures the similarity between two sets of data and is more robust to class imbalance. + +$$ +IoU = \frac{\text{Area of Overlap}}{\text{Area of Union}} +$$ + +## 4. Real-World Applications + +* **Medical Imaging:** Identifying tumors or mapping organs in MRI and CT scans. +* **Self-Driving Cars:** Identifying the exact boundaries of lanes, sidewalks, and drivable space. +* **Satellite Imagery:** Mapping land use, deforestation, or urban development. +* **Portrait Mode:** Separating the person (subject) from the background to apply a "bokeh" blur effect. + +## 5. Popular Models + +| Model | Type | Best For | +| :--- | :--- | :--- | +| **U-Net** | Semantic | Medical imaging and biomedical research. | +| **Mask R-CNN** | Instance | Detecting objects and generating masks (e.g., counting individual cells). | +| **DeepLabV3+** | Semantic | State-of-the-art results using Atrous (Dilated) Convolutions. | +| **SegNet** | Semantic | Efficient scene understanding for autonomous driving. | + +## 6. Implementation Sketch (PyTorch) + +Using a pre-trained segmentation model from `torchvision`: + +```python +import torch +from torchvision import models + +# Load a pre-trained DeepLabV3 model +model = models.segmentation.deeplabv3_resnet101(pretrained=True).eval() + +# Input: (Batch, Channels, Height, Width) +dummy_input = torch.randn(1, 3, 224, 224) + +# Output: Returns a dictionary containing 'out' - the pixel-wise class predictions +with torch.no_grad(): + output = model(dummy_input)['out'] + +print(f"Output shape: {output.shape}") +# Shape will be [1, 21, 224, 224] (for 21 Pascal VOC classes) + +``` + +## References + +* **ArXiv:** [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597) +* **Facebook Research:** [Mask R-CNN Paper](https://arxiv.org/abs/1703.06870) + +--- + +**Segmentation provides a high level of detail, but it's computationally expensive. How do we make these models faster for real-time applications?** \ No newline at end of file diff --git a/docs/machine-learning/deep-learning/cnn/convolution.mdx b/docs/machine-learning/deep-learning/cnn/convolution.mdx index e69de29..fad1675 100644 --- a/docs/machine-learning/deep-learning/cnn/convolution.mdx +++ b/docs/machine-learning/deep-learning/cnn/convolution.mdx @@ -0,0 +1,88 @@ +--- +title: The Convolution Operation +sidebar_label: Convolution +description: "Understanding kernels, filters, and how feature maps are created in Convolutional Neural Networks." +tags: [deep-learning, cnn, computer-vision, convolution, kernels] +--- + +The **Convolution** is the heart of Computer Vision. Unlike standard neural networks that treat every pixel as an independent feature, Convolution allows the network to preserve the **spatial relationship** between pixels, enabling it to recognize shapes, edges, and textures. + +## 1. What is a Convolution? + +At its simplest, a convolution is a mathematical operation where a small matrix (called a **Kernel** or **Filter**) slides across an input image and performs element-wise multiplication with the part of the input it is currently hovering over. + +The results are summed up to create a single value in a new matrix called a **Feature Map** (or Activation Map). + +## 2. The Anatomy of a Kernel + +A kernel is a grid of weights. Different weights allow the kernel to detect different types of features: + +* **Vertical Edge Detector:** A kernel with high values on the left and low values on the right. +* **Horizontal Edge Detector:** A kernel with high values on the top and low values on the bottom. +* **Sharpening Kernel:** A kernel that emphasizes the central pixel relative to its neighbors. + +## 3. Key Hyperparameters + +When performing a convolution, there are three main settings that determine the size and behavior of the output: + +### A. Stride +Stride is the number of pixels the kernel moves at a time. +* **Stride 1:** Moves one pixel at a time (larger output). +* **Stride 2:** Jumps two pixels at a time (smaller, downsampled output). + +### B. Padding +Since the kernel cannot "hang off" the edge of an image, the pixels on the borders are processed less than the pixels in the center. To fix this, we add a border of zeros around the image. +* **Valid Padding:** No padding (output is smaller than input). +* **Same Padding:** Zeros are added so the output is the same size as the input. + +### C. Depth (Channels) +If you are processing a color image, your input has 3 channels (Red, Green, Blue). Your kernel will also have a depth of 3 to match. + +## 4. The Math of Output Size + +To calculate the dimensions of the resulting Feature Map, we use the following formula: + +$$ +O = \frac{W - K + 2P}{S} + 1 +$$ + +* **$W$**: Input width/height +* **$K$**: Kernel size +* **$P$**: Padding +* **$S$**: Stride + +## 5. Why Convolution? + +1. **Sparse Connectivity:** Instead of every input pixel connecting to every output neuron, neurons only look at a small "receptive field." This massively reduces the number of parameters. +2. **Parameter Sharing:** The same kernel (weights) is used across the entire image. If a filter learns to detect a "circle," it can find that circle in the top-left corner or the bottom-right corner using the same weights. + +## 6. Implementation with PyTorch + +```python +import torch +import torch.nn as nn + +# Create a sample input: (Batch, Channels, Height, Width) +input_image = torch.randn(1, 3, 32, 32) + +# Define a Convolutional Layer +# 3 input channels (RGB), 16 output filters, 3x3 kernel size +conv_layer = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1) + +# Apply convolution +output = conv_layer(input_image) + +print(f"Input shape: {input_image.shape}") +print(f"Output shape: {output.shape}") +# Output: [1, 16, 32, 32] because of 'Same' padding + +``` + +## References + +* **Stanford CS231n:** [Convolutional Neural Networks for Visual Recognition](https://cs231n.github.io/convolutional-networks/) +* **Setosa.io:** [Image Kernels Visualizer](http://setosa.io/ev/image-kernels/) + +--- + +**Convolution extracts the features, but the resulting maps are often too large and computationally heavy. How do we shrink them down without losing the important information?** \ No newline at end of file diff --git a/docs/machine-learning/deep-learning/cnn/padding.mdx b/docs/machine-learning/deep-learning/cnn/padding.mdx index e69de29..515cf8b 100644 --- a/docs/machine-learning/deep-learning/cnn/padding.mdx +++ b/docs/machine-learning/deep-learning/cnn/padding.mdx @@ -0,0 +1,91 @@ +--- +title: Padding in CNNs +sidebar_label: Padding +description: "How padding prevents data loss at the edges and controls the output size of convolutional layers." +tags: [deep-learning, cnn, computer-vision, padding, zero-padding] +--- + +When we slide a kernel over an image in a [Convolutional Layer](./convolution), two problems occur: +1. **Shrinking Output:** The image gets smaller with every layer. +2. **Loss of Border Info:** Pixels at the corners are only "touched" by the kernel once, whereas central pixels are processed many times. + +**Padding** solves both by adding a border of extra pixels (usually zeros) around the input image. + +## 1. The Border Problem + +Imagine a $3 \times 3$ kernel sliding over a $5 \times 5$ image. The center pixel is involved in 9 different multiplications, but the corner pixel is only involved in 1. This means the network effectively "ignores" information at the edges of your images. + +## 2. Types of Padding + +There are two primary ways to handle padding in deep learning frameworks: + +### A. Valid Padding (No Padding) +In "Valid" padding, we add zero extra pixels. The kernel stays strictly within the boundaries of the original image. +* **Result:** The output is always smaller than the input. +* **Formula:** $O = (W - K + 1)$ + +### B. Same Padding (Zero Padding) +In "Same" padding, we add enough pixels (usually zeros) around the edges so that the output size is **exactly the same** as the input size (assuming a stride of 1). +* **Result:** Spatial dimensions are preserved. +* **Common use:** Deep architectures where we want to stack dozens of layers without the image disappearing. + +## 3. Mathematical Formula with Padding + +When we include padding ($P$), the formula for the output dimension becomes: + +$$ +O = \frac{W - K + 2P}{S} + 1 +$$ + +* **$W$**: Input dimension +* **$K$**: Kernel size +* **$P$**: Padding amount (number of pixels added to one side) +* **$S$**: Stride + +:::note +For "Same" padding with a stride of 1, the required padding is usually $P = \frac{K-1}{2}$. This is why kernel sizes are almost always odd numbers ($3 \times 3, 5 \times 5$). +::: + +## 4. Other Padding Techniques + +While **Zero Padding** is the standard, other methods exist for specific cases: +* **Reflection Padding:** Mirrors the pixels from inside the image. This is often used in style transfer or image generation to prevent "border artifacts." +* **Constant Padding:** Fills the border with a specific constant value (e.g., gray or white). + +## 5. Implementation + +### TensorFlow / Keras +Keras simplifies this by using strings: + +```python +from tensorflow.keras.layers import Conv2D + +# Output size will be smaller than input +valid_conv = Conv2D(32, (3, 3), padding='valid') + +# Output size will be identical to input +same_conv = Conv2D(32, (3, 3), padding='same') + +``` + +### PyTorch + +In PyTorch, you specify the exact number of pixels: + +```python +import torch.nn as nn + +# For a 3x3 kernel, padding=1 gives 'same' output +# (3-1)/2 = 1 +conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1) + +``` + +## References + +* **CS231n:** [Spatial Arrangement of Layers](https://cs231n.github.io/convolutional-networks/#spatial) +* **PyTorch Docs:** [Conv2d Layer Specifications](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html) + +--- + +**Padding keeps the image size consistent, but what if we want to move across the image faster or purposely reduce the size?** \ No newline at end of file diff --git a/docs/machine-learning/deep-learning/cnn/pooling.mdx b/docs/machine-learning/deep-learning/cnn/pooling.mdx index e69de29..bb08edf 100644 --- a/docs/machine-learning/deep-learning/cnn/pooling.mdx +++ b/docs/machine-learning/deep-learning/cnn/pooling.mdx @@ -0,0 +1,83 @@ +--- +title: "Pooling Layers: Downsampling" +sidebar_label: Pooling +description: "Understanding Max Pooling, Average Pooling, and how they provide spatial invariance." +tags: [deep-learning, cnn, computer-vision, pooling, max-pooling] +--- + +After a [Convolution Operation](./convolution), the resulting feature maps can still be quite large. **Pooling** (also known as subsampling or downsampling) is used to reduce the spatial dimensions (Width x Height) of the data, which reduces the number of parameters and computation in the network. + +## 1. Why do we need Pooling? + +1. **Dimensionality Reduction:** It shrinks the data, making the model faster and less memory-intensive. +2. **Spatial Invariance:** It makes the network robust to small translations or distortions. If a feature (like an ear) moves by a few pixels, the pooled output remains largely the same. +3. **Prevents Overfitting:** By abstracting the features, it prevents the model from "memorizing" the exact pixel locations of features. + +## 2. Types of Pooling + +### A. Max Pooling +This is the most common type. It slides a window across the feature map and picks the **maximum value** within that window. +* **Logic:** "Did the feature appear anywhere in this region? If yes, keep the highest signal." + +### B. Average Pooling +It calculates the **average value** of all pixels within the window. +* **Logic:** "What is the general presence of this feature in the region?" +* **Use Case:** Often used in the final layers of some architectures (like Inception) to smooth out the transition to the output layer. + +![Comparison of Max Pooling vs Average Pooling on a feature map](/img/tutorials/ml/max-pooling-vs-average-pooling.jpg) + +## 3. How Pooling Works (Parameters) + +Like convolution, pooling uses a **Kernel Size** and a **Stride**. + +* **Standard Setup:** A 2x2 window with a stride of 2. +* **Effect:** This setup reduces the width and height of the image by exactly **half**, effectively discarding 75% of the activations while keeping the most "important" ones. + +## 4. Key Differences: Convolution vs. Pooling + +| Feature | Convolution | Pooling | +| :--- | :--- | :--- | +| **Learnable Parameters** | Yes (Weights and Biases) | No (Fixed mathematical rule) | +| **Purpose** | Feature Extraction | Dimensionality Reduction | +| **Effect on Channels** | Can increase/decrease | Keeps number of channels the same | + +## 5. Implementation with TensorFlow/Keras + +```python +from tensorflow.keras.layers import MaxPooling2D, AveragePooling2D + +# Max Pooling with a 2x2 window and stride of 2 +max_pool = MaxPooling2D(pool_size=(2, 2), strides=2) + +# Average Pooling +avg_pool = AveragePooling2D(pool_size=(2, 2)) + +``` + +## 6. Implementation with PyTorch + +```python +import torch.nn as nn + +# Max Pooling +# kernel_size=2, stride=2 +pool = nn.MaxPool2d(2, 2) + +# Apply to a sample input (Batch, Channels, Height, Width) +input_tensor = torch.randn(1, 16, 24, 24) +output = pool(input_tensor) + +print(f"Input shape: {input_tensor.shape}") +print(f"Output shape: {output.shape}") +# Output: [1, 16, 12, 12] + +``` + +## References + +* **DeepLearning.AI:** [Pooling Layers Tutorial](https://www.youtube.com/watch?v=PuFNG721zM8) +* **PyTorch Docs:** [MaxPool2d Documentation](https://pytorch.org/docs/stable/generated/torch.nn.MaxPool2d.html) + +--- + +**We’ve extracted features with Convolution and shrunk them with Pooling. Now, how do we turn these 2D grids into a final "Yes/No" or "Cat/Dog" prediction?** \ No newline at end of file diff --git a/docs/machine-learning/deep-learning/cnn/strides.mdx b/docs/machine-learning/deep-learning/cnn/strides.mdx index e69de29..23828f7 100644 --- a/docs/machine-learning/deep-learning/cnn/strides.mdx +++ b/docs/machine-learning/deep-learning/cnn/strides.mdx @@ -0,0 +1,86 @@ +--- +title: Strides in CNNs +sidebar_label: Strides +description: "Understanding how the step size of a filter influences spatial dimensions and computational efficiency." +tags: [deep-learning, cnn, computer-vision, strides, downsampling] +--- + +In a Convolutional Neural Network, the **Stride** is the number of pixels by which the filter (kernel) shifts over the input matrix. While [Padding](./padding) is used to maintain size, **Stride** is one of the primary ways we control the spatial dimensions of our feature maps. + +## 1. What is a Stride? + +When the stride is set to **1**, the filter moves one pixel at a time. This results in highly overlapping receptive fields and a larger output. + +When the stride is set to **2** (or more), the filter jumps two pixels at a time. This skips over pixels, resulting in a smaller output and less overlap. + +## 2. The Impact of Striding + +### A. Dimensionality Reduction +Increasing the stride is an alternative to [Pooling](/tutorial/machine-learning/deep-learning/cnn/pooling). By jumping over pixels, the network effectively "downsamples" the image. For example, a stride of 2 will roughly halve the width and height of the output. + +### B. Receptive Field +A larger stride allows the network to cover more area with fewer parameters, but it comes at a cost: **Information Loss**. Because the filter skips pixels, some fine-grained spatial details might be missed. + +### C. Computational Efficiency +Larger strides mean fewer total operations (multiplications and additions), which can significantly speed up the training and inference time of a model. + +## 3. Mathematical Formula + +To determine the output size when using strides ($S$), we use the general convolution formula: + +$$ +O = \left\lfloor \frac{W - K + 2P}{S} \right\rfloor + 1 +$$ + +* **$W$**: Input width/height +* **$K$**: Kernel size +* **$P$**: Padding +* **$S$**: Stride + +:::note +If the result of the division is not a whole number, most frameworks will "floor" the value (round down), meaning the last few pixels of the image might be ignored if the filter can't fit. +::: + +## 4. Comparing Stride and Pooling + +Both techniques are used to reduce the size of the data, but they differ in how they do it: + +| Feature | Large Stride Convolution | Pooling Layer | +| :--- | :--- | :--- | +| **Learning** | The network *learns* which pixels to weight during the jump. | Uses a *fixed* rule (Max or Average). | +| **Parameters** | Contains weights and biases. | No parameters. | +| **Trend** | Modern architectures (like ResNet) often prefer strided convolutions. | Classic architectures (like VGG) rely heavily on Pooling. | + +## 5. Implementation + +### TensorFlow / Keras + +```python +from tensorflow.keras.layers import Conv2D + +# A standard convolution (Stride 1) +conv_std = Conv2D(filters=32, kernel_size=(3, 3), strides=(1, 1)) + +# A downsampling convolution (Stride 2) +conv_down = Conv2D(filters=32, kernel_size=(3, 3), strides=(2, 2)) + +``` + +### PyTorch + +```python +import torch.nn as nn + +# Strides are defined as an integer or a tuple (height, width) +# This will halve the input dimensions +conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=2, padding=1) + +``` + +## References + +* **CS231n:** [Convolutional Neural Networks - Strides](https://cs231n.github.io/convolutional-networks/#stride) + +--- + +**We’ve covered how the filter moves, how it handles edges, and how it extracts features. Now, how do we combine all these pieces into a complete network?** \ No newline at end of file diff --git a/static/img/tutorials/ml/max-pooling-vs-average-pooling.jpg b/static/img/tutorials/ml/max-pooling-vs-average-pooling.jpg new file mode 100644 index 0000000..d500abf Binary files /dev/null and b/static/img/tutorials/ml/max-pooling-vs-average-pooling.jpg differ