codeharborhub · ajay-dhangar · Jan 17, 2026 · Jan 17, 2026
@@ -0,0 +1,85 @@
+---
+title: Image Classification
+sidebar_label: Image Classification
+description: "How to train neural networks to categorize images into predefined classes using CNNs."
+tags: [deep-learning, cnn, image-classification, computer-vision, transfer-learning]
+---
+
+**Image Classification** is the task of assigning a label or a category to an entire input image. It is the most fundamental task in Computer Vision and serves as the building block for more complex tasks like Object Detection and Image Segmentation.
+
+## 1. The Workflow: From Pixels to Labels
+
+An image classification model follows a linear pipeline where spatial information is gradually transformed into a semantic category.
+
+1.  **Input Layer:** Raw pixel data (e.g., $224 \times 224 \times 3$ for an RGB image).
+2.  **Feature Extraction:** Multiple [Convolution](../cnn/convolution) and [Pooling](../cnn/pooling) layers identify edges, shapes, and complex patterns.
+3.  **Flattening:** The 2D feature maps are converted into a 1D vector.
+4.  **Classification:** [Fully Connected Layers](https://www.youtube.com/watch?v=rxSmwM7z0_4) act as a traditional MLP to interpret the features.
+5.  **Output Layer:** Uses a **Softmax** function to provide probabilities for each class.
+
+## 2. Binary vs. Multi-Class Classification
+
+| Type | Output Neurons | Activation | Loss Function |
+| :--- | :--- | :--- | :--- |
+| **Binary** (Cat or Not) | 1 | Sigmoid | Binary Cross-Entropy |
+| **Multi-Class** (Cat, Dog, Bird) | $N$ (Number of classes) | Softmax | Categorical Cross-Entropy |
+
+## 3. Transfer Learning: Standing on the Shoulders of Giants
+
+Training a CNN from scratch requires thousands of images and massive computing power. Instead, most developers use **Transfer Learning**.
+
+This involves taking a model pre-trained on a massive dataset (like **ImageNet**, which has 1.4 million images across 1,000 classes) and repurposing it for a specific task.
+
+* **Freezing:** We keep the "Feature Extractor" weights fixed because they already know how to "see" shapes.
+* **Fine-Tuning:** We only replace and train the final classification head for our specific labels.
+
+## 4. Implementation with Keras (Transfer Learning)
+
+This example shows how to use the **MobileNetV2** architecture to classify custom images.
+
+```python
+import tensorflow as tf
+from tensorflow.keras import layers, models
+
+# 1. Load a pre-trained model without the top (classification) layer
+base_model = tf.keras.applications.MobileNetV2(
+    input_shape=(160, 160, 3), include_top=False, weights='imagenet'
+)
+
+# 2. Freeze the base model
+base_model.trainable = False
+
+# 3. Add custom classification head
+model = models.Sequential([
+    base_model,
+    layers.GlobalAveragePooling2D(),
+    layers.Dense(1, activation='sigmoid') # Binary: e.g., 'Mask' or 'No Mask'
+])
+
+model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
+
+```
+
+## 5. Challenges in Classification
+
+1. **Intra-class Variation:** A "Chair" can look very different depending on its design.
+2. **Scale Variation:** An object may occupy the entire frame or just a tiny corner.
+3. **Viewpoint Variation:** A model must recognize a car from the front, side, and top.
+4. **Occlusion:** Only part of the object might be visible (e.g., a dog behind a fence).
+
+## 6. Popular Architectures for Classification
+
+* **ResNet (Residual Networks):** Introduced "Skip Connections" to allow training of very deep networks (100+ layers).
+* **VGG-16:** A very deep but simple architecture using only  convolutions.
+* **Inception (GoogLeNet):** Uses different kernel sizes in the same layer to capture features at different scales.
+* **EfficientNet:** Optimized for the best balance between accuracy and computational cost.
+
+## References
+
+* **ImageNet:** [The Benchmark Dataset](https://www.image-net.org/)
+* **TensorFlow Tutorials:** [Image Classification for Beginners](https://www.tensorflow.org/tutorials/images/classification)
+* **PyTorch Tutorials:** [Transfer Learning for Computer Vision](https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html)
+
+---
+
+**Classifying an entire image is great, but what if you need to know *where* the object is or if there are multiple objects?**
@@ -0,0 +1,91 @@
+---
+title: Image Segmentation
+sidebar_label: Image Segmentation
+description: "Going beyond bounding boxes: How to classify every single pixel in an image."
+tags: [deep-learning, cnn, computer-vision, segmentation, u-net, mask-rcnn]
+---
+
+While [Image Classification](./image-classification) tells us **what** is in an image, and **Object Detection** tells us **where** it is, **Image Segmentation** provides a pixel-perfect understanding of the scene.
+
+It is the process of partitioning a digital image into multiple segments (sets of pixels) to simplify or change the representation of an image into something that is more meaningful and easier to analyze.
+
+## 1. Types of Segmentation
+
+Not all segmentation tasks are the same. We generally categorize them into three levels of complexity:
+
+### A. Semantic Segmentation
+Every pixel is assigned a class label (e.g., "Road," "Sky," "Car"). However, it does **not** differentiate between multiple instances of the same class. Two cars parked next to each other will appear as a single connected "blob."
+
+### B. Instance Segmentation
+This goes a step further by detecting and delineating each distinct object of interest. If there are five people in a photo, instance segmentation will give each person a unique color/ID.
+
+### C. Panoptic Segmentation
+The "holy grail" of segmentation. It combines semantic and instance segmentation to provide a total understanding of the scene—identifying individual objects (cars, people) and background textures (sky, grass).
+
+## 2. The Architecture: Encoder-Decoder (U-Net)
+
+Traditional CNNs lose spatial resolution through pooling. To get back to an image output of the same size as the input, we use an **Encoder-Decoder** architecture.
+
+1.  **Encoder (The "What"):** A standard CNN that downsamples the image to extract high-level features.
+2.  **Bottleneck:** The compressed representation of the image.
+3.  **Decoder (The "Where"):** Uses **Transposed Convolutions** (Upsampling) to recover the spatial dimensions.
+4.  **Skip Connections:** These are the "secret sauce" of the **U-Net** architecture. They pass high-resolution information from the encoder directly to the decoder to help refine the boundaries of the mask.
+
+## 3. Loss Functions for Segmentation
+
+Because we are classifying every pixel, standard accuracy can be misleading (especially if 90% of the image is just background). We use specialized metrics:
+
+* **Intersection over Union (IoU) / Jaccard Index:** Measures the overlap between the predicted mask and the ground truth.
+* **Dice Coefficient:** Similar to IoU, it measures the similarity between two sets of data and is more robust to class imbalance.
+
+$$
+IoU = \frac{\text{Area of Overlap}}{\text{Area of Union}}
+$$
+
+## 4. Real-World Applications
+
+* **Medical Imaging:** Identifying tumors or mapping organs in MRI and CT scans.
+* **Self-Driving Cars:** Identifying the exact boundaries of lanes, sidewalks, and drivable space.
+* **Satellite Imagery:** Mapping land use, deforestation, or urban development.
+* **Portrait Mode:** Separating the person (subject) from the background to apply a "bokeh" blur effect.
+
+## 5. Popular Models
+
+| Model | Type | Best For |
+| :--- | :--- | :--- |
+| **U-Net** | Semantic | Medical imaging and biomedical research. |
+| **Mask R-CNN** | Instance | Detecting objects and generating masks (e.g., counting individual cells). |
+| **DeepLabV3+** | Semantic | State-of-the-art results using Atrous (Dilated) Convolutions. |
+| **SegNet** | Semantic | Efficient scene understanding for autonomous driving. |
+
+## 6. Implementation Sketch (PyTorch)
+
+Using a pre-trained segmentation model from `torchvision`:
+
+```python
+import torch
+from torchvision import models
+
+# Load a pre-trained DeepLabV3 model
+model = models.segmentation.deeplabv3_resnet101(pretrained=True).eval()
+
+# Input: (Batch, Channels, Height, Width)
+dummy_input = torch.randn(1, 3, 224, 224)
+
+# Output: Returns a dictionary containing 'out' - the pixel-wise class predictions
+with torch.no_grad():
+    output = model(dummy_input)['out']
+
+print(f"Output shape: {output.shape}") 
+# Shape will be [1, 21, 224, 224] (for 21 Pascal VOC classes)
+
+```
+
+## References
+
+* **ArXiv:** [U-Net: Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597)
+* **Facebook Research:** [Mask R-CNN Paper](https://arxiv.org/abs/1703.06870)
+
+---
+
+**Segmentation provides a high level of detail, but it's computationally expensive. How do we make these models faster for real-time applications?**
@@ -0,0 +1,88 @@
+---
+title: The Convolution Operation
+sidebar_label: Convolution
+description: "Understanding kernels, filters, and how feature maps are created in Convolutional Neural Networks."
+tags: [deep-learning, cnn, computer-vision, convolution, kernels]
+---
+
+The **Convolution** is the heart of Computer Vision. Unlike standard neural networks that treat every pixel as an independent feature, Convolution allows the network to preserve the **spatial relationship** between pixels, enabling it to recognize shapes, edges, and textures.
+
+## 1. What is a Convolution?
+
+At its simplest, a convolution is a mathematical operation where a small matrix (called a **Kernel** or **Filter**) slides across an input image and performs element-wise multiplication with the part of the input it is currently hovering over.
+
+The results are summed up to create a single value in a new matrix called a **Feature Map** (or Activation Map).
+
+## 2. The Anatomy of a Kernel
+
+A kernel is a grid of weights. Different weights allow the kernel to detect different types of features:
+
+* **Vertical Edge Detector:** A kernel with high values on the left and low values on the right.
+* **Horizontal Edge Detector:** A kernel with high values on the top and low values on the bottom.
+* **Sharpening Kernel:** A kernel that emphasizes the central pixel relative to its neighbors.
+
+## 3. Key Hyperparameters
+
+When performing a convolution, there are three main settings that determine the size and behavior of the output:
+
+### A. Stride
+Stride is the number of pixels the kernel moves at a time.
+* **Stride 1:** Moves one pixel at a time (larger output).
+* **Stride 2:** Jumps two pixels at a time (smaller, downsampled output).
+
+### B. Padding
+Since the kernel cannot "hang off" the edge of an image, the pixels on the borders are processed less than the pixels in the center. To fix this, we add a border of zeros around the image.
+* **Valid Padding:** No padding (output is smaller than input).
+* **Same Padding:** Zeros are added so the output is the same size as the input.
+
+### C. Depth (Channels)
+If you are processing a color image, your input has 3 channels (Red, Green, Blue). Your kernel will also have a depth of 3 to match.
+
+## 4. The Math of Output Size
+
+To calculate the dimensions of the resulting Feature Map, we use the following formula:
+
+$$
+O = \frac{W - K + 2P}{S} + 1
+$$
+
+* **$W$**: Input width/height
+* **$K$**: Kernel size
+* **$P$**: Padding
+* **$S$**: Stride
+
+## 5. Why Convolution?
+
+1.  **Sparse Connectivity:** Instead of every input pixel connecting to every output neuron, neurons only look at a small "receptive field." This massively reduces the number of parameters.
+2.  **Parameter Sharing:** The same kernel (weights) is used across the entire image. If a filter learns to detect a "circle," it can find that circle in the top-left corner or the bottom-right corner using the same weights.
+
+## 6. Implementation with PyTorch
+
+```python
+import torch
+import torch.nn as nn
+
+# Create a sample input: (Batch, Channels, Height, Width)
+input_image = torch.randn(1, 3, 32, 32)
+
+# Define a Convolutional Layer
+# 3 input channels (RGB), 16 output filters, 3x3 kernel size
+conv_layer = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=1)
+
+# Apply convolution
+output = conv_layer(input_image)
+
+print(f"Input shape: {input_image.shape}")
+print(f"Output shape: {output.shape}") 
+# Output: [1, 16, 32, 32] because of 'Same' padding
+
+```
+
+## References
+
+* **Stanford CS231n:** [Convolutional Neural Networks for Visual Recognition](https://cs231n.github.io/convolutional-networks/)
+* **Setosa.io:** [Image Kernels Visualizer](http://setosa.io/ev/image-kernels/)
+
+---
+
+**Convolution extracts the features, but the resulting maps are often too large and computationally heavy. How do we shrink them down without losing the important information?**
@@ -0,0 +1,91 @@
+---
+title: Padding in CNNs
+sidebar_label: Padding
+description: "How padding prevents data loss at the edges and controls the output size of convolutional layers."
+tags: [deep-learning, cnn, computer-vision, padding, zero-padding]
+---
+
+When we slide a kernel over an image in a [Convolutional Layer](./convolution), two problems occur:
+1.  **Shrinking Output:** The image gets smaller with every layer.
+2.  **Loss of Border Info:** Pixels at the corners are only "touched" by the kernel once, whereas central pixels are processed many times.
+
+**Padding** solves both by adding a border of extra pixels (usually zeros) around the input image.
+
+## 1. The Border Problem
+
+Imagine a $3 \times 3$ kernel sliding over a $5 \times 5$ image. The center pixel is involved in 9 different multiplications, but the corner pixel is only involved in 1. This means the network effectively "ignores" information at the edges of your images.
+
+## 2. Types of Padding
+
+There are two primary ways to handle padding in deep learning frameworks:
+
+### A. Valid Padding (No Padding)
+In "Valid" padding, we add zero extra pixels. The kernel stays strictly within the boundaries of the original image.
+* **Result:** The output is always smaller than the input.
+* **Formula:** $O = (W - K + 1)$
+
+### B. Same Padding (Zero Padding)
+In "Same" padding, we add enough pixels (usually zeros) around the edges so that the output size is **exactly the same** as the input size (assuming a stride of 1).
+* **Result:** Spatial dimensions are preserved.
+* **Common use:** Deep architectures where we want to stack dozens of layers without the image disappearing.
+
+## 3. Mathematical Formula with Padding
+
+When we include padding ($P$), the formula for the output dimension becomes:
+
+$$
+O = \frac{W - K + 2P}{S} + 1
+$$
+
+* **$W$**: Input dimension
+* **$K$**: Kernel size
+* **$P$**: Padding amount (number of pixels added to one side)
+* **$S$**: Stride
+
+:::note
+For "Same" padding with a stride of 1, the required padding is usually $P = \frac{K-1}{2}$. This is why kernel sizes are almost always odd numbers ($3 \times 3, 5 \times 5$).
+:::
+
+## 4. Other Padding Techniques
+
+While **Zero Padding** is the standard, other methods exist for specific cases:
+* **Reflection Padding:** Mirrors the pixels from inside the image. This is often used in style transfer or image generation to prevent "border artifacts."
+* **Constant Padding:** Fills the border with a specific constant value (e.g., gray or white).
+
+## 5. Implementation
+
+### TensorFlow / Keras
+Keras simplifies this by using strings:
+
+```python
+from tensorflow.keras.layers import Conv2D
+
+# Output size will be smaller than input
+valid_conv = Conv2D(32, (3, 3), padding='valid')
+
+# Output size will be identical to input
+same_conv = Conv2D(32, (3, 3), padding='same')
+
+```
+
+### PyTorch
+
+In PyTorch, you specify the exact number of pixels:
+
+```python
+import torch.nn as nn
+
+# For a 3x3 kernel, padding=1 gives 'same' output
+# (3-1)/2 = 1
+conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1)
+
+```
+
+## References
+
+* **CS231n:** [Spatial Arrangement of Layers](https://cs231n.github.io/convolutional-networks/#spatial)
+* **PyTorch Docs:** [Conv2d Layer Specifications](https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html)
+
+---
+
+**Padding keeps the image size consistent, but what if we want to move across the image faster or purposely reduce the size?**