diff --git a/docs/machine-learning/deep-learning/cnn-applications/recommendation-systems.mdx b/docs/machine-learning/deep-learning/cnn-applications/recommendation-systems.mdx index e69de29..8e86119 100644 --- a/docs/machine-learning/deep-learning/cnn-applications/recommendation-systems.mdx +++ b/docs/machine-learning/deep-learning/cnn-applications/recommendation-systems.mdx @@ -0,0 +1,102 @@ +--- +title: Deep Learning in Recommendation Systems +sidebar_label: Recommendation Systems +description: "How CNNs and deep neural networks power modern discovery engines like Netflix, YouTube, and Pinterest." +tags: [deep-learning, cnn, recommendation-systems, embeddings, computer-vision] +--- + +Traditional recommendation systems relied on **Collaborative Filtering** (finding similar users) or **Content-Based Filtering** (matching tags). Modern systems, however, use **Deep Learning** to understand the actual content of the items—images, text, and video—to make highly personalized "visual" or "semantic" recommendations. + +## 1. The Role of CNNs in Recommendations + +CNNs have revolutionized recommendation engines in industries where the "visual" aspect is the primary driver of user interest (e.g., Fashion, Home Decor, or Social Media). + +### A. Visual Search and Similarity +In apps like Pinterest or Instagram, CNNs extract feature vectors (embeddings) from images. If a user likes a photo of a "mid-century modern chair," the system finds other images whose feature vectors are mathematically close in vector space. + +### B. Extracting Latent Features +Traditional systems might only know a product is "Blue" and "Large." A CNN can detect latent features that aren't in the metadata, such as "minimalist aesthetic," "high-waisted cut," or "warm lighting." + +## 2. Hybrid Architectures + +Modern recommenders rarely use just one model. They often combine multiple neural networks in a "Wide & Deep" architecture: + +1. **The Deep Component (CNN/RNN):** Processes unstructured data like product images or video thumbnails to learn high-level abstractions. +2. **The Wide Component (Linear):** Handles structured categorical data like user ID, location, or past purchase history. +3. **The Ranking Head:** Combines these signals to predict the probability that a user will click or buy. + +```mermaid +graph TD + User_Data[User Profile & History] --> Wide[Wide Linear Model] + Product_Img[Product Image] --> CNN[CNN Feature Extractor] + CNN --> Embed[Visual Embedding] + Embed --> Deep[Deep Neural Network] + Wide --> Fusion[Feature Fusion Layer] + Deep --> Fusion + Fusion --> Output[Click Probability] + + style CNN fill:#e1f5fe,stroke:#01579b,color:#333 + style Wide fill:#fff3e0,stroke:#ef6c00,color:#333 + style Output fill:#e8f5e9,stroke:#2e7d32,color:#333 + +``` + +## 3. Collaborative Deep Learning (CDL) + +**Collaborative Deep Learning** integrates deep learning for content features with a ratings matrix. + +* The CNN learns a representation of the item (e.g., a movie poster or a song's spectrogram). +* The system then uses these "deep features" to fill in the gaps in the user-item matrix where data is missing (the **Cold Start** problem). + +## 4. Solving the "Cold Start" Problem + +The **Cold Start** problem occurs when a new item is added to the platform and has no ratings yet. + +* **Without CNNs:** The item won't be recommended because no one has interacted with it. +* **With CNNs:** The model "sees" the item, recognizes it is similar to other popular items visually, and can start recommending it immediately based on content alone. + +## 5. Use Case: Pinterest's "Visual Pin" Recommender + +Pinterest uses a massive CNN architecture called **PinSage**. It uses Graph Convolutional Networks (GCN) that combine: + +1. **Visual features** (what the pin looks like). +2. **Graph features** (what other pins it is frequently "saved" with). + +This allows the system to recommend a "rustic dining table" even if the user just started browsing "wooden cabins." + +## 6. Implementation Sketch (Feature Extraction) + +To build a visual recommender, we often use a pre-trained CNN just to get the "embeddings" (the output of the last pooling layer before classification). + +```python +import tensorflow as tf +from tensorflow.keras.applications import ResNet50 +from tensorflow.keras.preprocessing import image +from sklearn.metrics.pairwise import cosine_similarity + +# 1. Load ResNet50 without the classification head +model = ResNet50(weights='imagenet', include_top=False, pooling='avg') + +# 2. Extract features from two different product images +def get_embedding(img_path): + img = image.load_img(img_path, target_size=(224, 224)) + x = image.img_to_array(img) + x = np.expand_dims(x, axis=0) + return model.predict(x) + +feat1 = get_embedding('product_A.jpg') +feat2 = get_embedding('product_B.jpg') + +# 3. Calculate similarity score (0 to 1) +similarity = cosine_similarity(feat1, feat2) +print(f"Product Similarity: {similarity[0][0]}") + +``` + +## References + +* **Google Research:** [Wide & Deep Learning for Recommender Systems](https://arxiv.org/abs/1606.07792) + +--- + +**Visual recommendations are powerful, but they are only part of the story. To understand how a user's interests change over time, we need models that can remember the sequence of their actions.** \ No newline at end of file diff --git a/docs/machine-learning/deep-learning/cnn-applications/video-recognition.mdx b/docs/machine-learning/deep-learning/cnn-applications/video-recognition.mdx index e69de29..aa424bd 100644 --- a/docs/machine-learning/deep-learning/cnn-applications/video-recognition.mdx +++ b/docs/machine-learning/deep-learning/cnn-applications/video-recognition.mdx @@ -0,0 +1,84 @@ +--- +title: Video Recognition and Action Analysis +sidebar_label: Video Recognition +description: "Exploring 3D CNNs, Optical Flow, and Temporal Modeling for analyzing moving images." +tags: [deep-learning, cnn, video-analysis, temporal-features, 3d-cnn] +--- + +Video recognition takes Computer Vision to the next level by adding the **temporal dimension**. Unlike [Image Classification](./image-classification), which analyzes a single frame, video recognition must understand how objects move and interact over time to identify actions, events, or anomalies. + +## 1. What Makes Video Different? + +A video is essentially a sequence of images (frames) stacked over time. To recognize a "jump," the model can't just look at one frame; it must see the transition from the ground to the air and back. + +This introduces the concept of **Spatial-Temporal Features**: +* **Spatial Features:** What objects are in the frame? (Detected by standard CNNs). +* **Temporal Features:** How are these objects moving across frames? (Detected by specialized architectures). + +## 2. Core Architectures for Video + +Because video data is computationally "heavy," researchers have developed several distinct ways to process it: + +### A. 3D Convolutional Neural Networks (3D-CNNs) +Instead of a 2D kernel ($3 \times 3$), we use a 3D kernel ($3 \times 3 \times 3$). The third dimension slides across the time axis (consecutive frames). +* **Popular Model:** **C3D** or **I3D** (Inflated 3D ConvNet). +* **Strength:** Naturally captures motion and appearance simultaneously. + +### B. Two-Stream Networks +This architecture splits the task into two paths: +1. **Spatial Stream:** Takes a single RGB frame to identify objects. +2. **Temporal Stream:** Takes **Optical Flow** (the pattern of apparent motion of objects between frames) to identify movement. +The two streams are fused at the end to make a final prediction. + +### C. CNN + RNN (LRCN) +A CNN extracts features from individual frames, and these features are then fed into a **Long Short-Term Memory (LSTM)** network. The LSTM "remembers" previous frames to build a context of the action. + +## 3. Key Concepts: Optical Flow + +**Optical Flow** is the distribution of apparent velocities of movement of brightness patterns in an image. In video recognition, it helps the model ignore the static background and focus entirely on the "motion signature" of the subject. + +## 4. Common Tasks in Video Analysis + +| Task | Goal | Example | +| :--- | :--- | :--- | +| **Action Recognition** | Classify the activity in a video clip. | "Running," "Cooking," "Swimming." | +| **Temporal Action Localization** | Find the start and end time of an action. | Finding the exact second a goal was scored in a match. | +| **Video Summarization** | Create a short version of a long video. | Generating a "highlight reel" from a full game. | +| **Anomaly Detection** | Identify unusual behavior. | Detecting a fall in elderly care or a fight in security footage. | + +## 5. Challenges in Video Recognition + +1. **High Computational Cost:** Processing 30 frames per second requires significantly more memory and GPU power than a single image. +2. **Long-Term Dependencies:** Some actions (like "making a sandwich") take a long time and require the model to remember events from minutes ago. +3. **Viewpoint and Occlusion:** Movement looks different depending on the camera angle. + +## 6. Implementation Sketch (PyTorch Video) + +PyTorch provides a dedicated library called `PyTorchVideo` for these tasks. + +```python +import torch +from torchvision.models.video import r3d_18 + +# Load a pre-trained 3D ResNet model +# It expects input shape: (Batch, Channels, Time/Frames, Height, Width) +model = r3d_18(pretrained=True).eval() + +# Create a dummy video clip: 1 clip, 3 channels (RGB), 16 frames, 112x112 resolution +video_clip = torch.randn(1, 3, 16, 112, 112) + +with torch.no_grad(): + prediction = model(video_clip) + +print(f"Prediction shape: {prediction.shape}") # [1, 400] for Kinetics-400 dataset classes + +``` + +## References + +* **PyTorch Video:** [Official Documentation](https://pytorchvideo.org/) +* **ArXiv:** [Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset (I3D Paper)](https://arxiv.org/abs/1705.07750) + +--- + +**Video recognition relies heavily on understanding sequences over time. To dive deeper into how models "remember" the past, we need to look at sequence-specific architectures.** \ No newline at end of file