Skip to content

A minimalist, high-performance GPT implementation in PyTorch, optimized for research and training on the TinyStories dataset.

License

Notifications You must be signed in to change notification settings

dhruvjverma/NanoLanguageModel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NanoLM: A Research-Oriented Mini Language Model

Python 3.13 PyTorch License: MIT

NanoLM is a minimalist, highly educational implementation of a Generative Pre-trained Transformer (GPT) designed for research and prototyping. This project focuses on understanding the core mechanics of LLMs, from tokenization and data packing to efficient training and text generation.

🚀 Key Features

  • Architectural Clarity: A clean, single-file implementation of the Transformer architecture (nlm/model.py) based on the GPT-2 design.
  • Modern Optimizations: Includes support for Flash Attention, Mixed Precision (BF16/FP16), and Fused AdamW for maximum throughput.
  • Custom Tokenization: Optimized WordPiece tokenizer specifically trained on the TinyStories dataset for efficient subword encoding.
  • Research Ready: Modular structure designed for experimentation with different attention mechanisms and sequence models (e.g., SSMs).

📁 Project Structure

nlm/
├── nlm/                # Core library
│   ├── __init__.py
│   └── model.py        # Transformer architecture & NanoConfig
├── scripts/            # Training and utility scripts
│   ├── train_tokenizer.py
│   ├── tokenize_data.py
│   └── verify_model.py
├── notebooks/          # Interactive research & demos
│   └── inference.ipynb
├── tests/              # Unit & integration tests
├── train.py            # Main training entry point
├── generate.py         # Inference entry point
├── requirements.txt    # Project dependencies
└── README.md           # Documentation

🛠️ Installation

  1. Clone the repository:

    git clone https://github.com/your-username/nlm.git
    cd nlm
  2. Install dependencies:

    pip install -r requirements.txt

📖 Usage

1. Prepare Data

First, train the tokenizer and preprocess the TinyStories dataset:

python scripts/train_tokenizer.py
python scripts/tokenize_data.py

2. Training

Run the training loop with your desired configuration:

python train.py

Metric monitoring (Loss, Perplexity, MFU, VRAM) is included via tqdm and local logging.

3. Generation

Generate text using a trained checkpoint:

python generate.py

🔬 Model Configuration (Nano-S)

The default configuration is optimized for rapid prototyping:

  • Embedding Dim: 64
  • Layers: 8
  • Heads: 8
  • Context Size: 256 tokens
  • Vocab Size: 8,000

🛤️ Roadmap

  • Hybrid SSM-Attention: Integrating Mamba/SSM layers with standard Attention for efficient long-context modeling.
  • Weight Interpolation: Capability to load and interpolate weights from larger GPT-2 models.
  • Distributed Training: Support for DistributedDataParallel (DDP).

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Andrej Karpathy's nanoGPT for the architectural inspiration.
  • The Hugging Face team for the transformers and datasets libraries.

About

A minimalist, high-performance GPT implementation in PyTorch, optimized for research and training on the TinyStories dataset.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published