NanoLM: A Research-Oriented Mini Language Model

NanoLM is a minimalist, highly educational implementation of a Generative Pre-trained Transformer (GPT) designed for research and prototyping. This project focuses on understanding the core mechanics of LLMs, from tokenization and data packing to efficient training and text generation.

🚀 Key Features

Architectural Clarity: A clean, single-file implementation of the Transformer architecture (nlm/model.py) based on the GPT-2 design.
Modern Optimizations: Includes support for Flash Attention, Mixed Precision (BF16/FP16), and Fused AdamW for maximum throughput.
Custom Tokenization: Optimized WordPiece tokenizer specifically trained on the TinyStories dataset for efficient subword encoding.
Research Ready: Modular structure designed for experimentation with different attention mechanisms and sequence models (e.g., SSMs).

📁 Project Structure

nlm/
├── nlm/                # Core library
│   ├── __init__.py
│   └── model.py        # Transformer architecture & NanoConfig
├── scripts/            # Training and utility scripts
│   ├── train_tokenizer.py
│   ├── tokenize_data.py
│   └── verify_model.py
├── notebooks/          # Interactive research & demos
│   └── inference.ipynb
├── tests/              # Unit & integration tests
├── train.py            # Main training entry point
├── generate.py         # Inference entry point
├── requirements.txt    # Project dependencies
└── README.md           # Documentation

🛠️ Installation

Clone the repository:

git clone https://github.com/your-username/nlm.git
cd nlm

Install dependencies:
```
pip install -r requirements.txt
```

📖 Usage

1. Prepare Data

First, train the tokenizer and preprocess the TinyStories dataset:

python scripts/train_tokenizer.py
python scripts/tokenize_data.py

2. Training

Run the training loop with your desired configuration:

python train.py

Metric monitoring (Loss, Perplexity, MFU, VRAM) is included via tqdm and local logging.

3. Generation

Generate text using a trained checkpoint:

python generate.py

🔬 Model Configuration (Nano-S)

The default configuration is optimized for rapid prototyping:

Embedding Dim: 64
Layers: 8
Heads: 8
Context Size: 256 tokens
Vocab Size: 8,000

🛤️ Roadmap

Hybrid SSM-Attention: Integrating Mamba/SSM layers with standard Attention for efficient long-context modeling.
Weight Interpolation: Capability to load and interpolate weights from larger GPT-2 models.
Distributed Training: Support for DistributedDataParallel (DDP).

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Andrej Karpathy's nanoGPT for the architectural inspiration.
The Hugging Face team for the transformers and datasets libraries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NanoLM: A Research-Oriented Mini Language Model

🚀 Key Features

📁 Project Structure

🛠️ Installation

📖 Usage

1. Prepare Data

2. Training

3. Generation

🔬 Model Configuration (Nano-S)

🛤️ Roadmap

📄 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
nlm		nlm
notebooks		notebooks
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate.py		generate.py
requirements.txt		requirements.txt
train.py		train.py

License

dhruvjverma/NanoLanguageModel

Folders and files

Latest commit

History

Repository files navigation

NanoLM: A Research-Oriented Mini Language Model

🚀 Key Features

📁 Project Structure

🛠️ Installation

📖 Usage

1. Prepare Data

2. Training

3. Generation

🔬 Model Configuration (Nano-S)

🛤️ Roadmap

📄 License

🙏 Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages