NanoLM is a minimalist, highly educational implementation of a Generative Pre-trained Transformer (GPT) designed for research and prototyping. This project focuses on understanding the core mechanics of LLMs, from tokenization and data packing to efficient training and text generation.
- Architectural Clarity: A clean, single-file implementation of the Transformer architecture (
nlm/model.py) based on the GPT-2 design. - Modern Optimizations: Includes support for Flash Attention, Mixed Precision (BF16/FP16), and Fused AdamW for maximum throughput.
- Custom Tokenization: Optimized WordPiece tokenizer specifically trained on the TinyStories dataset for efficient subword encoding.
- Research Ready: Modular structure designed for experimentation with different attention mechanisms and sequence models (e.g., SSMs).
nlm/
├── nlm/ # Core library
│ ├── __init__.py
│ └── model.py # Transformer architecture & NanoConfig
├── scripts/ # Training and utility scripts
│ ├── train_tokenizer.py
│ ├── tokenize_data.py
│ └── verify_model.py
├── notebooks/ # Interactive research & demos
│ └── inference.ipynb
├── tests/ # Unit & integration tests
├── train.py # Main training entry point
├── generate.py # Inference entry point
├── requirements.txt # Project dependencies
└── README.md # Documentation
-
Clone the repository:
git clone https://github.com/your-username/nlm.git cd nlm -
Install dependencies:
pip install -r requirements.txt
First, train the tokenizer and preprocess the TinyStories dataset:
python scripts/train_tokenizer.py
python scripts/tokenize_data.pyRun the training loop with your desired configuration:
python train.pyMetric monitoring (Loss, Perplexity, MFU, VRAM) is included via tqdm and local logging.
Generate text using a trained checkpoint:
python generate.pyThe default configuration is optimized for rapid prototyping:
- Embedding Dim: 64
- Layers: 8
- Heads: 8
- Context Size: 256 tokens
- Vocab Size: 8,000
- Hybrid SSM-Attention: Integrating Mamba/SSM layers with standard Attention for efficient long-context modeling.
- Weight Interpolation: Capability to load and interpolate weights from larger GPT-2 models.
- Distributed Training: Support for
DistributedDataParallel(DDP).
This project is licensed under the MIT License - see the LICENSE file for details.
- Andrej Karpathy's
nanoGPTfor the architectural inspiration. - The Hugging Face team for the
transformersanddatasetslibraries.