A tool for converting large technical reference PDFs (2100+ pages) into well-structured, AI-friendly Markdown files with extracted images.
This tool allows very large technical reference PDF files to be useful to AI coding agents by:
- Breaking down large PDFs into reasonably sized markdown files
- Extracting and organizing images (block diagrams, schematics, etc.)
- Using LLM assistance to clean up PDF artifacts and structure the content
- Creating a navigable index for easy access
This is a one-time conversion tool that favors quality over performance. The output is designed to be directly accessible by AI agents without requiring RAG (Retrieval-Augmented Generation).
- Extracts PDF Table of Contents to intelligently chunk the document
- Preserves all images and diagrams from the PDF
- Uses Google's Gemini API to format and clean up extracted text
- Converts tables, code blocks, and technical content to proper Markdown
- Creates a master index file for easy navigation
- Configurable via environment variables and command-line arguments
- Customizable LLM prompt template
- Parallel processing with configurable worker count for faster conversions
- Resume capability - automatically skips already-processed sections
- Smart diagram detection - filters out false positives (tables, headers) from real diagrams
- Dual model support - uses vision model for pages with diagrams, text-only model otherwise
- Exponential backoff for API rate limit handling
-
Clone or download this repository
-
Create and activate a virtual environment:
# Create virtual environment
python3 -m venv venv
# Activate it
# On macOS/Linux:
source venv/bin/activate
# On Windows:
# venv\Scripts\activate- Install Python dependencies:
pip install -r requirements.txt- Set up your API key:
cp .env.example .env
# Edit .env and add your Google API keyGet your API key from Google AI Studio
Create a .env file based on .env.example:
# Google API Configuration
GOOGLE_API_KEY=your_api_key_here
# Vision model - used for pages with diagrams/images (multimodal)
LLM_MODEL_NAME=gemini-2.5-flash
# Text-only model - used for pages without images (faster/cheaper)
LLM_MODEL_TEXT_ONLY=gemini-2.5-flash
# Default paths (optional, can override via command line)
PDF_PATH=cc1312r7.pdf
OUTPUT_DIR=output_markdownThe tool automatically selects the appropriate model based on whether a page contains diagrams or images. You can use different models for each purpose (e.g., a more capable model for vision tasks).
The LLM prompt used for formatting is stored in prompt.txt. You can customize this file to adjust how the LLM processes your content. The template uses Python string formatting with {title} and {raw_content} placeholders.
Note: Make sure your virtual environment is activated before running:
source venv/bin/activate # macOS/Linux
# or venv\Scripts\activate on WindowsConvert a PDF using default settings:
python pdf2md.pypython pdf2md.py -i /path/to/manual.pdf -o output_directorypython pdf2md.py -i manual.pdf -o output_md -p custom_prompt.txt# Use 8 parallel workers for faster processing
python pdf2md.py -i manual.pdf -o output_md -w 8usage: pdf2md.py [-h] [-i PDF_PATH] [-o OUTPUT_DIR] [-p PROMPT_FILE] [-w MAX_WORKERS]
Convert large technical PDF files to structured Markdown with images.
optional arguments:
-h, --help show this help message and exit
-i PDF_PATH, --pdf PDF_PATH
Path to input PDF file (default: from .env or 'cc1312r7.pdf')
-o OUTPUT_DIR, --output OUTPUT_DIR
Output directory for markdown files (default: from .env or 'output_markdown')
-p PROMPT_FILE, --prompt PROMPT_FILE
Path to LLM prompt template file (default: 'prompt.txt')
-w MAX_WORKERS, --workers MAX_WORKERS
Maximum number of parallel LLM API calls (default: 4)
The tool creates the following structure:
output_markdown/
├── index.md # Master index with links to all sections
├── images/ # All extracted images
│ ├── p23_xref456.png
│ ├── p45_xref789.jpg
│ └── ...
├── section_1.md # Individual section files
├── section_2.md
└── ...
index.md: Start here - contains links to all sections organized by the PDF's Table of Contentsimages/: All extracted images, named by page number and reference ID- Section files: One markdown file per ToC section, with proper headings, tables, code blocks, and image references
- Extract Table of Contents: Uses PyMuPDF to read the PDF's ToC structure
- Process Each Section (Phase 1 - Sequential): For each ToC entry:
- Extracts text and images from the relevant page range
- Detects diagrams using smart heuristics (page coverage, element sizes)
- Saves embedded images and renders full-page diagrams when appropriate
- Skips already-processed sections (resume capability)
- Format with LLM (Phase 2 - Parallel): Using configurable worker threads:
- Selects vision model for pages with images, text-only model otherwise
- Cleans up PDF artifacts (page breaks, headers, footers)
- Formats as proper Markdown (headings, lists, tables, code blocks)
- Writes output files immediately as each section completes
- Implements exponential backoff for rate limit handling
- Save Output: Creates individual markdown files for each section plus a master index
- Python 3.7+
- PyMuPDF (fitz) - PDF processing
- google-generativeai - Gemini API access
- python-dotenv - Environment variable management
- Pillow (PIL) - Image processing for multimodal API
The tool uses Google's Gemini API, which charges based on tokens processed:
- gemini-2.5-flash: Fast and cost-effective (default)
- gemini-2.5-pro: More capable but slower/expensive
Check current pricing at Google AI Studio.
For a 2100-page technical manual, expect to process several million tokens. The dual-model approach helps reduce costs by using text-only processing for pages without diagrams.
- Make sure you created a
.envfile with your API key - Check that the key is valid at Google AI Studio
- The tool requires a PDF with a proper ToC/bookmarks structure
- Check if your PDF has a ToC by opening it in a PDF reader
- Consider manually splitting PDFs without ToC structure
- The script implements automatic exponential backoff (1s, 2s, 4s, 8s, 16s delays)
- Rate limit errors (429) are automatically retried up to 5 times
- Reduce
--workersif you consistently hit rate limits - Use
gemini-2.5-flashfor higher rate limits
- Some PDFs have embedded images that are difficult to extract
- Check the
images/folder to verify what was extracted - Vector graphics are rendered as full-page images when detected as real diagrams
- The tool automatically detects existing output files and skips re-processing them
- Simply re-run the same command to resume from where you left off
- Delete specific
.mdfiles if you want to regenerate them
Currently the tool only supports Google's Gemini API. Future versions will add support for:
- OpenAI (GPT-4o, GPT-4o-mini) - Vision and text models
- Anthropic Claude (Claude Sonnet, Claude Haiku) - Vision and text models
- Local LLMs (Ollama, llama.cpp) - For offline/private processing
Implementation Strategy:
-
Provider Abstraction Layer - Create a base
LLMProviderclass with common interface methods (generate_content(),supports_vision(),get_model_info()) -
Provider Implementations - Each provider gets its own class:
GeminiProvider(current implementation, refactored)OpenAIProvider(usingopenaipackage)ClaudeProvider(usinganthropicpackage)OllamaProvider(usingollamapackage or REST API)
-
Configuration - Select provider via environment variable or CLI flag:
# Via environment LLM_PROVIDER=openai OPENAI_API_KEY=sk-... # Via command line python pdf2md.py -i manual.pdf --provider claude
-
Graceful Fallback - Vision models used when available; automatic fallback to text-only for providers/models without vision support
This tool is provided as-is for technical document conversion purposes.
- This is a one-time conversion tool designed for large technical reference documents
- Quality is prioritized over performance, but parallel processing speeds up conversions
- The output is optimized for AI agent consumption, not human reading
- Large PDFs benefit from the parallel workers feature (
-wflag) - The resume capability makes it safe to interrupt and restart long conversions