Warning
This repository is under active development and may not be stable.
HashPrep is a Python library for intelligent dataset profiling and debugging that acts as a comprehensive pre-training quality assurance tool for machine learning projects. Think of it as "Pandas Profiling + PyLint for datasets", designed specifically for machine learning workflows.
It catches critical dataset issues before they derail your ML pipeline, explains the problems, and suggests context-aware fixes.
If you want, HashPrep can even apply those fixes for you automatically.
Key features include:
- Intelligent Profiling: Detect missing values, skewed distributions, outliers, and data type inconsistencies.
- ML-Specific Checks: Identify data leakage, dataset drift, class imbalance, and high-cardinality features.
- Automated Preparation: Get suggestions for encoding, imputation, scaling, and transformations.
- Rich Reporting: Generate statistical summaries and exportable reports (HTML/PDF/Markdown/JSON) with embedded visualizations.
- Production-Ready Pipelines: Output reproducible cleaning and preprocessing code (
fixes.py) that integrates seamlessly with ML workflows. - Modern Themes: Choose between "Minimal" (professional) and "Neubrutalism" (bold) report styles.
Get a quick summary of critical issues in your terminal.
hashprep scan dataset.csvGenerate a comprehensive HTML report with visualizations.
hashprep report dataset.csv --format html --theme minimalOptions:
--theme:minimal(default) orneubrutalism--format:html,pdf,md, orjson--no-visualizations: Disable plot generation for faster performance.
Automatically generate a Python script (dataset_fixes.py) to apply suggested fixes.
hashprep report dataset.csv --with-codeThis project is licensed under the MIT License.
We welcome contributions from the community to make HashPrep better!
Before you get started, please:
- Review our CONTRIBUTING.md for detailed guidelines and setup instructions
- Write clean, well-documented code
- Follow best practices for the stack or component you’re working on
- Open a pull request (PR) with a clear description of your changes and motivation