Machine Learning Pipeline with Streamlit Interface

Overview

This project, CardioPredict AI, provides a comprehensive machine learning pipeline for binary classification, specifically targeting early detection and monitoring of cardiovascular diseases (CVDs). It features an end-to-end workflow from data ingestion to model deployment, emphasizing reproducibility, modularity, and transparency. The system integrates five public datasets (1,871 rows), employs a Random Forest Classifier optimized for recall (94.75%), and includes SHAP-based interpretability. A Streamlit application enables real-time CVD risk prediction, with comparisons against Logistic Regression, SVM, XGBoost, KNN, and a Voting Classifier. Hosted at https://cardiovascularprediction-jc78xmh2kjldht53ac2juf.streamlit.app/

Features

Automated Data Pipeline: Loads and merges five datasets (Statlog, Heart_Disease_Prediction, etc.), ensuring consistency.
Configurable Machine Learning Workflow: Externalizes parameters via config.yaml for data processing, feature selection, and training.
Feature Selection Module: Uses Random Forest importance to select top 12 features (e.g., thalach, oldpeak).
Model Training and Persistence: Trains a RandomForestClassifier (n_estimators=600, class_weight='balanced') and saves it as random_forest_model.pkl.
Comprehensive Model Evaluation: Assesses accuracy, precision, recall (94.75%), F1-score, and ROC-AUC, recorded in metrics.json.
Interactive Streamlit Application: Offers a user-friendly interface with prediction input, SHAP visualization, and basic error handling.

Project Structure

.
├── README.md
├── requirements.txt
├── .gitignore
├── .devcontainer/
├── .streamlit/
├── Cardiovascular.ipynb
├── app.py
├── packages.txt
├── random_forest_model.pkl
├── scaler.pkl
├── data/
│ └── raw/
│ ├── Heart_Disease_Prediction (1).csv
│ ├── Cardiovascular_Disease_Dataset.csv
│ ├── hear_LAPPt.csv
│ └── heart nandal.csv
├── reports/
│ └── metrics.json
└── config/
└── config.yaml

README.md: This documentation.
requirements.txt: Python dependency list.
.gitignore: Specifies intentionally untracked files to ignore.
.devcontainer/: Configuration for development containers.
.streamlit/: Streamlit configuration files.
Cardiovascular.ipynb: Jupyter Notebook containing the main machine learning pipeline (data loading, preprocessing, model training, evaluation).
app.py: Streamlit application for interactive predictions.
packages.txt: Additional package dependencies (if any).
random_forest_model.pkl: The trained Random Forest model.
scaler.pkl: The trained MinMaxScaler for feature scaling.
data/raw/: Directory for the raw input datasets.
reports/: Directory for evaluation metrics (e.g., metrics.json).
config/: Directory for configuration files (e.g., config.yaml).

Installation

Prerequisites

Python 3.8 or higher
pip (Python package installer)

Steps

Clone the Repository:
git clone https://github.com/CodeRishiX/Cardiovascularprediction.git
cd Cardiovascularprediction
Create a Virtual Environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
Install Dependencies:
pip install -r requirements.txt

(Ensure requirements.txt contains: pandas numpy scikit-learn seaborn matplotlib joblib shap xgboost streamlit)

Table 1: Project Dependencies

Package Name	Version
pandas	1.3.5
scikit-learn	1.0.2
numpy	1.21.6
streamlit	1.10.0
pyyaml	6.0
joblib	1.1.0
shap	0.41.0
xgboost	1.5.0

Dataset

The project utilizes five datasets stored in data/raw/: Heart_Disease_Prediction (1).csv, Cardiovascular_Disease_Dataset.csv, hear_LAPPt.csv, heart nandal.csv, and Statlog (fetched from UCI). These are merged into 1,871 rows after removing duplicates. Preprocessing (as performed in Cardiovascular.ipynb) includes:

Missing Value Handling: Imputes numerical columns (age, trestbps, etc.) with means and categorical columns (cp, restecg) with modes, configurable in config.yaml.
Feature Scaling/Normalization: Applies MinMaxScaler to numerical features using scaler.pkl.
Categorical Encoding: Uses one-hot encoding for cp, restecg, slope.
The processed data is typically used within the notebook, and a processed version could be saved (e.g., data/processed/processed_data.csv) if needed for direct loading.

Methodology

Data Loading and Preprocessing

Data Loading: Handled within Cardiovascular.ipynb, which loads the five datasets and standardizes column names (e.g., BP to trestbps).
Preprocessing: Performed within Cardiovascular.ipynb, merging data, removing duplicates, imputing missing values, scaling numerical features, and one-hot encoding categorical variables, all parameterized via config.yaml.

Feature Engineering and Selection

Feature Selection: Within Cardiovascular.ipynb, a temporary RandomForestClassifier (n_estimators=300, max_depth=30) is used to select the top 12 features: ['thalach', 'oldpeak', 'ca', 'cp_4.0', 'cp_2.0', 'age', 'trestbps', 'chol', 'restecg_2.0', 'slope_1.0', 'slope_3.0', 'slope_2.0'].

Table 3: Illustrative Feature Importance

Feature Name	Importance Score (Illustrative)
thalach	0.25
oldpeak	0.20
ca	0.15
cp_4.0	0.10
cp_2.0	0.08
...	...

Model Training

Model Choice: Within Cardiovascular.ipynb, a RandomForestClassifier is trained with n_estimators=600, max_depth=18, min_samples_split=8, max_features='log2', and class_weight='balanced'.
Hyperparameters: Configured in config.yaml for reproducibility.
Model Persistence: The trained model is saved as random_forest_model.pkl and the scaler as scaler.pkl.

Model Evaluation

Evaluation: Performed within Cardiovascular.ipynb, computing accuracy, precision, recall, F1-score, and ROC-AUC. These metrics can be saved to reports/metrics.json.
Configuration: Threshold (0.44) for recall optimization is in config.yaml.

Table 2: Model Performance Metrics

Metric	Value (Illustrative)
Accuracy	90.21%
Precision	89.50%
Recall	94.75%
F1-Score	91.31%

Usage

Running the Machine Learning Pipeline (via Jupyter Notebook)

Ensure dataset files are in data/raw/.
Open and run the Cardiovascular.ipynb notebook in a Jupyter environment (e.g., Jupyter Lab, VS Code with Jupyter extension).
The notebook will handle data loading, preprocessing, model training, and evaluation.
Check outputs:
- Trained model: random_forest_model.pkl
- Trained scaler: scaler.pkl
- Metrics (if saved by the notebook): reports/metrics.json

Running the Streamlit Application

Ensure random_forest_model.pkl and scaler.pkl exist in the project root.
Launch the app from the project root:
streamlit run app.py
Interact with the app at http://localhost:8501:
- Input feature values (e.g., age, thalach).
- View predictions and SHAP values.

Results and Discussion

The RandomForestClassifier achieved 90.21% accuracy, 94.75% recall, and 91.31% F1-score, with thalach and oldpeak as top contributors per SHAP. The high recall meets medical standards (<20% false negatives), though a 7.35% train-test gap suggests mild overfitting. Further analysis of the confusion matrix could refine insights.

Conclusions

CardioPredict AI delivers a reproducible CVD prediction pipeline with a robust RandomForestClassifier and an interactive Streamlit app. Its modularity and transparency support further research, validated by multi-metric evaluation and SHAP interpretability.

Future Work

Explore XGBoost or neural networks for improved performance.
Augment data with CVD-specific features.
Deploy on a cloud platform (e.g., AWS).
Add model monitoring for drift detection.

Contributing

Report bugs or suggest features via GitHub Issues.
Submit pull requests with new branches, adhering to code style.

License

MIT License (see LICENSE file).

Contact

Open an issue on https://github.com/CodeRishiX/Cardiovascularprediction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Machine Learning Pipeline with Streamlit Interface

Overview

Features

Project Structure

Installation

Prerequisites

Steps

Table 1: Project Dependencies

Dataset

Methodology

Data Loading and Preprocessing

Feature Engineering and Selection

Table 3: Illustrative Feature Importance

Model Training

Model Evaluation

Table 2: Model Performance Metrics

Usage

Running the Machine Learning Pipeline (via Jupyter Notebook)

Running the Streamlit Application

Results and Discussion

Conclusions

Future Work

Contributing

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.devcontainer		.devcontainer
.streamlit		.streamlit
.gitignore		.gitignore
Cardiovascular.ipynb		Cardiovascular.ipynb
LICENSE		LICENSE
Merged_data.csv		Merged_data.csv
Readme.md		Readme.md
Sample_Patient_Data_for_Testing_Streamlit.txt		Sample_Patient_Data_for_Testing_Streamlit.txt
app.py		app.py
packages.txt		packages.txt
requirements.txt		requirements.txt
rf_model.pkl		rf_model.pkl
scaler.pkl		scaler.pkl

License

CodeRishiX/Cardiovascularprediction

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Pipeline with Streamlit Interface

Overview

Features

Project Structure

Installation

Prerequisites

Steps

Table 1: Project Dependencies

Dataset

Methodology

Data Loading and Preprocessing

Feature Engineering and Selection

Table 3: Illustrative Feature Importance

Model Training

Model Evaluation

Table 2: Model Performance Metrics

Usage

Running the Machine Learning Pipeline (via Jupyter Notebook)

Running the Streamlit Application

Results and Discussion

Conclusions

Future Work

Contributing

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages