The GRACE project aims to process a given dataset into a graph structure where nodes represent dataset features and edges represent possible feature interactions. We use this graph to constrain an XGBoost model, with two primary objectives:
- Improve ML Performance: By providing the model with domain-informed or empirically discovered feature interactions, we can guide it towards better performance.
- Enhance Explainability & Reduce Complexity: By simplifying the graph structure to a minimal set of nodes and edges, we create a more interpretable and less complex model.
The workflow is as follows:
- Initial Graph Creation: An initial knowledge graph is created. This can be done manually, through an automated agent (
create_kg.py), or by loading a pre-existing graph. The initial graph is based on feature importance (SHAP-IQ) and known biological/domain mechanisms. - Graph Optimization: The core of the project is in
graph_reduction.py. We use a multi-objective optimization process with Optuna to iteratively refine the graph. The optimization seeks to find a Pareto front of graphs that are optimal in terms of both predictive performance (e.g., AUC or Accuracy) and simplicity (number of nodes and edges). - Constrained Model Training: The optimized graph structure is used to generate
interaction_constraintsfor anXGBoostclassifier. This forces the model to only consider interactions between features connected by an edge in the graph. - Evaluation: The final constrained model is evaluated on a test set to measure its performance.
- Python 3.10+
- A virtual environment (e.g.,
venvorconda) is highly recommended.
Clone the repository to your local machine:
git clone <repository-url>
cd GRACECreate and activate a virtual environment. For example, with venv:
python -m venv venv
source venv/bin/activateInstall the required dependencies:
pip install -r requirements.txtThe project requires API keys for an LLM provider (like OpenAI) for the agent-based graph creation.
- Create a
.envfile in the root of the project directory:OPENAI_API_KEY="your-api-key-here" - Edit the
params.pyfile to configure the project:- Set
DATASET_NAMEto either"mimic"or"adni". - Set
LLM_PROVIDERto your desired provider (e.g.,"openai").
- Set
Execute the main script from the root directory:
python main.pyThe script will load the data, run the graph optimization process, train the final model, and save the results and visualizations in the images/ and models/ directories.
For advanced users and domain experts, we provide an interactive web interface for manual graph editing:
python run_interactive_kg.pyThis launches a Streamlit app where you can:
- 🎯 Visualize optimized knowledge graphs interactively
- ✏️ Edit graphs by adding/removing nodes and edges
- 🔒 Lock critical edges to preserve domain knowledge
- 🔄 Re-optimize graphs with your constraints
- 📊 Monitor performance metrics in real-time
- 💾 Export modified graphs for further analysis
Perfect for clinicians and researchers who want to inject domain expertise into the automated optimization process. See INTERACTIVE_KG_README.md for detailed instructions.
GRACE/
├── datasets/ # Raw CSV datasets
├── kg/ # Knowledge Graphs (GraphML) and agent outputs
├── models/ # Saved trained model files
├── images/ # Saved plots and visualizations
├── main.py # Main script to run the full pipeline
├── graph_reduction.py# Core logic for graph optimization using Optuna
├── create_kg.py # Script for agent-based initial KG creation
├── visualizations.py # Functions for plotting results
├── utils.py # Utility functions for graph manipulation
├── params.py # All user-configurable parameters
└── README.md # This file