Team BeyondInfinity
NASA Space Apps Challenge 2025
ExoDetect AI is a comprehensive machine learning pipeline for automated exoplanet detection and validation across NASA's Kepler, K2, and TESS missions. Our system combines state-of-the-art ML models with professional astronomical vetting tools to classify transit signals as confirmed planets, candidates, or false positives.
Note: This is a hackathon project with known bugs and limitations. Contributions are welcome from anyone interested in improving exoplanet detection tools!
Exoplanet Detection - Create an AI/ML model trained on NASA's open-source exoplanet datasets with a web interface for user interaction.
The link for the app https://beyondinfinity-lbws.streamlit.app/
- Multi-Mission Support: Models trained on Kepler, K2, and TESS datasets
- Ensemble ML Approach: XGBoost, LightGBM, MLP, Random Forest
- Professional Validation: Integration with LEO-vetter for automated signal vetting
- Interactive Web App: Streamlit-based interface for predictions and analysis
- Complete Pipeline: From TIC number input to classification with confidence scores
- Light Curve Analysis: Periodogram generation, phase folding, and detrending
- XGBoost: 91% accuracy, 0.981 macro AUC
- LightGBM: 93% accuracy, 90% precision
- MLP Neural Network: 90% accuracy, 0.972 macro AUC
- Random Forest: 90% OOB score
- 3-class problem: CONFIRMED, CANDIDATE, FALSE POSITIVE
- XGBoost: 92.5% accuracy, 0.923 F1-score
- LightGBM: 92.8% accuracy, 0.927 F1-score
- 3-class problem: CONFIRMED, CANDIDATE, FALSE POSITIVE
- XGBoost: 76% accuracy (challenging 3-class problem)
- LightGBM: 75% accuracy
- TabPFN: Experimental approach for small datasets
Research & Planning
- Explored NASA Exoplanet Archive datasets (Kepler, K2, TESS)
- Studied the transit method and common false positive types
- Discovered LEO-vetter tool for professional signal validation
- Identified class imbalance as primary challenge (1:50+ ratios)
Data Processing & EDA
- Downloaded cumulative Kepler catalog (9,564 KOIs)
- Downloaded K2 EPIC catalog (4,585 candidates)
- Accessed TESS TOI catalog via astroquery (4,960 objects)
- Analyzed feature distributions and missing value patterns
- Implemented robust preprocessing pipeline
Model Development
- Trained initial models with severe class imbalance
- Experimented with undersampling, oversampling (SMOTE)
- Optimized hyperparameters for each mission/model combination
- Discovered TabPFN effectiveness on TESS data
- Achieved breakthrough with coarse-grained models
LEO-Vetter Integration
- Resolved
rho(stellar density) calculation bug - Integrated lightkurve for TESS light curve fetching
- Connected astroquery for TIC catalog queries
- Implemented complete TIC โ classification pipeline
- Generated diagnostic plots (periodograms, phase-folded curves)
Web Application Development
- Built Streamlit interface with 5 main pages
- Implemented 13-feature prediction system
- Added batch processing capabilities
- Created model comparison dashboard
- Debugged file path issues for deployment
Final Polish & Documentation
- Wrote comprehensive README
- Created demo script for judges
- Tested end-to-end workflows
- Prepared presentation materials
Raw Data (CSV)
โ
Preprocessing
- Missing value imputation (median strategy)
- Feature scaling (StandardScaler/RobustScaler)
- Class balancing (undersampling/SMOTE)
โ
Model Training
- XGBoost (gradient boosting)
- LightGBM (fast gradient boosting)
- MLP (neural network)
- Random Forest (ensemble)
- TabPFN (transformer-based, TESS)
โ
Validation
- Stratified K-fold cross-validation
- Balanced accuracy, precision, recall, F1
- ROC-AUC for multi-class
โ
Deployment (joblib serialization)
TIC Number Input
โ
Light Curve Fetching (lightkurve + MAST)
โ
Preprocessing
- Remove NaNs and bad quality flags
- Detrend with transit masking
โ
Stellar Parameter Retrieval (TIC catalog)
- Radius, mass, temperature, surface gravity
- Calculate stellar density (ฯ = M/Rยณ)
- Limb darkening coefficients
โ
LEO-Vetter Analysis
- Odd-even transit comparison
- Secondary eclipse search
- Centroid motion analysis
- V-shaped transit detection
- Ghost diagnostic
โ
Classification: PC (Planet Candidate), FP (False Positive), FA (False Alarm)
exodetect-ai/
โโโ README.md
โโโ requirements.txt
โโโ st_app.py # Main Streamlit application
โโโ train_pipeline.py # ML training script
โ
โโโ models/
โ โโโ xgboost_model.pkl # Kepler XGBoost
โ โโโ lightgbm_model.pkl # Kepler LightGBM
โ โโโ mlp_model.pkl # Kepler MLP
โ โโโ random_forest_model.pkl # Kepler Random Forest
โ โโโ lgb_coarse_model.pkl # Kepler coarse-grained
โ โโโ xgb_coarse_model.pkl # Kepler coarse-grained
โ โโโ kmodel/
โ โโโ xgboost_model.pkl # K2 XGBoost
โ โโโ lightgbm_model.pkl # K2 LightGBM
โ
โโโ data/
โ โโโ cumulative.csv # Kepler dataset
โ โโโ k2_epic.csv # K2 dataset
โ โโโ tess_toi.csv # TESS dataset
โ
โโโ notebooks/
โ โโโ kepler_eda.ipynb
โ โโโ k2_training.ipynb
โ โโโ tess_tabpfn.ipynb
โ
โโโ LEO-vetter/ # Submodule for validation
- Python 3.8+
- pip package manager
# Clone repository
git clone https://github.com/BeyondInfinity/exodetect-ai.git
cd exodetect-ai
# Create virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install core dependencies
pip install -r requirements.txt
# Install LEO-vetter (for TIC analysis)
pip install git+https://github.com/mkunimoto/LEO-vetter.git
pip install git+https://github.com/stevepur/transit-diffImage.git# Kepler cumulative catalog
wget https://exoplanetarchive.ipac.caltech.edu/cgi-bin/nstedAPI/nph-nstedAPI?table=cumulative
# K2 EPIC catalog
wget https://exoplanetarchive.ipac.caltech.edu/cgi-bin/nstedAPI/nph-nstedAPI?table=k2candidates
# TESS TOI catalog (via astroquery in code)streamlit run st_app.pyNavigate to http://localhost:8501
Input (13 features):
features = {
'tce_period': 3.52, # days
'tce_duration': 2.5, # hours
'tce_depth': 1500.0, # ppm
'tce_snr': 12.5,
'tce_rp_rs': 0.012, # radius ratio
'tce_impact': 0.5,
'tce_model_chisq': 1.2,
'tce_dof': 100,
'tce_mes': 10.0,
'stellar_logg': 4.4,
'stellar_teff': 5777.0, # K
'stellar_rad': 1.0, # Rโ
'stellar_mass': 1.0 # Mโ
}Output:
Prediction: CONFIRMED
Confidence: 89.3%
Class Probabilities:
- CANDIDATE: 8.7%
- CONFIRMED: 89.3%
- FALSE POSITIVE: 2.0%
# In Streamlit app
TIC: 231663901
Period: 1.430363 days
Epoch: 1338.885 BJD
Duration: 0.069 days
# Results:
LEO-Vetter: Planet Candidate (PC)
Depth: 1500 ppm
Duration: 1.66 hours
SNR: 12.5python train_pipeline.py \
--data cumulative.csv \
--model xgboost \
--downsample 1000 \
--output my_model- scikit-learn - Model training, preprocessing, metrics
- XGBoost - Gradient boosting (optimized for Kepler)
- LightGBM - Fast gradient boosting (best for K2)
- TensorFlow/Keras - Multi-layer perceptron
- TabPFN - Transformer for small TESS dataset
- imbalanced-learn - SMOTE, undersampling
- lightkurve - TESS/Kepler light curve analysis
- astroquery - MAST/TIC catalog queries
- astropy - FITS file handling, time series
- LEO-vetter - Professional signal validation
- Streamlit - Interactive web interface
- pandas - Data manipulation
- matplotlib/seaborn - Visualizations
- joblib - Model persistence
Exoplanets are detected when they pass in front of their host star, causing a periodic dip in brightness. Key parameters:
- Period: Time between transits (orbital period)
- Depth: Fractional brightness decrease (โ (R_p/R_*)ยฒ)
- Duration: Length of transit event
- Shape: Ingress/egress profile indicates impact parameter
- Eclipsing Binaries: Two stars orbiting each other
- Blended Systems: Background eclipsing binary
- Stellar Variability: Spots, flares, pulsations
- Instrumental Artifacts: Cosmic rays, detector noise
- Centroid Shifts: Light from nearby source
- Odd-Even Test: Compare odd/even numbered transits
- Secondary Eclipse: Search for occultation signal
- Centroid Motion: Star position during transit
- Ghost Diagnostic: Nearby contaminating sources
- Shape Analysis: V-shaped vs U-shaped transits
| Mission | Samples | Models | Best Accuracy | Best F1 | ROC-AUC |
|---|---|---|---|---|---|
| Kepler | 9,564 | 3 | 93% | 0.90 | 0.981 |
| K2 | 4,585 | 3 | 92.8% | 0.93 | N/A |
| TESS | 4,960 | 3 | 76% | 0.74 | N/A |
- Successfully handled 1:50+ class imbalance
- Integrated professional validation tools used by NASA
- Created end-to-end pipeline from raw TIC to classification
- Built intuitive interface accessible to researchers and public
- Achieved production-ready performance on Kepler data
- Class Imbalance: Solved with strategic undersampling + coarse models
- Missing Values: Robust imputation strategy (median + 10% fallback)
- LEO-Vetter Integration: Fixed stellar density calculation bug
- TESS Difficulty: Leveraged TabPFN for small, noisy dataset
- Deployment: Resolved model path issues, created flexible architecture
- Add TESS models to production app
- Implement batch CSV processing with progress bars
- Export LEO-vetter reports as PDF
- Add feature importance visualizations
- Create API endpoints for external tools
- Train on TOI+ (community-vetted TESS candidates)
- Implement active learning for labeling efficiency
- Add time series visualization (interactive light curves)
- Support for custom/uploaded light curves
- Multi-mission ensemble voting
- Real-time TESS alert processing
- Integration with JWST follow-up planning
- Atmosphere characterization predictions
- Habitability zone calculations
- Citizen science interface for labeling
Team Members:
- Rafaa Ali - Co-developer and collaborator.
Roles & Contributions:
During this intense 24-hour hackathon, we learned an immense amount about exoplanet detection, machine learning pipelines, and astronomical data processing. This project represents our first deep dive into:
- Handling severely imbalanced astronomical datasets
- Integrating professional scientific validation tools
- Building production ML pipelines from scratch
- Working with NASA's mission data archives
Current Bugs:
- LEO-vetter integration occasionally fails with certain TIC numbers
- Batch processing needs better error handling for malformed CSVs
- Model loading can timeout on slower connections
- Some edge cases in feature preprocessing cause prediction errors
- UI responsiveness issues with large batch uploads
Limitations:
- TESS models not yet integrated into production app
- No real-time validation of input feature ranges
- Limited error messages for invalid inputs
- Batch processing lacks progress tracking
- No model retraining interface
We welcome contributions! If you're interested in improving this tool, please see the Contributing section below.
This is an open hackathon project and we encourage contributions from the community! Whether you're an astronomer, data scientist, or developer, there are many ways to help:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Make your changes
- Test thoroughly
- Commit (
git commit -m 'Add amazing feature') - Push (
git push origin feature/amazing-feature) - Open a Pull Request
- Bug Fixes: See issues labeled
bugandhelp-wanted - Documentation: Improve installation guides, add tutorials
- Testing: Write unit tests, integration tests
- Features: Implement items from Future Work section
- UI/UX: Improve Streamlit interface design
- Performance: Optimize model loading and inference
- Data: Add support for more missions (JWST, etc.)
# Clone your fork
git clone https://github.com/YOUR_USERNAME/exodetect-ai.git
cd exodetect-ai
# Create virtual environment
python -m venv .venv
source .venv/bin/activate
# Install dev dependencies
pip install -r requirements-dev.txt
# Run tests
pytest tests/
# Run app locally
streamlit run st_app.py- Follow PEP 8 for Python code
- Add docstrings to all functions
- Include type hints where applicable
- Write descriptive commit messages
- Add tests for new features
This hackathon was an incredible learning experience. Key takeaways:
-
Class Imbalance is Hard: Real astronomical data is heavily imbalanced (1:50+ ratios). We learned multiple strategies (undersampling, SMOTE, class weights) and when to apply each.
-
Domain Knowledge Matters: Understanding the physics of transits, types of false positives, and detection methods was crucial for feature engineering and model interpretation.
-
Integration is Challenging: Connecting our models with LEO-vetter required debugging stellar density calculations and handling missing TIC catalog data gracefully.
-
Performance โ Simplicity: Our best models often came from careful preprocessing rather than complex architectures.
-
Time Constraints Force Prioritization: With 24 hours, we learned to focus on MVP features and defer nice-to-haves.
-
Open Source is Powerful: Standing on the shoulders of giants (lightkurve, LEO-vetter, scikit-learn) let us accomplish far more than starting from scratch.
- LEO-Vetter GitHub - Kunimoto et al. (2022)
- Lightkurve Documentation
- XGBoost Documentation
- Streamlit Documentation
- Borucki et al. (2010) - Kepler Planet-Detection Mission
- Ricker et al. (2015) - TESS Mission Overview
- Kunimoto et al. (2022) - Automated Vetting of Planet Candidates
- Thompson et al. (2018) - Kepler Data Characteristics Handbook
This project is licensed under the MIT License - see LICENSE file for details.
Datasets are provided by NASA and are in the public domain.
- NASA Exoplanet Science Institute for maintaining public archives
- Kepler/K2/TESS Science Teams for mission data
- Michelle Kunimoto for LEO-vetter tool
- Lightkurve Collaboration for light curve analysis tools
- NASA Space Apps Challenge organizers
- Open-source ML community (scikit-learn, XGBoost, etc.)
Team BeyondInfinity
- GitHub: [github.com/moe-phantom
- Email: maaabkiron@gmail.com
-LinkedIn: MOHAMED ALWTHIQ
RAFAA ALI ABDALLA
Dedicated to the discovery of new worlds and the advancement of human knowledge.
Made with โค๏ธ for NASA Space Apps Challenge 2025