SCoPe: ZTF Source Classification Project

scope-ml uses machine learning to classify light curves from the Zwicky Transient Facility (ZTF) and the Vera C. Rubin Observatory (LSST). The documentation is hosted at https://zwickytransientfacility.github.io/scope-ml/. For local preview, install mkdocs-material and run mkdocs serve.

Feature generation includes period-finding (Conditional Entropy, Analysis of Variance, Lomb-Scargle, FPW) and Fourier decomposition via the periodfind library. Fourier features are computed using a batched weighted linear least-squares solver with BIC model selection, replacing the previous per-source scipy.optimize.curve_fit loop.

Rubin DP1 Feature Generation (Local Parquet Files)

The pipeline supports running against Rubin Data Preview 1 (DP1) data stored as local parquet files, bypassing the TAP API entirely. This is the recommended approach for large-scale feature generation.

Prerequisites

You need three gzip-compressed parquet files downloaded from the Rubin Science Platform and placed in a single directory:

/path/to/dp1_data/
  Object.parquet.gzip
  ForcedSource.parquet.gzip
  Visit.parquet.gzip

Configuration

Tell scope-ml to use local files instead of the TAP API by setting data_path in config.yaml:

rubin:
  data_path: /path/to/dp1_data/

Or set the environment variable:

export RUBIN_DATA_PATH=/path/to/dp1_data/

When data_path (or RUBIN_DATA_PATH) is set, all Rubin commands automatically use the local parquet backend (RubinLocalClient) instead of the TAP client. No token is needed for local mode.

Single-run feature generation

For a small number of sources (e.g. a cone search or a short object list):

# Cone search
generate-features-rubin --ra 62.0 --dec -37.0 --radius 60 --doCPU

# From a CSV file with an objectId column
generate-features-rubin --objectid-file my_objects.csv --doCPU

Output is written to generated_features_rubin/gen_features_rubin.parquet by default.

Large-scale processing with SLURM

For processing the full DP1 catalog, use the chunked SLURM workflow:

1. Prepare chunks -- scan the local parquet files, filter to objects with enough detections, and split into chunk CSVs:

prepare-rubin-chunks \
  --data-path /path/to/dp1_data/ \
  --chunk-size 5000 \
  --min-n-lc-points 50 \
  --output-dir rubin_chunks

This writes rubin_chunks/chunk_000.csv, rubin_chunks/chunk_001.csv, etc., plus a master list rubin_chunks/all_eligible_objectids.csv.

2. Generate the SLURM array script:

generate-features-rubin-slurm \
  --chunk-dir rubin_chunks \
  --output-dir rubin_slurm \
  --venv /path/to/your/.venv \
  --cpus-per-task 8 \
  --top-n-periods 50

This writes rubin_slurm/run_rubin_features.sh. Edit the script to adjust partition, account, memory, and module loads for your cluster.

3. Submit the array job:

sbatch rubin_slurm/run_rubin_features.sh

Each array task processes one chunk and writes generated_features_rubin/gen_features_rubin_<TASK_ID>.parquet.

4. Combine results after all jobs finish:

combine-rubin-features \
  --input-dir generated_features_rubin \
  --output generated_features_rubin/dp1_features_combined.parquet

Single-node chunked runner (no SLURM)

If you don't have a SLURM cluster, you can run chunks sequentially (or resume after interruption) with:

python tools/run_rubin_chunked.py \
  --objectid-file rubin_chunks/all_eligible_objectids.csv \
  --doCPU \
  --chunk-size 5000 \
  --top-n-periods 50

Completed chunks are saved to generated_features_rubin/chunks/ and skipped on restart, so the job is resumable.

CLI reference

Command	Description
`get-rubin-ids`	Discover object IDs via cone search or read from CSV
`generate-features-rubin`	Generate features for a set of Rubin sources
`prepare-rubin-chunks`	Scan local parquet files and split eligible objects into chunk CSVs
`generate-features-rubin-slurm`	Generate a SLURM array job script from chunk files
`combine-rubin-features`	Merge per-chunk parquet outputs into a single file

Funding

We gratefully acknowledge previous and current support from the U.S. National Science Foundation (NSF) Harnessing the Data Revolution (HDR) Institute for Accelerated AI Algorithms for Data-Driven Discovery (A3D3) under Cooperative Agreement No. PHY-2117997.

Name		Name	Last commit message	Last commit date
Latest commit History 422 Commits
.github		.github
assets		assets
data		data
docs		docs
hpc_files		hpc_files
regenerate_training		regenerate_training
scope		scope
tests		tests
tools		tools
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
config.defaults.yaml		config.defaults.yaml
dev-requirements.txt		dev-requirements.txt
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SCoPe: ZTF Source Classification Project

Rubin DP1 Feature Generation (Local Parquet Files)

Prerequisites

Configuration

Single-run feature generation

Large-scale processing with SLURM

Single-node chunked runner (no SLURM)

CLI reference

Funding

About

Uh oh!

Releases 6

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SCoPe: ZTF Source Classification Project

Rubin DP1 Feature Generation (Local Parquet Files)

Prerequisites

Configuration

Single-run feature generation

Large-scale processing with SLURM

Single-node chunked runner (no SLURM)

CLI reference

Funding

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 6

Uh oh!

Contributors

Uh oh!

Languages