Skip to content

AnacletoLAB/RNA_m5C_predict

Repository files navigation

RNA m⁵C Predictor

Deep‑learning inference for cytosine 5‑methylation in Human RNA sequences

Pipeline overview End‑to‑end pipeline for writer‑resolved m5C prediction. (a) Construction of an m5C high-confidence catalog. The 45,500 bisulfite calls} are lifted over, mapped to GENCODE v19 and cytosines on methylated transcripts are split into methylated (turquoise) and unmethylated (orange). Nine STREME runs plus manual curation yield four writer motifs (Type I–IV, boxed). PFM‑based rescoring removes BS‑seq artifacts, random negatives are added from the same transcripts and redundancy filtering produces a 26,300×2 non‑redundant corpus. (b) Model Training and Inference. A five‑fold CV grid search (Bi‑GRU, CNN, Transformer) selects Bi‑GRU as the best model. Its false‑positive calls on around 10 million held‑out cytosines are harvested as hard negatives, redundancy-filtered and merged into an augmented training set, then re‑trained and deployed transcriptome‑wide. (c) Resources. The resulting writer‑specific prediction database, with refined motifs and coherent secondary‑structure profiles are released for community use. In this repository we further provide a standalone Python tool for m5C prediction given FASTA files as described below.

This repository accompanies: AI methods and biologically informed data curation enable accurate RNA m⁵C prediction
Saitto et al., 2025 (preprint on bioRxiv) DOI: 10.1101/2025.09.22.677824


What’s inside

📂 dataset/              # training/validation data
📂 experiments/          # training + analysis code
📂 experiments/models    # code of the Bi-GRU, CNN and Transformer models to predict m5C RNA modifications
📂 human_transcriptome_predictions/     # dataframe with predicted m5Cs across human trascriptome
📂 model_weights/        # final Bi‑GRU checkpoint (heavy hard‑negative mining)
predict_m5c.py           # ← run this to predict new samples
test.fasta               # tiny example FASTA for a smoke test

How predict_m5c.py works

  1. Loads the frozen Bi‑GRU from model_weights/…pt.
  2. Parses an input FASTA (DNA or RNA; U is transparently mapped to T).
  3. Slides a 51‑nt window over every cytosine (25 nt flanks). Windows containing ambiguous bases (N, etc.) are skipped.
  4. One‑hot encodes each window to (51, 4) and batches them.
  5. Predicts five classes (unmodified, I, II, III, IV) and writes a probability table.

Requirements (inference)

Package Version used
Python  3.11.5
PyTorch  2.1.0
pandas  2.2.3

Optional for Excel output: openpyxl ≥ 3.1.0

The script auto‑detects a GPU. Pass --cpu to force CPU‑only inference (works even with a CPU‑only PyTorch wheel).


Installation

python -m pip install --upgrade pip
pip install torch==2.1.0 pandas==2.2.3
# Excel support (optional)
pip install "openpyxl>=3.1.0"

Usage

#place ".fasta" file inside repository

# quick start (GPU if available)
python predict_m5c.py --fasta_file my_sequences.fasta

# force CPU
python predict_m5c.py --fasta_file my_sequences.fasta --cpu

# change batch size (faster on large GPUs)
python predict_m5c.py --fasta_file my_sequences.fasta --batch_size 256

# choose output format by extension
python predict_m5c.py --fasta_file my_sequences.fasta --output_file results.tsv   # TSV
python predict_m5c.py --fasta_file my_sequences.fasta --output_file results.xlsx  # Excel (needs openpyxl)

Output columns

sequence_id | position | Type (unmodified/I/II/III/IV) | p. unmodified | p. I | p. II | p. III | p. IV

Data availability

Location File Size Purpose
GitHub (human_transcriptome_predictions/) m5C_predictions.tsv.gz 4 MB Quick download; small enough for the repo.
Zenodo m5C_predictions.tsv.gz
(same checksum as GitHub)
4 MB Archival copy with DOI for citation; long‑term preservation.
(optional) Zenodo m5C_predictions.xlsx 16 MB Excel version for bench biologists (exported from the TSV).

Data available at: https://doi.org/10.5281/zenodo.16629378

Dataset description

This dataset provides transcriptome‑wide predictions of RNA 5‑methyl‑cytosine (m⁵C) sites for the human reference transcriptome (GENCODE v45, GRCh38).

Each row corresponds to a cytosine residue predicted to be methylated and contains:

  • Transcript‑level identifiers: transcript_id, gene_id, gene_name, transcript_type, tags.
  • position: zero‑based coordinate of the cytosine within the transcript sequence.
  • Type: predicted methyltransferase class – I (NSUN2), II (NSUN6), III (NSUN5), IV (NSUN1).
  • probability: posterior probability assigned by the model (rounded to 4 decimals).
  • in_train_or_test_sets: TRUE if the 51‑nt window centred on this cytosine was present in the training or validation sets; FALSE otherwise.

Citation

Saitto, E., Casiraghi, E., Paccanaro, A. & Valentini, G.
AI methods and biologically informed data curation enable accurate RNA m5C prediction.
bioRxiv (September 2025).
DOI: 10.1101/2025.09.22.677824


About

Deep‑learning inference for cytosine 5‑methylation in Human RNA sequences

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •