RNA m⁵C Predictor

Deep‑learning inference for cytosine 5‑methylation in Human RNA sequences

End‑to‑end pipeline for writer‑resolved m5C prediction. (a) Construction of an m5C high-confidence catalog. The 45,500 bisulfite calls} are lifted over, mapped to GENCODE v19 and cytosines on methylated transcripts are split into methylated (turquoise) and unmethylated (orange). Nine STREME runs plus manual curation yield four writer motifs (Type I–IV, boxed). PFM‑based rescoring removes BS‑seq artifacts, random negatives are added from the same transcripts and redundancy filtering produces a 26,300×2 non‑redundant corpus. (b) Model Training and Inference. A five‑fold CV grid search (Bi‑GRU, CNN, Transformer) selects Bi‑GRU as the best model. Its false‑positive calls on around 10 million held‑out cytosines are harvested as hard negatives, redundancy-filtered and merged into an augmented training set, then re‑trained and deployed transcriptome‑wide. (c) Resources. The resulting writer‑specific prediction database, with refined motifs and coherent secondary‑structure profiles are released for community use. In this repository we further provide a standalone Python tool for m5C prediction given FASTA files as described below.

This repository accompanies: AI methods and biologically informed data curation enable accurate RNA m⁵C prediction
Saitto et al., 2025 (preprint on bioRxiv) DOI: 10.1101/2025.09.22.677824

What’s inside

📂 dataset/              # training/validation data
📂 experiments/          # training + analysis code
📂 experiments/models    # code of the Bi-GRU, CNN and Transformer models to predict m5C RNA modifications
📂 human_transcriptome_predictions/     # dataframe with predicted m5Cs across human trascriptome
📂 model_weights/        # final Bi‑GRU checkpoint (heavy hard‑negative mining)
predict_m5c.py           # ← run this to predict new samples
test.fasta               # tiny example FASTA for a smoke test

How `predict_m5c.py` works

Loads the frozen Bi‑GRU from model_weights/…pt.
Parses an input FASTA (DNA or RNA; U is transparently mapped to T).
Slides a 51‑nt window over every cytosine (25 nt flanks). Windows containing ambiguous bases (N, etc.) are skipped.
One‑hot encodes each window to (51, 4) and batches them.
Predicts five classes (unmodified, I, II, III, IV) and writes a probability table.

Requirements (inference)

Package	Version used
Python	3.11.5
PyTorch	2.1.0
pandas	2.2.3

Optional for Excel output: openpyxl ≥ 3.1.0

The script auto‑detects a GPU. Pass --cpu to force CPU‑only inference (works even with a CPU‑only PyTorch wheel).

Installation

python -m pip install --upgrade pip
pip install torch==2.1.0 pandas==2.2.3
# Excel support (optional)
pip install "openpyxl>=3.1.0"

Usage

#place ".fasta" file inside repository

# quick start (GPU if available)
python predict_m5c.py --fasta_file my_sequences.fasta

# force CPU
python predict_m5c.py --fasta_file my_sequences.fasta --cpu

# change batch size (faster on large GPUs)
python predict_m5c.py --fasta_file my_sequences.fasta --batch_size 256

# choose output format by extension
python predict_m5c.py --fasta_file my_sequences.fasta --output_file results.tsv   # TSV
python predict_m5c.py --fasta_file my_sequences.fasta --output_file results.xlsx  # Excel (needs openpyxl)

Output columns

sequence_id | position | Type (unmodified/I/II/III/IV) | p. unmodified | p. I | p. II | p. III | p. IV

Data availability

Location	File	Size	Purpose
GitHub (`human_transcriptome_predictions/`)	`m5C_predictions.tsv.gz`	4 MB	Quick download; small enough for the repo.
Zenodo	`m5C_predictions.tsv.gz` (same checksum as GitHub)	4 MB	Archival copy with DOI for citation; long‑term preservation.
(optional) Zenodo	`m5C_predictions.xlsx`	16 MB	Excel version for bench biologists (exported from the TSV).

Data available at: https://doi.org/10.5281/zenodo.16629378

Dataset description

This dataset provides transcriptome‑wide predictions of RNA 5‑methyl‑cytosine (m⁵C) sites for the human reference transcriptome (GENCODE v45, GRCh38).

Each row corresponds to a cytosine residue predicted to be methylated and contains:

Transcript‑level identifiers: transcript_id, gene_id, gene_name, transcript_type, tags.
position: zero‑based coordinate of the cytosine within the transcript sequence.
Type: predicted methyltransferase class – I (NSUN2), II (NSUN6), III (NSUN5), IV (NSUN1).
probability: posterior probability assigned by the model (rounded to 4 decimals).
in_train_or_test_sets: TRUE if the 51‑nt window centred on this cytosine was present in the training or validation sets; FALSE otherwise.

Citation

Saitto, E., Casiraghi, E., Paccanaro, A. & Valentini, G.
AI methods and biologically informed data curation enable accurate RNA m5C prediction.
bioRxiv (September 2025).
DOI: 10.1101/2025.09.22.677824

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
dataset		dataset
experiments		experiments
human_transcriptome_predictions		human_transcriptome_predictions
model_weights		model_weights
.gitattributes		.gitattributes
.gitignore		.gitignore
overview.png		overview.png
predict_m5c.py		predict_m5c.py
readme.md		readme.md
test.fasta		test.fasta

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RNA m⁵C Predictor

What’s inside

How `predict_m5c.py` works

Requirements (inference)

Installation

Usage

Output columns

Data availability

Dataset description

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

AnacletoLAB/RNA_m5C_predict

Folders and files

Latest commit

History

Repository files navigation

RNA m⁵C Predictor

What’s inside

How predict_m5c.py works

Requirements (inference)

Installation

Usage

Output columns

Data availability

Dataset description

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

How `predict_m5c.py` works

Packages