Deep‑learning inference for cytosine 5‑methylation in Human RNA sequences
End‑to‑end pipeline for writer‑resolved m5C prediction.
(a) Construction of an m5C high-confidence catalog. The 45,500 bisulfite calls} are lifted over, mapped to GENCODE v19 and cytosines on methylated transcripts are split into methylated (turquoise) and unmethylated (orange). Nine STREME runs plus manual curation yield four writer motifs (Type I–IV, boxed). PFM‑based rescoring removes BS‑seq artifacts, random negatives are added from the same transcripts and redundancy filtering produces a 26,300×2 non‑redundant corpus. (b) Model Training and Inference. A five‑fold CV grid search (Bi‑GRU, CNN, Transformer) selects Bi‑GRU as the best model. Its false‑positive calls on around 10 million held‑out cytosines are harvested as hard negatives, redundancy-filtered and merged into an augmented training set, then re‑trained and deployed transcriptome‑wide. (c) Resources. The resulting writer‑specific prediction database, with refined motifs and coherent secondary‑structure profiles are released for community use. In this repository we further provide a standalone Python tool for m5C prediction given FASTA files as described below.
This repository accompanies: AI methods and biologically informed data curation enable accurate RNA m⁵C prediction
Saitto et al., 2025 (preprint on bioRxiv)
DOI: 10.1101/2025.09.22.677824
📂 dataset/ # training/validation data
📂 experiments/ # training + analysis code
📂 experiments/models # code of the Bi-GRU, CNN and Transformer models to predict m5C RNA modifications
📂 human_transcriptome_predictions/ # dataframe with predicted m5Cs across human trascriptome
📂 model_weights/ # final Bi‑GRU checkpoint (heavy hard‑negative mining)
predict_m5c.py # ← run this to predict new samples
test.fasta # tiny example FASTA for a smoke test
- Loads the frozen Bi‑GRU from
model_weights/…pt. - Parses an input FASTA (DNA or RNA;
Uis transparently mapped toT). - Slides a 51‑nt window over every cytosine (25 nt flanks). Windows
containing ambiguous bases (
N, etc.) are skipped. - One‑hot encodes each window to
(51, 4)and batches them. - Predicts five classes (unmodified, I, II, III, IV) and writes a probability table.
| Package | Version used |
|---|---|
| Python | 3.11.5 |
| PyTorch | 2.1.0 |
| pandas | 2.2.3 |
Optional for Excel output: openpyxl ≥ 3.1.0
The script auto‑detects a GPU. Pass
--cputo force CPU‑only inference (works even with a CPU‑only PyTorch wheel).
python -m pip install --upgrade pip
pip install torch==2.1.0 pandas==2.2.3
# Excel support (optional)
pip install "openpyxl>=3.1.0"#place ".fasta" file inside repository
# quick start (GPU if available)
python predict_m5c.py --fasta_file my_sequences.fasta
# force CPU
python predict_m5c.py --fasta_file my_sequences.fasta --cpu
# change batch size (faster on large GPUs)
python predict_m5c.py --fasta_file my_sequences.fasta --batch_size 256
# choose output format by extension
python predict_m5c.py --fasta_file my_sequences.fasta --output_file results.tsv # TSV
python predict_m5c.py --fasta_file my_sequences.fasta --output_file results.xlsx # Excel (needs openpyxl)sequence_id | position | Type (unmodified/I/II/III/IV) | p. unmodified | p. I | p. II | p. III | p. IV
| Location | File | Size | Purpose |
|---|---|---|---|
GitHub (human_transcriptome_predictions/) |
m5C_predictions.tsv.gz |
4 MB | Quick download; small enough for the repo. |
| Zenodo | m5C_predictions.tsv.gz (same checksum as GitHub) |
4 MB | Archival copy with DOI for citation; long‑term preservation. |
| (optional) Zenodo | m5C_predictions.xlsx |
16 MB | Excel version for bench biologists (exported from the TSV). |
Data available at: https://doi.org/10.5281/zenodo.16629378
This dataset provides transcriptome‑wide predictions of RNA 5‑methyl‑cytosine (m⁵C) sites for the human reference transcriptome (GENCODE v45, GRCh38).
Each row corresponds to a cytosine residue predicted to be methylated and contains:
- Transcript‑level identifiers:
transcript_id,gene_id,gene_name,transcript_type,tags. position: zero‑based coordinate of the cytosine within the transcript sequence.Type: predicted methyltransferase class – I (NSUN2), II (NSUN6), III (NSUN5), IV (NSUN1).probability: posterior probability assigned by the model (rounded to 4 decimals).in_train_or_test_sets:TRUEif the 51‑nt window centred on this cytosine was present in the training or validation sets;FALSEotherwise.
Saitto, E., Casiraghi, E., Paccanaro, A. & Valentini, G.
AI methods and biologically informed data curation enable accurate RNA m5C prediction.
bioRxiv (September 2025).
DOI: 10.1101/2025.09.22.677824