Skip to content

Source code and dataset for the paper 'Saamayik: A Benchmark and Dataset for English-Sanskrit Translation'

Notifications You must be signed in to change notification settings

ayushbits/Saamayik

Repository files navigation

Sāmayik: English-Sanskrit Parallel Dataset

Sāmayik: A Benchmark and Dataset for English-Sanskrit Translation
Ayush Maheshwari, Ashim Gupta, Amrith Krishna, Atul Kumar Singh, Ganesh Ramakrishnan, G. Anil Kumar and Jitin Singla
LREC-COLING 2024

Overview

Sāmayik is an English-Sanskrit parallel dataset that captures contemporary usage of Sanskrit, particularly in prose. This dataset comprises of around 53,000 parallel sentence pairs gathered from diverse sources, including spoken content on contemporary world affairs, interpretation of literary works, pedagogical content, and more.

🤗 Hugging Face Dataset

The complete dataset is available on Hugging Face with easy-to-use integration:

Dataset Link: https://huggingface.co/datasets/acomquest/Saamayik

from datasets import load_dataset

# Load the dataset
dataset = load_dataset("acomquest/Saamayik")

# Access splits
train = dataset['train']        # 43,493 sentences
validation = dataset['validation']  # 2,416 sentences  
test = dataset['test']          # 2,417 sentences
test_ood = dataset['test_ood']  # 4,047 sentences (Mann Ki Baat)

Data

  1. data/final_data/ - Main dataset with train, test, dev splits (48,326 sentences total)
    • Includes: Bible, NIOS, Spoken Tutorials, Gitasopanam
    • Does NOT include MKB (Mann Ki Baat)
  2. data/mkb/ - Mann Ki Baat dataset (4,047 sentences) - provided as out-of-domain evaluation set
  3. data/<corpus> - Individual corpus files: spoken-tutorials, gitasopanam, bible, nios

Data Splits

Split Total Main Dataset MKB (OOD) Description
train 43,493 43,493 0 Training data
validation 2,416 2,416 0 Validation data
test 2,417 2,417 0 In-domain test
test_ood 4,047 0 4,047 Out-of-domain test (Mann Ki Baat)
Total 52,373 48,326 4,047

Source Distribution

Main Dataset (48,326 sentences):

  1. Bible (7,838 pairs) - Sanskrit Bible translation from 1851
  2. NIOS (11,356 pairs) - National Institute of Open Schooling educational content
  3. Spoken Tutorials (23,835 pairs) - Technical and instructional content
  4. Gitasopanam (5,885 pairs) - Spiritual and philosophical texts

Out-of-Domain Dataset (4,047 sentences):

  1. Mann Ki Baat (MKB) - Monthly radio podcast (2014-2022) on contemporary topics

Note on Duplicates

This dataset intentionally retains natural duplicates to preserve the original distribution:

  • English side: 2,735 duplicate instances (5.66% of total)
  • Sanskrit side: 722 duplicate instances (1.49% of total)
  • Exact duplicate pairs: 210 instances (0.43% of total)

Most duplicates are section headers, educational prompts, and boilerplate text from Spoken Tutorials and NIOS.

Evaluation and training scripts

  1. Each model folder contains train, evaluation and data generation scripts
  2. Fine-tuning and evaluation scripts for IndicTrans2 is directly used from the original repository.

Citation

If you use Sāmayik in your research, please cite our paper.

@inproceedings{maheshwari-etal-2024-samayik-benchmark,
    title = "Samayik: A Benchmark and Dataset for {E}nglish-{S}anskrit Translation",
    author = "Maheshwari, Ayush  and Gupta, Ashim  and Krishna, Amrith  and Singh, Atul Kumar and Ramakrishnan, Ganesh and Gourishetty, Anil Kumar and Singla, Jitin",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = May,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.1245",
    pages = "14298--14304",
}

About

Source code and dataset for the paper 'Saamayik: A Benchmark and Dataset for English-Sanskrit Translation'

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages