This project aims to identify the language of each word in a given sentence. The dataset is based on IDRBT-FIRE social media data, where each sentence contains a maximum of two languages: English and one Indian language (Telugu, Malayalam, Tamil, Marathi, Hindi, Kannada, Bengali, or Gujarati).
-
Word-Level Classification
- Eight binary classifiers (Naïve Bayes) are trained to distinguish English from each Indian language using word lists.
-
Sentence-Level Classification
- Sentences are labeled with the languages present (English + one Indian language).
- A sentence-level classifier (MLP model with one hidden layer of 120 units) is trained to predict the language pair.
-
Word Label Prediction
- The trained binary classifiers predict language labels for individual words.
-
CRF-Based Sequence Modeling
- The predicted word-level labels are mapped to actual sentence-level labels.
- A Conditional Random Fields (CRF) model is trained using a window size of 3 to refine the final language identification.
InputTraining.txt/InputTesting.txt– Input sentences for training/testing.AnnotationTraining.txt/AnnotationTesting.txt– Labeled language annotations for training/testing.- Word lists for different Indian languages:
bengaliW.txtgujaratiW.txthindiW.txtkannadaW.txtmalayalamW.txtmaratiW.txttamil.txttelugu.txteng2.txt(English word list)