Language Identification with CRF

This project aims to identify the language of each word in a given sentence. The dataset is based on IDRBT-FIRE social media data, where each sentence contains a maximum of two languages: English and one Indian language (Telugu, Malayalam, Tamil, Marathi, Hindi, Kannada, Bengali, or Gujarati).

Approach

Word-Level Classification
- Eight binary classifiers (Naïve Bayes) are trained to distinguish English from each Indian language using word lists.
Sentence-Level Classification
- Sentences are labeled with the languages present (English + one Indian language).
- A sentence-level classifier (MLP model with one hidden layer of 120 units) is trained to predict the language pair.
Word Label Prediction
- The trained binary classifiers predict language labels for individual words.
CRF-Based Sequence Modeling
- The predicted word-level labels are mapped to actual sentence-level labels.
- A Conditional Random Fields (CRF) model is trained using a window size of 3 to refine the final language identification.

Dataset

InputTraining.txt / InputTesting.txt – Input sentences for training/testing.
AnnotationTraining.txt / AnnotationTesting.txt – Labeled language annotations for training/testing.
Word lists for different Indian languages:
- bengaliW.txt
- gujaratiW.txt
- hindiW.txt
- kannadaW.txt
- malayalamW.txt
- maratiW.txt
- tamil.txt
- telugu.txt
- eng2.txt (English word list)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Language Identification with CRF

Approach

Dataset

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
AnnotationTesting.txt		AnnotationTesting.txt
AnnotationTraining.txt		AnnotationTraining.txt
InputTesting.txt		InputTesting.txt
InputTraining.txt		InputTraining.txt
LangugageIdentification_CRF.ipynb		LangugageIdentification_CRF.ipynb
ReadMe.md		ReadMe.md
bengaliW.txt		bengaliW.txt
eng2.txt		eng2.txt
gujaratiW.txt		gujaratiW.txt
hindiW.txt		hindiW.txt
kannadaW.txt		kannadaW.txt
malayalamW.txt		malayalamW.txt
maratiW.txt		maratiW.txt
tamil.txt		tamil.txt
telugu.txt		telugu.txt

CaptainO5/Language-Identification

Folders and files

Latest commit

History

Repository files navigation

Language Identification with CRF

Approach

Dataset

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages