Skip to content

imhnor/nlp-basics-en-ur-token-ngrams-smoothing-pentreebank

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

Urdu NLP — N-Gram Models, Smoothing, and Tokenization

This notebook provides multiple Natural Language Processing (NLP) solutions for Urdu text processing in Python, focusing on both statistical language modeling and tokenization techniques.

Features

  1. N-Gram Model with Add-One (Laplace) Smoothing

    • Implementation of an N-gram language model using Maximum Likelihood Estimation (MLE) and Add-One smoothing to handle zero-probability issues.
  2. Smooth_NGRAM

    • Enhanced N-gram model with configurable smoothing techniques for better probability estimation.
  3. Unicode Byte Pair Encoding (BPE)

    • Subword tokenization method adapted for Urdu script, suitable for morphologically rich languages.
  4. Penn Treebank Tokenization (PTB)

    • Tokenization method based on the Penn Treebank style with modifications for Urdu and Unicode compatibility.
  5. Urdu-Specific Preprocessing — Joiner & Non-Joiner Handling

    • Proper handling of Urdu joiner characters (Zero Width Joiner, Zero Width Non-Joiner) during normalization and tokenization.
  6. Evaluation & Examples

    • Perplexity calculation, probability lookup, and next-word prediction examples.

  • To demonstrate practical NLP solutions for Urdu text using statistical language models.

About

Multiple NLP Task solution, Byte Pair Encoding , Penn Tree bank tokenization, NGram model by applying Add one smoothing

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published