This notebook provides multiple Natural Language Processing (NLP) solutions for Urdu text processing in Python, focusing on both statistical language modeling and tokenization techniques.
-
N-Gram Model with Add-One (Laplace) Smoothing
- Implementation of an N-gram language model using Maximum Likelihood Estimation (MLE) and Add-One smoothing to handle zero-probability issues.
-
Smooth_NGRAM
- Enhanced N-gram model with configurable smoothing techniques for better probability estimation.
-
Unicode Byte Pair Encoding (BPE)
- Subword tokenization method adapted for Urdu script, suitable for morphologically rich languages.
-
Penn Treebank Tokenization (PTB)
- Tokenization method based on the Penn Treebank style with modifications for Urdu and Unicode compatibility.
-
Urdu-Specific Preprocessing — Joiner & Non-Joiner Handling
- Proper handling of Urdu joiner characters (Zero Width Joiner, Zero Width Non-Joiner) during normalization and tokenization.
-
Evaluation & Examples
- Perplexity calculation, probability lookup, and next-word prediction examples.
- To demonstrate practical NLP solutions for Urdu text using statistical language models.