Urdu NLP — N-Gram Models, Smoothing, and Tokenization

This notebook provides multiple Natural Language Processing (NLP) solutions for Urdu text processing in Python, focusing on both statistical language modeling and tokenization techniques.

Features

N-Gram Model with Add-One (Laplace) Smoothing
- Implementation of an N-gram language model using Maximum Likelihood Estimation (MLE) and Add-One smoothing to handle zero-probability issues.
Smooth_NGRAM
- Enhanced N-gram model with configurable smoothing techniques for better probability estimation.
Unicode Byte Pair Encoding (BPE)
- Subword tokenization method adapted for Urdu script, suitable for morphologically rich languages.
Penn Treebank Tokenization (PTB)
- Tokenization method based on the Penn Treebank style with modifications for Urdu and Unicode compatibility.
Urdu-Specific Preprocessing — Joiner & Non-Joiner Handling
- Proper handling of Urdu joiner characters (Zero Width Joiner, Zero Width Non-Joiner) during normalization and tokenization.
Evaluation & Examples
- Perplexity calculation, probability lookup, and next-word prediction examples.

To demonstrate practical NLP solutions for Urdu text using statistical language models.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
nlp-basics .ipynb		nlp-basics .ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Urdu NLP — N-Gram Models, Smoothing, and Tokenization

Features

About

Uh oh!

Releases

Packages

Languages

imhnor/nlp-basics-en-ur-token-ngrams-smoothing-pentreebank

Folders and files

Latest commit

History

Repository files navigation

Urdu NLP — N-Gram Models, Smoothing, and Tokenization

Features

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages