Biomedical Abbreviation and Long-form Detection using Token Classification

This project investigates multiple deep learning models for identifying biomedical abbreviations (AC) and long forms (LF) using the PLOD-CW-25 dataset. It involves EDA, token classification modeling, and evaluation of trade-offs between model accuracy and efficiency.

📊 Exploratory Data Analysis (EDA)

Dataset Overview, Completeness & Sentence Counts
- Verified token-level annotation consistency
- Measured document and sentence-level distribution across train, validation, and test splits
Token and Tag Distributions
- Analyzed frequency of BIO tags (B-AC, B-LF, I-LF, O)
- Investigated token casing and sentence length patterns
Sub-Domain Exploration & Abbreviation Analysis
- Grouped entries by biomedical sub-domains
- Identified trends in abbreviation usage density
Abbreviation Characteristics & Ambiguity
- Checked reuse and ambiguity of abbreviations (e.g., multiple meanings)
- Evaluated impact of character length and term frequency on detection complexity

🧠 Experiments & Models

🔹 Traditional Models

Model: CRF + BiLSTM
Embeddings: Word2Vec and Word+Char
Result: Macro F1 ≈ 0.73

🔹 Sequence Models

Model: RNN and Bi-LSTM
Embeddings: FastText
Result: Bi-LSTM F1 ≈ 0.67; RNN F1 ≈ 0.52

🔹 Transformer Model

Model: Fine-tuned RoBERTa
Optimizers: Adam, LION, LAMB
Tokenizer: RobertaTokenizerFast (BPE)
Result: RoBERTa + LION: Micro F1 = 0.8622, Macro F1 = 0.855

✅ Evaluation Summary

A. Can the models fulfil their purpose?

Yes — all models successfully detected biomedical abbreviations and long forms. RoBERTa + LION outperformed others with superior generalization and convergence speed.

B. What is a good F1/accuracy threshold?

F1-scores above 0.75 are considered strong for biomedical NER. Our best model reached 0.855 macro F1, exceeding this threshold.

C. How could low-performing models be improved?

Use BiLSTM instead of plain RNN
Incorporate contextual embeddings (e.g., BioBERT) instead of static ones
Fine-tune transformers properly; pretrained-only models underperform
Avoid LAMB for small-batch training

D. Tokenization Strategy Impact

Word-level models used native PLOD tokens
RoBERTa used BPE subword tokenization, which improved performance
Subword realignment was essential for BIO tagging

E. Accuracy vs. Efficiency Trade-off

RoBERTa + LION is highly accurate but resource-heavy
For deployment, distillation or pruning can reduce size with ~2–3% performance drop
Critical applications (clinical, research) may favor full model; lightweight tools can trade-off

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Exploratory_Data_Analysis.ipynb		Exploratory_Data_Analysis.ipynb
Fasttext_RNN_vs_Fasttext_Bilstm.ipynb		Fasttext_RNN_vs_Fasttext_Bilstm.ipynb
README.md		README.md
crf_bilstm_word2vec_vs_charword.ipynb		crf_bilstm_word2vec_vs_charword.ipynb
roberta_finetune_optimizers.ipynb		roberta_finetune_optimizers.ipynb
roberta_finetune_optimizers_fixed.ipynb		roberta_finetune_optimizers_fixed.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Biomedical Abbreviation and Long-form Detection using Token Classification

📊 Exploratory Data Analysis (EDA)

🧠 Experiments & Models

🔹 Traditional Models

🔹 Sequence Models

🔹 Transformer Model

✅ Evaluation Summary

A. Can the models fulfil their purpose?

B. What is a good F1/accuracy threshold?

C. How could low-performing models be improved?

D. Tokenization Strategy Impact

E. Accuracy vs. Efficiency Trade-off

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Biomedical Abbreviation and Long-form Detection using Token Classification

📊 Exploratory Data Analysis (EDA)

🧠 Experiments & Models

🔹 Traditional Models

🔹 Sequence Models

🔹 Transformer Model

✅ Evaluation Summary

A. Can the models fulfil their purpose?

B. What is a good F1/accuracy threshold?

C. How could low-performing models be improved?

D. Tokenization Strategy Impact

E. Accuracy vs. Efficiency Trade-off

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages