From 2faf3809b55b905417943b6b70dd30aa767a9ce0 Mon Sep 17 00:00:00 2001 From: Jan Spoerer Date: Sun, 8 Dec 2024 17:37:06 +0100 Subject: [PATCH 1/5] Added a short explanation of the difference between zeroshot and guided topic modeling to both of the respective documentations so that users immediately know that there are two very similar methods for providing pre-defined topics --- docs/getting_started/guided/guided.md | 6 ++++++ docs/getting_started/zeroshot/zeroshot.md | 4 ++++ 2 files changed, 10 insertions(+) diff --git a/docs/getting_started/guided/guided.md b/docs/getting_started/guided/guided.md index 9233ac41..aa94316c 100644 --- a/docs/getting_started/guided/guided.md +++ b/docs/getting_started/guided/guided.md @@ -1,3 +1,9 @@ +!!! Note + Difference between Zero-shot and Guided BERTopic: + Guided BERTopic is similar - yet not equivalent - to [Zeros-shot Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/zeroshot/zeroshot.html). + Use Guided BERTopic to boost certain keyword's importance. Use [Zeros-shot Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/zeroshot/zeroshot.html) to try to categorize documents into predefined topics ("zero-shot topics") before the clustering the remaining, unclassified documents, using the default unsupervised BERTopic topic exploration algorithm. + + Guided Topic Modeling or Seeded Topic Modeling is a collection of techniques that guides the topic modeling approach by setting several seed topics to which the model will converge to. These techniques allow the user to set a predefined number of topic representations that are sure to be in documents. For example, take an IT business that has a ticket system for the software their clients use. Those tickets may typically contain information about a specific bug regarding login issues that the IT business is aware of. To model that bug, we can create a seed topic representation containing the words `bug`, `login`, `password`, diff --git a/docs/getting_started/zeroshot/zeroshot.md b/docs/getting_started/zeroshot/zeroshot.md index 951f6f0c..d1ffc884 100644 --- a/docs/getting_started/zeroshot/zeroshot.md +++ b/docs/getting_started/zeroshot/zeroshot.md @@ -1,3 +1,7 @@ +!!! Note + Difference between Zero-shot and Guided BERTopic: + Zeros-shot Topic Modeling is similar - yet not equivalent - to [Guided BERTopic](https://maartengr.github.io/BERTopic/getting_started/guided/guided.html). Use [Guided BERTopic](https://maartengr.github.io/BERTopic/getting_started/guided/guided.html) to boost certain keyword's importance. Use [Zeros-shot Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/zeroshot/zeroshot.html) to try to categorize documents into predefined topics ("zero-shot topics") before the clustering the remaining, unclassified documents, using the default unsupervised BERTopic topic exploration algorithm. + Zero-shot Topic Modeling is a technique that allows you to find topics in large amounts of documents that were predefined. When faced with many documents, you often have an idea of which topics will definitely be in there. Whether that is a result of simply knowing your data or if a domain expert is involved in defining those topics. This method allows you to not only find those specific topics but also create new topics for documents that would not fit with your predefined topics. From 1462f3668ea56d591a578d0a1513177e0f205c4f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jan=20Sp=C3=B6rer?= Date: Mon, 31 Mar 2025 14:30:39 +0200 Subject: [PATCH 2/5] Update docs/getting_started/guided/guided.md Co-authored-by: Maarten Grootendorst --- docs/getting_started/guided/guided.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/getting_started/guided/guided.md b/docs/getting_started/guided/guided.md index aa94316c..eb0e3119 100644 --- a/docs/getting_started/guided/guided.md +++ b/docs/getting_started/guided/guided.md @@ -1,7 +1,7 @@ !!! Note Difference between Zero-shot and Guided BERTopic: Guided BERTopic is similar - yet not equivalent - to [Zeros-shot Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/zeroshot/zeroshot.html). - Use Guided BERTopic to boost certain keyword's importance. Use [Zeros-shot Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/zeroshot/zeroshot.html) to try to categorize documents into predefined topics ("zero-shot topics") before the clustering the remaining, unclassified documents, using the default unsupervised BERTopic topic exploration algorithm. + Use Guided BERTopic to boost the importance of certain keywords. Use [Zeros-shot Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/zeroshot/zeroshot.html) to try to categorize documents into predefined topics ("zero-shot topics") before clustering the remaining unclassified documents using the main algorithm of BERTopic. Guided Topic Modeling or Seeded Topic Modeling is a collection of techniques that guides the topic modeling approach by setting several seed topics to which the model will converge to. These techniques allow the user to set a predefined number of topic representations that are sure to be in documents. For example, take an IT business that has a ticket system for the software their clients use. Those tickets may typically contain information about a specific bug regarding login issues that the IT business is aware of. From 9b050cc8f8eeb612d615a0e7d33c4d484fd83e06 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jan=20Sp=C3=B6rer?= Date: Tue, 1 Apr 2025 20:31:00 +0200 Subject: [PATCH 3/5] Removed trailing whitespace (linter was failing) --- docs/getting_started/guided/guided.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/getting_started/guided/guided.md b/docs/getting_started/guided/guided.md index 615b75a0..f1833c85 100644 --- a/docs/getting_started/guided/guided.md +++ b/docs/getting_started/guided/guided.md @@ -1,6 +1,6 @@ !!! Note - Difference between Zero-shot and Guided BERTopic: - Guided BERTopic is similar - yet not equivalent - to [Zeros-shot Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/zeroshot/zeroshot.html). + Difference between Zero-shot and Guided BERTopic: + Guided BERTopic is similar - yet not equivalent - to [Zeros-shot Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/zeroshot/zeroshot.html). Use Guided BERTopic to boost the importance of certain keywords. Use [Zeros-shot Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/zeroshot/zeroshot.html) to try to categorize documents into predefined topics ("zero-shot topics") before clustering the remaining unclassified documents using the main algorithm of BERTopic. Guided Topic Modeling or Seeded Topic Modeling is a collection of techniques that guides the topic modeling approach by setting several seed topics to which the model will converge to. These techniques allow the user to set a predefined number of topic representations that are sure to be in documents. For example, take an IT business that has a ticket system for the software their clients use. Those tickets may typically contain information about a specific bug regarding login issues that the IT business is aware of. From 99530b552756f575ac6c04b3f93677028ec2e65c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jan=20Sp=C3=B6rer?= Date: Tue, 1 Apr 2025 20:33:21 +0200 Subject: [PATCH 4/5] Removed trailing whitespace to satisfy the linter --- docs/getting_started/zeroshot/zeroshot.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/getting_started/zeroshot/zeroshot.md b/docs/getting_started/zeroshot/zeroshot.md index d5839cba..102287d0 100644 --- a/docs/getting_started/zeroshot/zeroshot.md +++ b/docs/getting_started/zeroshot/zeroshot.md @@ -1,5 +1,5 @@ !!! Note - Difference between Zero-shot and Guided BERTopic: + Difference between Zero-shot and Guided BERTopic: Zeros-shot Topic Modeling is similar - yet not equivalent - to [Guided BERTopic](https://maartengr.github.io/BERTopic/getting_started/guided/guided.html). Use [Guided BERTopic](https://maartengr.github.io/BERTopic/getting_started/guided/guided.html) to boost certain keyword's importance. Use [Zeros-shot Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/zeroshot/zeroshot.html) to try to categorize documents into predefined topics ("zero-shot topics") before the clustering the remaining, unclassified documents, using the default unsupervised BERTopic topic exploration algorithm. Zero-shot Topic Modeling is a technique that allows you to find topics in large amounts of documents that were predefined. When faced with many documents, you often have an idea of which topics will definitely be in there. Whether that is a result of simply knowing your data or if a domain expert is involved in defining those topics. From 16afe546eb26ac1adef0da5283ae903c7074b8dc Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jan=20Sp=C3=B6rer?= Date: Wed, 2 Apr 2025 08:31:53 +0200 Subject: [PATCH 5/5] Minimal typo correction (removing trailing unnecessary "to"); mainly to trigger the GitHub jobs again as one failed --- docs/getting_started/guided/guided.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/getting_started/guided/guided.md b/docs/getting_started/guided/guided.md index f1833c85..25af49c5 100644 --- a/docs/getting_started/guided/guided.md +++ b/docs/getting_started/guided/guided.md @@ -3,7 +3,7 @@ Guided BERTopic is similar - yet not equivalent - to [Zeros-shot Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/zeroshot/zeroshot.html). Use Guided BERTopic to boost the importance of certain keywords. Use [Zeros-shot Topic Modeling](https://maartengr.github.io/BERTopic/getting_started/zeroshot/zeroshot.html) to try to categorize documents into predefined topics ("zero-shot topics") before clustering the remaining unclassified documents using the main algorithm of BERTopic. -Guided Topic Modeling or Seeded Topic Modeling is a collection of techniques that guides the topic modeling approach by setting several seed topics to which the model will converge to. These techniques allow the user to set a predefined number of topic representations that are sure to be in documents. For example, take an IT business that has a ticket system for the software their clients use. Those tickets may typically contain information about a specific bug regarding login issues that the IT business is aware of. +Guided Topic Modeling or Seeded Topic Modeling is a collection of techniques that guides the topic modeling approach by setting several seed topics to which the model will converge. These techniques allow the user to set a predefined number of topic representations that are sure to be in documents. For example, take an IT business that has a ticket system for the software their clients use. Those tickets may typically contain information about a specific bug regarding login issues that the IT business is aware of. To model that bug, we can create a seed topic representation containing the words `bug`, `login`, `password`, and `username`. By defining those words, a Guided Topic Modeling approach will try to converge at least one topic to those words.