Skip to content

Commit 6697e59

Browse files
authored
Link to documentation within text tutorials (#1190)
* link to docs in classifier tutorials Signed-off-by: Sarah Yurick <[email protected]> * add links to semdedup tutorials Signed-off-by: Sarah Yurick <[email protected]> * download and extract links Signed-off-by: Sarah Yurick <[email protected]> * Apply suggestions from code review Signed-off-by: Sarah Yurick <[email protected]> --------- Signed-off-by: Sarah Yurick <[email protected]> Signed-off-by: Sarah Yurick <[email protected]>
1 parent 421b4b1 commit 6697e59

15 files changed

+100
-38
lines changed

tutorials/text/deduplication/semantic/semantic_e2e.ipynb

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,18 +7,18 @@
77
"source": [
88
"# End-to-end Semantic Deduplication on Text Data\n",
99
"\n",
10-
"GPU accelerated implementation of [SemDeDup: Data-efficient learning at web-scale through semantic deduplication](https://arxiv.org/abs/2303.09540) \n",
10+
"GPU accelerated implementation of [SemDeDup: Data-efficient learning at web-scale through semantic deduplication](https://arxiv.org/abs/2303.09540). For more information about semantic deduplication in NeMo Curator, refer to the [Semantic Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/semdedup.html) documentation page.\n",
1111
"\n",
1212
"The tutorial here shows how to run Semantic Duplication on text data by executing a single workflow which does the following:\n",
1313
"\n",
1414
"1. Read original dataset\n",
15-
"2. Run embedding generation \n",
15+
"2. Run embedding generation\n",
1616
"3. Use K-Means to cluster the embeddings\n",
1717
"4. Compute pairwise similarity inside each of the clusters\n",
1818
"5. Identify duplicates based on `eps` provided (and `ranking_strategy`)\n",
1919
"6. Remove duplicates from the original dataset\n",
2020
"\n",
21-
"We also allow users to also run these steps independently, which can be seen in the step by step tutorial in the same directory as this tutorial.\n"
21+
"We also allow users to also run these steps independently, which can be seen in the step by step tutorial in the same directory as this tutorial."
2222
]
2323
},
2424
{
@@ -97,6 +97,7 @@
9797
"source": [
9898
"## Running as a Single Stage (End-to-End)\n",
9999
"\n",
100+
"See the [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/stages/stages.text.deduplication.semantic.html#stages.text.deduplication.semantic.TextSemanticDeduplicationWorkflow) for more information about the `TextSemanticDeduplicationWorkflow` class.\n",
100101
"\n",
101102
"### Performance Notes\n",
102103
"Set `id_generator=True` if you want to remove duplicates from large datasets (i.e. when `perform_removal=True`).\n",

tutorials/text/deduplication/semantic/semantic_step_by_step.ipynb

Lines changed: 19 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -7,13 +7,13 @@
77
"source": [
88
"# Step by Step Semantic Deduplication on Text Data\n",
99
"\n",
10-
"GPU accelerated implementation of [SemDeDup: Data-efficient learning at web-scale through semantic deduplication](https://arxiv.org/abs/2303.09540) \n",
10+
"GPU accelerated implementation of [SemDeDup: Data-efficient learning at web-scale through semantic deduplication](https://arxiv.org/abs/2303.09540). For more information about semantic deduplication in NeMo Curator, refer to the [Semantic Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/semdedup.html) documentation page.\n",
1111
"\n",
12-
"The tutorial here shows how to run Semantic Duplication on text data by executing three workflows sequentially. \n",
12+
"The tutorial here shows how to run Semantic Duplication on text data by executing three workflows sequentially.\n",
1313
"\n",
1414
"We also use an ID Generator to show how it works when running it separately.\n",
1515
"\n",
16-
"1. Create ID generator.\n",
16+
"1. Create ID generator\n",
1717
"2. Running embedding generation\n",
1818
"3. Running K-Means + pairwise (without duplicate identification)\n",
1919
"4. Run duplicate identification\n",
@@ -136,8 +136,9 @@
136136
"source": [
137137
"## Create ID Generator\n",
138138
"\n",
139-
"1. This creates a Ray Actor in the background.\n",
140-
"2. When we read our dataset now, this actor in the background is used to assign monotonically increasing integer IDs to each row. "
139+
"This creates a Ray Actor in the background. When we read our dataset now, this actor in the background is used to assign monotonically increasing integer IDs to each row.\n",
140+
"\n",
141+
"See the [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/stages/stages.deduplication.id_generator.html#stages.deduplication.id_generator.create_id_generator_actor) for more information about the `create_id_generator_actor` function."
141142
]
142143
},
143144
{
@@ -206,7 +207,9 @@
206207
"source": [
207208
"## Run Embedding Generation\n",
208209
"\n",
209-
"1. We output the embeddings as Parquet files so that we can read more smartly during our K-Means step. This is the recommended file format before you run K-Means.\n"
210+
"We output the embeddings as Parquet files so that we can read more smartly during our K-Means step. This is the recommended file format before you run K-Means.\n",
211+
"\n",
212+
"See the [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/stages/stages.text.embedders.base.html#stages.text.embedders.base.EmbeddingCreatorStage) for more information about the `EmbeddingCreatorStage` class."
210213
]
211214
},
212215
{
@@ -400,7 +403,9 @@
400403
"source": [
401404
"## Run Semantic Deduplication workflow (without specifying `eps`)\n",
402405
"\n",
403-
"1. We intentionally don't specify `eps` so that we can show how to run `IdentifyDuplicates` as a separate stage."
406+
"We intentionally don't specify `eps` so that we can show how to run `IdentifyDuplicates` as a separate stage.\n",
407+
"\n",
408+
"See the [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/stages/stages.deduplication.semantic.workflow.html#stages.deduplication.semantic.workflow.SemanticDeduplicationWorkflow) for more information about the `SemanticDeduplicationWorkflow` class."
404409
]
405410
},
406411
{
@@ -748,7 +753,9 @@
748753
"source": [
749754
"## Identify Duplicates\n",
750755
"\n",
751-
"We will create a simple pipeline that now identifies duplicates and writes them out."
756+
"We will create a simple pipeline that now identifies duplicates and writes them out.\n",
757+
"\n",
758+
"See the [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/stages/stages.deduplication.semantic.identify_duplicates.html#stages.deduplication.semantic.identify_duplicates.IdentifyDuplicatesStage) for more information about the `IdentifyDuplicatesStage` class."
752759
]
753760
},
754761
{
@@ -888,7 +895,7 @@
888895
"source": [
889896
"## Removing Duplicates\n",
890897
"\n",
891-
"We offer a simple `TextDuplicatesRemovalWorkflow` that can remove duplicates from a given input dataset and list of duplicates to remove. \n",
898+
"We offer a simple `TextDuplicatesRemovalWorkflow` that can remove duplicates from a given input dataset and list of duplicates to remove.\n",
892899
"\n",
893900
"### Notes\n",
894901
"1. When running the removal workflow, we must specify the same input configuration as we did when we \"generated IDs\".\n",
@@ -899,11 +906,12 @@
899906
"### Performance\n",
900907
"If you notice OOMs during this stage, you can try using `RayDataActor`.\n",
901908
"\n",
902-
"\n",
903909
"### How `TextDuplicatesRemovalWorkflow` works\n",
904910
"1. It starts the ID Generator using `create_id_generator(filepath=...)`\n",
905911
"1. It runs a pipeline that does [`ParquetReader`, `TextDuplicatesRemovalStage`, `ParquetWriter`] (assuming input/output filetypes are Parquet)\n",
906-
"1. It kills the ID Generator using `kill_id_generator_actor`\n"
912+
"1. It kills the ID Generator using `kill_id_generator_actor`\n",
913+
"\n",
914+
"See the [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/stages/stages.text.deduplication.semantic.html#stages.text.deduplication.semantic.TextSemanticDeduplicationWorkflow) for more information about the `TextSemanticDeduplicationWorkflow` class."
907915
]
908916
},
909917
{

tutorials/text/distributed-data-classification/README.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,15 @@
11
# Distributed Data Classification
2+
23
The following is a set of Jupyter notebook tutorials which demonstrate how to use various text classification models supported by NeMo Curator.
34
The goal of using these classifiers is to help with data annotation, which is useful in data blending for foundation model training.
45

56
Each of these classifiers are available on Hugging Face and can be run independently with the [Transformers](https://github.com/huggingface/transformers) library.
67
By running them with NeMo Curator, the classifiers are accelerated using a heterogenous pipeline setup where tokenization is run across CPUs and model inference is run across GPUs.
78
Each of the Jupyter notebooks in this directory demonstrate how to run the classifiers on text data and are easily scalable to large amounts of data.
89

9-
Before running any of these notebooks, please see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies.
10+
Before running any of these notebooks, see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies.
11+
12+
For more information about the classifiers, refer to our [Distributed Data Classification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/distributed-classifier.html) documentation page.
1013

1114
## List of Classifiers
1215

tutorials/text/distributed-data-classification/aegis-classification.ipynb

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,9 @@
1414
" - Volta™ or higher (compute capability 7.0+)\n",
1515
" - CUDA 12.x\n",
1616
"\n",
17-
"Before running this notebook, please see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies."
17+
"Before running this notebook, see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies.\n",
18+
"\n",
19+
"For more information about the classifiers, refer to our [Distributed Data Classification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/distributed-classifier.html) documentation page."
1820
]
1921
},
2022
{
@@ -167,7 +169,9 @@
167169
"\n",
168170
"if self.filter_by is not None and len(self.filter_by) > 0:\n",
169171
" self.stages.append(Filter(filter_fn=self.filter_by_category, filter_field=...))\n",
170-
"```"
172+
"```\n",
173+
"\n",
174+
"See the [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/stages/stages.text.classifiers.aegis.html#stages.text.classifiers.aegis.AegisClassifier) for more information about the `AegisClassifier` class."
171175
]
172176
},
173177
{

tutorials/text/distributed-data-classification/content-type-classification.ipynb

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,9 @@
1212
" - Volta™ or higher (compute capability 7.0+)\n",
1313
" - CUDA 12.x\n",
1414
"\n",
15-
"Before running this notebook, please see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies."
15+
"Before running this notebook, see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies.\n",
16+
"\n",
17+
"For more information about the classifiers, refer to our [Distributed Data Classification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/distributed-classifier.html) documentation page."
1618
]
1719
},
1820
{
@@ -187,7 +189,9 @@
187189
"\n",
188190
"if self.filter_by is not None and len(self.filter_by) > 0:\n",
189191
" self.stages.append(Filter(filter_fn=self.filter_by_category, filter_field=...))\n",
190-
"```"
192+
"```\n",
193+
"\n",
194+
"See the [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/stages/stages.text.classifiers.content_type.html#stages.text.classifiers.content_type.ContentTypeClassifier) for more information about the `ContentTypeClassifier` class."
191195
]
192196
},
193197
{

tutorials/text/distributed-data-classification/domain-classification.ipynb

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,9 @@
1212
" - Volta™ or higher (compute capability 7.0+)\n",
1313
" - CUDA 12.x\n",
1414
"\n",
15-
"Before running this notebook, please see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies."
15+
"Before running this notebook, see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies.\n",
16+
"\n",
17+
"For more information about the classifiers, refer to our [Distributed Data Classification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/distributed-classifier.html) documentation page."
1618
]
1719
},
1820
{
@@ -159,7 +161,9 @@
159161
"\n",
160162
"if self.filter_by is not None and len(self.filter_by) > 0:\n",
161163
" self.stages.append(Filter(filter_fn=self.filter_by_category, filter_field=...))\n",
162-
"```"
164+
"```\n",
165+
"\n",
166+
"See the [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/stages/stages.text.classifiers.domain.html#stages.text.classifiers.domain.DomainClassifier) for more information about the `DomainClassifier` class."
163167
]
164168
},
165169
{

tutorials/text/distributed-data-classification/fineweb-edu-classification.ipynb

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,9 @@
1212
" - Volta™ or higher (compute capability 7.0+)\n",
1313
" - CUDA 12.x\n",
1414
"\n",
15-
"Before running this notebook, please see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies."
15+
"Before running this notebook, see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies.\n",
16+
"\n",
17+
"For more information about the classifiers, refer to our [Distributed Data Classification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/distributed-classifier.html) documentation page."
1618
]
1719
},
1820
{
@@ -156,7 +158,9 @@
156158
" self.stages.append(Filter(filter_fn=self.filter_by_category, filter_field=...))\n",
157159
"```\n",
158160
"\n",
159-
"Since the FineWeb-Edu classifier outputs a floating point value ranging from 0 to 5, Curator labels samples with a value of 2.5 or higher as `high_quality` and samples with a value lower than 2.5 as `low_quality`."
161+
"Since the FineWeb-Edu classifier outputs a floating point value ranging from 0 to 5, Curator labels samples with a value of 2.5 or higher as `high_quality` and samples with a value lower than 2.5 as `low_quality`.\n",
162+
"\n",
163+
"See the [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/stages/stages.text.classifiers.fineweb_edu.html#stages.text.classifiers.fineweb_edu.FineWebEduClassifier) for more information about the `FineWebEduClassifier` class."
160164
]
161165
},
162166
{

tutorials/text/distributed-data-classification/fineweb-mixtral-edu-classification.ipynb

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,9 @@
1212
" - Volta™ or higher (compute capability 7.0+)\n",
1313
" - CUDA 12.x\n",
1414
"\n",
15-
"Before running this notebook, please see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies."
15+
"Before running this notebook, see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies.\n",
16+
"\n",
17+
"For more information about the classifiers, refer to our [Distributed Data Classification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/distributed-classifier.html) documentation page."
1618
]
1719
},
1820
{
@@ -156,7 +158,9 @@
156158
" self.stages.append(Filter(filter_fn=self.filter_by_category, filter_field=...))\n",
157159
"```\n",
158160
"\n",
159-
"Since the NemoCurator FineWeb Mixtral Edu classifier outputs a floating point value ranging from 0 to 5, Curator labels samples with a value of 2.5 or higher as `high_quality` and samples with a value lower than 2.5 as `low_quality`."
161+
"Since the NemoCurator FineWeb Mixtral Edu classifier outputs a floating point value ranging from 0 to 5, Curator labels samples with a value of 2.5 or higher as `high_quality` and samples with a value lower than 2.5 as `low_quality`.\n",
162+
"\n",
163+
"See the [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/stages/stages.text.classifiers.fineweb_edu.html#stages.text.classifiers.fineweb_edu.FineWebMixtralEduClassifier) for more information about the `FineWebMixtralEduClassifier` class."
160164
]
161165
},
162166
{

tutorials/text/distributed-data-classification/fineweb-nemotron-edu-classification.ipynb

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,9 @@
1212
" - Volta™ or higher (compute capability 7.0+)\n",
1313
" - CUDA 12.x\n",
1414
"\n",
15-
"Before running this notebook, please see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies."
15+
"Before running this notebook, see this [Installation Guide](https://docs.nvidia.com/nemo/curator/latest/admin/installation.html#admin-installation) page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies.\n",
16+
"\n",
17+
"For more information about the classifiers, refer to our [Distributed Data Classification](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/quality-assessment/distributed-classifier.html) documentation page."
1618
]
1719
},
1820
{
@@ -156,7 +158,9 @@
156158
" self.stages.append(Filter(filter_fn=self.filter_by_category, filter_field=...))\n",
157159
"```\n",
158160
"\n",
159-
"Since the NemoCurator FineWeb Nemotron-4 Edu classifier outputs a floating point value ranging from 0 to 5, Curator labels samples with a value of 2.5 or higher as `high_quality` and samples with a value lower than 2.5 as `low_quality`."
161+
"Since the NemoCurator FineWeb Nemotron-4 Edu classifier outputs a floating point value ranging from 0 to 5, Curator labels samples with a value of 2.5 or higher as `high_quality` and samples with a value lower than 2.5 as `low_quality`.\n",
162+
"\n",
163+
"See the [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/stages/stages.text.classifiers.fineweb_edu.html#stages.text.classifiers.fineweb_edu.FineWebNemotronEduClassifier) for more information about the `FineWebNemotronEduClassifier` class."
160164
]
161165
},
162166
{

0 commit comments

Comments
 (0)