|
7 | 7 | "source": [ |
8 | 8 | "# Step by Step Semantic Deduplication on Text Data\n", |
9 | 9 | "\n", |
10 | | - "GPU accelerated implementation of [SemDeDup: Data-efficient learning at web-scale through semantic deduplication](https://arxiv.org/abs/2303.09540) \n", |
| 10 | + "GPU accelerated implementation of [SemDeDup: Data-efficient learning at web-scale through semantic deduplication](https://arxiv.org/abs/2303.09540). For more information about semantic deduplication in NeMo Curator, refer to the [Semantic Deduplication](https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/semdedup.html) documentation page.\n", |
11 | 11 | "\n", |
12 | | - "The tutorial here shows how to run Semantic Duplication on text data by executing three workflows sequentially. \n", |
| 12 | + "The tutorial here shows how to run Semantic Duplication on text data by executing three workflows sequentially.\n", |
13 | 13 | "\n", |
14 | 14 | "We also use an ID Generator to show how it works when running it separately.\n", |
15 | 15 | "\n", |
16 | | - "1. Create ID generator.\n", |
| 16 | + "1. Create ID generator\n", |
17 | 17 | "2. Running embedding generation\n", |
18 | 18 | "3. Running K-Means + pairwise (without duplicate identification)\n", |
19 | 19 | "4. Run duplicate identification\n", |
|
136 | 136 | "source": [ |
137 | 137 | "## Create ID Generator\n", |
138 | 138 | "\n", |
139 | | - "1. This creates a Ray Actor in the background.\n", |
140 | | - "2. When we read our dataset now, this actor in the background is used to assign monotonically increasing integer IDs to each row. " |
| 139 | + "This creates a Ray Actor in the background. When we read our dataset now, this actor in the background is used to assign monotonically increasing integer IDs to each row.\n", |
| 140 | + "\n", |
| 141 | + "See the [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/stages/stages.deduplication.id_generator.html#stages.deduplication.id_generator.create_id_generator_actor) for more information about the `create_id_generator_actor` function." |
141 | 142 | ] |
142 | 143 | }, |
143 | 144 | { |
|
206 | 207 | "source": [ |
207 | 208 | "## Run Embedding Generation\n", |
208 | 209 | "\n", |
209 | | - "1. We output the embeddings as Parquet files so that we can read more smartly during our K-Means step. This is the recommended file format before you run K-Means.\n" |
| 210 | + "We output the embeddings as Parquet files so that we can read more smartly during our K-Means step. This is the recommended file format before you run K-Means.\n", |
| 211 | + "\n", |
| 212 | + "See the [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/stages/stages.text.embedders.base.html#stages.text.embedders.base.EmbeddingCreatorStage) for more information about the `EmbeddingCreatorStage` class." |
210 | 213 | ] |
211 | 214 | }, |
212 | 215 | { |
|
400 | 403 | "source": [ |
401 | 404 | "## Run Semantic Deduplication workflow (without specifying `eps`)\n", |
402 | 405 | "\n", |
403 | | - "1. We intentionally don't specify `eps` so that we can show how to run `IdentifyDuplicates` as a separate stage." |
| 406 | + "We intentionally don't specify `eps` so that we can show how to run `IdentifyDuplicates` as a separate stage.\n", |
| 407 | + "\n", |
| 408 | + "See the [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/stages/stages.deduplication.semantic.workflow.html#stages.deduplication.semantic.workflow.SemanticDeduplicationWorkflow) for more information about the `SemanticDeduplicationWorkflow` class." |
404 | 409 | ] |
405 | 410 | }, |
406 | 411 | { |
|
748 | 753 | "source": [ |
749 | 754 | "## Identify Duplicates\n", |
750 | 755 | "\n", |
751 | | - "We will create a simple pipeline that now identifies duplicates and writes them out." |
| 756 | + "We will create a simple pipeline that now identifies duplicates and writes them out.\n", |
| 757 | + "\n", |
| 758 | + "See the [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/stages/stages.deduplication.semantic.identify_duplicates.html#stages.deduplication.semantic.identify_duplicates.IdentifyDuplicatesStage) for more information about the `IdentifyDuplicatesStage` class." |
752 | 759 | ] |
753 | 760 | }, |
754 | 761 | { |
|
888 | 895 | "source": [ |
889 | 896 | "## Removing Duplicates\n", |
890 | 897 | "\n", |
891 | | - "We offer a simple `TextDuplicatesRemovalWorkflow` that can remove duplicates from a given input dataset and list of duplicates to remove. \n", |
| 898 | + "We offer a simple `TextDuplicatesRemovalWorkflow` that can remove duplicates from a given input dataset and list of duplicates to remove.\n", |
892 | 899 | "\n", |
893 | 900 | "### Notes\n", |
894 | 901 | "1. When running the removal workflow, we must specify the same input configuration as we did when we \"generated IDs\".\n", |
|
899 | 906 | "### Performance\n", |
900 | 907 | "If you notice OOMs during this stage, you can try using `RayDataActor`.\n", |
901 | 908 | "\n", |
902 | | - "\n", |
903 | 909 | "### How `TextDuplicatesRemovalWorkflow` works\n", |
904 | 910 | "1. It starts the ID Generator using `create_id_generator(filepath=...)`\n", |
905 | 911 | "1. It runs a pipeline that does [`ParquetReader`, `TextDuplicatesRemovalStage`, `ParquetWriter`] (assuming input/output filetypes are Parquet)\n", |
906 | | - "1. It kills the ID Generator using `kill_id_generator_actor`\n" |
| 912 | + "1. It kills the ID Generator using `kill_id_generator_actor`\n", |
| 913 | + "\n", |
| 914 | + "See the [API Reference](https://docs.nvidia.com/nemo/curator/latest/apidocs/stages/stages.text.deduplication.semantic.html#stages.text.deduplication.semantic.TextSemanticDeduplicationWorkflow) for more information about the `TextSemanticDeduplicationWorkflow` class." |
907 | 915 | ] |
908 | 916 | }, |
909 | 917 | { |
|
0 commit comments