Skip to content

Commit b4c67b5

Browse files
Updating index.rst to match requested layout (#414)
Signed-off-by: aschilling <[email protected]>
1 parent 7272ca0 commit b4c67b5

File tree

4 files changed

+115
-34
lines changed

4 files changed

+115
-34
lines changed

docs/user-guide/image-curation.rst

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
==============
2+
Image Curation
3+
==============
4+
5+
:ref:`Get Started <data-curator-image-getting-started>`
6+
Install NeMo Curator's image curation modules.
7+
8+
:ref:`Image-Text Pair Datasets <data-curator-image-datasets>`
9+
Image-text pair datasets are commonly used as the basis for training multimodal generative models. NeMo Curator interfaces with the standardized WebDataset format for curating such datasets.
10+
11+
:ref:`Image Embedding Creation <data-curator-image-embedding>`
12+
Image embeddings are the backbone to many data curation operations in NeMo Curator. This section describes how to efficiently create embeddings for massive datasets.
13+
14+
:ref:`Classifiers <data-curator-image-classifiers>`
15+
NeMo Curator provides several ways to use common classifiers like aesthetic scoring and not-safe-for-work (NSFW) scoring.
16+
17+
:ref:`Semantic Deduplication <data-curator-semdedup>`
18+
Semantic deduplication with image datasets has been shown to drastically improve model performance. NeMo Curator has a semantic deduplication module that can work with any modality.
19+
20+
.. toctree::
21+
:maxdepth: 4
22+
:titlesonly:
23+
24+
image/gettingstarted.rst
25+
image/datasets.rst
26+
image/classifiers/index.rst
27+
semdedup.rst

docs/user-guide/index.rst

Lines changed: 4 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -37,23 +37,6 @@ Text Curation
3737
:ref:`Personally Identifiable Information Identification and Removal <data-curator-pii>`
3838
The purpose of the personally identifiable information (PII) redaction tool is to help scrub sensitive data out of training datasets
3939

40-
.. toctree::
41-
:maxdepth: 4
42-
:titlesonly:
43-
44-
45-
download.rst
46-
documentdataset.rst
47-
cpuvsgpu.rst
48-
qualityfiltering.rst
49-
languageidentificationunicodeformatting.rst
50-
gpudeduplication.rst
51-
semdedup.rst
52-
syntheticdata.rst
53-
taskdecontamination.rst
54-
personalidentifiableinformationidentificationandremoval.rst
55-
distributeddataclassification.rst
56-
5740
-------------------
5841
Image Curation
5942
-------------------
@@ -73,16 +56,6 @@ Image Curation
7356
:ref:`Semantic Deduplication <data-curator-semdedup>`
7457
Semantic deduplication with image datasets has been shown to drastically improve model performance. NeMo Curator has a semantic deduplication module that can work with any modality.
7558

76-
.. toctree::
77-
:maxdepth: 4
78-
:titlesonly:
79-
80-
image/gettingstarted.rst
81-
image/datasets.rst
82-
image/classifiers/index.rst
83-
semdedup.rst
84-
85-
8659
-------------------
8760
Reference
8861
-------------------
@@ -106,12 +79,9 @@ Reference
10679
API Documentation for all the modules in NeMo Curator
10780

10881
.. toctree::
109-
:maxdepth: 4
82+
:maxdepth: 1
11083
:titlesonly:
11184

112-
113-
kubernetescurator.rst
114-
sparkother.rst
115-
bestpractices.rst
116-
nextsteps.rst
117-
api/index.rst
85+
text-curation.rst
86+
image-curation.rst
87+
reference.rst

docs/user-guide/reference.rst

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
=========
2+
Reference
3+
=========
4+
5+
:ref:`NeMo Curator on Kubernetes <data-curator-kubernetes>`
6+
Demonstration of how to run the NeMo Curator on a Dask Cluster deployed on top of Kubernetes
7+
8+
:ref:`NeMo Curator and Apache Spark <data-curator-sparkother>`
9+
Demonstration of how to read and write datasets when using Apache Spark and NeMo Curator
10+
11+
:ref:`Best Practices <data-curator-best-practices>`
12+
A collection of suggestions on how to best use NeMo Curator to curate your dataset
13+
14+
:ref:`Next Steps <data-curator-next-steps>`
15+
Now that you've curated your data, let's discuss where to go next in the NeMo Framework to put it to good use.
16+
17+
`Tutorials <https://github.com/NVIDIA/NeMo-Curator/tree/main/tutorials>`__
18+
To get started, you can explore the NeMo Curator GitHub repository and follow the available tutorials and notebooks. These resources cover various aspects of data curation, including training from scratch and Parameter-Efficient Fine-Tuning (PEFT).
19+
20+
:ref:`API Docs <data-curator-api>`
21+
API Documentation for all the modules in NeMo Curator
22+
23+
.. toctree::
24+
:maxdepth: 4
25+
:titlesonly:
26+
27+
28+
kubernetescurator.rst
29+
sparkother.rst
30+
bestpractices.rst
31+
nextsteps.rst
32+
api/index.rst

docs/user-guide/text-curation.rst

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
=============
2+
Text Curation
3+
=============
4+
:ref:`Downloading and Extracting Text <data-curator-download>`
5+
Downloading a massive public dataset is usually the first step in data curation, and it can be cumbersome due to the dataset’s massive size and hosting method. This section describes how to download and extract large corpora efficiently.
6+
7+
:ref:`Working with DocumentDataset <data-curator-documentdataset>`
8+
DocumentDataset is the standard format for datasets in NeMo Curator. This section describes how to get datasets in and out of this format, as well as how DocumentDataset interacts with the modules.
9+
10+
:ref:`CPU and GPU Modules with Dask <data-curator-cpuvsgpu>`
11+
NeMo Curator provides both CPU based modules and GPU based modules and supports methods for creating compatible Dask clusters and managing the dataset transfer between CPU and GPU.
12+
13+
:ref:`Document Filtering <data-curator-qualityfiltering>`
14+
This section describes how to use the 30+ heuristic and classifier filters available within the NeMo Curator and implement custom filters to apply to the documents within the corpora.
15+
16+
:ref:`Language Identification and Unicode Fixing <data-curator-languageidentification>`
17+
Large, unlabeled text corpora often contain a variety of languages. The NeMo Curator provides utilities to identify languages and fix improperly decoded Unicode characters.
18+
19+
:ref:`GPU Accelerated Exact and Fuzzy Deduplication <data-curator-gpu-deduplication>`
20+
Both exact and fuzzy deduplication functionalities are supported in NeMo Curator and accelerated using RAPIDS cuDF.
21+
22+
:ref:`GPU Accelerated Semantic Deduplication <data-curator-semdedup>`
23+
NeMo Curator provides scalable and GPU accelerated semantic deduplication functionality using RAPIDS cuML, cuDF, crossfit and PyTorch.
24+
25+
:ref:`Distributed Data Classification <data-curator-distributeddataclassifer>`
26+
NeMo-Curator provides a scalable and GPU accelerated module to help users run inference with pre-trained models on large volumes of text documents.
27+
28+
:ref:`Synthetic Data Generation <data-curator-syntheticdata>`
29+
Synthetic data generation tools and example piplines are available within NeMo Curator.
30+
31+
:ref:`Downstream Task Decontamination <data-curator-downstream>`
32+
After training, large language models are usually evaluated by their performance on downstream tasks consisting of unseen test data. When dealing with large datasets, there is a potential for leakage of this test data into the model’s training dataset. NeMo Curator allows you to remove sections of documents in your dataset that are present in downstream tasks.
33+
34+
:ref:`Personally Identifiable Information Identification and Removal <data-curator-pii>`
35+
The purpose of the personally identifiable information (PII) redaction tool is to help scrub sensitive data out of training datasets
36+
37+
.. toctree::
38+
:maxdepth: 4
39+
:titlesonly:
40+
41+
42+
download.rst
43+
documentdataset.rst
44+
cpuvsgpu.rst
45+
qualityfiltering.rst
46+
languageidentificationunicodeformatting.rst
47+
gpudeduplication.rst
48+
semdedup.rst
49+
syntheticdata.rst
50+
taskdecontamination.rst
51+
personalidentifiableinformationidentificationandremoval.rst
52+
distributeddataclassification.rst

0 commit comments

Comments
 (0)