You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/user-guide/distributeddataclassification.rst
+27-2Lines changed: 27 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,12 +15,14 @@ NeMo Curator provides a module to help users run inference with pre-trained mode
15
15
This is achieved by chunking the datasets across multiple computing nodes, each equipped with multiple GPUs, to accelerate the classification task in a distributed manner.
16
16
Since the classification of a single text document is independent of other documents within the dataset, we can distribute the workload across multiple nodes and GPUs to perform parallel processing.
17
17
18
-
Domain, quality, content safety, and educational content models are tasks we include as examples within our module.
18
+
Domain (English and multilingual), quality, content safety, and educational content models are tasks we include as examples within our module.
19
19
20
20
Here, we summarize why each is useful for training an LLM:
21
21
22
22
- The **Domain Classifier** is useful because it helps the LLM understand the context and specific domain of the input text. Because different domains have different linguistic characteristics and terminologies, an LLM's ability to generate contextually relevant responses can be improved by tailoring training data to a specific domain. Overall, this helps provide more accurate and specialized information.
23
23
24
+
- The **Multilingual Domain Classifier** is the same as the domain classifier, but has been trained to classify text in 52 languages, including English.
25
+
24
26
- The **Quality Classifier** is useful for filtering out noisy or low quality data. This allows the model to focus on learning from high quality and informative examples, which contributes to the LLM's robustness and enhances its ability to generate reliable and meaningful outputs. Additionally, quality classification helps mitigate biases and inaccuracies that may arise from poorly curated training data.
25
27
26
28
- The **AEGIS Safety Models** are essential for filtering harmful or risky content, which is critical for training models that should avoid learning from unsafe data. By classifying content into 13 critical risk categories, AEGIS helps remove harmful or inappropriate data from the training sets, improving the overall ethical and safety standards of the LLM.
@@ -45,7 +47,7 @@ Check out ``nemo_curator.classifiers.base.py`` for reference.
45
47
Domain Classifier
46
48
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
47
49
48
-
The Domain Classifier is used to categorize text documents into specific domains or subject areas. This is particularly useful for organizing large datasets and tailoring the training data for domain-specific LLMs.
50
+
The Domain Classifier is used to categorize English text documents into specific domains or subject areas. This is particularly useful for organizing large datasets and tailoring the training data for domain-specific LLMs.
49
51
50
52
Let's see how ``DomainClassifier`` works in a small excerpt taken from ``examples/classifiers/domain_example.py``:
51
53
@@ -64,6 +66,29 @@ Let's see how ``DomainClassifier`` works in a small excerpt taken from ``example
64
66
In this example, the domain classifier is obtained directly from `Hugging Face <https://huggingface.co/nvidia/domain-classifier>`_.
65
67
It filters the input dataset to include only documents classified as "Games" or "Sports".
66
68
69
+
Multilingual Domain Classifier
70
+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
71
+
72
+
The Multilingual Domain Classifier is used to categorize text documents across 52 languages into specific domains or subject areas.
73
+
74
+
Using the ``MultilingualDomainClassifier`` is very similar to using the ``DomainClassifier`` as described above. Here is an example:
75
+
76
+
.. code-block:: python
77
+
78
+
from nemo_curator.classifiers import MultilingualDomainClassifier
For more information about the multilingual domain classifier, including its supported languages, please see the `nvidia/multilingual-domain-classifier <https://huggingface.co/nvidia/multilingual-domain-classifier>`_ on Hugging Face.
0 commit comments