You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/nlp/quantization.rst
+10-32Lines changed: 10 additions & 32 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,7 +19,7 @@ The quantization process consists of the following steps:
19
19
2. Calibrating the model to obtain appropriate algorithm-specific scaling factors
20
20
3. Producing an output directory or .qnemo tarball with model config (json), quantized weights (safetensors) and tokenizer config (yaml).
21
21
22
-
Loading models requires using an ModelOpt spec defined in `nemo.collections.nlp.models.language_modeling.megatron.gpt_layer_modelopt_spec <https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/models/language_modeling/megatron/gpt_layer_modelopt_spec.py>`_ module. Typically the calibration step is lightweight and uses a small dataset to obtain appropriate statistics for scaling tensors. The output directory produced (or a .qnemo tarball) is ready to be used to build a serving engine with the Nvidia TensorRT-LLM library. The engine build step is also available in NeMo project in ``nemo.deploy`` and ``nemo.export`` modules.
22
+
Loading models requires using an ModelOpt spec defined in `nemo.collections.nlp.models.language_modeling.megatron.gpt_layer_modelopt_spec <https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/models/language_modeling/megatron/gpt_layer_modelopt_spec.py>`_ module. Typically the calibration step is lightweight and uses a small dataset to obtain appropriate statistics for scaling tensors. The output directory produced (or a .qnemo tarball) is ready to be used to build a serving engine with the Nvidia TensorRT-LLM library. The engine build step is also available in the `Export-Deploy repository <https://github.com/NVIDIA-NeMo/Export-Deploy>`_.
23
23
24
24
Quantization algorithm can also be conveniently set to ``"null"`` to perform only the weights export step using default precision for TensorRT-LLM deployment. This is useful to obtain baseline performance and accuracy results for comparison.
25
25
@@ -103,19 +103,11 @@ The output directory stores the following files:
103
103
├── tokenizer.model
104
104
└── tokenizer_config.yaml
105
105
106
-
The TensorRT-LLM engine can be conveniently built and run using ``TensorRTLLM`` class available in ``nemo.export`` submodule:
106
+
.. note::
107
+
The export and deployment functionality has been moved to a separate repository.
trt_llm_exporter.forward(["Hi, how are you?", "I am good, thanks, how about you?"])
117
-
118
-
Alternatively, it can also be built directly using ``trtllm-build`` command, see `TensorRT-LLM documentation <https://nvidia.github.io/TensorRT-LLM/latest/legacy/architecture/checkpoint.html#build-checkpoint-into-tensorrt-engine>`_:
110
+
The TensorRT-LLM engine can be built directly using ``trtllm-build`` command, see `TensorRT-LLM documentation <https://nvidia.github.io/TensorRT-LLM/latest/legacy/architecture/checkpoint.html#build-checkpoint-into-tensorrt-engine>`_:
119
111
120
112
.. code-block:: bash
121
113
@@ -129,7 +121,7 @@ Alternatively, it can also be built directly using ``trtllm-build`` command, see
129
121
130
122
Known issues
131
123
^^^^^^^^^^^^
132
-
* Currently with ``nemo.export`` module building TensorRT-LLM engines for quantized "qnemo" models is limited to single-node deployments.
124
+
* Building TensorRT-LLM engines for quantized "qnemo" models is limited to single-node deployments.
133
125
134
126
135
127
Quantization-Aware Training (QAT)
@@ -183,25 +175,11 @@ Note that you may tweak the QAT trainer steps and learning rate if needed to ach
183
175
NeMo checkpoints trained in FP8 with `NVIDIA Transformer Engine <https://github.com/NVIDIA/TransformerEngine>`_
If you have an FP8-quantized checkpoint, produced during pre-training or fine-tuning with Transformer Engine, you can convert it to a FP8 TensorRT-LLM engine directly using ``nemo.export``.
187
-
The API is the same as with regular ``.nemo`` and ``.qnemo`` checkpoints:
If you have an FP8-quantized checkpoint, produced during pre-training or fine-tuning with Transformer Engine, you can convert it to a FP8 TensorRT-LLM engine directly using the Export-Deploy repository.
203
179
204
-
By default quantization settings are auto-detected from the NeMo checkpoint.
180
+
.. note::
181
+
Export and deployment functionality is available in the Export-Deploy repository.
providing `min_shape`, `opt_shape` and `max_shape`
79
-
as arguments (in this example all are set to the input tensor shape for simplicity) which enables inputs with dynamic shapes after compilation.
80
-
For more information about TensorRT and dynamic shapes please review the [Torch-Tensorrt documentation](https://pytorch.org/TensorRT/user_guide/dynamic_shapes.html)
81
-
82
-
The file `cosmos_trt_run.py` provides a stand-alone script to tokenize tensors with a TensorRT-accelerated
83
-
Cosmos tokenizer.
37
+
For TensorRT acceleration examples and documentation, please refer to:
38
+
https://github.com/NVIDIA-NeMo/Export-Deploy
84
39
85
40
# Examples
86
41
1. Multimodal autoregressive model dataset preparation using the [discrete cosmos tokenizer](../../../../nemo/collections/multimodal_autoregressive/data/README.md)
0 commit comments