Skip to content

Commit c39c5dd

Browse files
committed
remove ExportDeploy & references
1 parent d3354d2 commit c39c5dd

File tree

23 files changed

+21
-788
lines changed

23 files changed

+21
-788
lines changed

docs/source/nlp/quantization.rst

Lines changed: 10 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ The quantization process consists of the following steps:
1919
2. Calibrating the model to obtain appropriate algorithm-specific scaling factors
2020
3. Producing an output directory or .qnemo tarball with model config (json), quantized weights (safetensors) and tokenizer config (yaml).
2121

22-
Loading models requires using an ModelOpt spec defined in `nemo.collections.nlp.models.language_modeling.megatron.gpt_layer_modelopt_spec <https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/models/language_modeling/megatron/gpt_layer_modelopt_spec.py>`_ module. Typically the calibration step is lightweight and uses a small dataset to obtain appropriate statistics for scaling tensors. The output directory produced (or a .qnemo tarball) is ready to be used to build a serving engine with the Nvidia TensorRT-LLM library. The engine build step is also available in NeMo project in ``nemo.deploy`` and ``nemo.export`` modules.
22+
Loading models requires using an ModelOpt spec defined in `nemo.collections.nlp.models.language_modeling.megatron.gpt_layer_modelopt_spec <https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/models/language_modeling/megatron/gpt_layer_modelopt_spec.py>`_ module. Typically the calibration step is lightweight and uses a small dataset to obtain appropriate statistics for scaling tensors. The output directory produced (or a .qnemo tarball) is ready to be used to build a serving engine with the Nvidia TensorRT-LLM library. The engine build step is also available in the `Export-Deploy repository <https://github.com/NVIDIA-NeMo/Export-Deploy>`_.
2323

2424
Quantization algorithm can also be conveniently set to ``"null"`` to perform only the weights export step using default precision for TensorRT-LLM deployment. This is useful to obtain baseline performance and accuracy results for comparison.
2525

@@ -103,19 +103,11 @@ The output directory stores the following files:
103103
├── tokenizer.model
104104
└── tokenizer_config.yaml
105105
106-
The TensorRT-LLM engine can be conveniently built and run using ``TensorRTLLM`` class available in ``nemo.export`` submodule:
106+
.. note::
107+
The export and deployment functionality has been moved to a separate repository.
108+
Install with: ``pip install git+https://github.com/NVIDIA-NeMo/Export-Deploy.git``
107109

108-
.. code-block:: python
109-
110-
from nemo.export.tensorrt_llm import TensorRTLLM
111-
trt_llm_exporter = TensorRTLLM(model_dir="/path/to/trt_llm_engine_folder")
112-
trt_llm_exporter.export(
113-
nemo_checkpoint_path="llama3-70b-base-fp8-qnemo",
114-
model_type="llama",
115-
)
116-
trt_llm_exporter.forward(["Hi, how are you?", "I am good, thanks, how about you?"])
117-
118-
Alternatively, it can also be built directly using ``trtllm-build`` command, see `TensorRT-LLM documentation <https://nvidia.github.io/TensorRT-LLM/latest/legacy/architecture/checkpoint.html#build-checkpoint-into-tensorrt-engine>`_:
110+
The TensorRT-LLM engine can be built directly using ``trtllm-build`` command, see `TensorRT-LLM documentation <https://nvidia.github.io/TensorRT-LLM/latest/legacy/architecture/checkpoint.html#build-checkpoint-into-tensorrt-engine>`_:
119111

120112
.. code-block:: bash
121113
@@ -129,7 +121,7 @@ Alternatively, it can also be built directly using ``trtllm-build`` command, see
129121
130122
Known issues
131123
^^^^^^^^^^^^
132-
* Currently with ``nemo.export`` module building TensorRT-LLM engines for quantized "qnemo" models is limited to single-node deployments.
124+
* Building TensorRT-LLM engines for quantized "qnemo" models is limited to single-node deployments.
133125

134126

135127
Quantization-Aware Training (QAT)
@@ -183,25 +175,11 @@ Note that you may tweak the QAT trainer steps and learning rate if needed to ach
183175
NeMo checkpoints trained in FP8 with `NVIDIA Transformer Engine <https://github.com/NVIDIA/TransformerEngine>`_
184176
----------------------------------------------------------------------------------------------------------------
185177

186-
If you have an FP8-quantized checkpoint, produced during pre-training or fine-tuning with Transformer Engine, you can convert it to a FP8 TensorRT-LLM engine directly using ``nemo.export``.
187-
The API is the same as with regular ``.nemo`` and ``.qnemo`` checkpoints:
188-
189-
.. code-block:: python
190-
191-
from nemo.export.tensorrt_llm import TensorRTLLM
192-
trt_llm_exporter = TensorRTLLM(model_dir="/path/to/trt_llm_engine_folder")
193-
trt_llm_exporter.export(
194-
nemo_checkpoint_path="/path/to/llama2-7b-base-fp8.nemo",
195-
model_type="llama",
196-
)
197-
trt_llm_exporter.forward(["Hi, how are you?", "I am good, thanks, how about you?"])
198-
199-
The export settings for quantization can be adjusted via ``trt_llm_exporter.export`` arguments:
200-
201-
* ``fp8_quantized: Optional[bool] = None``: manually enables/disables FP8 quantization
202-
* ``fp8_kvcache: Optional[bool] = None``: manually enables/disables FP8 quantization for KV-cache
178+
If you have an FP8-quantized checkpoint, produced during pre-training or fine-tuning with Transformer Engine, you can convert it to a FP8 TensorRT-LLM engine directly using the Export-Deploy repository.
203179

204-
By default quantization settings are auto-detected from the NeMo checkpoint.
180+
.. note::
181+
Export and deployment functionality is available in the Export-Deploy repository.
182+
See: https://github.com/NVIDIA-NeMo/Export-Deploy
205183

206184

207185
References

examples/llm/finetune/automodel_vllm.py

Lines changed: 0 additions & 158 deletions
This file was deleted.

nemo/collections/common/video_tokenizers/README.md

Lines changed: 3 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -29,58 +29,13 @@ for the complete list of supported tokenizers.
2929

3030
### Acceleration with TensorRT
3131

32-
**Note:** TensorRT acceleration requires the Export-Deploy repository:
32+
**Note:** TensorRT acceleration functionality has been moved to the Export-Deploy repository:
3333
```bash
3434
pip install git+https://github.com/NVIDIA-NeMo/Export-Deploy.git
3535
```
3636

37-
To use these tokenizers with TensorRT and acheive up to 3X speedup during tokenization,
38-
users can define a lightweight wrapper model and then pass this wrapper model to `trt_compile`
39-
```python
40-
import torch
41-
from nemo.collections.common.video_tokenizers.cosmos_tokenizer import CausalVideoTokenizer
42-
from nemo.export.tensorrt_lazy_compiler import trt_compile
43-
44-
class VaeWrapper(torch.nn.Module):
45-
def __init__(self, vae):
46-
super().__init__()
47-
self.vae = vae
48-
49-
def forward(self, input_tensor):
50-
output_tensor = self.vae.autoencode(input_tensor)
51-
return output_tensor
52-
53-
model = CausalVideoTokenizer.from_pretrained(
54-
"Cosmos-Tokenizer-DV4x8x8",
55-
use_pytorch=True,
56-
dtype="float"
57-
)
58-
model_wrapper = VaeWrapper(model)
59-
60-
input_tensor = torch.randn(1, 3, 9, 512, 512).to('cuda').to(torch.float)
61-
opt_shape = min_shape = max_shape = input_tensor.shape
62-
63-
path_to_engine_outputs="./trt_outputs"
64-
trt_compile(
65-
model_wrapper,
66-
path_to_engine_outputs,
67-
args={
68-
"precision": "bf16",
69-
"input_profiles": [
70-
{"input_tensor": [min_shape, opt_shape, max_shape]},
71-
],
72-
},
73-
)
74-
75-
output = model_wrapper(input_tensor)
76-
```
77-
Note that the `trt_compile` function requires
78-
providing `min_shape`, `opt_shape` and `max_shape`
79-
as arguments (in this example all are set to the input tensor shape for simplicity) which enables inputs with dynamic shapes after compilation.
80-
For more information about TensorRT and dynamic shapes please review the [Torch-Tensorrt documentation](https://pytorch.org/TensorRT/user_guide/dynamic_shapes.html)
81-
82-
The file `cosmos_trt_run.py` provides a stand-alone script to tokenize tensors with a TensorRT-accelerated
83-
Cosmos tokenizer.
37+
For TensorRT acceleration examples and documentation, please refer to:
38+
https://github.com/NVIDIA-NeMo/Export-Deploy
8439

8540
# Examples
8641
1. Multimodal autoregressive model dataset preparation using the [discrete cosmos tokenizer](../../../../nemo/collections/multimodal_autoregressive/data/README.md)

nemo/collections/common/video_tokenizers/cosmos_trt_run.py

Lines changed: 0 additions & 102 deletions
This file was deleted.

0 commit comments

Comments
 (0)