NVIDIA-NeMo
diff --git a/‎docs/source/nlp/quantization.rst‎
Lines changed: 10 additions & 32 deletions b/‎docs/source/nlp/quantization.rst‎
Lines changed: 10 additions & 32 deletions
diff --git a/‎examples/llm/finetune/automodel_vllm.py‎
Lines changed: 0 additions & 158 deletions b/‎examples/llm/finetune/automodel_vllm.py‎
Lines changed: 0 additions & 158 deletions
diff --git a/‎nemo/collections/common/video_tokenizers/README.md‎
Lines changed: 3 additions & 48 deletions b/‎nemo/collections/common/video_tokenizers/README.md‎
Lines changed: 3 additions & 48 deletions
diff --git a/‎nemo/collections/common/video_tokenizers/cosmos_trt_run.py‎
Lines changed: 0 additions & 102 deletions b/‎nemo/collections/common/video_tokenizers/cosmos_trt_run.py‎
Lines changed: 0 additions & 102 deletions
@@ -19,7 +19,7 @@ The quantization process consists of the following steps:
 2. Calibrating the model to obtain appropriate algorithm-specific scaling factors
 3. Producing an output directory or .qnemo tarball with model config (json), quantized weights (safetensors) and tokenizer config (yaml).
 
-Loading models requires using an ModelOpt spec defined in `nemo.collections.nlp.models.language_modeling.megatron.gpt_layer_modelopt_spec <https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/models/language_modeling/megatron/gpt_layer_modelopt_spec.py>`_ module. Typically the calibration step is lightweight and uses a small dataset to obtain appropriate statistics for scaling tensors. The output directory produced (or a .qnemo tarball) is ready to be used to build a serving engine with the Nvidia TensorRT-LLM library. The engine build step is also available in NeMo project in ``nemo.deploy`` and ``nemo.export`` modules.
+Loading models requires using an ModelOpt spec defined in `nemo.collections.nlp.models.language_modeling.megatron.gpt_layer_modelopt_spec <https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/models/language_modeling/megatron/gpt_layer_modelopt_spec.py>`_ module. Typically the calibration step is lightweight and uses a small dataset to obtain appropriate statistics for scaling tensors. The output directory produced (or a .qnemo tarball) is ready to be used to build a serving engine with the Nvidia TensorRT-LLM library. The engine build step is also available in the `Export-Deploy repository <https://github.com/NVIDIA-NeMo/Export-Deploy>`_.
 
 Quantization algorithm can also be conveniently set to ``"null"`` to perform only the weights export step using default precision for TensorRT-LLM deployment. This is useful to obtain baseline performance and accuracy results for comparison.
 
@@ -103,19 +103,11 @@ The output directory stores the following files:
     ├── tokenizer.model
     └── tokenizer_config.yaml
 
-The TensorRT-LLM engine can be conveniently built and run using ``TensorRTLLM`` class available in ``nemo.export`` submodule:
+.. note::
+   The export and deployment functionality has been moved to a separate repository.
+   Install with: ``pip install git+https://github.com/NVIDIA-NeMo/Export-Deploy.git``
 
-.. code-block:: python
-
-    from nemo.export.tensorrt_llm import TensorRTLLM
-    trt_llm_exporter = TensorRTLLM(model_dir="/path/to/trt_llm_engine_folder")
-    trt_llm_exporter.export(
-        nemo_checkpoint_path="llama3-70b-base-fp8-qnemo",
-        model_type="llama",
-    )
-    trt_llm_exporter.forward(["Hi, how are you?", "I am good, thanks, how about you?"])
-
-Alternatively, it can also be built directly using ``trtllm-build`` command, see `TensorRT-LLM documentation <https://nvidia.github.io/TensorRT-LLM/latest/legacy/architecture/checkpoint.html#build-checkpoint-into-tensorrt-engine>`_:
+The TensorRT-LLM engine can be built directly using ``trtllm-build`` command, see `TensorRT-LLM documentation <https://nvidia.github.io/TensorRT-LLM/latest/legacy/architecture/checkpoint.html#build-checkpoint-into-tensorrt-engine>`_:
 
 .. code-block:: bash
 
@@ -129,7 +121,7 @@ Alternatively, it can also be built directly using ``trtllm-build`` command, see
 
 Known issues
 ^^^^^^^^^^^^
-* Currently with ``nemo.export`` module building TensorRT-LLM engines for quantized "qnemo" models is limited to single-node deployments.
+* Building TensorRT-LLM engines for quantized "qnemo" models is limited to single-node deployments.
 
 
 Quantization-Aware Training (QAT)
@@ -183,25 +175,11 @@ Note that you may tweak the QAT trainer steps and learning rate if needed to ach
 NeMo checkpoints trained in FP8 with `NVIDIA Transformer Engine <https://github.com/NVIDIA/TransformerEngine>`_
 ----------------------------------------------------------------------------------------------------------------
 
-If you have an FP8-quantized checkpoint, produced during pre-training or fine-tuning with Transformer Engine, you can convert it to a FP8 TensorRT-LLM engine directly using ``nemo.export``.
-The API is the same as with regular ``.nemo`` and ``.qnemo`` checkpoints:
-
-.. code-block:: python
-
-    from nemo.export.tensorrt_llm import TensorRTLLM
-    trt_llm_exporter = TensorRTLLM(model_dir="/path/to/trt_llm_engine_folder")
-    trt_llm_exporter.export(
-        nemo_checkpoint_path="/path/to/llama2-7b-base-fp8.nemo",
-        model_type="llama",
-    )
-    trt_llm_exporter.forward(["Hi, how are you?", "I am good, thanks, how about you?"])
-
-The export settings for quantization can be adjusted via ``trt_llm_exporter.export`` arguments:
-
-* ``fp8_quantized: Optional[bool] = None``: manually enables/disables FP8 quantization
-* ``fp8_kvcache: Optional[bool] = None``: manually enables/disables FP8 quantization for KV-cache
+If you have an FP8-quantized checkpoint, produced during pre-training or fine-tuning with Transformer Engine, you can convert it to a FP8 TensorRT-LLM engine directly using the Export-Deploy repository.
 
-By default quantization settings are auto-detected from the NeMo checkpoint.
+.. note::
+   Export and deployment functionality is available in the Export-Deploy repository.
+   See: https://github.com/NVIDIA-NeMo/Export-Deploy
 
 
 References
 
@@ -29,58 +29,13 @@ for the complete list of supported tokenizers.
 
 ### Acceleration with TensorRT
 
-**Note:** TensorRT acceleration requires the Export-Deploy repository:
+**Note:** TensorRT acceleration functionality has been moved to the Export-Deploy repository:
 ```bash
 pip install git+https://github.com/NVIDIA-NeMo/Export-Deploy.git
 ```
 
-To use these tokenizers with TensorRT and acheive up to 3X speedup during tokenization,
-users can define a lightweight wrapper model and then pass this wrapper model to `trt_compile`
-```python
-import torch
-from nemo.collections.common.video_tokenizers.cosmos_tokenizer import CausalVideoTokenizer
-from nemo.export.tensorrt_lazy_compiler import trt_compile
-
-class VaeWrapper(torch.nn.Module):
-    def __init__(self, vae):
-        super().__init__()
-        self.vae = vae
-
-    def forward(self, input_tensor):
-        output_tensor = self.vae.autoencode(input_tensor)
-        return output_tensor
-
-model = CausalVideoTokenizer.from_pretrained(
-    "Cosmos-Tokenizer-DV4x8x8", 
-    use_pytorch=True, 
-    dtype="float"
-)
-model_wrapper = VaeWrapper(model)
-
-input_tensor = torch.randn(1, 3, 9, 512, 512).to('cuda').to(torch.float)
-opt_shape = min_shape = max_shape = input_tensor.shape
-
-path_to_engine_outputs="./trt_outputs"
-trt_compile(
-    model_wrapper,
-    path_to_engine_outputs,
-    args={
-        "precision": "bf16",
-        "input_profiles": [
-            {"input_tensor": [min_shape, opt_shape, max_shape]},
-        ],
-    },
-)
-
-output = model_wrapper(input_tensor)
-```
-Note that the `trt_compile` function requires 
-providing `min_shape`, `opt_shape` and `max_shape`
-as arguments (in this example all are set to the input tensor shape for simplicity) which enables inputs with dynamic shapes after compilation.
-For more information about TensorRT and dynamic shapes please review the [Torch-Tensorrt documentation](https://pytorch.org/TensorRT/user_guide/dynamic_shapes.html)
-
-The file `cosmos_trt_run.py` provides a stand-alone script to tokenize tensors with a TensorRT-accelerated
-Cosmos tokenizer.
+For TensorRT acceleration examples and documentation, please refer to:
+https://github.com/NVIDIA-NeMo/Export-Deploy
 
 # Examples
 1. Multimodal autoregressive model dataset preparation using the [discrete cosmos tokenizer](../../../../nemo/collections/multimodal_autoregressive/data/README.md)