Skip to content

Commit 5912121

Browse files
pablo-garayjenchen13tango4jchtruong814
authored andcommitted
chore: remove ExportDeploy (NVIDIA-NeMo#15033)
* add EP in PTQ (NVIDIA-NeMo#15015) Signed-off-by: jenchen13 <[email protected]> Signed-off-by: Pablo Garay <[email protected]> * remove ExportDeploy Signed-off-by: Pablo Garay <[email protected]> * remove exportDeploy tests Signed-off-by: Pablo Garay <[email protected]> * remove references Signed-off-by: Pablo Garay <[email protected]> * lintfix Signed-off-by: Pablo Garay <[email protected]> * Fixing lines for multispeaker pipeline (NVIDIA-NeMo#15030) * Fixing lines for multispeaker pipeline Signed-off-by: taejinp <[email protected]> * Removing unused imports Signed-off-by: taejinp <[email protected]> * Apply isort and black reformatting Signed-off-by: tango4j <[email protected]> * Making changes for HF Space deployment Signed-off-by: taejinp <[email protected]> * Apply isort and black reformatting Signed-off-by: chtruong814 <[email protected]> * Updated multispk trans utils. Signed-off-by: taejinp <[email protected]> --------- Signed-off-by: taejinp <[email protected]> Signed-off-by: tango4j <[email protected]> Signed-off-by: chtruong814 <[email protected]> Co-authored-by: tango4j <[email protected]> Co-authored-by: chtruong814 <[email protected]> Signed-off-by: Pablo Garay <[email protected]> * remove ExportDeploy & references Signed-off-by: Pablo Garay <[email protected]> * lintfix Signed-off-by: Pablo Garay <[email protected]> * get load_ckpt back Signed-off-by: Pablo Garay <[email protected]> * lintfix Signed-off-by: Pablo Garay <[email protected]> * Apply isort and black reformatting Signed-off-by: pablo-garay <[email protected]> * back Signed-off-by: Pablo Garay <[email protected]> * revert back Signed-off-by: Pablo Garay <[email protected]> * revert back Signed-off-by: Pablo Garay <[email protected]> * remove ExportDeploy Signed-off-by: Pablo Garay <[email protected]> --------- Signed-off-by: jenchen13 <[email protected]> Signed-off-by: Pablo Garay <[email protected]> Signed-off-by: taejinp <[email protected]> Signed-off-by: tango4j <[email protected]> Signed-off-by: chtruong814 <[email protected]> Signed-off-by: pablo-garay <[email protected]> Co-authored-by: Jenny Chen <[email protected]> Co-authored-by: Taejin Park <[email protected]> Co-authored-by: tango4j <[email protected]> Co-authored-by: chtruong814 <[email protected]> Co-authored-by: pablo-garay <[email protected]> Signed-off-by: genquan9 <[email protected]>
1 parent c0b4fa5 commit 5912121

File tree

97 files changed

+30
-12146
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

97 files changed

+30
-12146
lines changed

.github/workflows/cicd-main-nemo2.yml

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -162,8 +162,6 @@ jobs:
162162
runner: self-hosted-azure
163163
- script: L2_NEMO_2_LoRA_MERGE
164164
runner: self-hosted-azure
165-
- script: L2_NEMO_2_LoRA_Export
166-
runner: self-hosted-azure-gpus-1
167165
- script: L2_NEMO_2_LoRA_Inference
168166
runner: self-hosted-azure-gpus-1
169167
- script: L2_NeMo_2_NeMo_Mcore_Mixtral_bitexact
@@ -177,8 +175,6 @@ jobs:
177175
runner: self-hosted-azure
178176
- script: L2_NeMo_2_PTQ_Llama2_FP8_nemo
179177
runner: self-hosted-azure
180-
- script: L2_NeMo_2_PTQ_Unified_Export
181-
runner: self-hosted-azure
182178
- script: L2_NeMo_2_Distill_Llama3_TP1PP2
183179
runner: self-hosted-azure
184180
- script: L2_NeMo_2_Prune_Llama_TP1PP2

.github/workflows/code-linting.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,8 @@ jobs:
4242
"!nemo/collections/audio/**/*.py",
4343
"!nemo/collections/multimodal/speech_llm/**/*.py",
4444
"!nemo/collections/speechlm/**/*.py",
45-
"!nemo/collections/speechlm2/**/*.py"
45+
"!nemo/collections/speechlm2/**/*.py",
46+
"!nemo/export/**/*.py"
4647
] | join(",")')
4748
fi
4849

docker/Dockerfile.ci.export_deploy

Lines changed: 0 additions & 98 deletions
This file was deleted.

docker/common/install_dep.sh

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -279,8 +279,7 @@ vllm() {
279279
$INSTALL_DIR/venv/bin/pip install --no-cache-dir setuptools coverage
280280
$INSTALL_DIR/venv/bin/pip wheel --no-cache-dir --no-build-isolation \
281281
--wheel-dir $WHEELS_DIR/ \
282-
-r $CURR/requirements/requirements_vllm.txt \
283-
-r $CURR/requirements/requirements_deploy.txt
282+
-r $CURR/requirements/requirements_vllm.txt
284283
fi
285284
}
286285

docs/source/nlp/quantization.rst

Lines changed: 10 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ The quantization process consists of the following steps:
1919
2. Calibrating the model to obtain appropriate algorithm-specific scaling factors
2020
3. Producing an output directory or .qnemo tarball with model config (json), quantized weights (safetensors) and tokenizer config (yaml).
2121

22-
Loading models requires using an ModelOpt spec defined in `nemo.collections.nlp.models.language_modeling.megatron.gpt_layer_modelopt_spec <https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/models/language_modeling/megatron/gpt_layer_modelopt_spec.py>`_ module. Typically the calibration step is lightweight and uses a small dataset to obtain appropriate statistics for scaling tensors. The output directory produced (or a .qnemo tarball) is ready to be used to build a serving engine with the Nvidia TensorRT-LLM library. The engine build step is also available in NeMo project in ``nemo.deploy`` and ``nemo.export`` modules.
22+
Loading models requires using an ModelOpt spec defined in `nemo.collections.nlp.models.language_modeling.megatron.gpt_layer_modelopt_spec <https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/nlp/models/language_modeling/megatron/gpt_layer_modelopt_spec.py>`_ module. Typically the calibration step is lightweight and uses a small dataset to obtain appropriate statistics for scaling tensors. The output directory produced (or a .qnemo tarball) is ready to be used to build a serving engine with the Nvidia TensorRT-LLM library. The engine build step is also available in the `Export-Deploy repository <https://github.com/NVIDIA-NeMo/Export-Deploy>`_.
2323

2424
Quantization algorithm can also be conveniently set to ``"null"`` to perform only the weights export step using default precision for TensorRT-LLM deployment. This is useful to obtain baseline performance and accuracy results for comparison.
2525

@@ -103,19 +103,11 @@ The output directory stores the following files:
103103
├── tokenizer.model
104104
└── tokenizer_config.yaml
105105
106-
The TensorRT-LLM engine can be conveniently built and run using ``TensorRTLLM`` class available in ``nemo.export`` submodule:
106+
.. note::
107+
The export and deployment functionality has been moved to a separate repository.
108+
Install with: ``pip install git+https://github.com/NVIDIA-NeMo/Export-Deploy.git``
107109

108-
.. code-block:: python
109-
110-
from nemo.export.tensorrt_llm import TensorRTLLM
111-
trt_llm_exporter = TensorRTLLM(model_dir="/path/to/trt_llm_engine_folder")
112-
trt_llm_exporter.export(
113-
nemo_checkpoint_path="llama3-70b-base-fp8-qnemo",
114-
model_type="llama",
115-
)
116-
trt_llm_exporter.forward(["Hi, how are you?", "I am good, thanks, how about you?"])
117-
118-
Alternatively, it can also be built directly using ``trtllm-build`` command, see `TensorRT-LLM documentation <https://nvidia.github.io/TensorRT-LLM/latest/legacy/architecture/checkpoint.html#build-checkpoint-into-tensorrt-engine>`_:
110+
The TensorRT-LLM engine can be built directly using ``trtllm-build`` command, see `TensorRT-LLM documentation <https://nvidia.github.io/TensorRT-LLM/latest/legacy/architecture/checkpoint.html#build-checkpoint-into-tensorrt-engine>`_:
119111

120112
.. code-block:: bash
121113
@@ -129,7 +121,7 @@ Alternatively, it can also be built directly using ``trtllm-build`` command, see
129121
130122
Known issues
131123
^^^^^^^^^^^^
132-
* Currently with ``nemo.export`` module building TensorRT-LLM engines for quantized "qnemo" models is limited to single-node deployments.
124+
* Building TensorRT-LLM engines for quantized "qnemo" models is limited to single-node deployments.
133125

134126

135127
Quantization-Aware Training (QAT)
@@ -183,25 +175,11 @@ Note that you may tweak the QAT trainer steps and learning rate if needed to ach
183175
NeMo checkpoints trained in FP8 with `NVIDIA Transformer Engine <https://github.com/NVIDIA/TransformerEngine>`_
184176
----------------------------------------------------------------------------------------------------------------
185177

186-
If you have an FP8-quantized checkpoint, produced during pre-training or fine-tuning with Transformer Engine, you can convert it to a FP8 TensorRT-LLM engine directly using ``nemo.export``.
187-
The API is the same as with regular ``.nemo`` and ``.qnemo`` checkpoints:
188-
189-
.. code-block:: python
190-
191-
from nemo.export.tensorrt_llm import TensorRTLLM
192-
trt_llm_exporter = TensorRTLLM(model_dir="/path/to/trt_llm_engine_folder")
193-
trt_llm_exporter.export(
194-
nemo_checkpoint_path="/path/to/llama2-7b-base-fp8.nemo",
195-
model_type="llama",
196-
)
197-
trt_llm_exporter.forward(["Hi, how are you?", "I am good, thanks, how about you?"])
198-
199-
The export settings for quantization can be adjusted via ``trt_llm_exporter.export`` arguments:
200-
201-
* ``fp8_quantized: Optional[bool] = None``: manually enables/disables FP8 quantization
202-
* ``fp8_kvcache: Optional[bool] = None``: manually enables/disables FP8 quantization for KV-cache
178+
If you have an FP8-quantized checkpoint, produced during pre-training or fine-tuning with Transformer Engine, you can convert it to a FP8 TensorRT-LLM engine directly using the Export-Deploy repository.
203179

204-
By default quantization settings are auto-detected from the NeMo checkpoint.
180+
.. note::
181+
Export and deployment functionality is available in the Export-Deploy repository.
182+
See: https://github.com/NVIDIA-NeMo/Export-Deploy
205183

206184

207185
References

examples/llm/finetune/automodel_vllm.py

Lines changed: 0 additions & 143 deletions
This file was deleted.

0 commit comments

Comments
 (0)