Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 10 additions & 3 deletions docs/.nav.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ nav:
- getting_started/compatibility_matrix.md
- getting_started/validated_models.md
- Configuration Guides:
- configuration/env_vars.md
- configuration/env_variables.md
- configuration/long_context.md
- Calibration:
- configuration/calibration/calibration.md
Expand All @@ -20,11 +20,18 @@ nav:
- configuration/quantization/inc.md
- configuration/quantization/auto_awq.md
- configuration/quantization/gptqmodel.md
- configuration/performance_tuning.md
- configuration/pipeline_parallelism.md
- Warm-up:
- configuration/warm-up/warm-up.md
- configuration/warm-up/sampler_warm-up.md
- configuration/warm-up/defragmenter_warm-up.md
- configuration/warm-up/managing_warm-up.md
- Features:
- features/supported_features.md
- features/*
- features/quantization
- features/bucketing_mechanism.md
- features/floating_point_8.md
- features/inified_attn.md
- Developer Guides:
- dev_guide/plugin_system.md
- dev_guide/ci-failures.md
Expand Down
File renamed without changes.
3 changes: 0 additions & 3 deletions docs/configuration/optimization.md

This file was deleted.

47 changes: 47 additions & 0 deletions docs/configuration/performance_tuning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Performance Tuning

Understanding how configuration settings affect system behavior is essential for effective performance management. This document explains how you can tune and optimize performance.

## Warm-Up

During the development phase, when evaluating a model for inference on vLLM, you may skip the warm-up phase of the server using the `VLLM_SKIP_WARMUP=true` environment variable. This helps to achieve faster testing turnaround times. However, disabling warm-up is acceptable only for development purposes, we strongly recommend keeping it enabled in production environments. Keep warm-up enabled during deployment with optimal number of [buckets](../../features/bucketing_mechanism.md).

Warm-up time depends on many factors, such as input and output sequence length, batch size, number of buckets, and data type. It can even take a couple of hours, depending on the configuration. For more information, see the [Warm-up](../../features/warmup.md) document.

## Memory Allocation

HPU graphs and the KV cache share the same usable memory pool, determined by `gpu_memory_utilization`. Memory allocation between the two must be balanced to prevent performance degradation. You can find memory consumption information for your model in the logs. They provide device memory usage during model weight loading, profiling runs (using dummy data and without the KV cache), and the final usable memory available before the warm-up phase begins. You can use this information to determine an appropriate bucketing scheme for warm-ups. The following example shows the initial part of the generated server log for the [Meta-Llama-3.1-8B](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B) model:

```text hl_lines="3 4 5 7"
INFO 09-24 17:31:39 habana_model_runner.py:590] Pre-loading model weights on hpu:0 took 15.05 GiB of device memory (15.05 GiB/94.62 GiB used) and 1.067 GiB of host memory (8.199 GiB/108.2 GiB used)
INFO 09-24 17:31:39 habana_model_runner.py:636] Wrapping in HPU Graph took 0 B of device memory (15.05 GiB/94.62 GiB used) and -3.469 MiB of host memory (8.187 GiB/108.2 GiB used)
INFO 09-24 17:31:39 habana_model_runner.py:640] Loading model weights took in total 15.05 GiB of device memory (15.05 GiB/94.62 GiB used) and 1.056 GiB of host memory (8.188 GiB/108.2 GiB used)
INFO 09-24 17:31:40 habana_worker.py:153] Model profiling run took 355 MiB of device memory (15.4 GiB/94.62 GiB used) and 131.4 MiB of host memory (8.316 GiB/108.2 GiB used)
INFO 09-24 17:31:40 habana_worker.py:177] Free device memory: 79.22 GiB, 71.3 GiB usable (gpu_memory_utilization=0.9), 7.13 GiB reserved for HPUGraphs (VLLM_GRAPH_RESERVED_MEM=0.1), 64.17 GiB reserved for KV cache
INFO 09-24 17:31:40 habana_executor.py:85] # HPU blocks: 4107, # CPU blocks: 256
INFO 09-24 17:31:41 habana_worker.py:208] Initializing cache engine took 64.17 GiB of device memory (79.57 GiB/94.62 GiB used) and 1.015 GiB of host memory (9.329 GiB/108.2 GiB used)
```

You can control the ratio between HPU graphs and KV cache using the `VLLM_GRAPH_RESERVED_MEM` environment variable. Increasing the KV cache size enables larger batch processing, improving overall throughput. Enabling [HPU graphs](../warm-up/warm-up.md#hpu-graph-capture) helps reduce host [overhead](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html#reducing-host-overhead-with-hpu-graphs) and can lower latency.

The following example shows the warm-up phase logs:

```text
INFO 09-24 17:32:13 habana_model_runner.py:1477] Graph/Prompt captured:24 (100.0%) used_mem:67.72 MiB buckets:[(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024)]
INFO 09-24 17:32:13 habana_model_runner.py:1477] Graph/Decode captured:1 (100.0%) used_mem:64 KiB buckets:[(4, 128)]
INFO 09-24 17:32:13 habana_model_runner.py:1620] Warmup finished in 32 secs, allocated 92.77 MiB of device memory
INFO 09-24 17:32:13 habana_executor.py:91] init_cache_engine took 64.26 GiB of device memory (79.66 GiB/94.62 GiB used) and 1.104 GiB of host memory (9.419 GiB/108.2 GiB used)
```

After analyzing these logs, you should have a good understanding of how much free device memory remains for overhead calculations and how much more could still be used by increasing `gpu_memory_utilization`. You can balance the memory allocation for warm-up bucketing, HPU graphs, and the KV cache to suit your workload requirements.

The `VLLM_GRAPH_PROMPT_RATIO` environment variable controls the ratio of usable graph memory between prefill and decode graphs. Assigning more memory to a stage usually results in faster execution for that stage.

## Bucketing Mechanism

The [bucketing mechanism](../../features/bucketing_mechanism.md) can help optimize performance across different workloads. The vLLM server is pre-configured for heavy decoding scenarios with high request concurrency, using the default maximum batch size strategy (`VLLM_GRAPH_DECODE_STRATEGY`). During low-load periods, this configuration may not be ideal and can be adjusted for smaller batch sizes. For example, modifying bucket ranges via `VLLM_DECODE_BS_BUCKET_{param}` can improve efficiency. For a list of environment variables controlling bucketing behavior, see the [Environment Variables](../env_variables.md) document.

## Floating Point 8-bit

Using the Floating Point 8-bit (FP8) data type for large language models reduces memory bandwidth requirements by half compared to BF16. In addition, the FP8 computation is twice as fast as BF16, enabling performance gains even for compute-bound workloads, such as offline inference with large batch sizes.
For more information, see the [Floating Point 8-bit](../../features/floating_point_8.md) document.
2 changes: 1 addition & 1 deletion docs/configuration/quantization/quantization.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ title: Introduction

# Quantization and Inference

Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices. The Intel® Gaudi® Backend supports following quantization backends:
Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices. The Intel® Gaudi® Backend supports the following quantization backends:

- [Intel® Neural Compressor](inc.md)
- [Auto_Awq](auto_awq.md)
Expand Down
54 changes: 54 additions & 0 deletions docs/configuration/warm-up/defragmenter_warm-up.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Defragmenter Warm-Up

The defragmenter reclaims and compacts sparse KV-cache block usage at runtime by swapping rarely packed high-index blocks with lower free indices. Its warm-up phase pre-compiles the small swap graphs so that later online defragmentation can execute with near-zero graph compile latency.

Defragmentation may be triggered mid-serving when the highest allocated block index drifts far above the actual number of in-use blocks (fragmentation). The operation itself is a sequence of swap kernels applied over key and value caches. With warm-up, all representative padded sizes are precompiled ahead of time via a deterministic, minimal swap. This ensures that online defragmentation becomes a predictable, low-latency maintenance task. Skipping only the defragmenter warm-up does not compromise correctness; it only increases the risk of sporadic latency when fragmentation first exceeds the threshold that mandates compaction.

The potential consequences of omitting warm-up include:

- The first fragmentation event that requires a previously unseen padded swap size triggers graph capture and compilation on the critical path.
- Compilation latency can manifest as a sudden tail-latency spike for a user request.
- Multiple first-seen swap sizes across different processes may each trigger separate compilations.

You can disable either the warm-up step itself or the entire defragmentation feature. To skip all warm-up phases, including the defragmenter, set `VLLM_SKIP_WARMUP=true`. Alternatively, running without unified attention effectively disables the defragmenter, since it is tied to unified attention; in this case, the warm-up becomes a no-op. Note that there is no separate environment flag in this version to force-enable or disable defragmentation independently of unified attention. Additionally, if supported by your execution mode, you can avoid graph compilation for defragmenter swaps by setting `VLLM_DEFRAG_WITH_GRAPHS=false`. This causes swaps to fall back to regular execution, while the warm-up still exercises them without triggering graph capture.

Related environment variables:

- `VLLM_DEFRAG_THRESHOLD`: Sets the fragmentation trigger heuristic. The default value is 32; lower values make compaction more aggressive.
- `VLLM_DEFRAG_WITH_GRAPHS`: Determines whether swap paths are compiled or graphed. By default, this follows `bridge_mode == eager`.
- `VLLM_DEBUG=defrag`: Enables verbose defragmentation debug logging.
- `VLLM_SKIP_WARMUP`: Disables all warm-up stages including defragmentation.

!!! note
Disabling the defragmenter warm-up does not turn off defragmentation itself, unless unified attention or the feature is entirely disabled. It simply skips ahead-of-time graph preparation, which may shift the compilation cost to the first live fragmentation event.

## Performing Defragmenter Warm-Up

During the main warm-up (`warmup_model`), the system calls the internal `warmup_defragmenter` method after initializing the KV caches and defragmenter. The process is defined by following warm-up steps:

1. Confirming that the defragmenter warm-up feature is enabled, as it only runs when unified attention is enabled, and that the `cache_utils` swap utilities are ready.
2. Establishing the list of padding thresholds: `[8, 16, 32, 64, 128, 256, 512]`.
3. Choosing a minimal valid swap pair `[(1, 0)]` with two distinct block IDs. Only two real blocks are required. Internally, each swap call is padded up to the current threshold length so that a compiled graph for that exact padded size is produced.
4. Iterating through each threshold and invoking a swap. This captures or compiles, depending on the execution mode, the swap graph for that padded size.
5. Performing one extra swap with the first threshold in cases when the number of thresholds is odd. It causes the sequence of swaps to return the KV cache to its original state (net zero logical change).
6. Completing logs.

Future defragmentation swap requests always round or pad to one of these known thresholds. All operational swap sizes hit a pre-compiled path and avoid on-demand compilation latency.

## Logs

The following example presents a typical sequence of logs that appear when there are at least two KV-cache blocks available:

```text
INFO 09-22 16:26:24 [hpu_model_runner.py:3428] Warming up defragmenter with thresholds: [8, 16, 32, 64, 128, 256, 512]
INFO 09-22 16:26:27 [hpu_model_runner.py:3452] Defragmenter warmup completed successfully
```

If insufficient blocks exist, such as extremely small test configuration or allocation failure, warm-up is skipped gracefully and you may see logs similar to the following example:

```text
INFO 09-22 16:26:24 [hpu_model_runner.py:3428] Warming up defragmenter with thresholds: [8, 16, 32, 64, 128, 256, 512]
WARNING hh:mm:ss hpu_model_runner.py:#### Skipping defragmenter warmup, insufficient blocks (1)
```

To emit fine-grained debug messages during live defragmentation, not the minimal warm-up swaps only, add `VLLM_DEBUG=defrag` to the environment. This way you will be able to see the number of blocks swapped and post-compaction statistics.
82 changes: 82 additions & 0 deletions docs/configuration/warm-up/managing_warm-up.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Managing and Reducing Warm-up Time

This document provides guidance on reducing warm-up time during vLLM model deployment on Intel® Gaudi® accelerators. It outlines the use of HPU graph caching, bucketing strategies,
and experimental features to improve the model performance.

## Reducing Warm-up Time with HPU Graph Caching

Intel Gaudi software supports caching of compiled HPU graphs using the `PT_HPU_RECIPE_CACHE_CONFIG` environment variable. This can significantly reduce startup time by reusing previously compiled graphs.

Setting the variable requires using the following format:

```python
export PT_HPU_RECIPE_CACHE_CONFIG=<RECIPE_CACHE_PATH>,<RECIPE_CACHE_DELETE>,<RECIPE_CACHE_SIZE_MB>
```

Where:

- `RECIPE_CACHE_PATH`: The directory for storing the compiled graph recipes.
- `RECIPE_CACHE_DELETE`: A boolean that controls cache behavior: when set to `true`, existing contents are cleared before storing new graph-compiled recipes; when set to `false`, the graph-compiled recipes stored in `RECIPE_CACHE_PATH` are reused, which speeds up the warm-up.
- `RECIPE_CACHE_SIZE_MB`: Sets the maximum size of the cache directory in MB. If the cache size limit is reached, the PyTorch bridge automatically deletes the oldest recipes, based on file creation time. We recommend adjusting the cache directory size according to the model and use case requirements.

The graph compilation process consists of two stages: GC graph compilation and HPU graph compilation. When `PT_HPU_RECIPE_CACHE_CONFIG` is enabled, the GC stage is skipped by reusing cached graphs, significantly reducing overall compilation time. The HPU graph compilation step, however, is still executed. The graph has to be regenerated in the following cases:

- PyTorch container or Intel® Gaudi® software version changes.
- Platform changes, for example Intel® Gaudi® 2 to Intel® Gaudi® 3.
- Model tensor parallelism or data type changes, for example, BF16 to FP8 or FP8 to BF16.

### Storage Recommendations

For scale-up scenarios where caching is shared across processes, we recommend using the local disk. Remote filesystems, such as NFS, should be avoided because they do not support file locking.

In Kubernetes environments, the cache can be stored on a PVC or NFS, but it should be copied to local disk before use.

For a usage example, refer to [Intel Gaudi Tutorials](https://github.com/HabanaAI/Gaudi-tutorials/blob/special/k8s/vllm-8b-cache.yaml).

### Deployment with vLLM

To cache the compiled HPU graphs and reduce the startup time, use one of the following methods.

#### Serving Command

Add the cache parameter to the serving command as shown in the following example for Llama 3.1 8B:

```python
# Store in cache
export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_recipe_cache/',True,8192
# Replay from cache
export PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_recipe_cache/',False,8192
VLLM_PROMPT_BS_BUCKET_MAX=256 \
VLLM_DECODE_BS_BUCKET_MIN=128 \
VLLM_DECODE_BS_BUCKET_STEP=128 \
VLLM_DECODE_BS_BUCKET_MAX=128 \
VLLM_PROMPT_SEQ_BUCKET_MAX=1024 \
VLLM_DECODE_BLOCK_BUCKET_MAX=1024 \
PT_HPU_WEIGHT_SHARING=0 PT_HPU_MAX_COMPOUND_OP_SIZE=30 PT_HPU_LAZY_MODE=1 PT_HPU_ENABLE_LAZY_COLLECTIVES=true vllm serve meta-llama/Llama-3.1-8B-instruct -tp 1 --weights-load-device cpu --max-model-len 8192
```

This results in the following:

| Precision | Without cache | With cache | Time reduction |
| --------- | ------------- | ---------- | -------------- |
| BF16 | 66 sec | 23 sec | ~65% faster |
| FP8 | 504 sec | 34 sec | ~93% faster |

#### Docker

No changes are required in the Dockerfile as recipe cache is specific to the model and use case. Use the ``-e`` flag to set the environment variable:

```
-e PT_HPU_RECIPE_CACHE_CONFIG='/tmp/llama3_8b_recipe_cache/',True,8192
```

## Bucket Management

vLLM warm-up time is determined by the number of HPU graphs that must be compiled to support dynamic shapes, which in turn are influenced by the `batch_size` and `sequence_length`. The following parameters define the upper limits for graph compilation. Setting them according to `max_model_len` ensures that additional graphs are not compiled at runtime.

- Sequence length max (`VLLM_PROMPT_SEQ_BUCKET_MAX`): `max_model_len`
- Block size max (`VLLM_DECODE_BLOCK_BUCKET_MAX`): `max(128, (max_num_seqs*2048)/block_size)`

## Exponential Bucketing

The `VLLM_EXPONENTIAL_BUCKETING=True` flag, enabled by default starting with the vLLM `1.21.0-post1` release, switches the bucketing strategy from linear to exponential. This can reduce the number of buckets and warm-up time by up to 80%, while generally maintaining comparable inference performance. In some configurations, however, it may lead to a slight performance drop due to increased padding. This setting is particularly effective for BF16 and FP8 models. To use linear bucketing instead, set `VLLM_EXPONENTIAL_BUCKETING=False`.
Loading