HunyuanVideo-1.5 is a video generation model that delivers top-tier quality with only 8.3B parameters, significantly lowering the barrier to usage. It runs smoothly on consumer-grade GPUs, making it accessible for every developer and creator. This repository provides the implementation and tools needed to generate creative videos.
👏 Join our WeChat and Discord | 💻 Official website Try our model!
- 🚀 Nov 24, 2025: We now support cache inference, achieving approximately 2x speedup! Pull the latest code to try it. 🔥🔥🔥🆕
- 👋 Nov 20, 2025: We release the inference code and model weights of HunyuanVideo-1.5.
demo.mp4
If you develop/use HunyuanVideo-1.5 in your projects, welcome to let us know.
-
ComfyUI - ComfyUI: A powerful and modular diffusion model GUI with a graph/nodes interface. ComfyUI supports HunyuanVideo-1.5 with various engineering optimizations for fast inference. We provide a ComfyUI Usage Guide for HunyuanVideo-1.5.
-
Community-implemented ComfyUI Plugin - comfyui_hunyuanvideo_1.5_plugin: A community-implemented ComfyUI plugin for HunyuanVideo-1.5, offering both simplified and complete node sets for quick usage or deep workflow customization, with built-in automatic model download support.
-
LightX2V - LightX2V: A lightweight and efficient video generation framework that integrates HunyuanVideo-1.5, supporting multiple engineering acceleration techniques for fast inference.
-
Wan2GP v9.62 - Wan2GP: WanGP is a very low VRAM app (as low 6 GB of VRAM for Hunyuan Video 1.5) supports Lora Accelerator for a 8 steps generation and offers tools to facilitate Video Generation.
- HunyuanVideo-1.5 (T2V/I2V)
- Inference Code and checkpoints
- ComfyUI Support
- LightX2V Support
- Diffusers Support
- Release all model weights (Sparse attention, distill model, and SR models)
- 🔥🔥🔥 News
- 🎥 Demo
- 🧩 Community Contributions
- 📑 Open-source Plan
- 📖 Introduction
- ✨ Key Features
- 📜 System Requirements
- 🛠️ Dependencies and Installation
- 🧱 Download Pretrained Models
- 📝 Prompt Guide
- 🔑 Usage
- 🧱 Models Cards
- 🎬 More Examples
- 📊 Evaluation
- 📚 Citation
- 🙏 Acknowledgements
- 🌟 Github Star History
We present HunyuanVideo-1.5, a lightweight yet powerful video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture with selective and sliding tile attention(SSTA), enhanced bilingual understanding through glyph-aware text encoding , progressive pre-training and post-training, and an efficient video super-resolution network. Leveraging these designs, we developed a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and resolutions. Extensive experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source models. By releasing the code and weights of HunyuanVideo-1.5, we provide the community with a high-performance foundation that significantly lowers the cost of video creation and research, making advanced video generation more accessible to all.
-
Lightweight High-Performance Architecture: We propose an efficient architecture that integrates an 8.3B-parameter Diffusion Transformer (DiT) with a 3D causal VAE, achieving compression ratios of 16× in spatial dimensions and 4× along the temporal axis. Additionally, the innovative SSTA (Selective and Sliding Tile Attention) mechanism prunes redundant spatiotemporal kv blocks, significantly reduces computational overhead for long video sequences and accelerates inference, achieving an end-to-end speedup of
$1.87 \times$ in 10-second 720p video synthesis compared to FlashAttention-3.
- Video Super-Resolution Enhancement: We develop an efficient few-step super-resolution network that upscales outputs to 1080p. It enhances sharpness while correcting distortions, thereby refining details and overall visual texture.
- End-to-End Training Optimization: This work employs a multi-stage, progressive training strategy covering the entire pipeline from pre-training to post-training. Combined with the Muon optimizer to accelerate convergence, this approach holistically refines motion coherence, aesthetic quality, and human preference alignment, achieving professional-grade content generation.
-
GPU: NVIDIA GPU with CUDA support
-
Minimum GPU Memory: 14 GB (with model offloading enabled)
Note: The memory requirements above are measured with model offloading enabled. If your GPU has sufficient memory, you may disable offloading for improved inference speed.
- Operating System: Linux
- Python: Python 3.10 or higher
- CUDA: Compatible CUDA version for your PyTorch installation
git clone https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5.git
cd HunyuanVideo-1.5pip install -r requirements.txt
pip install -i https://mirrors.tencent.com/pypi/simple/ --upgrade tencentcloud-sdk-python-
Flash Attention: Install Flash Attention for faster inference and reduced GPU memory consumption. Detailed installation instructions are available at Flash Attention.
-
Flex-Block-Attention: flex-block-attn is only required for sparse attention to achieve faster inference and can be installed by the following command:
git clone https://github.com/Tencent-Hunyuan/flex-block-attn.git cd flex-block-attn git submodule update --init --recursive python3 setup.py install -
SageAttention: To enable SageAttention for faster inference, you need to install it by the following command:
Note: Enabling SageAttention will automatically disable Flex-Block-Attention.
git clone https://github.com/cooper1637/SageAttention.git cd SageAttention export EXT_PARALLEL=4 NVCC_APPEND_FLAGS="--threads 8" MAX_JOBS=32 # Optional python3 setup.py install
💡 Distillation models and sparse attention models are still coming soon. Please stay tuned for the latest updates on the Hugging Face Model Card.
Download the pretrained models before generating videos. Detailed instructions are available at checkpoints-download.md.
Prompt enhancement plays a crucial role in enabling our model to generate high-quality videos. By writing longer and more detailed prompts, the generated video will be significantly improved. We encourage you to craft comprehensive and descriptive prompts to achieve the best possible video quality. we recommend community partners consulting our official guide on how to write effective prompts.
Reference: HunyuanVideo-1.5 Prompt Handbook
For users seeking to optimize prompts for other large models, it is recommended to consult the definition of t2v_rewrite_system_prompt in the file hyvideo/utils/rewrite/t2v_prompt.py to guide text-to-video rewriting. Similarly, for image-to-video rewriting, refer to the definition of i2v_rewrite_system_prompt in hyvideo/utils/rewrite/i2v_prompt.py.
For prompt rewriting, we recommend using Gemini or models deployed via vLLM. This codebase currently only supports models compatible with the vLLM API. If you wish to use Gemini, you will need to implement your own interface calls.
For models with a vLLM API, note that T2V (text-to-video) and I2V (image-to-video) have different recommended models and environment variables:
- T2V: use Qwen3-235B-A22B-Thinking-2507, configure
T2V_REWRITE_BASE_URLandT2V_REWRITE_MODEL_NAME - I2V: use Qwen3-VL-235B-A22B-Instruct, configure
I2V_REWRITE_BASE_URLandI2V_REWRITE_MODEL_NAME
You may set the above model names to any other vLLM-compatible models you have deployed (including HuggingFace models).
Rewriting is enabled by default (--rewritedefaults totrue); to disable it explicitly, use--rewrite falseor--rewrite 0. If no vLLM endpoint is configured, the pipeline runs without remote rewriting.
Example: Generate a video (works for both T2V and I2V; set IMAGE_PATH=none for T2V or provide an image path for I2V)
export T2V_REWRITE_BASE_URL="<your_vllm_server_base_url>"
export T2V_REWRITE_MODEL_NAME="<your_model_name>"
export I2V_REWRITE_BASE_URL="<your_vllm_server_base_url>"
export I2V_REWRITE_MODEL_NAME="<your_model_name>"
PROMPT='A girl holding a paper with words "Hello, world!"'
IMAGE_PATH=./data/reference_image.png # Optional, 'none' or <image path>
SEED=1
ASPECT_RATIO=16:9
RESOLUTION=480p
OUTPUT_PATH=./outputs/output.mp4
# Configuration
N_INFERENCE_GPU=8 # Parallel inference GPU count
CFG_DISTILLED=true # Inference with CFG distilled model, 2x speedup
SPARSE_ATTN=false # Inference with sparse attention (only 720p models are equipped with sparse attention). Please ensure flex-block-attn is installed
SAGE_ATTN=true # Inference with SageAttention
REWRITE=true # Enable prompt rewriting. Please ensure rewrite vLLM server is deployed and configured.
OVERLAP_GROUP_OFFLOADING=true # Only valid when group offloading is enabled, significantly increases CPU memory usage but speeds up inference
ENABLE_CACHE=true # Enable feature cache during inference. Significantly speeds up inference.
MODEL_PATH=ckpts # Path to pretrained model
torchrun --nproc_per_node=$N_INFERENCE_GPU generate.py \
--prompt "$PROMPT" \
--image_path $IMAGE_PATH \
--resolution $RESOLUTION \
--aspect_ratio $ASPECT_RATIO \
--seed $SEED \
--cfg_distilled $CFG_DISTILLED \
--sparse_attn $SPARSE_ATTN \
--use_sageattn $SAGE_ATTN \
--enable_cache $ENABLE_CACHE \
--rewrite $REWRITE \
--output_path $OUTPUT_PATH \
--overlap_group_offloading $OVERLAP_GROUP_OFFLOADING \
--save_pre_sr_video \
--model_path $MODEL_PATHTips: If your GPU memory is > 14GB but you encounter OOM (Out of Memory) errors during generation, you can try setting the following environment variable before running:
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:128Tips: If you have limited CPU memory and encounter OOM during inference, you can try disable overlapped group offloading by adding the following argument:
--overlap_group_offloading false
| Argument | Type | Required | Default | Description |
|---|---|---|---|---|
--prompt |
str | Yes | - | Text prompt for video generation |
--negative_prompt |
str | No | '' |
Negative prompt for video generation |
--resolution |
str | Yes | - | Video resolution: 480p or 720p |
--model_path |
str | Yes | - | Path to pretrained model directory |
--aspect_ratio |
str | No | 16:9 |
Aspect ratio of the output video |
--num_inference_steps |
int | No | 50 |
Number of inference steps |
--video_length |
int | No | 121 |
Number of frames to generate |
--seed |
int | No | 123 |
Random seed for reproducibility |
--image_path |
str | No | None |
Path to reference image (enables i2v mode). Use none or None to explicitly use text-to-video mode |
--output_path |
str | No | None |
Output file path (if not provided, saves to ./outputs/output_{transformer_version}_{timestamp}.mp4) |
--sr |
bool | No | true |
Enable super resolution (use --sr false or --sr 0 to disable) |
--save_pre_sr_video |
bool | No | false |
Save original video before super resolution (use --save_pre_sr_video or --save_pre_sr_video true to enable, only effective when super resolution is enabled) |
--rewrite |
bool | No | true |
Enable prompt rewriting (use --rewrite false or --rewrite 0 to disable, may result in lower quality video generation) |
--cfg_distilled |
bool | No | false |
Enable CFG distilled model for faster inference (~2x speedup, use --cfg_distilled or --cfg_distilled true to enable) |
--sparse_attn |
bool | No | false |
Enable sparse attention for faster inference (~1.5-2x speedup, requires H-series GPUs, auto-enables CFG distilled, use --sparse_attn or --sparse_attn true to enable) |
--offloading |
bool | No | true |
Enable CPU offloading (use --offloading false or --offloading 0 to disable for faster inference if GPU memory allows) |
--group_offloading |
bool | No | None |
Enable group offloading (default: None, automatically enabled if offloading is enabled. Use --group_offloading or --group_offloading true/1 to enable, --group_offloading false/0 to disable) |
--overlap_group_offloading |
bool | No | true |
Enable overlap group offloading (default: true). Significantly increases CPU memory usage but speeds up inference. Use --overlap_group_offloading or --overlap_group_offloading true/1 to enable, --overlap_group_offloading false/0 to disable |
--dtype |
str | No | bf16 |
Data type for transformer: bf16 (faster, lower memory) or fp32 (better quality, slower, higher memory) |
--use_sageattn |
bool | No | false |
Enable SageAttention (use --use_sageattn or --use_sageattn true/1 to enable, --use_sageattn false/0 to disable) |
--sage_blocks_range |
str | No | 0-53 |
SageAttention blocks range (e.g., 0-5 or 0,1,2,3,4,5) |
--enable_torch_compile |
bool | No | false |
Enable torch compile for transformer (use --enable_torch_compile or --enable_torch_compile true/1 to enable, --enable_torch_compile false/0 to disable) |
--enable_cache |
bool | No | false |
Enable cache for transformer (use --enable_cache or --enable_cache true/1 to enable, --enable_cache false/0 to disable) |
--cache_start_step |
int | No | 11 |
Start step to skip when using cache |
--cache_end_step |
int | No | 45 |
End step to skip when using cache |
--total_steps |
int | No | 50 |
Total inference steps |
--cache_step_interval |
int | No | 4 |
Step interval to skip when using cache |
Note: Use --nproc_per_node to specify the number of GPUs. For example, --nproc_per_node=8 uses 8 GPUs.
The following table provides the optimal inference configurations (CFG scale, embedded CFG scale, flow shift, and inference steps) for each model to achieve the best generation quality:
| Model | CFG Scale | Embedded CFG Scale | Flow Shift | Inference Steps |
|---|---|---|---|---|
| 480p T2V | 6 | None | 5 | 50 |
| 480p I2V | 6 | None | 5 | 50 |
| 720p T2V | 6 | None | 9 | 50 |
| 720p I2V | 6 | None | 7 | 50 |
| 480p T2V CFG Distilled | 1 | None | 5 | 50 |
| 480p I2V CFG Distilled | 1 | None | 5 | 50 |
| 720p T2V CFG Distilled | 1 | None | 9 | 50 |
| 720p I2V CFG Distilled | 1 | None | 7 | 50 |
| 720p T2V CFG Distilled Sparse | 1 | None | 9 | 50 |
| 720p I2V CFG Distilled Sparse | 1 | None | 7 | 50 |
| 480→720 SR Step Distilled | 1 | None | 2 | 6 |
| 720→1080 SR Step Distilled | 1 | None | 2 | 8 |
Please note that the cfg distilled model we provided, must use 50 steps to generate correct results.
| ModelName | Download |
|---|---|
| HunyuanVideo-1.5-480P-T2V | 480P-T2V |
| HunyuanVideo-1.5-480P-I2V | 480P-I2V |
| HunyuanVideo-1.5-480P-T2V-cfg-distill | 480P-T2V-cfg-distill |
| HunyuanVideo-1.5-480P-I2V-cfg-distill | 480P-I2V-cfg-distill |
| HunyuanVideo-1.5-720P-T2V | 720P-T2V |
| HunyuanVideo-1.5-720P-I2V | 720P-I2V |
| HunyuanVideo-1.5-720P-T2V-cfg-distill | Comming soon |
| HunyuanVideo-1.5-720P-I2V-cfg-distill | 720P-I2V-cfg-distill |
| HunyuanVideo-1.5-720P-T2V-sparse-cfg-distill | Comming soon |
| HunyuanVideo-1.5-720P-I2V-sparse-cfg-distill | 720P-I2V-sparse-cfg-distill |
| HunyuanVideo-1.5-720P-sr-step-distill | 720P-sr |
| HunyuanVideo-1.5-1080P-sr-step-distill | 1080P-sr |
We assess text-to-video generation using a comprehensive rating methodology that considers five key dimensions: text-video consistency, visual quality, structural stability, motion effects, and the aesthetic quality of individual frames. For image-to-video generation, the evaluation encompasses image-video consistency, instruction responsiveness, visual quality, structural stability, and motion effects.
The GSB(Good/Same/Bad) approach is widely used to evaluate the relative performance of two models based on overall video perception quality.We carefully construct 300 diverse text prompts and 300 image samples to cover balanced application scenarios for both text-to-video and image-to-video tasks. For each prompt or image input, an equal number of video samples are generated by each model in a single run to ensure comparability. To maintain fairness, inference is performed only once per input without any cherry-picking of results. All competing models are evaluated using their default configurations. The evaluation is conducted by over 100 professional assessors
We report inference speed with basic engineering-level acceleration techniques enabled on 8 H800 GPUs to demonstrate practical performance achievable in real-world deployment scenarios. Please note that in this experiment, we do not pursue the most extreme acceleration at the cost of generation quality, but rather to achieve notable speed improvements while maintaining nearly identical output quality.
We report the total inference time for 50 diffusion steps for HunyuanVideo 1.5 below:
@misc{hunyuanvideo2025,
title={HunyuanVideo 1.5 Technical Report},
author={Tencent Hunyuan Foundation Model Team},
year={2025},
publisher = {GitHub},
howpublished = {\url{https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5}},
}We would like to thank the contributors to the Transformers, Diffusers , HuggingFace and Qwen-VL, for their open research and exploration.









