-
Notifications
You must be signed in to change notification settings - Fork 161
Description
Problem Description
When testing DeepSeek R1 + TP4 + MXFP4 with vllm docker: docker: rocm/vllm-dev:dsfp4_1111 on MI355 machine, I saw the error of "AssertionError: Aiter MLA only supports 16 or 128 number of heads. Provided 32 number of heads".
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] WorkerProc failed to start.
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] Traceback (most recent call last):
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 571, in worker_main
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] worker = WorkerProc(*args, **kwargs)
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 437, in init
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] self.worker.load_model()
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 214, in load_model
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] self.model_runner.load_model(eep_scale_up=eep_scale_up)
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2562, in load_model
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] self.model = model_loader.load_model(
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 45, in load_model
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] model = initialize_model(vllm_config=vllm_config,
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 63, in initialize_model
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] return model_class(vllm_config=vllm_config, prefix=prefix)
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1179, in init
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] self.model = DeepseekV2Model(vllm_config=vllm_config,
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 200, in init
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1107, in init
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] self.start_layer, self.end_layer, self.layers = make_layers(
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] ^^^^^^^^^^^^
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 627, in make_layers
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1109, in
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] lambda prefix: DeepseekV2DecoderLayer(vllm_config, prefix),
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 927, in init
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] self.self_attn = attn_cls(
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] ^^^^^^^^^
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 876, in init
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] self.mla_attn = MultiHeadLatentAttention(
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/mla.py", line 156, in init
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] self.mla_attn = Attention(
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] ^^^^^^^^^^
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 208, in init
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/rocm_aiter_mla.py", line 192, in init
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] assert (num_heads == 16 or num_heads == 128), (
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] AssertionError: Aiter MLA only supports 16 or 128 number of heads.
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] Provided 32 number of heads.
Operating System
Ubuntu 22.04.5 LTS
CPU
AMD EPYC 9575F 64-Core Processor
GPU
8x MI355
ROCm Version
rocm-7.1.0
ROCm Component
No response
Steps to Reproduce
(1) docker run command:
docker_base_img_name=docker.io/rocm/vllm-dev:dsfp4_1111
docker_ctnr_name=vllm_dev_dsfp4_1111_test
docker run -it
--name=${docker_ctnr_name}
--user=root
-e HF_HOME=/data/huggingface
--volume=/data:/data
--volume=$HOME:/workdir
-v /mnt:/mnt
--cap-add=SYS_PTRACE
--group-add=video
--ipc=host
--shm-size=16G
--device=/dev/kfd
--device=/dev/dri
--security-opt seccomp=unconfined
-d -w /app/
${docker_base_img_name} bash
(2) launch server:
LOG_FILE="MI355_DSR1_FP4_TP4_vllm_355_public1111docker.log"
max_num_seqs=32
max_num_batched_tokens=163840
max_seq_len_to_capture=1024
tensor_parallel_size=4
max_model_len=70000
MODEL=amd/DeepSeek-R1-0528-MXFP4-Preview
unset FLATMM_HIP_CLANG_PATH
VLLM_USE_V1=1
VLLM_DISABLE_COMPILE_CACHE=1
AMDGCN_USE_BUFFER_OPS=1
VLLM_ROCM_USE_AITER=1
VLLM_TRITON_FP4_GEMM_USE_ASM=0
VLLM_ROCM_USE_AITER_FP4_ASM_GEMM=0
VLLM_ROCM_USE_AITER_MHA=1
VLLM_ROCM_USE_AITER_MLA=0
VLLM_ROCM_USE_CK_MXFP4_MOE=1
VLLM_TORCH_PROFILER_DIR=.
VLLM_TORCH_PROFILER_WITH_STACK=0
VLLM_ROCM_USE_AITER_TRITON_MLA=0
VLLM_ROCM_USE_AITER_TRITON_FUSED_SHARED_EXPERTS=1
VLLM_ROCM_USE_AITER_TRITON_FUSED_RMSNORM_FP4_QUANT=1
VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE=1
VLLM_ROCM_USE_AITER_TRITON_MXFP4_BMM=1
VLLM_ROCM_USE_AITER_TRITON_FUSED_MUL_ADD=1
VLLM_ROCM_USE_AITER_TRITON_FP8_BMM=0
HF_HOME=/data/huggingface vllm serve ${MODEL}
--host localhost
--port 8989
--swap-space 64
--disable-log-requests
--dtype auto
--tensor-parallel-size ${tensor_parallel_size}
--max-num-seqs ${max_num_seqs}
--distributed-executor-backend mp
--trust-remote-code
--block-size 1
--compilation-config='{"pass_config":{"enable_attn_fusion":true,"enable_noop":true,"enable_fusion":true},"cudagraph_mode":"FULL","custom_ops":["+rms_norm","+silu_and_mul","+quant_fp8"],"splitting_ops":[]}'
--gpu-memory-utilization 0.95
--max-model-len ${max_model_len}
--kv-cache-dtype fp8
--max-seq-len-to-capture ${max_seq_len_to_capture}
--max-num-batched-tokens ${max_num_batched_tokens}
--async-scheduling 2>&1 | tee "$LOG_FILE"
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response