Skip to content

[Issue]: "AssertionError: Aiter MLA only supports 16 or 128 number of heads. Provided 32 number of heads" in DeepSeek R1 + TP4 + MXFP4 +MI355 test #1468

@leishaoSC

Description

@leishaoSC

Problem Description

When testing DeepSeek R1 + TP4 + MXFP4 with vllm docker: docker: rocm/vllm-dev:dsfp4_1111 on MI355 machine, I saw the error of "AssertionError: Aiter MLA only supports 16 or 128 number of heads. Provided 32 number of heads".

�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] WorkerProc failed to start.
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] Traceback (most recent call last):
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 571, in worker_main
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] worker = WorkerProc(*args, **kwargs)
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 437, in init
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] self.worker.load_model()
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 214, in load_model
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] self.model_runner.load_model(eep_scale_up=eep_scale_up)
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2562, in load_model
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] self.model = model_loader.load_model(
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 45, in load_model
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] model = initialize_model(vllm_config=vllm_config,
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 63, in initialize_model
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] return model_class(vllm_config=vllm_config, prefix=prefix)
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1179, in init
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] self.model = DeepseekV2Model(vllm_config=vllm_config,
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 200, in init
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1107, in init
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] self.start_layer, self.end_layer, self.layers = make_layers(
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] ^^^^^^^^^^^^
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 627, in make_layers
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1109, in
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] lambda prefix: DeepseekV2DecoderLayer(vllm_config, prefix),
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 927, in init
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] self.self_attn = attn_cls(
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] ^^^^^^^^^
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 876, in init
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] self.mla_attn = MultiHeadLatentAttention(
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/mla.py", line 156, in init
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] self.mla_attn = Attention(
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] ^^^^^^^^^^
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/attention/layer.py", line 208, in init
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] self.impl = impl_cls(num_heads, head_size, scale, num_kv_heads,
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/rocm_aiter_mla.py", line 192, in init
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] assert (num_heads == 16 or num_heads == 128), (
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] AssertionError: Aiter MLA only supports 16 or 128 number of heads.
�[1;36m(Worker_TP2 pid=477)�[0;0m ERROR 11-22 04:37:16 [multiproc_executor.py:597] Provided 32 number of heads.

Operating System

Ubuntu 22.04.5 LTS

CPU

AMD EPYC 9575F 64-Core Processor

GPU

8x MI355

ROCm Version

rocm-7.1.0

ROCm Component

No response

Steps to Reproduce

(1) docker run command:
docker_base_img_name=docker.io/rocm/vllm-dev:dsfp4_1111
docker_ctnr_name=vllm_dev_dsfp4_1111_test

docker run -it
--name=${docker_ctnr_name}
--user=root
-e HF_HOME=/data/huggingface
--volume=/data:/data
--volume=$HOME:/workdir
-v /mnt:/mnt
--cap-add=SYS_PTRACE
--group-add=video
--ipc=host
--shm-size=16G
--device=/dev/kfd
--device=/dev/dri
--security-opt seccomp=unconfined
-d -w /app/
${docker_base_img_name} bash

(2) launch server:
LOG_FILE="MI355_DSR1_FP4_TP4_vllm_355_public1111docker.log"
max_num_seqs=32
max_num_batched_tokens=163840
max_seq_len_to_capture=1024
tensor_parallel_size=4
max_model_len=70000

MODEL=amd/DeepSeek-R1-0528-MXFP4-Preview
unset FLATMM_HIP_CLANG_PATH

VLLM_USE_V1=1
VLLM_DISABLE_COMPILE_CACHE=1
AMDGCN_USE_BUFFER_OPS=1
VLLM_ROCM_USE_AITER=1
VLLM_TRITON_FP4_GEMM_USE_ASM=0
VLLM_ROCM_USE_AITER_FP4_ASM_GEMM=0
VLLM_ROCM_USE_AITER_MHA=1
VLLM_ROCM_USE_AITER_MLA=0
VLLM_ROCM_USE_CK_MXFP4_MOE=1
VLLM_TORCH_PROFILER_DIR=.
VLLM_TORCH_PROFILER_WITH_STACK=0
VLLM_ROCM_USE_AITER_TRITON_MLA=0
VLLM_ROCM_USE_AITER_TRITON_FUSED_SHARED_EXPERTS=1
VLLM_ROCM_USE_AITER_TRITON_FUSED_RMSNORM_FP4_QUANT=1
VLLM_ROCM_USE_AITER_TRITON_FUSED_ROPE_ZEROS_KV_CACHE=1
VLLM_ROCM_USE_AITER_TRITON_MXFP4_BMM=1
VLLM_ROCM_USE_AITER_TRITON_FUSED_MUL_ADD=1
VLLM_ROCM_USE_AITER_TRITON_FP8_BMM=0
HF_HOME=/data/huggingface vllm serve ${MODEL}
--host localhost
--port 8989
--swap-space 64
--disable-log-requests
--dtype auto
--tensor-parallel-size ${tensor_parallel_size}
--max-num-seqs ${max_num_seqs}
--distributed-executor-backend mp
--trust-remote-code
--block-size 1
--compilation-config='{"pass_config":{"enable_attn_fusion":true,"enable_noop":true,"enable_fusion":true},"cudagraph_mode":"FULL","custom_ops":["+rms_norm","+silu_and_mul","+quant_fp8"],"splitting_ops":[]}'
--gpu-memory-utilization 0.95
--max-model-len ${max_model_len}
--kv-cache-dtype fp8
--max-seq-len-to-capture ${max_seq_len_to_capture}
--max-num-batched-tokens ${max_num_batched_tokens}
--async-scheduling 2>&1 | tee "$LOG_FILE"

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions