Skip to content

[Issue]: AITER all reduce kernel segmentation fault in MI300x #1542

@graceleeis

Description

@graceleeis

Problem Description

I am trying to run DeepSeek-V3.2 on an MI300X machine via SGLang, using the NSA + AITER (tilelang / flashMLA) pipeline on ROCm. That pipeline depends on AITER’s ROCm kernels, including its custom all-reduce, layernorm, and the FlashMLA module (flash_mla). On this machine those pieces are either missing or broken.

Here is what I am seeing.

AITER custom all-reduce segfaults

As soon as workers start running the NSA + AITER path, the process segfaults in aiter/aiter/jit/core.py.

This happens even when SGLang is started with --disable-custom-all-reduce, so it looks like AITER’s own ROCm all-reduce kernel is still being used somewhere in the pipeline.

The Python side never sees a PyTorch RuntimeError. It is a hard segmentation fault.

Operating System

Ubuntu 22.04.5 LTS

CPU

Intel(R) Xeon(R) Platinum 8480C

GPU

8* AMD MI300X

ROCm Version

6.10.5

ROCm Component

No response

Steps to Reproduce

docker image: lmsysorg/sglang:dsv32-rocm
command to run: python -m sglang.launch_server --model-path /models/DeepSeek-V3.2/ --tp 8 --disable-cuda-graph --disable-custom-all-reduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions