[XPU] [DEEP_EP 2/4] DeepEP normal intranode / internode support xpu #76362

jianingyu-ustc · 2025-11-11T11:25:26Z

PR Category

Custom Device

PR Types

New features

Description

[XPU] [DEEP_EP 2/4] DeepEP normal intranode / internode support xpu

CLAassistant · 2025-11-11T11:25:33Z

All committers have signed the CLA.

CLAassistant · 2025-11-11T11:25:34Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ ZibinGuo
❌ jianingyu-ustc
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

paddle-bot · 2025-11-11T11:25:36Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

ZibinGuo · 2025-11-21T05:49:49Z

paddle/fluid/distributed/collective/deep_ep_xpu/config.hpp


    // Message sizes
-    EP_HOST_ASSERT(num_scales * sizeof(float) <= hidden);
+    EP_HOST_ASSERT(num_scales * sizeof(float) <= static_cast<size_t>(hidden));


这里直接判断会有问题吗？

应该没有吧...，两边都会转换为size_t

ForFishes

LGTM

qingqing01

LGTM for library size, since this PR is unrelated to GPU.

ZibinGuo

LGTM

ZibinGuo · 2025-11-21T07:38:25Z

deep_ep 功能实现，差单测

ZibinGuo · 2025-11-21T08:18:24Z

/re-run all-failed

jianingyu-ustc · 2025-11-22T02:08:11Z

/re-run all-failed

dynamicheart · 2025-11-24T01:56:24Z

paddle/fluid/distributed/collective/process_group_bkcl.cc

+  calc_event_ = std::make_shared<XPUEventManager>();
+  auto* calc_ctx = static_cast<phi::XPUContext*>(
+      phi::DeviceContextPool::Instance().Get(place));
+  calc_ctx->CreateStream();


calc_ctx似乎已经默认创建四个流了，这里去给calc_ctx创建一个新的流的目的是什么？

XPU由于历史原因，一直使用默认流，只有手动调用CreateStream()才会创建新流，但GPU每当创建一个CUDAStream, 就会创建一个新流。这里deep_ep为了建立通信流和计算流，且和GPU对齐，故此修改。

Paddle/paddle/phi/backends/xpu/xpu_context.cc

Lines 250 to 257 in f9062d5

void CreateStream() {

if (context_->xpu_stream) {

VLOG(3) << "xpu stream is already created for current context";

return;

}

PADDLE_ENFORCE_XPU_SUCCESS(xpu_stream_create(&context_->xpu_stream));

stream_owned_ = true;

}

如果CDNN_CLUSTER_PARALLEL已经创建了流，调用这个接口无效

dynamicheart · 2025-11-24T01:57:47Z

paddle/phi/core/distributed/comm_context_manager.cc

  if (CommContextManager::device_id != -1) {
    std::unique_ptr<phi::XPUContext> dev_ctx(new phi::XPUContext(
        phi::XPUPlace(CommContextManager::device_id), true));
+    dev_ctx->CreateStream();


涉及通信库Stream的修改，请找 @lj970926 来review下

改的时候就和lj970926还有XiaociZhang对过

dynamicheart

LGTM

lj970926

LGTM for comm stream

jianingyu-ustc requested review from ForFishes and sneaxiy as code owners November 11, 2025 11:25

paddle-bot bot added the contributor External developers label Nov 11, 2025

jianingyu-ustc force-pushed the develop branch from bb944ce to 1a46747 Compare November 18, 2025 11:19

jianingyu-ustc changed the title ~~[XPU] [DEEP_EP 2/4] DeepEP support xpu~~ [XPU] [DEEP_EP 2/4] DeepEP normal intranode / internode support xpu Nov 19, 2025

jianingyu-ustc force-pushed the develop branch 4 times, most recently from 6ff38bb to 21fb80c Compare November 20, 2025 15:21

DeepEP normal intranode / internode support xpu

c8eedd4

jianingyu-ustc force-pushed the develop branch from 21fb80c to c8eedd4 Compare November 20, 2025 15:28

jianingyu-ustc marked this pull request as draft November 21, 2025 02:49

jianingyu-ustc marked this pull request as ready for review November 21, 2025 02:50

jianingyu-ustc changed the title ~~[XPU] [DEEP_EP 2/4] DeepEP normal intranode / internode support xpu~~ [XPU] [DEEP_EP 2/4] DeepEP normal intranode / internode support xpu Nov 21, 2025

jianingyu-ustc marked this pull request as draft November 21, 2025 02:59

jianingyu-ustc marked this pull request as ready for review November 21, 2025 03:00

ZibinGuo reviewed Nov 21, 2025

View reviewed changes

ForFishes approved these changes Nov 21, 2025

View reviewed changes

qingqing01 approved these changes Nov 21, 2025

View reviewed changes

ZibinGuo approved these changes Nov 21, 2025

View reviewed changes

zyfncg approved these changes Nov 21, 2025

View reviewed changes

dynamicheart reviewed Nov 24, 2025

View reviewed changes

dynamicheart approved these changes Nov 24, 2025

View reviewed changes

dynamicheart merged commit 15301e0 into PaddlePaddle:develop Nov 24, 2025
121 of 135 checks passed

lj970926 approved these changes Nov 24, 2025

View reviewed changes

	void CreateStream() {
	if (context_->xpu_stream) {
	VLOG(3) << "xpu stream is already created for current context";
	return;
	}
	PADDLE_ENFORCE_XPU_SUCCESS(xpu_stream_create(&context_->xpu_stream));
	stream_owned_ = true;
	}

[XPU] [DEEP_EP 2/4] DeepEP normal intranode / internode support xpu #76362

[XPU] [DEEP_EP 2/4] DeepEP normal intranode / internode support xpu #76362

Uh oh!

Conversation

jianingyu-ustc commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

Uh oh!

CLAassistant commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CLAassistant commented Nov 11, 2025

Uh oh!

paddle-bot bot commented Nov 11, 2025

Uh oh!

ZibinGuo Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

jianingyu-ustc Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

ForFishes left a comment

Choose a reason for hiding this comment

Uh oh!

qingqing01 left a comment

Choose a reason for hiding this comment

Uh oh!

ZibinGuo left a comment

Choose a reason for hiding this comment

Uh oh!

ZibinGuo commented Nov 21, 2025

Uh oh!

ZibinGuo commented Nov 21, 2025

Uh oh!

jianingyu-ustc commented Nov 22, 2025

Uh oh!

dynamicheart Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

ZibinGuo Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

dynamicheart Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

dynamicheart Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

ZibinGuo Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

dynamicheart left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lj970926 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

jianingyu-ustc commented Nov 11, 2025 •

edited

Loading

CLAassistant commented Nov 11, 2025 •

edited

Loading