[CI/CE] Add Qwen3MoE CI Config #2876

hushenwei2000 · 2025-11-06T14:46:38Z

PR types

CI/CE

PR changes

Others

Description

Add Qwen3MoE CI/CE configs.
Distributed config: TP2SPSD2EP4PP2-packing

Cover configs:

SFT / DPO
Use / Not Use LoRA
DeepEP / AllToAll

TODO:

Pretrain
DPO + Use LoRA: Currently use TP4SPSD2EP4PP2-packing. (Otherwise will rase "parameters not trainable" error)
DPO + AllToAll: Currently has precision problem, recommend set ep_communication_type: "deepep".

paddle-bot · 2025-11-06T14:46:44Z

Thanks for your contribution!

codecov-commenter · 2025-11-06T14:53:03Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (develop@0ee5333). Learn more about missing BASE report.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #2876   +/-   ##
==========================================
  Coverage           ?   31.00%           
==========================================
  Files              ?      355           
  Lines              ?    59111           
  Branches           ?        0           
==========================================
  Hits               ?    18327           
  Misses             ?    40784           
  Partials           ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

lugimzzz · 2025-11-07T05:24:52Z

examples/config/sft/full_tp_pp_ep.yaml

+train_dataset_path: data-sft/train_gsm8k.json
+train_dataset_prob: "1.0"
+eval_dataset_path: data-sft/test_gsm8k.json
+eval_dataset_prob: "1.0"


train_dataset_type: erniekit
eval_dataset_type: erniekit
train_dataset_path: ./data/sft/train.json
train_dataset_prob: "1.0"
eval_dataset_path: ./data/sft/dev.json
eval_dataset_prob: "1.0"
max_seq_len: 8192
packing: true
mix_strategy: concat
和其他模型一样

lugimzzz · 2025-11-07T05:27:28Z

examples/config/sft/qwen3moe.yaml

+do_eval: false
+per_device_eval_batch_size: 1
+per_device_train_batch_size: 1
+num_train_epochs: 5


num_train_epochs: 1
max_steps: -1

lugimzzz · 2025-11-07T05:31:44Z

examples/config/sft/full_tp_pp_ep.yaml

+# use_filtered_label_loss: true
+optim: adamw_custom
+tensorwise_offload_optimizer: true
+recompute: true


offload_optim: false use_fused_head_and_loss_fn: true # use_filtered_label_loss: true optim: adamw_custom tensorwise_offload_optimizer: true

只留下tensorwise_offload_optimizer: true就行

lugimzzz · 2025-11-07T05:57:20Z

examples/config/sft/full_tp_pp_ep.yaml

+fp16_opt_level: O2
+unified_checkpoint: true
+
+sharding_parallel_config: "split_param"


SFT 4个yaml可以缩减为2yaml，只需要full_tp_pp_ep.yaml 和 lora_tp_pp_ep.yaml 默认为 all2all ep_communication_type: "alltoall" # choices: [deepep, alltoall] deepep only for Hooper GPU

lugimzzz · 2025-11-07T06:00:06Z

examples/config/dpo/full_tp_pp_ep.yaml

+learning_rate: 1.0e-6
+
+# performance
+tensor_parallel_degree: 2


新增一个lora的

dpo/full_tp_pp_ep.yaml 和 dpo/lora_tp_pp_ep.yaml

Add Qwen3MoE CI Config

108bb7e

lugimzzz reviewed Nov 7, 2025

View reviewed changes

hushenwei2000 added 2 commits November 7, 2025 18:12

fix config

8c726f6

fix config

5e5f70e

hushenwei2000 changed the title ~~Add Qwen3MoE CI Config~~ [CI/CE] Add Qwen3MoE CI Config Nov 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI/CE] Add Qwen3MoE CI Config #2876

[CI/CE] Add Qwen3MoE CI Config #2876

Uh oh!

hushenwei2000 commented Nov 6, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Nov 6, 2025

Uh oh!

codecov-commenter commented Nov 6, 2025

Uh oh!

lugimzzz Nov 7, 2025

Uh oh!

lugimzzz Nov 7, 2025

Uh oh!

lugimzzz Nov 7, 2025

Uh oh!

lugimzzz Nov 7, 2025

Uh oh!

lugimzzz Nov 7, 2025

Uh oh!

lugimzzz Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[CI/CE] Add Qwen3MoE CI Config #2876

Are you sure you want to change the base?

[CI/CE] Add Qwen3MoE CI Config #2876

Uh oh!

Conversation

hushenwei2000 commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR types

PR changes

Description

Uh oh!

paddle-bot bot commented Nov 6, 2025

Uh oh!

codecov-commenter commented Nov 6, 2025

Codecov Report

Uh oh!

lugimzzz Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

lugimzzz Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

lugimzzz Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

lugimzzz Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

lugimzzz Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

lugimzzz Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hushenwei2000 commented Nov 6, 2025 •

edited

Loading