Add blog for Elastic EP by UNIDY2002 · Pull Request #321 · lm-sys/lm-sys.github.io

UNIDY2002 · 2026-03-17T11:19:47Z

No description provided.

UNIDY2002 · 2026-03-17T11:24:15Z

blog/2026-03-17-eep-partial-failure-tolerance.md

ShangmingCai · 2026-03-18T15:40:53Z

blog/2026-03-17-eep-partial-failure-tolerance.md

-| 16                     | 6.2                                     |
+- **Service Returns Responsive within Seconds**: To test extreme resilience, we evaluated DeepSeek V3.2 on 4 nodes (32 GPUs total, setting ep_size=dp_size=32) with 256 redundant experts, allowing us to tolerate up to 2 full node failures. When measuring the service interruption time caused by sudden rank failures, Elastic EP reduces downtime by over 90%, from 2–3 minutes to less than 10 seconds.
+
+| Number of failed ranks | Interruption time with Elastic EP (sec) | Throughput with remaining ranks (tokens/sec) | Mean TPOT with remaining ranks (ms) |


Better add an explanation why Mean TPOT decreases, I assume that is because the total batch size is smaller because of the dp rank decreases?

Do we decrease the request rate here? If not, then maybe each EP rank will get more tokens per batch? IIUC, lower Mean TPOT usually indicates a higher per-req token throughput. If the total request number is the same, then each req should get fewer tokens per second since the computing resources are reduced?

Maybe you should provide the benchmark setting for reader to better understand the workload and for reproduction.

The 4 decode-node setup used to evaluate recovery time was not optimized for the best throughput/latency performance, so the results were not convincing. However, I cannot acquire enough GPU resources to conduct a more thorough evaluation at this stage 😢 So I reverted the throughput/latency performance data from this table.

Nevertheless, I have revised the writings to describe the reproduction steps more clearly.

blog/2026-03-17-eep-partial-failure-tolerance.md

ShangmingCai reviewed Mar 17, 2026

View reviewed changes

blog/2026-03-17-eep-partial-failure-tolerance.md Outdated Show resolved Hide resolved

Add blog EEP

5d0d246

UNIDY2002 force-pushed the eep branch from 27264e6 to 5d0d246 Compare March 17, 2026 12:28

Update

8a91e96

ShangmingCai reviewed Mar 18, 2026

View reviewed changes

Update

346972c

ShangmingCai reviewed Mar 19, 2026

View reviewed changes

blog/2026-03-17-eep-partial-failure-tolerance.md Outdated Show resolved Hide resolved

ShangmingCai reviewed Mar 19, 2026

View reviewed changes

blog/2026-03-17-eep-partial-failure-tolerance.md Outdated Show resolved Hide resolved

ShangmingCai reviewed Mar 19, 2026

View reviewed changes

blog/2026-03-17-eep-partial-failure-tolerance.md Outdated Show resolved Hide resolved

UNIDY2002 and others added 3 commits March 19, 2026 12:03

Update writing

66f2253

Update

3be0c0b

Update

97e5130

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add blog for Elastic EP#321

Add blog for Elastic EP#321
UNIDY2002 wants to merge 6 commits intolm-sys:mainfrom
UNIDY2002:eep

UNIDY2002 commented Mar 17, 2026

Uh oh!

UNIDY2002 commented Mar 17, 2026

Uh oh!

Uh oh!

ShangmingCai Mar 18, 2026

Uh oh!

UNIDY2002 Mar 19, 2026

Uh oh!

ShangmingCai Mar 19, 2026

Uh oh!

UNIDY2002 Mar 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

UNIDY2002 commented Mar 17, 2026

Uh oh!

UNIDY2002 commented Mar 17, 2026

Uh oh!

Uh oh!

ShangmingCai Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

UNIDY2002 Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

ShangmingCai Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

UNIDY2002 Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

UNIDY2002 Mar 19, 2026 •

edited

Loading