Skip to content

Add blog for Elastic EP#321

Open
UNIDY2002 wants to merge 6 commits intolm-sys:mainfrom
UNIDY2002:eep
Open

Add blog for Elastic EP#321
UNIDY2002 wants to merge 6 commits intolm-sys:mainfrom
UNIDY2002:eep

Conversation

@UNIDY2002
Copy link

No description provided.

@UNIDY2002
Copy link
Author

cc @ShangmingCai

| 16 | 6.2 |
- **Service Returns Responsive within Seconds**: To test extreme resilience, we evaluated DeepSeek V3.2 on 4 nodes (32 GPUs total, setting ep_size=dp_size=32) with 256 redundant experts, allowing us to tolerate up to 2 full node failures. When measuring the service interruption time caused by sudden rank failures, Elastic EP reduces downtime by over 90%, from 2–3 minutes to less than 10 seconds.

| Number of failed ranks | Interruption time with Elastic EP (sec) | Throughput with remaining ranks (tokens/sec) | Mean TPOT with remaining ranks (ms) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better add an explanation why Mean TPOT decreases, I assume that is because the total batch size is smaller because of the dp rank decreases?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we decrease the request rate here? If not, then maybe each EP rank will get more tokens per batch? IIUC, lower Mean TPOT usually indicates a higher per-req token throughput. If the total request number is the same, then each req should get fewer tokens per second since the computing resources are reduced?

Maybe you should provide the benchmark setting for reader to better understand the workload and for reproduction.

Copy link
Author

@UNIDY2002 UNIDY2002 Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 4 decode-node setup used to evaluate recovery time was not optimized for the best throughput/latency performance, so the results were not convincing. However, I cannot acquire enough GPU resources to conduct a more thorough evaluation at this stage 😢 So I reverted the throughput/latency performance data from this table.

Nevertheless, I have revised the writings to describe the reproduction steps more clearly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants