Add async checkpoint feature by VincentCheungKokomo · Pull Request #1703 · InternLM/xtuner

VincentCheungKokomo · 2026-04-22T16:35:37Z

Add async DCP checkpoint support

This change adds async checkpoint saving for XTuner v1 training. The trainer
now supports an async_checkpoint option, starts merged async DCP saves for model
and optimizer state, and defers checkpoint metadata finalization until the
background staging/upload futures complete.

The async path writes model and optimizer state into a merged weights/
checkpoint format, while resume keeps compatibility with both the new merged
format and the existing model/optimizer DCP format. Checkpoint metadata is only
registered after async save completion, so failed async saves are not exposed as
resumable checkpoints.

The training engine now creates a dedicated process group for async checkpoint
work, supports merged async save/load helpers, and cleans up the async process
group at trainer shutdown.

Tests and benchmark configs are added to cover async checkpoint intervals and
provide reproducible verification runs for 8B and 30B models.

HAOCHENYE · 2026-04-27T17:43:41Z

 from xtuner.v1.utils.grad_norm import cal_grad_norm


+if BlockingAsyncStager is not None:


In [2]: fw = FileSystemWriter("./") In [3]: from torch.distributed.checkpoint.staging import AsyncStager, BlockingAsyncStager In [4]: isinstance(fw, AsyncStager) Out[4]: True

is _CachingStagingWriter necessary?

HAOCHENYE · 2026-04-27T17:46:25Z

                    options=_set_options,
                )

+    def load_dcp_merged(


The state dict format should be consistant with async_save and save. If merged_state_dict performs better, just replace the current implementation.

HAOCHENYE · 2026-04-27T17:52:08Z

+        self._async_checkpoint = async_checkpoint
+        self._pending_staging_futures: list[Future] | None = None
+        self._pending_upload_futures: list[Future] | None = None
+        self._pending_checkpoint_finalize: _CheckpointFinalize | None = None


Following dcp.async_save, the async interface should return an awaitable future. We can assume there is at most one in-flight async save future in the trainer at any time, and the trainer will always wait for the previous async save to finish before issuing a new one.

HAOCHENYE · 2026-04-27T17:54:28Z

            ckpt_saved = self._maybe_save(is_snapshot=False)
            if not ckpt_saved:
                _ = self._maybe_save(is_snapshot=True)
+            checkpoint_time = time.time() - time_before_checkpoint


Just log the checkpoint time in train_engine

VincentCheungKokomo force-pushed the feature/async-checkpoint branch from 5730aa4 to 7a7136b Compare April 22, 2026 17:01

Add async checkpoint feature

302b6ec

VincentCheungKokomo force-pushed the feature/async-checkpoint branch from 7a7136b to 302b6ec Compare April 23, 2026 03:47

HAOCHENYE reviewed Apr 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add async checkpoint feature#1703

Add async checkpoint feature#1703
VincentCheungKokomo wants to merge 1 commit intoInternLM:mainfrom
VincentCheungKokomo:feature/async-checkpoint

VincentCheungKokomo commented Apr 22, 2026

Uh oh!

HAOCHENYE Apr 27, 2026

Uh oh!

HAOCHENYE Apr 27, 2026

Uh oh!

HAOCHENYE Apr 27, 2026

Uh oh!

HAOCHENYE Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		from xtuner.v1.utils.grad_norm import cal_grad_norm


		if BlockingAsyncStager is not None:

Conversation

VincentCheungKokomo commented Apr 22, 2026

Uh oh!

HAOCHENYE Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

HAOCHENYE Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

HAOCHENYE Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

HAOCHENYE Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants