Skip to content

[distill][phase1-2] Decouple DMD2 from Wan + YAML training args + checkpoint/resume#1122

Open
alexzms wants to merge 45 commits intohao-ai-lab:mainfrom
FoundationResearch:distill-phase1+2
Open

[distill][phase1-2] Decouple DMD2 from Wan + YAML training args + checkpoint/resume#1122
alexzms wants to merge 45 commits intohao-ai-lab:mainfrom
FoundationResearch:distill-phase1+2

Conversation

@alexzms
Copy link
Collaborator

@alexzms alexzms commented Feb 22, 2026

1) Motivation

Phase 0 #1120 introduced a new distillation scaffold (Trainer ↔ Method ↔ Adapter + ModelBundle), but it still had two big limitations:

  • Algorithm/model coupling still leaked through (e.g. Wan-specific method naming and pipeline-backed behavior).
  • Entrypoints and configs were still “legacy-shaped”, which makes it hard to scale to many models/methods/roles without re-creating a new *_distillation_vN.py per model family.

This PR lands Phase 1 + Phase 2, pushing the refactor to the point where we can run few-step distillation via a YAML-only entrypoint and keep the method/algorithm reusable, while the adapter absorbs model/pipeline quirks.


2) Phase 1: Decouple DistillMethod

What Phase 1 changes

  • Move distillation logic toward a FastGen-style hierarchy where:
    • DistillTrainer is infra-only (loop/accum/step/logging).
    • DistillMethod owns the algorithm and update policy (multi-optimizer schedules, stepping cadence).
    • DistillAdapter owns model/pipeline-specific forward context and batch normalization.

Key outcomes

  • Generic algorithm method: introduce a reusable DMD2 implementation under a method taxonomy:
    • fastvideo/distillation/methods/distribution_matching/dmd2.py
  • Model-family adapter: Wan specifics live in the adapter (forward context, pipeline normalization):
    • fastvideo/distillation/adapters/wan.py
  • Validation boundary: validation is no longer a “pipeline side-effect”; it becomes an explicit component rather than being baked into legacy pipelines.

3) Phase 2: YAML-only entrypoint + builder runtime + fully decouple

Phase 2 makes the new path standalone (no legacy distillation pipeline dependency):

New entrypoint (YAML-only)

  • fastvideo/training/distillation.py
    • Only accepts new YAML configs (no legacy config fallback / merging).
    • CLI stays minimal: runtime controls like --config, --resume-from-checkpoint, --override-output-dir, --dry-run.

Example YAML:

distill:
  model: wan
  method: dmd2

models:
  student:
    family: wan
    path: Wan-AI/Wan2.1-T2V-1.3B-Diffusers
    trainable: true
  teacher:
    family: wan
    path: Wan-AI/Wan2.1-T2V-14B-Diffusers
    trainable: false
  critic:
    family: wan
    path: Wan-AI/Wan2.1-T2V-1.3B-Diffusers
    trainable: true

training:
  # Distributed
  num_gpus: 8
  sp_size: 1
  tp_size: 1

  # Data (parquet dataset folder)
  data_path: data/Wan-Syn_77x448x832_600k
  dataloader_num_workers: 4
  ...

YAML config loading + typed spec

  • fastvideo/distillation/yaml_config.py
  • fastvideo/distillation/specs.py

Builder that instantiates runtime from roles + method

  • fastvideo/distillation/builder.py
    • Builds roles (student/teacher/critic/…) → adapter → model bundle → method → trainer.
    • Keeps “how to assemble a run” separate from both trainer and method.

Validation without legacy pipelines

  • fastvideo/distillation/validators/base.py
  • fastvideo/distillation/validators/wan.py
    • Phase 2 runs validation via the new validator path (no legacy _log_validation).

Checkpoint save/resume (new system)

  • fastvideo/distillation/checkpoint.py
    • Save/resume trainable roles, optimizer/scheduler, dataloader state, and RNG states (including adapter-exposed generators).
    • Integrated into fastvideo/distillation/trainer.py.

“outside/” config tree (non-invasive)

  • fastvideo/distillation/outside/fastvideo/configs/distillation/...
    • Used to iterate on a better distillation config scheme without touching fastvideo/configs/* yet.
    • This PR’s entrypoint reads explicit paths; no “outside path auto-completion” magic.

4) What’s done vs. not done

✅ Done in this PR (Phase 1 + 2)

  • Land a generic DMD2 method in a our new distillation framework.
  • Land a model-family adapter (WanAdapter) that owns forward context (keeps methods clean).
  • Add a YAML-only distillation entrypoint: fastvideo/training/distillation.py.
  • Add a role-based builder that instantiates adapter/bundle/method/trainer from YAML.
  • Add independent validation (no legacy distillation pipeline dependency).
  • Add checkpoint save/resume to the new path (with retention policy).
  • Provide a runnable few-step distillation config + example script:
    • YAML: fastvideo/distillation/outside/fastvideo/configs/distillation/distill_wan2.1_t2v_1.3B_dmd2_8steps.yaml
    • Runner: examples/distillation/phase2/temp.sh (Will be cleaned in phase 3)

❌ Not done yet (planned follow-ups)

  • Additional distillation methods beyond DMD2 (e.g. Self-Forcing + ODE init path).
  • Additional adapters/models beyond Wan.
  • Long-term cleanup: remove legacy distillation pipelines (left intact for now; new code path is additive).
  • More rigorous automated tests for end-to-end training/validation (GPU CI, SSIM regression, etc.).
  • Since finetuning is a special distillation (only student and dataset would be provided), our new distillation framework is fully capable of aborbing the finetuning code into itself, resulting in a even cleaner training pipeline.
  • Better config yaml design. For supporting future distillation methods or even finetuning.
  • Decouple builder function in builer.py

Tests / Evidence


Feedback and suggestions are highly welcome!

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @alexzms, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the distillation framework by decoupling the core algorithm from model-specific implementations and introducing a robust, YAML-driven configuration system. The changes aim to improve scalability, reusability, and maintainability of distillation training pipelines, enabling more flexible experimentation and easier integration of new models and methods. It also introduces a comprehensive checkpointing system and independent validation, laying a solid foundation for future distillation efforts.

Highlights

  • Decoupled Distillation Framework: Introduced a new distillation scaffold with a clear separation of concerns: DistillTrainer for infrastructure, DistillMethod for algorithms, and DistillAdapter for model/pipeline specifics. This refactors the distillation logic towards a FastGen-style hierarchy.
  • YAML-only Entrypoint: Implemented a new standalone entrypoint for distillation training that exclusively uses YAML configurations, moving away from legacy command-line argument structures. This simplifies configuration management and scalability.
  • Generic DMD2 Method and Wan Adapter: Landed a reusable DMD2 (Distribution Matching Distillation) implementation as a generic method and a WanAdapter to handle Wan-specific model and pipeline quirks, ensuring algorithm reusability across different models.
  • New Checkpoint and Resume System: Integrated a new checkpointing system that supports saving and resuming trainable roles, optimizers, schedulers, dataloader states, and RNG states, enhancing training robustness and flexibility.
  • Independent Validation: Established an independent validation mechanism that is no longer a side-effect of legacy pipelines, making validation an explicit and separate component of the distillation process.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • examples/distillation/phase0/distill_wan2.1_t2v_1.3B_dmd2_8steps.sh
    • Added a new shell script for Phase 0 DMD2 distillation.
  • examples/distillation/phase0/temp.sh
    • Added a temporary shell script for Phase 0 Wan DMD2 distillation.
  • examples/distillation/phase1/distill_wan2.1_t2v_1.3B_dmd2_8steps.sh
    • Added a new shell script for Phase 1 DMD2 distillation using the new method/adapter entrypoint.
  • examples/distillation/phase1/run.md
    • Added a markdown file for Phase 1 run links.
  • examples/distillation/phase1/temp.sh
    • Added a temporary shell script for Phase 1 Wan DMD2 distillation.
  • examples/distillation/phase2/README.md
    • Added a README for Phase 2 YAML-only distillation examples.
  • examples/distillation/phase2/distill_wan2.1_t2v_1.3B_dmd2_8steps.yaml
    • Added a YAML configuration file for Phase 2 Wan DMD2 distillation.
  • examples/distillation/phase2/run_wan2.1_t2v_1.3B_dmd2_8steps.sh
    • Added a shell script to run Phase 2 distillation from a YAML config.
  • examples/distillation/phase2/temp.sh
    • Added a temporary shell script for Phase 2 Wan DMD2 distillation.
  • fastvideo/distillation/init.py
    • Added an init.py file to define the distillation module's public API.
  • fastvideo/distillation/adapters/init.py
    • Added an init.py file for distillation adapters.
  • fastvideo/distillation/adapters/base.py
    • Added a base abstract class for distillation adapters.
  • fastvideo/distillation/adapters/wan.py
    • Added the WanAdapter implementation for Wan-specific distillation logic.
  • fastvideo/distillation/builder.py
    • Added a builder module to construct the distillation runtime from configuration.
  • fastvideo/distillation/bundle.py
    • Added data structures for ModelBundle and RoleHandle to manage models and their roles.
  • fastvideo/distillation/checkpoint.py
    • Added a checkpoint manager for saving and resuming distillation training states.
  • fastvideo/distillation/methods/init.py
    • Added an init.py file for distillation methods.
  • fastvideo/distillation/methods/consistency_model/init.py
    • Added an empty init.py for consistency model methods.
  • fastvideo/distillation/methods/distribution_matching/init.py
    • Added an init.py for distribution matching methods.
  • fastvideo/distillation/methods/distribution_matching/dmd2.py
    • Added the DMD2Method implementation, encapsulating the DMD2 algorithm.
  • fastvideo/distillation/methods/fine_tuning/init.py
    • Added an empty init.py for fine-tuning methods.
  • fastvideo/distillation/methods/knowledge_distillation/init.py
    • Added an empty init.py for knowledge distillation methods.
  • fastvideo/distillation/outside/README.md
    • Added a README explaining the purpose of the outside/ directory for config iteration.
  • fastvideo/distillation/outside/fastvideo/configs/distillation/distill_wan2.1_t2v_1.3B_dmd2_8steps.yaml
    • Added a YAML configuration file for Phase 2 Wan DMD2 distillation.
  • fastvideo/distillation/specs.py
    • Added data classes for DistillSpec and RoleSpec to define distillation configurations.
  • fastvideo/distillation/trainer.py
    • Added a generic DistillTrainer class for managing the training loop.
  • fastvideo/distillation/validators/init.py
    • Added an init.py for distillation validators.
  • fastvideo/distillation/validators/base.py
    • Added a base abstract class for distillation validators.
  • fastvideo/distillation/validators/wan.py
    • Added the WanValidator implementation for Wan-specific validation.
  • fastvideo/distillation/yaml_config.py
    • Added a module for loading distillation run configurations from YAML.
  • fastvideo/tests/distillation/test_optimizer_scheduler_alignment.py
    • Added a test for optimizer and scheduler alignment in distillation methods.
  • fastvideo/training/distillation.py
    • Added the main entrypoint for Phase 2 distillation training.
Activity
  • The pull request introduces a significant refactoring of the distillation framework, moving from a tightly coupled system to a more modular and extensible architecture.
  • Phase 1 focused on decoupling the DistillMethod from model specifics, introducing DistillTrainer, DistillMethod, and DistillAdapter components.
  • Phase 2 completed the refactor by enabling a YAML-only entrypoint, a role-based builder for runtime instantiation, and a new checkpointing system.
  • Manual GPU runs were performed to validate the expected behavior, showing progressively clearer and higher-quality validation outputs as training advanced.
  • The training behavior in Phase 1 and 2 closely matched that of Phase 0, indicating a successful refactor without regression in core functionality.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@alexzms alexzms added the go Trigger Buildkite CI label Feb 22, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a significant and well-executed refactoring that decouples the distillation method from the model specifics, introduces a flexible YAML-based configuration system, and adds robust checkpointing and resume capabilities. The new structure with Trainer, Method, Adapter, and Builder is clean and promotes reusability.

I've added a couple of minor suggestions to improve code style and maintainability. Overall, this is a great contribution that significantly improves the distillation framework.

Comment on lines +230 to +231
checkpointing_type=training_args.
enable_gradient_checkpointing_type,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line break in the middle of an attribute access is a bit unusual and harms readability. It's better to keep training_args.enable_gradient_checkpointing_type on a single line.

Suggested change
checkpointing_type=training_args.
enable_gradient_checkpointing_type,
checkpointing_type=training_args.enable_gradient_checkpointing_type,

@alexzms
Copy link
Collaborator Author

alexzms commented Feb 24, 2026

Up to this version, only validator is not decoupled and decoupling will require to rewrite WanPipeline to accept mode of sde/ode. Planned to be done during phase3.

@alexzms
Copy link
Collaborator Author

alexzms commented Feb 24, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go Trigger Buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants