Dopd opsd routing by ucalyptus · Pull Request #6237 · huggingface/trl

ucalyptus · 2026-07-01T23:44:54Z

question for myself: https://github.com/huggingface/trl/pull/5990/changes does this PR need to be merged before i can proceed?

Ports the OPSD trainer (huggingface#5990) as a base and adds a distillation_mode="dopd" option implementing the per-token routing from DOPD: Dual On-policy Distillation (arXiv:2606.30626). Each token is routed by advantage gap and teacher/student confidence into one of four regimes: light top-k reverse-KL, full-vocab JSD, a light stop-gradient student-consistency nudge, or a weak stop-gradient self-regularization fallback for ambiguous tokens. Adds pure-function CPU tests (no model/GPU) verifying routing correctness, mask exhaustiveness, and gradient behavior of the stop-gradient regimes.

ucalyptus and others added 3 commits July 1, 2026 19:37

Merge branch 'huggingface:main' into dopd-opsd-routing

a9fd250

Merge branch 'main' into dopd-opsd-routing

965e619

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dopd opsd routing#6237

Dopd opsd routing#6237
ucalyptus wants to merge 3 commits into
huggingface:mainfrom
ucalyptus:dopd-opsd-routing

ucalyptus commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ucalyptus commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant