Skip to content

Dopd opsd routing#6237

Draft
ucalyptus wants to merge 3 commits into
huggingface:mainfrom
ucalyptus:dopd-opsd-routing
Draft

Dopd opsd routing#6237
ucalyptus wants to merge 3 commits into
huggingface:mainfrom
ucalyptus:dopd-opsd-routing

Conversation

@ucalyptus

Copy link
Copy Markdown
Contributor

question for myself: https://github.com/huggingface/trl/pull/5990/changes does this PR need to be merged before i can proceed?

ucalyptus and others added 3 commits July 1, 2026 19:37
Ports the OPSD trainer (huggingface#5990) as a base and adds a
distillation_mode="dopd" option implementing the per-token routing from
DOPD: Dual On-policy Distillation (arXiv:2606.30626). Each token is
routed by advantage gap and teacher/student confidence into one of four
regimes: light top-k reverse-KL, full-vocab JSD, a light stop-gradient
student-consistency nudge, or a weak stop-gradient self-regularization
fallback for ambiguous tokens.

Adds pure-function CPU tests (no model/GPU) verifying routing
correctness, mask exhaustiveness, and gradient behavior of the
stop-gradient regimes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant