Align KTO doc with DPO and fix Logged metrics wording#6258
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 83869672d5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| * `num_tokens`: The total number of tokens processed so far. | ||
| * `loss`: The average KTO loss over the current logging interval. | ||
| * `entropy`: The average entropy of the model's predicted token distribution over non-masked tokens. |
There was a problem hiding this comment.
Remove KTO metrics that are never logged
For KTO, _compute_loss only appends kl, rewards/*, logps/*, and logits/* to self._metrics, and log() just averages those; unlike the SFT/Reward trainers, it never updates a num_tokens counter or computes entropy. Users following this section will look for num_tokens and entropy in KTO runs, but those fields are not emitted.
Useful? React with 👍 / 👎.
| The [`experimental.kto.KTOTrainer`] fully supports fine-tuning models with _tool calling_ capabilities. In this case, each dataset example should include: | ||
|
|
||
| * The conversation messages (prompt and completion), including any tool calls (`tool_calls`) and tool responses (`tool` role messages) | ||
| * The list of available tools in the `tools` column, typically provided as JSON schemas |
There was a problem hiding this comment.
Stop documenting unsupported KTO tool schemas
KTO's preprocessing does not consume the documented tools column: tokenize_fn only forwards example.get("chat_template_kwargs", {}) into processing_class.apply_chat_template, and _set_signature_columns_if_needed does not retain tools for raw datasets. Examples that follow this documented format therefore render prompts without the available tool schemas, so tool-calling fine-tunes that rely on the standard tools column train on the wrong prompt.
Useful? React with 👍 / 👎.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Docs only.
KTO doc (
kto_trainer.md) restructured to mirrordpo_trainer.md: added "Looking deeper into the KTO method" (loss + Loss Types table), "Customization" (constraints, model init, PEFT, Liger), "Tool Calling", and "Training Vision Language Models"; rewrote the "Logged metrics" section, which was wrong (listed non-existentrewards/chosen_sum,count/chosen, …); aligned shared wording with DPO. Kept KTO-only content (experimental warning, Example script, Usage tips).Logged metrics wording across trainer docs:
loss,learning_rate,entropy,num_tokens, … — not just rewards), matchingtpo_trainer.md.lossdescription fixed: was "cross-entropy loss" (copied from SFT), now "DPO loss" — DPO's loss is the preference loss, not token cross-entropy.Notes: KTO's metrics list includes
entropy/num_tokens(from two in-flight PRs) and the "Tool Calling" section assumes tool support from a separate PR: land those first. All referenced dataset/model IDs are real.Note
Low Risk
Changes are limited to Markdown documentation with no runtime or training code modified.
Overview
Documentation-only updates across TRL trainer guides.
KTO (
kto_trainer.md) is expanded to follow the same structure as DPO: overview/badge tweaks, dataset format examples, a Looking deeper into the KTO method section (loss math +loss_typetable), a corrected Logged metrics list (replacing obsolete*_sum/count/*names), Customization (Liger/PEFT/constraints), plus Tool calling and VLM sections. Contributor credit and section ordering are adjusted; usage tips stay at the end.Logged metrics intros in 11 trainer docs (CPO, DPO, GRPO, Nash-MD, Online DPO, ORPO, Reward, RLOO, SFT, TPO, XPO) now say metrics instead of reward metrics, since many logged fields are not rewards.
DPO clarifies that logged
lossis the average DPO preference loss, not token cross-entropy (which had been copied from SFT).Reviewed by Cursor Bugbot for commit 0bf3551. Bugbot is set up for automated code reviews on this repo. Configure here.