implement message level rollout with linear trajectories by AmineDiro · Pull Request #6250 · huggingface/trl

AmineDiro · 2026-07-02T15:43:55Z

AsyncGRPO: message-mode rollouts

Adds an opt-in way to build training rows from a multi-turn conversation.

Message mode keeps the conversation as messages and re-tokenizes the whole thing each turn, then checks whether the fresh tokens still start with the tokens held so far. If yes → append the new part (same as token mode). If no → a rewrite happened → close the row and open a new one that matches what the model actually read.

config

AsyncGRPOConfig(
    rollout_protocol="message",   # "token" (default) | "message"
    fork_threshold_tokens=1024,   # message mode only
)

How rows are built

Each turn is a TurnRecord(prompt_ids, output_ids, output_log_probs). At the end, _chain_to_sequences walks the turns and per turn classifies the drift vs. the tokens held so far:

CLEAN — new prompt starts with held tokens → append (prompt/tool = context, generated = trained).
REALIGN — only the last answer's tail wobbled and the new turn < fork_threshold_tokens → overwrite that tail as context, same row.
FORK — a real rewrite → start a new row.

Advantage: one per conversation, stamped on every row it produced, no split. Under the token-mean loss a fork is invisible (each generated token trained once, same advantage, same denominator).

Next work: Tree trajectories

The design already covers the two hard parts:

Scoring is branch-agnostic. Every TrainingSequence carries a rollout_id and _score_group groups by it, not by list position. N rows per conversation already works today; a tree that yields several rows per conversation needs no scoring change.
The reconciler is tree-agnostic. _common_prefix_len / _SampleBuilder / _chain_to_sequences reconcile a single linear chain of turns. A root→leaf path in a tree is such a chain.

So the only change a tree adds is on the rollout side: replace the flat turns: list[TurnRecord] (+ one _chain_to_sequences call) with a tree of turns, then run the reconciler once per root→leaf path, with all rows sharing the conversation's rollout_id. Recording would carry the turn's message context so each turn can be placed under its parent node (this is what lets shared prefixes branch and stay trained once). Nothing downstream (collator, loss, scoring) changes.

Note

Medium Risk
Touches multi-turn rollout → training-sample mapping and advantage stamping for GRPO; default rollout_protocol="token" limits blast radius, but message mode can change which tokens are trained when chat templates drift.

Overview
Adds an opt-in message rollout path for AsyncGRPO alongside the default token buffer mode, controlled by rollout_protocol and fork_threshold_tokens on AsyncGRPOConfig.

In message mode, MessageRolloutLoop re-tokenizes the full conversation each turn and runs _chain_to_sequences to turn turn records into one or more TrainingSequence rows: CLEAN appends on one row, REALIGN treats small last-answer wobble as context, FORK starts a new row when history rewrites. Token mode is unchanged in behavior but now emits a single TrainingSequence per conversation.

Rollout groups carry completions_sequences instead of flat logprob/mask lists; _score_group expands each conversation into multiple RolloutSamples while stamping the same conversation-level advantage on every forked row (metrics get a per-row copy). The trainer picks MessageRolloutLoop vs AsyncRolloutLoop and passes loop_cls into the spawned worker. Tests cover the reconciler, message loop, and scoring.

^{Reviewed by Cursor Bugbot for commit 7a354d6. Bugbot is set up for automated code reviews on this repo. Configure here.}

qgallouedec

cool, thanks! discussed internally. I'm sharing the figure I made

qgallouedec · 2026-07-02T15:57:14Z

+        rollout_protocol (`str`, *optional*, defaults to `"token"`):
+            How a multi-turn conversation is turned into training rows. `"token"` grows a token buffer, appending each
+            turn's generated tokens and tokenized tool results (fast; cannot represent a conversation rewrite).
+            `"message"` re-tokenizes the whole conversation every turn and reconciles the result against the tokens held
+            so far: a clean append stays one row, a rewrite (dropped reasoning, summarized history) forks a new row.
+        fork_threshold_tokens (`int`, *optional*, defaults to `1024`):
+            Message mode only. When a turn's re-tokenized prompt drifts inside the last generated answer, a drift with a
+            generated turn shorter than this many tokens is treated as a re-tokenization wobble (realigned as context)
+            rather than a rewrite (a new row). Ignored when `rollout_protocol="token"`.


I'm going to advocate for simplicity here: because it might become impossible to maintain if we continue to support every possible configuration. What do you think about supporting only the message protocol? This trainer is still experimental; we don't need to be backward compatible, so it's a good time to make bold and radical decisions.

I was thinking the same thing. Only thing that bothers me is the performance penalty of supporting this messages mode :/ Need to measure this to be sure 🫡 .

But I agree with the idea 1000%

yes I understand the concern. With tokens, you can only get G sequences for 1 prompt. For messages, you end up with, worst case scenario G*max_num_turns

bot-ci-comment · 2026-07-02T16:09:41Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec · 2026-07-02T16:15:08Z

to follow the repo structure, we should have only one test_async_grpo_trainer.py

cursor

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 9dc7d79. Configure here.}

cursor · 2026-07-03T15:25:55Z

+            tool_failure_count += n_failures
+            completion.extend(tool_messages)
+            messages.extend(tool_messages)  # tool result goes back as a MESSAGE, re-tokenized next turn
+            iteration_num += 1


Empty tool calls loop forever

High Severity

In MessageRolloutLoop, the exit check only treats missing tool_calls as terminal (tool_calls is None). An assistant message with tool_calls set to an empty list is treated as a tool turn: no messages are appended, iteration_num advances, and the loop re-tokenizes and generates again with identical input. With no iteration cap, this can spin indefinitely in the rollout worker.

^{Reviewed by Cursor Bugbot for commit 9dc7d79. Configure here.}

valid defensive concern, real hang path, but a genuine edge case.
Worth a one-line if not tool_calls: in both loops?

cursor · 2026-07-03T15:25:55Z

+            builders.append(builder)
+        else:
+            builders[-1].append_turn(turn, kind)
+    return [b.to_training_sequence(rollout_id) for b in builders if b.has_trained_token()]


Filtered rows skew GRPO advantages

Medium Severity

_chain_to_sequences drops reconciled builders with no completion_mask ones, so message-mode rollouts can yield zero TrainingSequences for a conversation (e.g. an empty generation turn). _score_group still computes that conversation’s reward and folds it into group mean/std for advantages, but emits no RolloutSamples, unlike token mode which always enqueues one row per generation.

Additional Locations (1)

trl/experimental/async_grpo/async_rollout_worker.py#L810-L843

^{Reviewed by Cursor Bugbot for commit 9dc7d79. Configure here.}

implement message level rollout with linear trajectories

dc49156

AmineDiro requested review from adithya-s-k, albertvillanova, burtenshaw and qgallouedec July 2, 2026 15:43

cleanup tests

172bea1

qgallouedec reviewed Jul 2, 2026

View reviewed changes

Merge main into linear-trajectory

cb5fff5

qgallouedec reviewed Jul 2, 2026

View reviewed changes

merge tests

9dc7d79

cursor Bot reviewed Jul 3, 2026

View reviewed changes

Merge branch 'main' into linear-trajectory

7a354d6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

implement message level rollout with linear trajectories#6250

implement message level rollout with linear trajectories#6250
AmineDiro wants to merge 5 commits into
mainfrom
linear-trajectory

AmineDiro commented Jul 2, 2026 •

edited by cursor Bot

Loading

Uh oh!

qgallouedec left a comment

Uh oh!

qgallouedec Jul 2, 2026

Uh oh!

AmineDiro Jul 2, 2026

Uh oh!

qgallouedec Jul 2, 2026 •

edited

Loading

Uh oh!

bot-ci-comment Bot commented Jul 2, 2026

Uh oh!

qgallouedec Jul 2, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jul 3, 2026

Uh oh!

qgallouedec Jul 3, 2026

Uh oh!

cursor Bot Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

AmineDiro commented Jul 2, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AsyncGRPO: message-mode rollouts

config

How rows are built

Next work: Tree trajectories

Uh oh!

qgallouedec left a comment

Choose a reason for hiding this comment

Uh oh!

qgallouedec Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

AmineDiro Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

qgallouedec Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bot-ci-comment Bot commented Jul 2, 2026

Uh oh!

qgallouedec Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jul 3, 2026

Choose a reason for hiding this comment

Empty tool calls loop forever

Uh oh!

qgallouedec Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jul 3, 2026

Choose a reason for hiding this comment

Filtered rows skew GRPO advantages

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AmineDiro commented Jul 2, 2026 •

edited by cursor Bot

Loading

qgallouedec Jul 2, 2026 •

edited

Loading