Environment-owned reward#6238
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| tools = tools or [] | ||
| self._standalone_tools = tools # tools that are not bound to the environment | ||
| self.environment_factory = environment_factory | ||
| self._env_owns_reward = False |
There was a problem hiding this comment.
small nit: owns reward points to an env that handles all the reward computation, but we can still pass in separate reward functions?
| # The environment owns the reward: score it now, while this rollout's environment still holds its | ||
| # final state (it is reset only when drawn again for the next rollout). | ||
| if self._env_owns_reward: | ||
| group.env_rewards.append(environment.get_reward()) |
There was a problem hiding this comment.
Looking at this, we might need to support async reward also ( like LLM as judge)
so we probably need an async version of the get_reward like get_reward_async.
We also are in the main generation loop. So if a user passes a sync reward function that take time, this will halt the infllght requests loop
AmineDiro
left a comment
There was a problem hiding this comment.
that's a very niice PR that will cleanup the API interface a lot !
sergiopaniego
left a comment
There was a problem hiding this comment.
this is great, thanks!
some parts that need to be revisited:
- OpenEnv guide
- Example scripts/notebooks using OpenEnv
| GRPO supports **agent training**: the model calls tools during generation and learns from the outcome. | ||
|
|
||
| - A **tool** is a plain Python function (sync or async) exposed to the model. Use `tools` for stateless calls (a calculator, a web search). | ||
| - An **environment** is the more general form: a stateful object built fresh per rollout, whose public methods are exposed as tools, plus a `reset` lifecycle hook and an optional `get_reward` that lets it own the reward. Use `environment_factory` when you need per-rollout state, a reset hook, or environment-owned reward. |
There was a problem hiding this comment.
not sure if these ideas here of get_reward could collide with how rewards are retrieved from OpenReward/Harbor @adithya-s-k
Implements the RFC in #5912: lets the environment own the reward. If an
environment_factoryenvironment defines a reservedget_reward()method (no args →float), it is called once per completed rollout and added as a reward source. Soreward_funcsbecomes optional.Important
This adds no new capability. It's an ergonomics change. The same reward was always expressible through
reward_funcs; this just lets the environment own it directly, which is the natural formulation for stateful environments.This matters for stateful / multi-turn environments, where the reward is a function of the environment's internal state (was the word guessed? did the game end in a win?). Today that state has to be leaked back out to a trainer-owned
reward_func; now the environment scores itself.Before
The reward lives in the trainer and reaches into the environment to recompute what it already knows:
After
The environment scores the episode it just ran:
Details
get_rewardis reserved (likereset): it is not exposed to the model as a tool.rewards/WordleEnv/mean), always weight 1: the environment owns its scale.reward_funcs: all sources are summed;reward_weightsapplies toreward_funcsonly.reward_funcsand noget_reward).GRPOTrainer,DPPOTrainer,GRPOWithReplayBufferTrainer, andAsyncGRPOTrainer. Docs + tests included.AI writing disclosure
Note
Medium Risk
Changes core RL reward plumbing and makes
reward_funcsoptional, so misconfigured trainers now error at init but existing scripts without rewards could break; metric key rename (reward/→rewards/) may affect dashboards.Overview
Adds environment-owned rewards for agent training: when an
environment_factoryclass defines optionalget_reward() -> float, the trainer calls it once per completed rollout and treats it as an extra reward source (weight 1, logged asrewards/{EnvClassName}/mean|std).reward_funcsis now optional if the environment suppliesget_reward; trainer-owned and env-owned rewards sum together, withreward_weightsapplying only toreward_funcs.get_rewardis reserved likereset—not exposed to the model as a tool—and environment method discovery excludes bothresetandget_rewardwhen building tool lists.GRPOTrainerregisters an internal reward wrapper at init and fails fast withValueErrorwhen neitherreward_funcsnor envget_rewardis provided.AsyncGRPOTrainer/AsyncRolloutWorkercapture per-rolloutenv_rewardsat generation time and append them during scoring. Docstrings for DPPO (and related trainers) align with the same contract; GRPO/RLOO docs rename per-function metrics fromreward/...torewards/...and expand the agent-training guide (tools vs environments, combined rewards).Tests cover env-only reward, coexistence with
reward_funcs, and the no-reward-source error.Reviewed by Cursor Bugbot for commit 218ff6f. Bugbot is set up for automated code reviews on this repo. Configure here.