Skip to content

[WIP] Benchmark rl_design branch logic#1692

Open
hhaAndroid wants to merge 47 commits intoInternLM:rl_designfrom
hhaAndroid:benchmark_logic
Open

[WIP] Benchmark rl_design branch logic#1692
hhaAndroid wants to merge 47 commits intoInternLM:rl_designfrom
hhaAndroid:benchmark_logic

Conversation

@hhaAndroid
Copy link
Copy Markdown
Collaborator

No description provided.

jayhenry and others added 30 commits February 5, 2026 10:57
* first version

* modify based on refactor_rollout_demo

* add more comments

* move generate_group from Env to Agent

* rename to RolloutState and Environment to be same with doc
…and adjust interface (InternLM#1488)

* [Rollout] Part 1.1: add return_routed_experts in sample_params and add update_status_from_finish_reason

* [Rollout] Part 2: refactor RolloutController interface and use RolloutState

* [Rollout] Part 2.1: adapt RolloutWorker to RolloutController

* [Rollout] Part 2.2: add rollout ut

* [Rollout] fix comments: 1. support error_msg in RolloutState; 2. adjust interface to pause_generation and offload;

* [Rollout] fix comments: delete useless return
* [Judger] refactor Judger to only expose judge to user

* [Judger] support multi-judger and fix ut
…#1490)

* [ReplayBuffer] add ReplayBuffer with various StorageBackend: FIFO, Staleness, or Database(implement in the future)

* [ReplayBuffer] optimize implementation of ReplayBuffer

* fix comments: add NaiveStorage and take fifo/staleness as policy for getting item
* [Producer] Add Sampler, SamplerWithBuffer, SyncProduceStrategy, AsyncProduceStrategy

* add tqdm in ProduceStrategy and fix comments on sampler
* [AgentLoopManager] support gsm8k agent_loop_manager and provide the usage of AgentLoopManager in ut

* fix lint error and mv all module to base
…1493)

* Optimizer Sampler && ProduceStrategy: 1. rename SamplerWithReplayBuffer to Sampler and use Sampler defaultly; 2. add is_valid_sample_fn and should_continue_fn arguments to produce_batch

* Add config for agentloop, agentloopmanager, produceStrategy

* fix ProduceStrategyConfig and AgentLoopConfig
* refine design file

* rl colocate trainer

* add exp tracker and fix train loop bug

* skip routed_experts when RolloutState.dump

* fix Sampler, ProduceStrategy, ... init methods

* add debug_rollout

* [Baseline] Run successfully with correct reward curve

* 1) Introduce abstract Judger class. 2) Adjust AgentLoopManagerConfig to include judger and related configs

* Add RL evaluation framework:
1) Modify RolloutState reward type to support dict;
2) Introduce Evaluator class for metric computation;
3) Integrate evaluator into RLColocateTrainer for initial and periodic evaluations;
4) Add length method to DatasetSampler for evaluator usage

* refine code

* add xtuner meta and work_dir

* fix agent loop manager unit test

* adjust evaluator config

* add _log_mini_batch_metrics

---------

Co-authored-by: YanhuiDua <dyh10280@163.com>
* Add RL Colocate Trainer configuration and cli entrypoint
* add ut CI

* add new rl_qwen3_gsk8k_grpo.py

* update test_rollout.py

* fix failed ut and skip evaluator, rl_trainer, vl_rollout ut
…nternLM#1521)

* Introduce RouterJudger and split judger configs into Native/Router variants

* fix dapo default config type
* 1. use StorageItem and QueryItem to replace StorageIndices and add DSLRule for QueryItem; 2. split ReplayBuffer to StorageBackend and ReplayPolicy

* use QueryDict to replay QueryItem and add PandasStorage

* fix copilot comments
…nLM#1520)

* Build XtunerMeta and TrainController by cfg.build

* Build RolloutController by cfg.build

* simplify rl colocate trainer init

* fix some lint errors

* fix some bugs
* fix cache state

* add cache file fo zhaopenhao

---------

Co-authored-by: Your Name <you@example.com>
* fix lint error

* fix producer and agentloop ci

* adjust reward type to dict
* add gsm8k_with_tool agent_loop as example

* fix claude comments

* fix haian comments

* add data_preprocess for gsm8k_with_tool
* add verl tool agent loop and unit test

* fix hang: verl tool agent use fixed loop, so we need also use fixed loop in trainer

* add log in agent loop manager and verl agent loop

* add sandbox to xtuner._testing

* add comments for asyncio_run

* skip verl agent loop unit test at default

* [Fix] XtunerAsyncLLMServerManager uses dynamic values from sampling_params instead of hardcoded defaults.

* Update session_uid handling in XtunerAsyncLLMServerManager and VerlToolAgentLoop to use session_id from sampling_params for improved session tracking

* refine default asyncio loop management

* move verl tool agent to recipe

* fix verl sandbox config

* Add verl Gsm8kTool example

* rename config and simplify verl config

* adapt to new rollout state api
* refactor folder

* rm useless ut and mv ut from ray to rl

* fix import error for main

* restore ut

* fix lint

* rm async config

* fix claude comments

* fix import error
…nternLM#1552)

* Introduce JudgerConfig with judger_type and Unify judger configs

* fix tool config and add comments for JudgerConfig

* add check num_ray_actors, num_cpus_per_actor and cpu_memory_per_actor

* fix trailing whitespace in test_judger.py

* [Fix] Remove redundant warning message and duplicated field defaults in judger configs

* [Fix] Fix ruff-format lint: collapse multi-line logger.warning to single line

---------

Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
* 1. add uid in RolloutState and replace token_staleness with response_steps in RolloutState; 2. add get_eos_ttoken in utils

* 1. rename OverProduceStrategyConfig to AsyncProduceStrategyConfig;

2. assign uid for each RolloutState in Sampler;

3. Support oversample in AsyncProduceStrategy

* support partial rollout in agentloop and rollout worker

* remove min_staleness and max_staleness in AsyncReplayBuffer

* support tail_batch

* add async config and ut

* rm useless pause / restart in rltrainer

* mv partial_rollout_handler to singleturn agent loop

* fix bugs for new code

* add real rollout controller and agent loop ut

* fix claude comments

* [Nit] Add type hints, translate Chinese comments, fix PEP 8 blank lines

- Add type hint to PartialRolloutHandler.__init__ return type
- Translate Chinese docstring and comments to English in agent_loop/utils.py
- Add missing PEP 8 blank lines between test classes in test_async_rollout.py

* [Fix] Wrap long docstring lines in RolloutConfig to pass docformatter lint

* replace response_step with response_rollout_steps

---------

Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
YanhuiDua and others added 17 commits March 16, 2026 19:35
* support log rollout info

* fix claude comments

* replace timing_n to group_gen_count and completed_samples to leftover_completed

* add docstring for ProducerTimings and ProduceBatchResult

* replace to Attributes to Args
* add qwen25_7B dapo_math config

* add qwen25_7B sync filter config
* Add checkpointing functionality to RLColocateTrainer and related classes

- Implemented save and resume methods in AgentLoopManager and Sampler for managing dataloader state.
- Enhanced RLColocateTrainer to support checkpoint configuration, including saving and resuming training state.
- Updated configuration to include checkpoint parameters and integrated checkpoint handling in training workflow.
- Added logging for checkpoint operations to improve traceability.
* Refactor rollout controller: rm redundant definition and rm useless functions

* mv SessionRouter to utils and add RolloutHealthChecker

* Improve rollout worker state safety and add rollout utils tests

* Refactor rollout utils tests to unittest style

* Refine rollout utils recover test for two-worker deactivate/restart flow

* Use real rollout controller setup in recover integration test

* fix ut

* fix claude comments

* fix typo
)

* support replay buffer save and resume

* fix lint

* fix ut

* debug ut
…nLM#1613)

* add deterministic and random seed support in RLColocateTrainer

* fix test_producer.py mock for get_rollout_metadata
* support vl

* update

* fix lint

* fix dataloader

* fix lint

* update comment

* refine

* update
* support r3 for moe model

* fix claude comments

* change routed_experts from tensor to list

* fix lint
* support multienv

* update cfg

* update design

* update

* update
…tLoop (InternLM#1663)

* Introduce CPUActorLauncher infrastructure

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Refactor judger and agent loop core abstractions

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Migrate configs and tests to the new judger interface

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix rollout controller access and CPU actor launcher defaults

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* mv judger_sample from agent_loop

* fix cpu pg

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nternLM#1675)

* split judger dispatch/build flow and cover MultiJudgerConfig

* rename multijudger to composedjudger
…Anthropic, and Responses APIs (InternLM#1679)

* add get_ready_status for rollout controller

* add gateway adapters and local rollout backend for OpenAI, Anthropic, and Responses APIs

* support streaming and fix toolcall parse

* align gateway with actor-based rollout serving

* add tool-call-parser and reasoning parser for rollout controller

* add qwen3p5 tool call parser

* Add per-api-key gateway capture folders and Add debug doc and scripts

* add keyed chat trace collection

* add get/pop/clear for tracestore

---------

Co-authored-by: huanghaian <huanghaian@pjlab.org.cn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants