[WIP] Benchmark rl_design branch logic by hhaAndroid · Pull Request #1692 · InternLM/xtuner

hhaAndroid · 2026-04-21T07:43:08Z

No description provided.

* first version * modify based on refactor_rollout_demo * add more comments * move generate_group from Env to Agent * rename to RolloutState and Environment to be same with doc

update

add agentloop

…and adjust interface (InternLM#1488) * [Rollout] Part 1.1: add return_routed_experts in sample_params and add update_status_from_finish_reason * [Rollout] Part 2: refactor RolloutController interface and use RolloutState * [Rollout] Part 2.1: adapt RolloutWorker to RolloutController * [Rollout] Part 2.2: add rollout ut * [Rollout] fix comments: 1. support error_msg in RolloutState; 2. adjust interface to pause_generation and offload; * [Rollout] fix comments: delete useless return

* [Judger] refactor Judger to only expose judge to user * [Judger] support multi-judger and fix ut

…#1490) * [ReplayBuffer] add ReplayBuffer with various StorageBackend: FIFO, Staleness, or Database(implement in the future) * [ReplayBuffer] optimize implementation of ReplayBuffer * fix comments: add NaiveStorage and take fifo/staleness as policy for getting item

* [Producer] Add Sampler, SamplerWithBuffer, SyncProduceStrategy, AsyncProduceStrategy * add tqdm in ProduceStrategy and fix comments on sampler

* [AgentLoopManager] support gsm8k agent_loop_manager and provide the usage of AgentLoopManager in ut * fix lint error and mv all module to base

…1493) * Optimizer Sampler && ProduceStrategy: 1. rename SamplerWithReplayBuffer to Sampler and use Sampler defaultly; 2. add is_valid_sample_fn and should_continue_fn arguments to produce_batch * Add config for agentloop, agentloopmanager, produceStrategy * fix ProduceStrategyConfig and AgentLoopConfig

* refine design file * rl colocate trainer * add exp tracker and fix train loop bug * skip routed_experts when RolloutState.dump * fix Sampler, ProduceStrategy, ... init methods * add debug_rollout * [Baseline] Run successfully with correct reward curve * 1) Introduce abstract Judger class. 2) Adjust AgentLoopManagerConfig to include judger and related configs * Add RL evaluation framework: 1) Modify RolloutState reward type to support dict; 2) Introduce Evaluator class for metric computation; 3) Integrate evaluator into RLColocateTrainer for initial and periodic evaluations; 4) Add length method to DatasetSampler for evaluator usage * refine code * add xtuner meta and work_dir * fix agent loop manager unit test * adjust evaluator config * add _log_mini_batch_metrics --------- Co-authored-by: YanhuiDua <dyh10280@163.com>

* Add RL Colocate Trainer configuration and cli entrypoint

run rl case

* add ut CI * add new rl_qwen3_gsk8k_grpo.py * update test_rollout.py * fix failed ut and skip evaluator, rl_trainer, vl_rollout ut

…nternLM#1521) * Introduce RouterJudger and split judger configs into Native/Router variants * fix dapo default config type

* 1. use StorageItem and QueryItem to replace StorageIndices and add DSLRule for QueryItem; 2. split ReplayBuffer to StorageBackend and ReplayPolicy * use QueryDict to replay QueryItem and add PandasStorage * fix copilot comments

…nLM#1520) * Build XtunerMeta and TrainController by cfg.build * Build RolloutController by cfg.build * simplify rl colocate trainer init * fix some lint errors * fix some bugs

* fix cache state * add cache file fo zhaopenhao --------- Co-authored-by: Your Name <you@example.com>

* fix lint error * fix producer and agentloop ci * adjust reward type to dict

fix timeout

* add gsm8k_with_tool agent_loop as example * fix claude comments * fix haian comments * add data_preprocess for gsm8k_with_tool

* add verl tool agent loop and unit test * fix hang: verl tool agent use fixed loop, so we need also use fixed loop in trainer * add log in agent loop manager and verl agent loop * add sandbox to xtuner._testing * add comments for asyncio_run * skip verl agent loop unit test at default * [Fix] XtunerAsyncLLMServerManager uses dynamic values from sampling_params instead of hardcoded defaults. * Update session_uid handling in XtunerAsyncLLMServerManager and VerlToolAgentLoop to use session_id from sampling_params for improved session tracking * refine default asyncio loop management * move verl tool agent to recipe * fix verl sandbox config * Add verl Gsm8kTool example * rename config and simplify verl config * adapt to new rollout state api

* refactor folder * rm useless ut and mv ut from ray to rl * fix import error for main * restore ut * fix lint * rm async config * fix claude comments * fix import error

…nternLM#1552) * Introduce JudgerConfig with judger_type and Unify judger configs * fix tool config and add comments for JudgerConfig * add check num_ray_actors, num_cpus_per_actor and cpu_memory_per_actor * fix trailing whitespace in test_judger.py * [Fix] Remove redundant warning message and duplicated field defaults in judger configs * [Fix] Fix ruff-format lint: collapse multi-line logger.warning to single line --------- Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>

* 1. add uid in RolloutState and replace token_staleness with response_steps in RolloutState; 2. add get_eos_ttoken in utils * 1. rename OverProduceStrategyConfig to AsyncProduceStrategyConfig; 2. assign uid for each RolloutState in Sampler; 3. Support oversample in AsyncProduceStrategy * support partial rollout in agentloop and rollout worker * remove min_staleness and max_staleness in AsyncReplayBuffer * support tail_batch * add async config and ut * rm useless pause / restart in rltrainer * mv partial_rollout_handler to singleturn agent loop * fix bugs for new code * add real rollout controller and agent loop ut * fix claude comments * [Nit] Add type hints, translate Chinese comments, fix PEP 8 blank lines - Add type hint to PartialRolloutHandler.__init__ return type - Translate Chinese docstring and comments to English in agent_loop/utils.py - Add missing PEP 8 blank lines between test classes in test_async_rollout.py * [Fix] Wrap long docstring lines in RolloutConfig to pass docformatter lint * replace response_step with response_rollout_steps --------- Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>

* support log rollout info * fix claude comments * replace timing_n to group_gen_count and completed_samples to leftover_completed * add docstring for ProducerTimings and ProduceBatchResult * replace to Attributes to Args

* add qwen25_7B dapo_math config * add qwen25_7B sync filter config

* Add checkpointing functionality to RLColocateTrainer and related classes - Implemented save and resume methods in AgentLoopManager and Sampler for managing dataloader state. - Enhanced RLColocateTrainer to support checkpoint configuration, including saving and resuming training state. - Updated configuration to include checkpoint parameters and integrated checkpoint handling in training workflow. - Added logging for checkpoint operations to improve traceability.

* Refactor rollout controller: rm redundant definition and rm useless functions * mv SessionRouter to utils and add RolloutHealthChecker * Improve rollout worker state safety and add rollout utils tests * Refactor rollout utils tests to unittest style * Refine rollout utils recover test for two-worker deactivate/restart flow * Use real rollout controller setup in recover integration test * fix ut * fix claude comments * fix typo

* fix qwen25_7B config

) * support replay buffer save and resume * fix lint * fix ut * debug ut

…nLM#1613) * add deterministic and random seed support in RLColocateTrainer * fix test_producer.py mock for get_rollout_metadata

* support vl * update * fix lint * fix dataloader * fix lint * update comment * refine * update

* support r3 for moe model * fix claude comments * change routed_experts from tensor to list * fix lint

* support multienv * update cfg * update design * update * update

…tLoop (InternLM#1663) * Introduce CPUActorLauncher infrastructure Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Refactor judger and agent loop core abstractions Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Migrate configs and tests to the new judger interface Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix rollout controller access and CPU actor launcher defaults Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * mv judger_sample from agent_loop * fix cpu pg --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…nternLM#1675) * split judger dispatch/build flow and cover MultiJudgerConfig * rename multijudger to composedjudger

…Anthropic, and Responses APIs (InternLM#1679) * add get_ready_status for rollout controller * add gateway adapters and local rollout backend for OpenAI, Anthropic, and Responses APIs * support streaming and fix toolcall parse * align gateway with actor-based rollout serving * add tool-call-parser and reasoning parser for rollout controller * add qwen3p5 tool call parser * Add per-api-key gateway capture folders and Add debug doc and scripts * add keyed chat trace collection * add get/pop/clear for tracestore --------- Co-authored-by: huanghaian <huanghaian@pjlab.org.cn>

jayhenry and others added 30 commits February 5, 2026 10:57

design draft for rl components API (InternLM#1477)

08df53a

* first version * modify based on refactor_rollout_demo * add more comments * move generate_group from Env to Agent * rename to RolloutState and Environment to be same with doc

Update design (InternLM#1478)

017ec6e

update

add replaybuffer (InternLM#1479)

3ca6319

update (InternLM#1480)

789a7fe

[Rollout] Part1: Add definition for RolloutState (InternLM#1483)

b508219

add text tokenize fn of rl (InternLM#1485)

5961352

Add AgentLoop (InternLM#1487)

38c04bd

add agentloop

[Judger] refactor Judger to only expose judge to user (InternLM#1489)

4080f7f

* [Judger] refactor Judger to only expose judge to user * [Judger] support multi-judger and fix ut

[Producer] Add Sampler and ProduceStrategy (InternLM#1491)

bad220a

* [Producer] Add Sampler, SamplerWithBuffer, SyncProduceStrategy, AsyncProduceStrategy * add tqdm in ProduceStrategy and fix comments on sampler

Support gsm8k AgentLoopManager and delete useless module (InternLM#1492)

0530cb3

* [AgentLoopManager] support gsm8k agent_loop_manager and provide the usage of AgentLoopManager in ut * fix lint error and mv all module to base

Add RL Colocate Trainer configuration and cli entrypoint (InternLM#1513)

9964f38

* Add RL Colocate Trainer configuration and cli entrypoint

CI: run rl case in rl_design branch (InternLM#1522)

d6209de

run rl case

CI: run unittest in rl_design branch (InternLM#1523)

2f9f8a3

* add ut CI * add new rl_qwen3_gsk8k_grpo.py * update test_rollout.py * fix failed ut and skip evaluator, rl_trainer, vl_rollout ut

Introduce RouterJudger and split judger configs into Native/Router (I…

cfc07e6

…nternLM#1521) * Introduce RouterJudger and split judger configs into Native/Router variants * fix dapo default config type

Simplify RL Colocate Trainer initialization by using cfg.build (Inter…

36184b7

…nLM#1520) * Build XtunerMeta and TrainController by cfg.build * Build RolloutController by cfg.build * simplify rl colocate trainer init * fix some lint errors * fix some bugs

Fix cache (InternLM#1527)

bd6a8ff

* fix cache state * add cache file fo zhaopenhao --------- Co-authored-by: Your Name <you@example.com>

Fix lint error (InternLM#1526)

fb6fe44

* fix lint error * fix producer and agentloop ci * adjust reward type to dict

CI: fix timeout (InternLM#1542)

b6b6e7c

fix timeout

add gsm8k_with_tool agent_loop as example (InternLM#1543)

4d1e29d

* add gsm8k_with_tool agent_loop as example * fix claude comments * fix haian comments * add data_preprocess for gsm8k_with_tool

[CI] Run only RL unit tests. (InternLM#1551)

7a6c9f5

Refactor folder (InternLM#1544)

0e05577

* refactor folder * rm useless ut and mv ut from ray to rl * fix import error for main * restore ut * fix lint * rm async config * fix claude comments * fix import error

Dump train and eval trajectory (InternLM#1557)

8bab796

YanhuiDua and others added 17 commits March 16, 2026 19:35

support log rollout info (InternLM#1579)

7ba27ba

* support log rollout info * fix claude comments * replace timing_n to group_gen_count and completed_samples to leftover_completed * add docstring for ProducerTimings and ProduceBatchResult * replace to Attributes to Args

add qwen25_7B dapo_math config (InternLM#1586)

638f06a

add qwen25_7B filter aysnc config (InternLM#1588)

c32312e

* add qwen25_7B dapo_math config * add qwen25_7B sync filter config

fix qwen25_7B config typo (InternLM#1594)

49951c4

* fix qwen25_7B config

support get local device rank (InternLM#1601)

111ad60

support replay buffer save and resume, save_hf in trainer (InternLM#1592

65f1d77

) * support replay buffer save and resume * fix lint * fix ut * debug ut

Add deterministic and random seed support in RLColocateTrainer (Inter…

bf4f8a0

…nLM#1613) * add deterministic and random seed support in RLColocateTrainer * fix test_producer.py mock for get_rollout_metadata

Support RL VLM of rl_design branch (InternLM#1598)

a32e7c0

* support vl * update * fix lint * fix dataloader * fix lint * update comment * refine * update

support r3 for moe model (InternLM#1605)

958ad94

* support r3 for moe model * fix claude comments * change routed_experts from tensor to list * fix lint

support multi tasks (InternLM#1662)

7aa405e

* support multienv * update cfg * update design * update * update

refactor judger dispatch/build flow and introduce MultiJudgerConfig (I…

71d5306

…nternLM#1675) * split judger dispatch/build flow and cover MultiJudgerConfig * rename multijudger to composedjudger

add plan

d04d5ff

update

b3cad2e

YanhuiDua force-pushed the rl_design branch from 558af26 to 7b9e7aa Compare April 27, 2026 09:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Benchmark rl_design branch logic#1692

[WIP] Benchmark rl_design branch logic#1692
hhaAndroid wants to merge 47 commits intoInternLM:rl_designfrom
hhaAndroid:benchmark_logic

hhaAndroid commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

hhaAndroid commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants