Skip to content

Question: Support for Cosmos 3 Reasoner Post-training #38

@Ethan-Lee-Sunghoon

Description

@Ethan-Lee-Sunghoon

Hi, thank you for open-sourcing this great project!

I have a question regarding the post-training/SFT support for the Cosmos 3 Reasoner.

In the previous Cosmos Reason2, there were guidelines on performing LoRA SFT using trl and cosmos-rl. For Cosmos 3 Reasoner, I noticed that SFT is now supported through the cosmos-framework.

While reviewing the training documentation (https://github.com/NVIDIA/cosmos-framework/blob/main/docs/training.md), I had a question about the starting weights used in the examples:

  1. In the "Reasoner Alignment SFT with LLaVA-OneVision (vfm-vlm)" example, the backbone used is Qwen/Qwen3-VL-8B-Instruct. Why does this process not start directly from the Cosmos 3 Reasoner weights? Additionally, what does "vfm-vlm" at the end of this example's title stand for/mean?
  2. In contrast, the "Reasoner Alignment SFT with VideoPhy-2 (Cosmos3-Nano)" example seems to start with the Cosmos 3 Nano weights. Could you please explain the key differences between these two examples and the reasoning behind using different starting weights for them?
    Additionally, in this setup, Qwen's vision encoder is frozen and only the LM of Cosmos3 is used—could you share the reasoning behind this design choice?
  3. Lastly, similar to previous Cosmos models, are there plans to release recipes utilizing the cosmos-framework in the cosmos-cookbook?

Thank you so much for your time and support!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions