Skip to content
119 changes: 119 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
- [Quickstart](#quickstart)
- [Generator with Diffusers](#generator-with-diffusers)
- [Generator with vLLM-Omni](#generator-with-vllm-omni)
- [Generator with SGLang](#generator-with-sglang)
- [Reasoner with Transformers](#reasoner-with-transformers)
- [Reasoner with vLLM](#reasoner-with-vllm)
- [Troubleshooting](#troubleshooting)
Expand Down Expand Up @@ -413,6 +414,124 @@ References:

</details>

#### Generator with SGLang

<details>
<summary>Expand SGLang generator setup, endpoints, and request reference</summary>

Use SGLang Diffusion for native Cosmos 3 visual generation behind OpenAI-compatible image and video APIs. Cosmos 3 also includes video-with-sound and action/policy models; this SGLang section focuses on the currently supported text-to-image, text-to-video, and image-to-video generator serving paths.

Supported checkpoints:

| Model | Status | Notes |
| --- | --- | --- |
| `nvidia/Cosmos3-Nano` | Supported | Text-to-image, text-to-video, image-to-video |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably good to specify we support other modalities such as sound and action.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the wording to mention that Cosmos 3 includes video-with-sound and action/policy models, while keeping this SGLang section scoped to the currently supported T2I/T2V/I2V generator serving paths.

| `nvidia/Cosmos3-Super` | Supported | Use multiple GPUs for the 64B checkpoint |
| `nvidia/Cosmos3-Super-Text2Image` | Supported | Text-to-image specialized checkpoint |
| `nvidia/Cosmos3-Super-Image2Video` | Supported | Image-to-video specialized checkpoint |
| `nvidia/Cosmos3-Nano-Policy-DROID` | Not supported yet | Action/policy checkpoint |

Install SGLang from the main branch with diffusion extras:

```shell
git clone --branch main https://github.com/sgl-project/sglang.git
cd sglang
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install -e "python[diffusion]"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make tag/stable release of the sglang repo and pin it here?
This command will always download top of tree sglang, which is not what we want as part of the README.

@mickqian mickqian Jun 2, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. I added an optional checkout step plus a version note. the default keeps tracking upstream SGLang to pick up ongoing Cosmos 3 fixes/performance improvements, while production or reproducible deployments should pin a release tag or known-good commit before install.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best to support uv or venv

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a venv setup before the editable SGLang install.

Comment thread
atharvajoshi10 marked this conversation as resolved.
pip install "cosmos-guardrail==0.3.1"
```

> **Version note:** Cosmos 3 support in SGLang Diffusion currently requires the SGLang main branch. Switch to a stable SGLang release once Cosmos 3 support is included there.

Start a Nano server:

```shell
sglang serve --model-path nvidia/Cosmos3-Nano
```

For a video-specialized checkpoint, use `Cosmos3-Super-Image2Video` with multiple GPUs:

```shell
sglang serve \
--model-path nvidia/Cosmos3-Super-Image2Video \
--num-gpus 4

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken

sglang serve \
  --model-path nvidia/Cosmos3-Super-Image2Video \
  --num-gpus 4

is equivalent to CFG + ulysses-deg 2 i.e.

sglang serve \
  --model-path nvidia/Cosmos3-Super-Image2Video \
  --num-gpus 4 --enable-cfg-parallel --ulysses-degree 2

which is indeed preferred way to serve multi-gpu inference, but only if the model fits into single GPU (>80GB). This it only best setup for performance, but it doesn't reduce memory requirements.

Safer option would be to use fsdp as an example for Cosmos3-Super checkpoint, as this setup actually does reduce memory requirement by sharding the weights across gpus, i.e.:

sglang serve \
  --model-path nvidia/Cosmos3-Super-Image2Video \
  --num-gpus 4 --use-fsdp-inference

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are looking for memory-friendly setups, yes we could do better, whether fsdp or offloading would do

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good that the default command is for better perf. But can we add a note saying something like if OOM error is hit, user can try add --use-fsdp-inference to save memory? @mickqian

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in c1dc833. I kept the default command as the performance-mode setup and added a memory-mode fallback using --use-fsdp-inference for users who hit OOM.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated again in fe557f1: switched the OOM fallback to SGLang Diffusion's higher-level --performance-mode memory preset, while keeping the default command as the performance-mode setup.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. thank you mick

```

This is the performance-mode setup. If it runs out of memory, switch to SGLang Diffusion's memory preset:

```shell
sglang serve \
--model-path nvidia/Cosmos3-Super-Image2Video \
--num-gpus 4 \
--performance-mode memory
```

Vision endpoints:

| Mode | Endpoint | Notes |
| --- | --- | --- |
| Text to image | `POST /v1/images/generations` | Returns base64 by default for Cosmos 3 |
| Text to video | `POST /v1/videos` | Creates an async job; poll `GET /v1/videos/{id}` and download `/content` |
| Image to video | `POST /v1/videos` | Upload the conditioning image with `input_reference` |

Text-to-video example:

```shell
# Submit an async video generation job and capture its ID.
job_id=$(curl -sS -X POST http://localhost:30000/v1/videos \
--form-string "prompt=A small warehouse robot moves a blue box across a clean floor." \
--form-string "negative_prompt=blurry, distorted, low quality" \
--form-string "size=1280x720" \
--form-string "num_frames=81" \
--form-string "fps=24" \
--form-string "num_inference_steps=35" \
--form-string "guidance_scale=4.0" \
--form-string "flow_shift=10.0" \
--form-string "seed=42" \
--form-string 'extra_params={"guardrails":true,"use_resolution_template":false,"use_duration_template":false}' \
| jq -r .id)

# Poll until the job completes. Cosmos 3 video generation can take several minutes.
status=""
until [ "$status" = "completed" ]; do
status=$(curl -sS "http://localhost:30000/v1/videos/${job_id}" | jq -r .status)
[ "$status" = "failed" ] && exit 1
sleep 5
done

# Download the completed MP4.
curl -sS -L "http://localhost:30000/v1/videos/${job_id}/content" \
-o cosmos3_t2v_output.mp4
Comment on lines +483 to +506

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add comments here to improve readability?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments for the submit, poll, and download steps in the video example.

```

Text-to-image example:

```shell
curl -sS -X POST http://localhost:30000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"prompt": "A warehouse robot folds a blue cloth on a clean workbench.",
"size": "1280x720",
"n": 1,
"num_inference_steps": 35,
"guidance_scale": 6.0,
"flow_shift": 10.0,
"seed": 0,
"extra_args": {
"use_resolution_template": false,
"guardrails": true
}
}'
```

SGLang accepts Cosmos 3 request options including `max_sequence_length`, `flow_shift`, `extra_params.guardrails`, `extra_params.use_resolution_template`, and `extra_params.use_duration_template`. Guardrails are enabled by default when `cosmos-guardrail` is installed; set `SGLANG_DISABLE_COSMOS3_GUARDRAILS=1` before starting the server to skip loading the guardrail models.

For complete serving instructions and request examples, see the [Cosmos3 SGLang cookbook](https://docs.sglang.io/cookbook/diffusion/Cosmos/Cosmos3).

</details>

#### Reasoner with Transformers
Coming soon!

Expand Down