Bug Description
Cosmos3-Nano
I can build and run NVIDIA/cosmos-framework on an ARM64 GB300 system, but Cosmos3-Nano inference completes with invalid outputs.
Text-to-image writes a valid JPEG, but the image is gray/noisy texture instead of matching the prompt. The reasoner sample also completes and writes output files, but the generated text is invalid/gibberish.
Across the successful-but-invalid runs, I observed this warning:
Failed to initialize the CUTLASS kernel. Last CUDA error is: no error
This looks like an ARM64/GB300 attention backend compatibility issue rather than a command-line usage issue.
Cosmos3-Super
As a separate follow-up, I also ran the same repo text-to-image sample unchanged with Cosmos3-Super on the same local ARM64 GB300 host. That run did not reach generation: it initialized tokenizers/model, allocated roughly 126 GB of GPU memory, then made no further log progress before I stopped it to free the GPU. No output image and no CUTLASS warning were produced in the Cosmos3-Super run.
Common Setup
Repo:
git clone https://github.com/NVIDIA/cosmos-framework.git
cd cosmos-framework
git checkout 82f8229
docker build --network=host -t cosmos-framework:arm64 .
Cosmos3-Nano Reproduction
Run the official Cosmos 3 text-to-image sample:
docker run --rm --gpus all --ipc=host --network=host \
-v "$PWD:/workspace" \
-v /workspace/.venv \
-v "$HOME/.cache:/root/.cache" \
cosmos-framework:arm64 \
python -m cosmos_framework.scripts.inference \
--parallelism-preset=latency \
--no-guardrails \
-i inputs/omni/t2i.json \
-o outputs/cosmos3_t2i_test \
--checkpoint-path Cosmos3-Nano \
--seed=0 \
--benchmark
I also tried:
The run still completed but produced the same gray/noisy output.
I also tried forcing the FlashAttention path, but that failed with:
ValueError: Could not find a compatible Attention backend for this use case / device.
Cosmos3-Super Reproduction
I also ran the same official text-to-image sample unchanged with Cosmos3-Super:
docker run --rm --gpus all --ipc=host --network=host \
-v "$PWD:/workspace" \
-v /workspace/.venv \
-v "$HOME/.cache:/root/.cache" \
cosmos-framework:arm64 \
python -m cosmos_framework.scripts.inference \
--parallelism-preset=latency \
--no-guardrails \
-i inputs/omni/t2i.json \
-o outputs/cosmos3_super_t2i_test \
--checkpoint-path Cosmos3-Super \
--seed=0 \
--benchmark
That Cosmos3-Super run reached:
Time spent on OmniMoTModel: set_up_model: 8.19 s
Then it stalled before generation. It allocated approximately 126630 MiB of GPU memory, stayed CPU-active at roughly one core with low GPU utilization, did not update logs further, and did not write vision.jpg or benchmark.json.
Reproducibility:
Expected vs. Actual Behavior
| Model |
Expected |
Actual |
Cosmos3-Nano |
Text-to-image output should match the prompt; reasoner should emit valid text. |
T2I writes a valid JPEG, but it is gray/noisy texture. The reasoner writes invalid/gibberish text. Logs show CUTLASS kernel initialization warnings. |
Cosmos3-Super |
Same unchanged T2I sample should reach generation and write vision.jpg. |
Stalls after model setup and does not write an output image or benchmark file. |
Outputs
Cosmos3-Nano Error / Warning
Failed to initialize the CUTLASS kernel. Last CUDA error is: no error
Cosmos3-Nano Observed Output
The T2I output file is a valid 960x960 RGB JPEG, but visually appears as gray/noisy texture rather than the requested scene.
Cosmos3-Super Follow-Up
The Cosmos3-Super run did not produce an image. Last real log line:
[06-14 17:17:35|job=|INFO|cosmos_framework/utils/timer.py:138:_log] Time spent on OmniMoTModel: set_up_model: 8.19 s
Only log files were written:
console.log
debug.log
host_run.log
No vision.jpg, no benchmark.json, and no observed CUTLASS/NATTEN warning before manual cleanup.
System Information
| Field |
Value |
| Environment |
Docker, image built from repo Dockerfile |
| Hardware |
Single NVIDIA GB300 |
| Architecture |
aarch64 / ARM64 |
| GPU Driver |
610.43.02 |
| Container PyTorch |
2.10.0+cu130 |
| CUDA |
CUDA 13 stack from container |
| Package Version / Commit |
82f8229 |
| Model |
Cosmos3-Nano; follow-up also tested Cosmos3-Super |
| Observed NATTEN |
0.21.6.dev6 |
Additional Context
The same commands launch successfully and produce output files, so this is not a startup/download failure. The suspicious part is the attention backend path on ARM64 GB300/Blackwell. The source appears to contain Blackwell-specific NATTEN/CUTLASS handling, but the available ARM64 wheel may not match the support level needed by this path.
For Cosmos3-Super, the unchanged text-to-image sample did not reach the point where I could evaluate output quality. It appears to be a separate pre-generation stall on the same ARM64 GB300 stack.
Please let me know if there is a recommended ARM64/GB300 dependency stack or NATTEN wheel version for Cosmos3-Nano and Cosmos3-Super inference.
debug_super.log
console_super.log
debug_nano.log
console_nano.log
Bug Description
Cosmos3-Nano
I can build and run
NVIDIA/cosmos-frameworkon an ARM64 GB300 system, butCosmos3-Nanoinference completes with invalid outputs.Text-to-image writes a valid JPEG, but the image is gray/noisy texture instead of matching the prompt. The reasoner sample also completes and writes output files, but the generated text is invalid/gibberish.
Across the successful-but-invalid runs, I observed this warning:
This looks like an ARM64/GB300 attention backend compatibility issue rather than a command-line usage issue.
Cosmos3-Super
As a separate follow-up, I also ran the same repo text-to-image sample unchanged with
Cosmos3-Superon the same local ARM64 GB300 host. That run did not reach generation: it initialized tokenizers/model, allocated roughly126 GBof GPU memory, then made no further log progress before I stopped it to free the GPU. No output image and no CUTLASS warning were produced in theCosmos3-Superrun.Common Setup
Repo:
Cosmos3-Nano Reproduction
Run the official Cosmos 3 text-to-image sample:
I also tried:
The run still completed but produced the same gray/noisy output.
I also tried forcing the FlashAttention path, but that failed with:
Cosmos3-Super Reproduction
I also ran the same official text-to-image sample unchanged with
Cosmos3-Super:That
Cosmos3-Superrun reached:Then it stalled before generation. It allocated approximately
126630 MiBof GPU memory, stayed CPU-active at roughly one core with low GPU utilization, did not update logs further, and did not writevision.jpgorbenchmark.json.Reproducibility:
Cosmos3-Nano: always produces invalid/noisy output on this stack.Cosmos3-Super: observed pre-generation stall on this stack.Expected vs. Actual Behavior
Cosmos3-NanoCosmos3-Supervision.jpg.Outputs
Cosmos3-Nano Error / Warning
Cosmos3-Nano Observed Output
The T2I output file is a valid
960x960RGB JPEG, but visually appears as gray/noisy texture rather than the requested scene.Cosmos3-Super Follow-Up
The
Cosmos3-Superrun did not produce an image. Last real log line:Only log files were written:
No
vision.jpg, nobenchmark.json, and no observed CUTLASS/NATTEN warning before manual cleanup.System Information
82f8229Cosmos3-Nano; follow-up also testedCosmos3-Super0.21.6.dev6Additional Context
The same commands launch successfully and produce output files, so this is not a startup/download failure. The suspicious part is the attention backend path on ARM64 GB300/Blackwell. The source appears to contain Blackwell-specific NATTEN/CUTLASS handling, but the available ARM64 wheel may not match the support level needed by this path.
For
Cosmos3-Super, the unchanged text-to-image sample did not reach the point where I could evaluate output quality. It appears to be a separate pre-generation stall on the same ARM64 GB300 stack.Please let me know if there is a recommended ARM64/GB300 dependency stack or NATTEN wheel version for
Cosmos3-NanoandCosmos3-Superinference.debug_super.log
console_super.log
debug_nano.log
console_nano.log