Skip to content

Kandinsky5 lite dit clean#1088

Open
jaisurya27 wants to merge 32 commits intohao-ai-lab:mainfrom
jaisurya27:kandinsky5-lite-dit-clean
Open

Kandinsky5 lite dit clean#1088
jaisurya27 wants to merge 32 commits intohao-ai-lab:mainfrom
jaisurya27:kandinsky5-lite-dit-clean

Conversation

@jaisurya27
Copy link

Adds native FastVideo implementation of Kandinsky5Transformer3DModel (no Diffusers wrapper).
Adds Kandinsky5 Lite DiT config + parameter mapping and registry wiring.
Adds a local parity test against Diffusers checkpoint.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @jaisurya27, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances FastVideo's capabilities by integrating the Kandinsky5 Lite DiT model through a native implementation of its 3D Transformer. This change moves away from external Diffusers wrappers, providing a more streamlined and optimized architecture within the FastVideo ecosystem. The update includes all necessary configuration and registry entries, and its correctness is validated by a comprehensive parity test against the original Diffusers version.

Highlights

  • Native Kandinsky5Transformer3DModel Implementation: A native FastVideo implementation of the Kandinsky5Transformer3DModel has been added, eliminating the need for Diffusers wrappers and integrating directly into the FastVideo framework.
  • Kandinsky5 Lite DiT Configuration and Registry: New configuration classes (Kandinsky5ArchConfig, Kandinsky5VideoConfig) were introduced, along with parameter mapping and wiring into the model registry, enabling proper setup and discovery of the Kandinsky5 model.
  • Local Parity Test: A dedicated local parity test has been included to ensure that the native FastVideo Kandinsky5 implementation produces numerically identical results to the original Diffusers checkpoint.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • fastvideo/configs/models/dits/init.py
    • Imported Kandinsky5VideoConfig.
    • Added Kandinsky5VideoConfig to the all export list.
  • fastvideo/configs/models/dits/kandinsky5.py
    • Defined Kandinsky5ArchConfig with specific parameters for the Kandinsky5 Transformer, including FSDP sharding conditions and Diffusers config fields.
    • Defined Kandinsky5VideoConfig to encapsulate the architecture configuration.
  • fastvideo/models/dits/kandinsky5.py
    • Implemented the Kandinsky5Transformer3DModel class, providing a native FastVideo version of the Kandinsky5 Transformer.
    • Included helper functions for frequency generation, local patching/merging, and fractal flattening/unflattening.
    • Defined various sub-modules: Kandinsky5TimeEmbeddings, Kandinsky5TextEmbeddings, Kandinsky5VisualEmbeddings, Kandinsky5RoPE1D, Kandinsky5RoPE3D, Kandinsky5Modulation, Kandinsky5Attention, Kandinsky5FeedForward, Kandinsky5OutLayer, Kandinsky5TransformerEncoderBlock, and Kandinsky5TransformerDecoderBlock.
    • Added a materialize_non_persistent_buffers method to handle buffer initialization.
  • fastvideo/models/loader/fsdp_load.py
    • Added a conditional call to model.materialize_non_persistent_buffers after loading an FSDP model, if the method exists.
  • fastvideo/models/registry.py
    • Registered Kandinsky5Transformer3DModel in the _VIDEO_DIT_MODELS dictionary.
  • tests/local_tests/transformers/test_kandinsky5_lite_transformer_parity.py
    • Added a new test file to verify the numerical parity between the native FastVideo Kandinsky5Transformer3DModel and the Diffusers Kandinsky5Transformer3DModel.
    • Included setup for loading both models, generating random inputs, and asserting close results.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a native FastVideo implementation for the Kandinsky 5 Lite DiT model, including its configuration, registry wiring, and a parity test against the Diffusers implementation. The overall structure is good, but I've identified a critical issue in the model's class attribute initialization that would cause an AttributeError. Additionally, there are a few other points for improvement regarding hardcoded data types, unused function parameters, and magic numbers that would enhance the code's robustness and maintainability.

Comment on lines +475 to +482
_fsdp_shard_conditions = Kandinsky5VideoConfig()._fsdp_shard_conditions
_compile_conditions = Kandinsky5VideoConfig()._compile_conditions
param_names_mapping = Kandinsky5VideoConfig().param_names_mapping
reverse_param_names_mapping = Kandinsky5VideoConfig(
).reverse_param_names_mapping
lora_param_names_mapping = Kandinsky5VideoConfig().lora_param_names_mapping
_supported_attention_backends = Kandinsky5VideoConfig(
)._supported_attention_backends
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The class attributes _fsdp_shard_conditions, _compile_conditions, etc., are being initialized by accessing attributes on a default Kandinsky5VideoConfig instance. However, these attributes are defined within the arch_config of the Kandinsky5VideoConfig. This will raise an AttributeError at runtime. The correct way to access them would be through the arch_config attribute.

    arch_config_defaults = Kandinsky5VideoConfig().arch_config
    _fsdp_shard_conditions = arch_config_defaults._fsdp_shard_conditions
    _compile_conditions = arch_config_defaults._compile_conditions
    param_names_mapping = arch_config_defaults.param_names_mapping
    reverse_param_names_mapping = arch_config_defaults.reverse_param_names_mapping
    lora_param_names_mapping = arch_config_defaults.lora_param_names_mapping
    _supported_attention_backends = arch_config_defaults._supported_attention_backends

def _apply_rotary(x: torch.Tensor, rope: torch.Tensor) -> torch.Tensor:
x_ = x.reshape(*x.shape[:-1], -1, 1, 2).to(torch.float32)
x_out = (rope * x_).sum(dim=-1)
return x_out.reshape(*x.shape).to(torch.bfloat16)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The output of _apply_rotary is hardcoded to torch.bfloat16. This can cause type mismatches and precision issues if the model is running with a different precision (e.g., float16 or float32). It should be cast to the original input tensor's dtype to ensure correctness across different precisions.

Suggested change
return x_out.reshape(*x.shape).to(torch.bfloat16)
return x_out.reshape(*x.shape).to(x.dtype)

shape: tuple[int, int, int, int],
block_mask: bool = False):
if block_mask:
pixel_size = 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The magic number 8 for pixel_size is used here and in fractal_unflatten. It would be better to define it as a constant at the module level for clarity and maintainability, e.g., FRACTAL_PIXEL_SIZE = 8.

Comment on lines +347 to +348
def forward(self, visual_embed: torch.Tensor, text_embed: torch.Tensor,
time_embed: torch.Tensor) -> torch.Tensor:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The text_embed parameter is unused in the forward method of Kandinsky5OutLayer. It should be removed to improve code clarity. The call to this method in Kandinsky5Transformer3DModel.forward at line 615 should also be updated accordingly.

    def forward(self, visual_embed: torch.Tensor, time_embed: torch.Tensor) -> torch.Tensor:

timestep: torch.Tensor,
encoder_hidden_states_image: torch.Tensor | list[torch.Tensor]
| None = None,
guidance=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The guidance parameter is unused in this forward method and should be removed to improve code clarity.

visual_embed = fractal_unflatten(visual_embed,
visual_shape,
block_mask=to_fractal)
x = self.out_layer(visual_embed, text_embed, time_embed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Following the removal of the unused text_embed parameter from Kandinsky5OutLayer.forward, this call should be updated to no longer pass it.

        x = self.out_layer(visual_embed, time_embed)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use as many layers from fastvideo/layers like ReplicatedLinear, RoPE and fused layernorm

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
I had implemented the FASTVIDEO native version for kandinsky5-lite. Have made sure to include these:

  • Migrated core projections to FastVideo native ReplicatedLinear (time/text/visual embeddings, modulation, attention qkv/out, out layer)
  • Switched attention compute to FastVideo LocalAttention and added fallback to torch SDPA when forward context is absent (for standalone parity test path).
  • Kept/used FastVideo fused norms in encoder/decoder blocks via LayerNormScaleShift (eps=1e-5) with float residual math to match official behavior
  • Kandinsky5FeedForward uses FastVideo MLP(..., bias=False, act_type="gelu").
  • Implemented native RoPE + fractal flatten/unflatten flow and non-persistent buffer materialization

@Eigensystem Eigensystem added the go Trigger Buildkite CI label Feb 21, 2026
@Eigensystem
Copy link
Collaborator

plz solve conflicts and pre-commit errors

Eigensystem and others added 20 commits March 5, 2026 20:45
Co-authored-by: Ishan Vaish <vaish.ishan@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Peiyuan Zhang <a1286225768@slurm-h200-204-227.slurm-compute.tenant-slurm.svc.cluster.local>
Co-authored-by: Peiyuan Zhang <a1286225768@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local>
Co-authored-by: RandNMR73 <notomatthew31@gmail.com>
Co-authored-by: JerryZhou54 <zhouw.jerry2017@outlook.com>
Co-authored-by: Davids048 <jundasu@ucsd.edu>
Co-authored-by: Peiyuan Zhang <a1286225768@slurm-h200-204-239.slurm-compute.tenant-slurm.svc.cluster.local>
Co-authored-by: SolitaryThinker <wlsaidhi@gmail.com>
Co-authored-by: Will Lin <wlsaidhi@gmail.com>
Co-authored-by: Will Lin <wlsaidhi@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go Trigger Buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants