Conversation
Summary of ChangesHello @jaisurya27, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances FastVideo's capabilities by integrating the Kandinsky5 Lite DiT model through a native implementation of its 3D Transformer. This change moves away from external Diffusers wrappers, providing a more streamlined and optimized architecture within the FastVideo ecosystem. The update includes all necessary configuration and registry entries, and its correctness is validated by a comprehensive parity test against the original Diffusers version. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a native FastVideo implementation for the Kandinsky 5 Lite DiT model, including its configuration, registry wiring, and a parity test against the Diffusers implementation. The overall structure is good, but I've identified a critical issue in the model's class attribute initialization that would cause an AttributeError. Additionally, there are a few other points for improvement regarding hardcoded data types, unused function parameters, and magic numbers that would enhance the code's robustness and maintainability.
| _fsdp_shard_conditions = Kandinsky5VideoConfig()._fsdp_shard_conditions | ||
| _compile_conditions = Kandinsky5VideoConfig()._compile_conditions | ||
| param_names_mapping = Kandinsky5VideoConfig().param_names_mapping | ||
| reverse_param_names_mapping = Kandinsky5VideoConfig( | ||
| ).reverse_param_names_mapping | ||
| lora_param_names_mapping = Kandinsky5VideoConfig().lora_param_names_mapping | ||
| _supported_attention_backends = Kandinsky5VideoConfig( | ||
| )._supported_attention_backends |
There was a problem hiding this comment.
The class attributes _fsdp_shard_conditions, _compile_conditions, etc., are being initialized by accessing attributes on a default Kandinsky5VideoConfig instance. However, these attributes are defined within the arch_config of the Kandinsky5VideoConfig. This will raise an AttributeError at runtime. The correct way to access them would be through the arch_config attribute.
arch_config_defaults = Kandinsky5VideoConfig().arch_config
_fsdp_shard_conditions = arch_config_defaults._fsdp_shard_conditions
_compile_conditions = arch_config_defaults._compile_conditions
param_names_mapping = arch_config_defaults.param_names_mapping
reverse_param_names_mapping = arch_config_defaults.reverse_param_names_mapping
lora_param_names_mapping = arch_config_defaults.lora_param_names_mapping
_supported_attention_backends = arch_config_defaults._supported_attention_backends| def _apply_rotary(x: torch.Tensor, rope: torch.Tensor) -> torch.Tensor: | ||
| x_ = x.reshape(*x.shape[:-1], -1, 1, 2).to(torch.float32) | ||
| x_out = (rope * x_).sum(dim=-1) | ||
| return x_out.reshape(*x.shape).to(torch.bfloat16) |
There was a problem hiding this comment.
The output of _apply_rotary is hardcoded to torch.bfloat16. This can cause type mismatches and precision issues if the model is running with a different precision (e.g., float16 or float32). It should be cast to the original input tensor's dtype to ensure correctness across different precisions.
| return x_out.reshape(*x.shape).to(torch.bfloat16) | |
| return x_out.reshape(*x.shape).to(x.dtype) |
| shape: tuple[int, int, int, int], | ||
| block_mask: bool = False): | ||
| if block_mask: | ||
| pixel_size = 8 |
| def forward(self, visual_embed: torch.Tensor, text_embed: torch.Tensor, | ||
| time_embed: torch.Tensor) -> torch.Tensor: |
There was a problem hiding this comment.
The text_embed parameter is unused in the forward method of Kandinsky5OutLayer. It should be removed to improve code clarity. The call to this method in Kandinsky5Transformer3DModel.forward at line 615 should also be updated accordingly.
def forward(self, visual_embed: torch.Tensor, time_embed: torch.Tensor) -> torch.Tensor:| timestep: torch.Tensor, | ||
| encoder_hidden_states_image: torch.Tensor | list[torch.Tensor] | ||
| | None = None, | ||
| guidance=None, |
| visual_embed = fractal_unflatten(visual_embed, | ||
| visual_shape, | ||
| block_mask=to_fractal) | ||
| x = self.out_layer(visual_embed, text_embed, time_embed) |
There was a problem hiding this comment.
Please use as many layers from fastvideo/layers like ReplicatedLinear, RoPE and fused layernorm
There was a problem hiding this comment.
Hi,
I had implemented the FASTVIDEO native version for kandinsky5-lite. Have made sure to include these:
- Migrated core projections to FastVideo native ReplicatedLinear (time/text/visual embeddings, modulation, attention qkv/out, out layer)
- Switched attention compute to FastVideo LocalAttention and added fallback to torch SDPA when forward context is absent (for standalone parity test path).
- Kept/used FastVideo fused norms in encoder/decoder blocks via LayerNormScaleShift (eps=1e-5) with float residual math to match official behavior
- Kandinsky5FeedForward uses FastVideo MLP(..., bias=False, act_type="gelu").
- Implemented native RoPE + fractal flatten/unflatten flow and non-persistent buffer materialization
|
plz solve conflicts and pre-commit errors |
Co-authored-by: Ishan Vaish <vaish.ishan@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…on-cuda platforms (hao-ai-lab#1107)
Co-authored-by: Peiyuan Zhang <a1286225768@slurm-h200-204-227.slurm-compute.tenant-slurm.svc.cluster.local> Co-authored-by: Peiyuan Zhang <a1286225768@slurm-login-0.slurm-login.tenant-slurm.svc.cluster.local>
Co-authored-by: RandNMR73 <notomatthew31@gmail.com> Co-authored-by: JerryZhou54 <zhouw.jerry2017@outlook.com> Co-authored-by: Davids048 <jundasu@ucsd.edu> Co-authored-by: Peiyuan Zhang <a1286225768@slurm-h200-204-239.slurm-compute.tenant-slurm.svc.cluster.local>
Co-authored-by: SolitaryThinker <wlsaidhi@gmail.com>
Co-authored-by: Will Lin <wlsaidhi@gmail.com>
Co-authored-by: Will Lin <wlsaidhi@gmail.com>
Adds native FastVideo implementation of Kandinsky5Transformer3DModel (no Diffusers wrapper).
Adds Kandinsky5 Lite DiT config + parameter mapping and registry wiring.
Adds a local parity test against Diffusers checkpoint.