[do not merge] diffusion: support mutli image input #14236

yhyang201 · 2025-12-01T17:12:32Z

Motivation

Add tests tomorrow

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

Co-authored-by: Zyann <[email protected]>

gemini-code-assist · 2025-12-01T17:13:07Z

Summary of Changes

Hello @yhyang201, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces foundational changes to enable the processing of multiple input images within the diffusion model pipelines. It modifies how image sizes are calculated, images are resized, and how the VAE encoding stage handles multiple image inputs, laying the groundwork for more complex multi-image generation and editing capabilities, such as image-to-image and image-to-video tasks. A new specialized pipeline for Qwen-Image-Edit is also added to leverage these multi-image features.

Highlights

Multi-Image Input Support: Core pipeline configurations (base.py, flux.py, qwen_image.py) are updated to handle lists of images for size calculation and resizing, enabling multi-image input for diffusion models.
New Pipeline for Qwen-Image-Edit: A QwenImageEditPlusPipelineConfig and corresponding sampling parameters are introduced, specifically tailored for advanced multi-image editing tasks within the Qwen-Image-Edit framework.
API and Internal Data Structure Updates: The OpenAI-compatible image API now supports uploading multiple images, and the internal Req object and ImageVAEEncodingStage are modified to correctly process and store multiple image inputs and their latents.
Image Preprocessing Integration: A new preprocess_image method is added to pipeline configurations and integrated into the image encoding stage, allowing for custom preprocessing logic for input images.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for multiple image inputs in the diffusion pipeline, which is a significant feature enhancement. The changes are mostly well-implemented, refactoring several methods to handle lists of images instead of single images. However, I've identified several issues that need attention. There are multiple incorrect type hints across different files, instances of code duplication that affect maintainability, and a critical bug in QwenImageEditPlusPipelineConfig.prepare_image_processor_kwargs that could lead to a NameError. Additionally, the new qwen-image-edit-2509 model is registered with incorrect sampling parameters. Addressing these points will improve the correctness and robustness of the new functionality.

gemini-code-assist · 2025-12-01T17:15:43Z

python/sglang/multimodal_gen/configs/pipeline_configs/qwen_image.py

+    def prepare_image_processor_kwargs(self, batch) -> dict:
+        prompt = batch.prompt
+        prompt_list = [prompt] if isinstance(prompt, str) else prompt
+        image_list = batch.condition_image
+
+        prompt_template_encode = (
+            "<|im_start|>system\nDescribe the key features of the input image "
+            "(color, shape, size, texture, objects, background), then explain how "
+            "the user's text instruction should alter or modify the image. Generate "
+            "a new image that meets the user's requirements while maintaining "
+            "consistency with the original input where appropriate.<|im_end|>\n"
+            "<|im_start|>user\n{}<|im_end|>\n"
+            "<|im_start|>assistant\n"
+        )
+        img_prompt_template = "Picture {}: <|vision_start|><|image_pad|><|vision_end|>"
+        if isinstance(image_list, list):
+            base_img_prompt = ""
+            for i, img in enumerate(image_list):
+                base_img_prompt += img_prompt_template.format(i + 1)
+        txt = [prompt_template_encode.format(base_img_prompt + p) for p in prompt_list]
+        return dict(text=txt, padding=True)


There is a potential NameError in this method. If batch.condition_image (assigned to image_list) is not a list (e.g., a single image object), the if isinstance(image_list, list): block will be skipped. This means base_img_prompt will not be initialized, causing a NameError on the line txt = [prompt_template_encode.format(base_img_prompt + p) for p in prompt_list]. To fix this, you should initialize base_img_prompt before the if block and also handle the case where image_list is a single image.

def prepare_image_processor_kwargs(self, batch) -> dict: prompt = batch.prompt prompt_list = [prompt] if isinstance(prompt, str) else prompt image_list = batch.condition_image prompt_template_encode = ( "<|im_start|>system\nDescribe the key features of the input image " "(color, shape, size, texture, objects, background), then explain how " "the user's text instruction should alter or modify the image. Generate " "a new image that meets the user's requirements while maintaining " "consistency with the original input where appropriate.<|im_end|>\n" "<|im_start|>user\n{}\n<|im_end|>\n" "<|im_start|>assistant\n" ) img_prompt_template = "Picture {}: <|vision_start|><|image_pad|><|vision_end|>" base_img_prompt = "" if image_list: if not isinstance(image_list, list): image_list = [image_list] for i, img in enumerate(image_list): base_img_prompt += img_prompt_template.format(i + 1) txt = [prompt_template_encode.format(base_img_prompt + p) for p in prompt_list] return dict(text=txt, padding=True)

python/sglang/multimodal_gen/registry.py

gemini-code-assist · 2025-12-01T17:15:44Z

python/sglang/multimodal_gen/configs/pipeline_configs/base.py

-    def calculate_condition_image_size(self, image, width, height) -> tuple[int, int]:
-        vae_scale_factor = self.vae_config.arch_config.spatial_compression_ratio
-        height, width = get_default_height_width(image, vae_scale_factor, height, width)
+    def calculate_condition_image_size(self, images) -> tuple[int, int]:


The return type hint for calculate_condition_image_size is incorrect. The function returns two lists of integers (width and height), but the type hint is tuple[int, int]. It should be tuple[list[int], list[int]] to match the actual return value.

Suggested change

def calculate_condition_image_size(self, images) -> tuple[int, int]:

def calculate_condition_image_size(self, images) -> tuple[list[int], list[int]]:

gemini-code-assist · 2025-12-01T17:15:44Z

python/sglang/multimodal_gen/configs/pipeline_configs/flux.py

-    def calculate_condition_image_size(
-        self, image, width, height
-    ) -> Optional[tuple[int, int]]:
+    def calculate_condition_image_size(self, images) -> Optional[tuple[int, int]]:


The return type hint for calculate_condition_image_size is incorrect. The function now returns two lists of integers, but the hint is Optional[tuple[int, int]]. It should be updated to tuple[list[int], list[int]]. Since the function no longer returns None, the Optional is also unnecessary.

Suggested change

def calculate_condition_image_size(self, images) -> Optional[tuple[int, int]]:

def calculate_condition_image_size(self, images) -> tuple[list[int], list[int]]:

gemini-code-assist · 2025-12-01T17:15:44Z

python/sglang/multimodal_gen/configs/pipeline_configs/flux.py

+    def resize_condition_image(self, images, target_width, target_height):
+        new_images = []
+        for image, width, height in zip(images, target_width, target_height):
+            new_images.append(
+                image.resize((width, height), PIL.Image.Resampling.LANCZOS)
+            )
+        return new_images


This resize_condition_image method is an exact copy of the implementation in the base class PipelineConfig. This override is redundant and can be removed to avoid code duplication and improve maintainability.

python/sglang/multimodal_gen/configs/pipeline_configs/qwen_image.py

gemini-code-assist · 2025-12-01T17:15:45Z

python/sglang/multimodal_gen/configs/pipeline_configs/qwen_image.py

+    def calculate_condition_image_size(self, images) -> tuple[int, int]:
+        calculated_widths = []
+        calculated_heights = []
+
+        for img in images:
+            image_width, image_height = img.size
+            calculated_width, calculated_height, _ = calculate_dimensions(
+                VAE_IMAGE_SIZE, image_width / image_height
+            )
+            calculated_widths.append(calculated_width)
+            calculated_heights.append(calculated_height)
+
+        return calculated_widths, calculated_heights


This method has two issues:

The return type hint is tuple[int, int], but it returns two lists of integers. It should be tuple[list[int], list[int]].

This method is a near-identical copy of the implementation in its parent class QwenImageEditPipelineConfig. This creates unnecessary code duplication.

Since the logic is the same, this method can be removed from QwenImageEditPlusPipelineConfig to inherit the implementation from the parent class, which would fix both issues and improve maintainability.

support mutli image input

9a61f60

Co-authored-by: Zyann <[email protected]>

yhyang201 requested a review from mickqian as a code owner December 1, 2025 17:12

github-actions bot added the diffusion SGLang Diffusion label Dec 1, 2025

gemini-code-assist bot reviewed Dec 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[do not merge] diffusion: support mutli image input #14236

[do not merge] diffusion: support mutli image input #14236

Uh oh!

yhyang201 commented Dec 1, 2025

Uh oh!

gemini-code-assist bot commented Dec 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 1, 2025

Uh oh!

Uh oh!

gemini-code-assist bot Dec 1, 2025

Uh oh!

gemini-code-assist bot Dec 1, 2025

Uh oh!

gemini-code-assist bot Dec 1, 2025

Uh oh!

Uh oh!

gemini-code-assist bot Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	def calculate_condition_image_size(self, images) -> tuple[int, int]:
	def calculate_condition_image_size(self, images) -> tuple[list[int], list[int]]:

[do not merge] diffusion: support mutli image input #14236

Are you sure you want to change the base?

[do not merge] diffusion: support mutli image input #14236

Uh oh!

Conversation

yhyang201 commented Dec 1, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Dec 1, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant