-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[do not merge] diffusion: support mutli image input #14236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Zyann <[email protected]>
Summary of ChangesHello @yhyang201, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces foundational changes to enable the processing of multiple input images within the diffusion model pipelines. It modifies how image sizes are calculated, images are resized, and how the VAE encoding stage handles multiple image inputs, laying the groundwork for more complex multi-image generation and editing capabilities, such as image-to-image and image-to-video tasks. A new specialized pipeline for Qwen-Image-Edit is also added to leverage these multi-image features. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for multiple image inputs in the diffusion pipeline, which is a significant feature enhancement. The changes are mostly well-implemented, refactoring several methods to handle lists of images instead of single images. However, I've identified several issues that need attention. There are multiple incorrect type hints across different files, instances of code duplication that affect maintainability, and a critical bug in QwenImageEditPlusPipelineConfig.prepare_image_processor_kwargs that could lead to a NameError. Additionally, the new qwen-image-edit-2509 model is registered with incorrect sampling parameters. Addressing these points will improve the correctness and robustness of the new functionality.
| def prepare_image_processor_kwargs(self, batch) -> dict: | ||
| prompt = batch.prompt | ||
| prompt_list = [prompt] if isinstance(prompt, str) else prompt | ||
| image_list = batch.condition_image | ||
|
|
||
| prompt_template_encode = ( | ||
| "<|im_start|>system\nDescribe the key features of the input image " | ||
| "(color, shape, size, texture, objects, background), then explain how " | ||
| "the user's text instruction should alter or modify the image. Generate " | ||
| "a new image that meets the user's requirements while maintaining " | ||
| "consistency with the original input where appropriate.<|im_end|>\n" | ||
| "<|im_start|>user\n{}<|im_end|>\n" | ||
| "<|im_start|>assistant\n" | ||
| ) | ||
| img_prompt_template = "Picture {}: <|vision_start|><|image_pad|><|vision_end|>" | ||
| if isinstance(image_list, list): | ||
| base_img_prompt = "" | ||
| for i, img in enumerate(image_list): | ||
| base_img_prompt += img_prompt_template.format(i + 1) | ||
| txt = [prompt_template_encode.format(base_img_prompt + p) for p in prompt_list] | ||
| return dict(text=txt, padding=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a potential NameError in this method. If batch.condition_image (assigned to image_list) is not a list (e.g., a single image object), the if isinstance(image_list, list): block will be skipped. This means base_img_prompt will not be initialized, causing a NameError on the line txt = [prompt_template_encode.format(base_img_prompt + p) for p in prompt_list]. To fix this, you should initialize base_img_prompt before the if block and also handle the case where image_list is a single image.
def prepare_image_processor_kwargs(self, batch) -> dict:
prompt = batch.prompt
prompt_list = [prompt] if isinstance(prompt, str) else prompt
image_list = batch.condition_image
prompt_template_encode = (
"<|im_start|>system\nDescribe the key features of the input image "
"(color, shape, size, texture, objects, background), then explain how "
"the user's text instruction should alter or modify the image. Generate "
"a new image that meets the user's requirements while maintaining "
"consistency with the original input where appropriate.<|im_end|>\n"
"<|im_start|>user\n{}\n<|im_end|>\n"
"<|im_start|>assistant\n"
)
img_prompt_template = "Picture {}: <|vision_start|><|image_pad|><|vision_end|>"
base_img_prompt = ""
if image_list:
if not isinstance(image_list, list):
image_list = [image_list]
for i, img in enumerate(image_list):
base_img_prompt += img_prompt_template.format(i + 1)
txt = [prompt_template_encode.format(base_img_prompt + p) for p in prompt_list]
return dict(text=txt, padding=True)| def calculate_condition_image_size(self, image, width, height) -> tuple[int, int]: | ||
| vae_scale_factor = self.vae_config.arch_config.spatial_compression_ratio | ||
| height, width = get_default_height_width(image, vae_scale_factor, height, width) | ||
| def calculate_condition_image_size(self, images) -> tuple[int, int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type hint for calculate_condition_image_size is incorrect. The function returns two lists of integers (width and height), but the type hint is tuple[int, int]. It should be tuple[list[int], list[int]] to match the actual return value.
| def calculate_condition_image_size(self, images) -> tuple[int, int]: | |
| def calculate_condition_image_size(self, images) -> tuple[list[int], list[int]]: |
| def calculate_condition_image_size( | ||
| self, image, width, height | ||
| ) -> Optional[tuple[int, int]]: | ||
| def calculate_condition_image_size(self, images) -> Optional[tuple[int, int]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type hint for calculate_condition_image_size is incorrect. The function now returns two lists of integers, but the hint is Optional[tuple[int, int]]. It should be updated to tuple[list[int], list[int]]. Since the function no longer returns None, the Optional is also unnecessary.
| def calculate_condition_image_size(self, images) -> Optional[tuple[int, int]]: | |
| def calculate_condition_image_size(self, images) -> tuple[list[int], list[int]]: |
| def resize_condition_image(self, images, target_width, target_height): | ||
| new_images = [] | ||
| for image, width, height in zip(images, target_width, target_height): | ||
| new_images.append( | ||
| image.resize((width, height), PIL.Image.Resampling.LANCZOS) | ||
| ) | ||
| return new_images |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def calculate_condition_image_size(self, images) -> tuple[int, int]: | ||
| calculated_widths = [] | ||
| calculated_heights = [] | ||
|
|
||
| for img in images: | ||
| image_width, image_height = img.size | ||
| calculated_width, calculated_height, _ = calculate_dimensions( | ||
| VAE_IMAGE_SIZE, image_width / image_height | ||
| ) | ||
| calculated_widths.append(calculated_width) | ||
| calculated_heights.append(calculated_height) | ||
|
|
||
| return calculated_widths, calculated_heights |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method has two issues:
- The return type hint is
tuple[int, int], but it returns two lists of integers. It should betuple[list[int], list[int]]. - This method is a near-identical copy of the implementation in its parent class
QwenImageEditPipelineConfig. This creates unnecessary code duplication.
Since the logic is the same, this method can be removed from QwenImageEditPlusPipelineConfig to inherit the implementation from the parent class, which would fix both issues and improve maintainability.
Motivation
Add tests tomorrow
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist