-
Notifications
You must be signed in to change notification settings - Fork 168
feat: image mask support for multi-modal late interaction models #591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds image mask support for multi-modal late interaction models, specifically implementing the get_image_mask method that generates binary masks to identify which tokens in a processed image sequence correspond to actual image content versus text/special tokens.
Key changes:
- Introduces
get_image_maskmethod to the base class that raisesNotImplementedErrorby default - Implements the method in ColPali to return boolean masks where True indicates image tokens (first 1024 tokens) and False indicates text tokens (last 6 tokens)
- Adds comprehensive test coverage for the new functionality
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py |
Adds abstract get_image_mask method with documentation and NotImplementedError default implementation |
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py |
Implements get_image_mask as a delegation wrapper to the underlying model |
fastembed/late_interaction_multimodal/colpali.py |
Implements ColPali-specific get_image_mask that returns boolean masks based on token IDs, and refactors IMAGE_TOKEN_ID constant usage |
tests/test_late_interaction_multimodal.py |
Adds three new test functions covering basic functionality, single image handling, and base class behavior |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py
Outdated
Show resolved
Hide resolved
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughAdds image-mask support across the multimodal late-interaction path. ColPali gains a public constant Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (1)
🧰 Additional context used🧬 Code graph analysis (2)fastembed/late_interaction_multimodal/colpali.py (2)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (2)
🪛 Ruff (0.14.8)fastembed/late_interaction_multimodal/colpali.py305-305: Unused method argument: (ARG002) ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
🔇 Additional comments (4)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
fastembed/late_interaction_multimodal/colpali.py (1)
302-341: Simplify the mask generation logic.The current implementation uses batch processing, but the actual mask generation doesn't depend on processing the images themselves - it simply returns the same static mask (
EMPTY_TEXT_PLACEHOLDER == IMAGE_TOKEN_ID) for every image. Thebatch_sizeparameter and batch loop add unnecessary complexity without providing any functional benefit.Apply this diff to simplify the implementation:
def get_image_mask( self, images: ImageInput | Iterable[ImageInput], batch_size: int = 16, **kwargs: Any, ) -> list[NumpyArray]: """ Generate image token masks for ColPali embeddings. For ColPali, image embeddings use 1030 tokens: - Tokens 0-1023: Image tokens (token ID 257152) - Tokens 1024-1029: Text tokens from prompt "Describe the image.\\n" Args: images: Single image or iterable of images batch_size: Batch size for processing **kwargs: Additional processing arguments Returns: List of binary masks (dtype=bool) where True = image token (ID 257152), False = other tokens. """ from pathlib import Path # Ensure images is iterable is_single = isinstance(images, (str, bytes, Path)) or hasattr(images, "read") - images_to_process: Iterable[ImageInput] = [images] if is_single else images # type: ignore[list-item] + if is_single: + images_to_process: Iterable[ImageInput] = [images] # type: ignore[list-item] + else: + images_to_process = images - # Process images in batches to get input_ids + # Generate mask for each image + # All ColPali images use the same mask pattern + static_mask: NumpyArray = self.EMPTY_TEXT_PLACEHOLDER == self.IMAGE_TOKEN_ID masks: list[NumpyArray] = [] - images_list = list(images_to_process) - for batch_start in range(0, len(images_list), batch_size): - batch = images_list[batch_start : batch_start + batch_size] - - # For ColPali images, input_ids follow EMPTY_TEXT_PLACEHOLDER pattern - # Generate mask: True for image tokens (ID 257152), False for others - for _ in batch: - mask: NumpyArray = self.EMPTY_TEXT_PLACEHOLDER == self.IMAGE_TOKEN_ID - masks.append(mask) + for _ in images_to_process: + masks.append(static_mask.copy()) return masksNote: The
batch_sizeand**kwargsparameters are retained for API consistency with other methods in the class, even though they're not used in this implementation.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
fastembed/late_interaction_multimodal/colpali.py(2 hunks)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py(1 hunks)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py(1 hunks)tests/test_late_interaction_multimodal.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
tests/test_late_interaction_multimodal.py (3)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
get_image_mask(187-220)fastembed/late_interaction_multimodal/colpali.py (1)
get_image_mask(302-341)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
get_image_mask(88-124)
fastembed/late_interaction_multimodal/colpali.py (2)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
get_image_mask(187-220)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
get_image_mask(88-124)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (2)
fastembed/late_interaction_multimodal/colpali.py (1)
get_image_mask(302-341)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
get_image_mask(187-220)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (2)
fastembed/late_interaction_multimodal/colpali.py (1)
get_image_mask(302-341)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
get_image_mask(88-124)
🪛 GitHub Actions: type-checkers
fastembed/late_interaction_multimodal/colpali.py
[error] 327-327: mypy: Incompatible types in assignment. Found expression type 'Path | Image | Iterable[str | Path | Image]' to be assigned to variable of type 'Iterable[str | Path | Image]'.
🪛 Ruff (0.14.8)
tests/test_late_interaction_multimodal.py
178-178: Probable insecure usage of temporary file or directory: "/tmp"
(S108)
fastembed/late_interaction_multimodal/colpali.py
306-306: Unused method argument: kwargs
(ARG002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Agent
- GitHub Check: Python 3.13.x on windows-latest test
- GitHub Check: Python 3.13.x on macos-latest test
🔇 Additional comments (5)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
187-220: LGTM!The delegation pattern is implemented correctly, and the docstring provides clear guidance on usage and expected behavior.
tests/test_late_interaction_multimodal.py (3)
125-152: LGTM!The test comprehensively validates mask generation for both single and multiple images, checking shape, dtype, and token positions as expected for ColPali's 1030-token format.
154-166: LGTM!This test validates the API's flexibility in accepting both single images and iterables, ensuring consistent behavior across input formats.
168-183: LGTM!The test correctly validates that the base class raises
NotImplementedErrorwith an appropriate message. The static analysis warning about/tmpusage is a false positive in this context, as this is a minimal test case where the cache directory is never actually used.fastembed/late_interaction_multimodal/colpali.py (1)
39-39: LGTM!Good refactoring to replace the magic number with a named constant, improving code readability and maintainability.
Also applies to: 42-42
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
fastembed/late_interaction_multimodal/colpali.py (1)
323-327: Consider movingPathimport to module level.The
Pathimport on line 323 is evaluated on every method call. While the overhead is negligible, moving it to the module-level imports would be more conventional.from typing import Any, Iterable, Sequence, Type +from pathlib import Path import numpy as npThen remove line 323 from the method body.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
fastembed/late_interaction_multimodal/colpali.py(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
fastembed/late_interaction_multimodal/colpali.py (2)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
get_image_mask(88-124)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
get_image_mask(187-220)
🪛 Ruff (0.14.8)
fastembed/late_interaction_multimodal/colpali.py
306-306: Unused method argument: kwargs
(ARG002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: Python 3.13.x on macos-latest test
- GitHub Check: Python 3.10.x on ubuntu-latest test
- GitHub Check: Python 3.12.x on ubuntu-latest test
- GitHub Check: Python 3.13.x on windows-latest test
- GitHub Check: Python 3.13.x on ubuntu-latest test
- GitHub Check: Python 3.11.x on ubuntu-latest test
🔇 Additional comments (3)
fastembed/late_interaction_multimodal/colpali.py (3)
39-42: Good refactor: Magic number extracted to named constant.The introduction of
IMAGE_TOKEN_IDwith a descriptive comment improves code clarity and ensures consistency betweenEMPTY_TEXT_PLACEHOLDERand the newget_image_maskmethod.
302-322: Clear method signature and documentation.The docstring accurately describes the 1030-token structure (1024 image tokens + 6 text tokens). The
batch_sizeparameter maintains API consistency with other methods likeembed_image, even though the current implementation doesn't perform heavy per-batch processing. The unusedkwargsfollows the base class interface pattern for future extensibility.
330-341: Implementation is correct for ColPali's fixed token structure.The mask generation correctly leverages the fact that all ColPali image embeddings use the same 1030-token pattern (
EMPTY_TEXT_PLACEHOLDER). The boolean comparisonEMPTY_TEXT_PLACEHOLDER == IMAGE_TOKEN_IDefficiently produces the expected mask where the first 1024 positions areTrue(image tokens) and the last 6 areFalse(text tokens).
All Submissions:
New Feature Submissions:
pre-commitwithpip3 install pre-commitand set up hooks withpre-commit install?