Skip to content

Conversation

@kacperlukawski
Copy link
Member

All Submissions:

  • Have you followed the guidelines in our Contributing document?
  • Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

  • Does your submission pass the existing tests?
  • Have you added tests for your feature?
  • Have you installed pre-commit with pip3 install pre-commit and set up hooks with pre-commit install?

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds image mask support for multi-modal late interaction models, specifically implementing the get_image_mask method that generates binary masks to identify which tokens in a processed image sequence correspond to actual image content versus text/special tokens.

Key changes:

  • Introduces get_image_mask method to the base class that raises NotImplementedError by default
  • Implements the method in ColPali to return boolean masks where True indicates image tokens (first 1024 tokens) and False indicates text tokens (last 6 tokens)
  • Adds comprehensive test coverage for the new functionality

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py Adds abstract get_image_mask method with documentation and NotImplementedError default implementation
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py Implements get_image_mask as a delegation wrapper to the underlying model
fastembed/late_interaction_multimodal/colpali.py Implements ColPali-specific get_image_mask that returns boolean masks based on token IDs, and refactors IMAGE_TOKEN_ID constant usage
tests/test_late_interaction_multimodal.py Adds three new test functions covering basic functionality, single image handling, and base class behavior

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@coderabbitai
Copy link

coderabbitai bot commented Dec 15, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds image-mask support across the multimodal late-interaction path. ColPali gains a public constant IMAGE_TOKEN_ID (replacing hardcoded 257152) and an implemented get_image_mask(...) that accepts a single image or iterable and returns boolean numpy masks marking image-token positions derived from the placeholder pattern. The LateInteractionMultimodalEmbedding class now delegates get_image_mask(...) to its model, and LateInteractionMultimodalEmbeddingBase exposes the get_image_mask(...) interface that raises NotImplementedError by default. New tests validate mask shape, dtype, and single/multiple image behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Review areas that may need extra attention:
    • ColPali.get_image_mask: batching logic, handling of single vs iterable inputs, and correctness of mask derivation against EMPTY_TEXT_PLACEHOLDER using IMAGE_TOKEN_ID.
    • Consistency and clarity of public API signatures and docstrings in LateInteractionMultimodalEmbedding and LateInteractionMultimodalEmbeddingBase.
    • Tests in tests/test_late_interaction_multimodal.py: CI gating, expected mask length (1030), dtype, and asserted true/false ranges.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'feat: image mask support for multi-modal late interaction models' directly and clearly describes the main feature addition across all modified files.
Description check ✅ Passed The PR description confirms adherence to contributing guidelines, test coverage, and completion of new feature submission requirements, all related to the image mask feature.
Docstring Coverage ✅ Passed Docstring coverage is 81.82% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/image-mask-support

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 06f1829 and c13576a.

📒 Files selected for processing (3)
  • fastembed/late_interaction_multimodal/colpali.py (2 hunks)
  • fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1 hunks)
  • fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py
🧰 Additional context used
🧬 Code graph analysis (2)
fastembed/late_interaction_multimodal/colpali.py (2)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
  • get_image_mask (187-218)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
  • get_image_mask (88-122)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (2)
fastembed/late_interaction_multimodal/colpali.py (1)
  • get_image_mask (302-333)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
  • get_image_mask (187-218)
🪛 Ruff (0.14.8)
fastembed/late_interaction_multimodal/colpali.py

305-305: Unused method argument: kwargs

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Python 3.10.x on ubuntu-latest test
  • GitHub Check: Python 3.12.x on ubuntu-latest test
  • GitHub Check: Python 3.13.x on ubuntu-latest test
  • GitHub Check: Python 3.13.x on macos-latest test
  • GitHub Check: Python 3.13.x on windows-latest test
🔇 Additional comments (4)
fastembed/late_interaction_multimodal/colpali.py (3)

39-39: LGTM! Good refactoring to use a named constant.

Replacing the magic number with IMAGE_TOKEN_ID improves code maintainability and makes the special token's purpose explicit.


42-42: LGTM! Consistent use of the IMAGE_TOKEN_ID constant.

The EMPTY_TEXT_PLACEHOLDER now correctly references the IMAGE_TOKEN_ID constant, maintaining the same behavior while improving maintainability.


302-333: LGTM! Efficient mask generation using fixed tokenization pattern.

The implementation correctly generates image token masks based on ColPali's fixed tokenization pattern (1024 image tokens followed by 6 text tokens). The method efficiently avoids loading/processing images since all ColPali images use the same mask.

Note: The **kwargs parameter is intentionally unused (reserved for future use and API consistency with the base class), so the static analysis warning can be ignored.

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)

88-122: LGTM! Well-designed base class interface.

The get_image_mask method provides a clean, well-documented interface for subclasses. The comprehensive docstring, correct example code, and descriptive NotImplementedError message make it easy for implementers to understand the expected behavior.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
fastembed/late_interaction_multimodal/colpali.py (1)

302-341: Simplify the mask generation logic.

The current implementation uses batch processing, but the actual mask generation doesn't depend on processing the images themselves - it simply returns the same static mask (EMPTY_TEXT_PLACEHOLDER == IMAGE_TOKEN_ID) for every image. The batch_size parameter and batch loop add unnecessary complexity without providing any functional benefit.

Apply this diff to simplify the implementation:

 def get_image_mask(
         self,
         images: ImageInput | Iterable[ImageInput],
         batch_size: int = 16,
         **kwargs: Any,
     ) -> list[NumpyArray]:
         """
         Generate image token masks for ColPali embeddings.
 
         For ColPali, image embeddings use 1030 tokens:
         - Tokens 0-1023: Image tokens (token ID 257152)
         - Tokens 1024-1029: Text tokens from prompt "Describe the image.\\n"
 
         Args:
             images: Single image or iterable of images
             batch_size: Batch size for processing
             **kwargs: Additional processing arguments
 
         Returns:
             List of binary masks (dtype=bool) where True = image token (ID 257152), False = other tokens.
         """
         from pathlib import Path
 
         # Ensure images is iterable
         is_single = isinstance(images, (str, bytes, Path)) or hasattr(images, "read")
-        images_to_process: Iterable[ImageInput] = [images] if is_single else images  # type: ignore[list-item]
+        if is_single:
+            images_to_process: Iterable[ImageInput] = [images]  # type: ignore[list-item]
+        else:
+            images_to_process = images
 
-        # Process images in batches to get input_ids
+        # Generate mask for each image
+        # All ColPali images use the same mask pattern
+        static_mask: NumpyArray = self.EMPTY_TEXT_PLACEHOLDER == self.IMAGE_TOKEN_ID
         masks: list[NumpyArray] = []
-        images_list = list(images_to_process)
-        for batch_start in range(0, len(images_list), batch_size):
-            batch = images_list[batch_start : batch_start + batch_size]
-
-            # For ColPali images, input_ids follow EMPTY_TEXT_PLACEHOLDER pattern
-            # Generate mask: True for image tokens (ID 257152), False for others
-            for _ in batch:
-                mask: NumpyArray = self.EMPTY_TEXT_PLACEHOLDER == self.IMAGE_TOKEN_ID
-                masks.append(mask)
+        for _ in images_to_process:
+            masks.append(static_mask.copy())
 
         return masks

Note: The batch_size and **kwargs parameters are retained for API consistency with other methods in the class, even though they're not used in this implementation.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 685fd9b and 986c6ba.

📒 Files selected for processing (4)
  • fastembed/late_interaction_multimodal/colpali.py (2 hunks)
  • fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1 hunks)
  • fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1 hunks)
  • tests/test_late_interaction_multimodal.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
tests/test_late_interaction_multimodal.py (3)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
  • get_image_mask (187-220)
fastembed/late_interaction_multimodal/colpali.py (1)
  • get_image_mask (302-341)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
  • get_image_mask (88-124)
fastembed/late_interaction_multimodal/colpali.py (2)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
  • get_image_mask (187-220)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
  • get_image_mask (88-124)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (2)
fastembed/late_interaction_multimodal/colpali.py (1)
  • get_image_mask (302-341)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
  • get_image_mask (187-220)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (2)
fastembed/late_interaction_multimodal/colpali.py (1)
  • get_image_mask (302-341)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
  • get_image_mask (88-124)
🪛 GitHub Actions: type-checkers
fastembed/late_interaction_multimodal/colpali.py

[error] 327-327: mypy: Incompatible types in assignment. Found expression type 'Path | Image | Iterable[str | Path | Image]' to be assigned to variable of type 'Iterable[str | Path | Image]'.

🪛 Ruff (0.14.8)
tests/test_late_interaction_multimodal.py

178-178: Probable insecure usage of temporary file or directory: "/tmp"

(S108)

fastembed/late_interaction_multimodal/colpali.py

306-306: Unused method argument: kwargs

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Agent
  • GitHub Check: Python 3.13.x on windows-latest test
  • GitHub Check: Python 3.13.x on macos-latest test
🔇 Additional comments (5)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

187-220: LGTM!

The delegation pattern is implemented correctly, and the docstring provides clear guidance on usage and expected behavior.

tests/test_late_interaction_multimodal.py (3)

125-152: LGTM!

The test comprehensively validates mask generation for both single and multiple images, checking shape, dtype, and token positions as expected for ColPali's 1030-token format.


154-166: LGTM!

This test validates the API's flexibility in accepting both single images and iterables, ensuring consistent behavior across input formats.


168-183: LGTM!

The test correctly validates that the base class raises NotImplementedError with an appropriate message. The static analysis warning about /tmp usage is a false positive in this context, as this is a minimal test case where the cache directory is never actually used.

fastembed/late_interaction_multimodal/colpali.py (1)

39-39: LGTM!

Good refactoring to replace the magic number with a named constant, improving code readability and maintainability.

Also applies to: 42-42

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
fastembed/late_interaction_multimodal/colpali.py (1)

323-327: Consider moving Path import to module level.

The Path import on line 323 is evaluated on every method call. While the overhead is negligible, moving it to the module-level imports would be more conventional.

 from typing import Any, Iterable, Sequence, Type
+from pathlib import Path
 
 import numpy as np

Then remove line 323 from the method body.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 986c6ba and 06f1829.

📒 Files selected for processing (1)
  • fastembed/late_interaction_multimodal/colpali.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
fastembed/late_interaction_multimodal/colpali.py (2)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
  • get_image_mask (88-124)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
  • get_image_mask (187-220)
🪛 Ruff (0.14.8)
fastembed/late_interaction_multimodal/colpali.py

306-306: Unused method argument: kwargs

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: Python 3.13.x on macos-latest test
  • GitHub Check: Python 3.10.x on ubuntu-latest test
  • GitHub Check: Python 3.12.x on ubuntu-latest test
  • GitHub Check: Python 3.13.x on windows-latest test
  • GitHub Check: Python 3.13.x on ubuntu-latest test
  • GitHub Check: Python 3.11.x on ubuntu-latest test
🔇 Additional comments (3)
fastembed/late_interaction_multimodal/colpali.py (3)

39-42: Good refactor: Magic number extracted to named constant.

The introduction of IMAGE_TOKEN_ID with a descriptive comment improves code clarity and ensures consistency between EMPTY_TEXT_PLACEHOLDER and the new get_image_mask method.


302-322: Clear method signature and documentation.

The docstring accurately describes the 1030-token structure (1024 image tokens + 6 text tokens). The batch_size parameter maintains API consistency with other methods like embed_image, even though the current implementation doesn't perform heavy per-batch processing. The unused kwargs follows the base class interface pattern for future extensibility.


330-341: Implementation is correct for ColPali's fixed token structure.

The mask generation correctly leverages the fact that all ColPali image embeddings use the same 1030-token pattern (EMPTY_TEXT_PLACEHOLDER). The boolean comparison EMPTY_TEXT_PLACEHOLDER == IMAGE_TOKEN_ID efficiently produces the expected mask where the first 1024 positions are True (image tokens) and the last 6 are False (text tokens).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants