feat: image mask support for multi-modal late interaction models #591

kacperlukawski · 2025-12-15T11:24:29Z

All Submissions:

Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

Does your submission pass the existing tests?
Have you added tests for your feature?
Have you installed pre-commit with pip3 install pre-commit and set up hooks with pre-commit install?

…Base

Copilot

Pull request overview

This PR adds image mask support for multi-modal late interaction models, specifically implementing the get_image_mask method that generates binary masks to identify which tokens in a processed image sequence correspond to actual image content versus text/special tokens.

Key changes:

Introduces get_image_mask method to the base class that raises NotImplementedError by default
Implements the method in ColPali to return boolean masks where True indicates image tokens (first 1024 tokens) and False indicates text tokens (last 6 tokens)
Adds comprehensive test coverage for the new functionality

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
`fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py`	Adds abstract `get_image_mask` method with documentation and NotImplementedError default implementation
`fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py`	Implements `get_image_mask` as a delegation wrapper to the underlying model
`fastembed/late_interaction_multimodal/colpali.py`	Implements ColPali-specific `get_image_mask` that returns boolean masks based on token IDs, and refactors IMAGE_TOKEN_ID constant usage
`tests/test_late_interaction_multimodal.py`	Adds three new test functions covering basic functionality, single image handling, and base class behavior

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

fastembed/late_interaction_multimodal/colpali.py

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py

coderabbitai · 2025-12-15T11:28:52Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds image-mask support across the multimodal late-interaction path. ColPali gains a public constant IMAGE_TOKEN_ID (replacing hardcoded 257152) and an implemented get_image_mask(...) that accepts a single image or iterable and returns boolean numpy masks marking image-token positions derived from the placeholder pattern. The LateInteractionMultimodalEmbedding class now delegates get_image_mask(...) to its model, and LateInteractionMultimodalEmbeddingBase exposes the get_image_mask(...) interface that raises NotImplementedError by default. New tests validate mask shape, dtype, and single/multiple image behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Review areas that may need extra attention:
- ColPali.get_image_mask: batching logic, handling of single vs iterable inputs, and correctness of mask derivation against EMPTY_TEXT_PLACEHOLDER using IMAGE_TOKEN_ID.
- Consistency and clarity of public API signatures and docstrings in LateInteractionMultimodalEmbedding and LateInteractionMultimodalEmbeddingBase.
- Tests in tests/test_late_interaction_multimodal.py: CI gating, expected mask length (1030), dtype, and asserted true/false ranges.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'feat: image mask support for multi-modal late interaction models' directly and clearly describes the main feature addition across all modified files.
Description check	✅ Passed	The PR description confirms adherence to contributing guidelines, test coverage, and completion of new feature submission requirements, all related to the image mask feature.
Docstring Coverage	✅ Passed	Docstring coverage is 81.82% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/image-mask-support

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 06f1829 and c13576a.

📒 Files selected for processing (3)

fastembed/late_interaction_multimodal/colpali.py (2 hunks)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1 hunks)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py

🧰 Additional context used

🧬 Code graph analysis (2)

fastembed/late_interaction_multimodal/colpali.py (2)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

get_image_mask (187-218)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)

get_image_mask (88-122)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (2)

fastembed/late_interaction_multimodal/colpali.py (1)

get_image_mask (302-333)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

get_image_mask (187-218)

🪛 Ruff (0.14.8)

fastembed/late_interaction_multimodal/colpali.py

305-305: Unused method argument: kwargs

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Python 3.10.x on ubuntu-latest test
GitHub Check: Python 3.12.x on ubuntu-latest test
GitHub Check: Python 3.13.x on ubuntu-latest test
GitHub Check: Python 3.13.x on macos-latest test
GitHub Check: Python 3.13.x on windows-latest test

🔇 Additional comments (4)

fastembed/late_interaction_multimodal/colpali.py (3)

39-39: LGTM! Good refactoring to use a named constant.

Replacing the magic number with IMAGE_TOKEN_ID improves code maintainability and makes the special token's purpose explicit.

42-42: LGTM! Consistent use of the IMAGE_TOKEN_ID constant.

The EMPTY_TEXT_PLACEHOLDER now correctly references the IMAGE_TOKEN_ID constant, maintaining the same behavior while improving maintainability.

302-333: LGTM! Efficient mask generation using fixed tokenization pattern.

The implementation correctly generates image token masks based on ColPali's fixed tokenization pattern (1024 image tokens followed by 6 text tokens). The method efficiently avoids loading/processing images since all ColPali images use the same mask.

Note: The **kwargs parameter is intentionally unused (reserved for future use and API consistency with the base class), so the static analysis warning can be ignored.

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)

88-122: LGTM! Well-designed base class interface.

The get_image_mask method provides a clean, well-documented interface for subclasses. The comprehensive docstring, correct example code, and descriptive NotImplementedError message make it easy for implementers to understand the expected behavior.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

fastembed/late_interaction_multimodal/colpali.py (1)

302-341: Simplify the mask generation logic.

The current implementation uses batch processing, but the actual mask generation doesn't depend on processing the images themselves - it simply returns the same static mask (EMPTY_TEXT_PLACEHOLDER == IMAGE_TOKEN_ID) for every image. The batch_size parameter and batch loop add unnecessary complexity without providing any functional benefit.

Apply this diff to simplify the implementation:

 def get_image_mask(
         self,
         images: ImageInput | Iterable[ImageInput],
         batch_size: int = 16,
         **kwargs: Any,
     ) -> list[NumpyArray]:
         """
         Generate image token masks for ColPali embeddings.
 
         For ColPali, image embeddings use 1030 tokens:
         - Tokens 0-1023: Image tokens (token ID 257152)
         - Tokens 1024-1029: Text tokens from prompt "Describe the image.\\n"
 
         Args:
             images: Single image or iterable of images
             batch_size: Batch size for processing
             **kwargs: Additional processing arguments
 
         Returns:
             List of binary masks (dtype=bool) where True = image token (ID 257152), False = other tokens.
         """
         from pathlib import Path
 
         # Ensure images is iterable
         is_single = isinstance(images, (str, bytes, Path)) or hasattr(images, "read")
-        images_to_process: Iterable[ImageInput] = [images] if is_single else images  # type: ignore[list-item]
+        if is_single:
+            images_to_process: Iterable[ImageInput] = [images]  # type: ignore[list-item]
+        else:
+            images_to_process = images
 
-        # Process images in batches to get input_ids
+        # Generate mask for each image
+        # All ColPali images use the same mask pattern
+        static_mask: NumpyArray = self.EMPTY_TEXT_PLACEHOLDER == self.IMAGE_TOKEN_ID
         masks: list[NumpyArray] = []
-        images_list = list(images_to_process)
-        for batch_start in range(0, len(images_list), batch_size):
-            batch = images_list[batch_start : batch_start + batch_size]
-
-            # For ColPali images, input_ids follow EMPTY_TEXT_PLACEHOLDER pattern
-            # Generate mask: True for image tokens (ID 257152), False for others
-            for _ in batch:
-                mask: NumpyArray = self.EMPTY_TEXT_PLACEHOLDER == self.IMAGE_TOKEN_ID
-                masks.append(mask)
+        for _ in images_to_process:
+            masks.append(static_mask.copy())
 
         return masks

Note: The batch_size and **kwargs parameters are retained for API consistency with other methods in the class, even though they're not used in this implementation.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 685fd9b and 986c6ba.

📒 Files selected for processing (4)

fastembed/late_interaction_multimodal/colpali.py (2 hunks)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1 hunks)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1 hunks)
tests/test_late_interaction_multimodal.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (4)

tests/test_late_interaction_multimodal.py (3)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

get_image_mask (187-220)

fastembed/late_interaction_multimodal/colpali.py (1)

get_image_mask (302-341)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)

get_image_mask (88-124)

fastembed/late_interaction_multimodal/colpali.py (2)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

get_image_mask (187-220)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)

get_image_mask (88-124)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (2)

fastembed/late_interaction_multimodal/colpali.py (1)

get_image_mask (302-341)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

get_image_mask (187-220)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (2)

fastembed/late_interaction_multimodal/colpali.py (1)

get_image_mask (302-341)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)

get_image_mask (88-124)

🪛 GitHub Actions: type-checkers

fastembed/late_interaction_multimodal/colpali.py

🪛 Ruff (0.14.8)

tests/test_late_interaction_multimodal.py

178-178: Probable insecure usage of temporary file or directory: "/tmp"

(S108)

fastembed/late_interaction_multimodal/colpali.py

306-306: Unused method argument: kwargs

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Agent
GitHub Check: Python 3.13.x on windows-latest test
GitHub Check: Python 3.13.x on macos-latest test

🔇 Additional comments (5)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

187-220: LGTM!

The delegation pattern is implemented correctly, and the docstring provides clear guidance on usage and expected behavior.

tests/test_late_interaction_multimodal.py (3)

125-152: LGTM!

The test comprehensively validates mask generation for both single and multiple images, checking shape, dtype, and token positions as expected for ColPali's 1030-token format.

154-166: LGTM!

This test validates the API's flexibility in accepting both single images and iterables, ensuring consistent behavior across input formats.

168-183: LGTM!

The test correctly validates that the base class raises NotImplementedError with an appropriate message. The static analysis warning about /tmp usage is a false positive in this context, as this is a minimal test case where the cache directory is never actually used.

fastembed/late_interaction_multimodal/colpali.py (1)

39-39: LGTM!

Good refactoring to replace the magic number with a named constant, improving code readability and maintainability.

Also applies to: 42-42

fastembed/late_interaction_multimodal/colpali.py

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

fastembed/late_interaction_multimodal/colpali.py (1)
323-327: Consider moving Path import to module level.

The Path import on line 323 is evaluated on every method call. While the overhead is negligible, moving it to the module-level imports would be more conventional.
 from typing import Any, Iterable, Sequence, Type
+from pathlib import Path
 
 import numpy as np
Then remove line 323 from the method body.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 986c6ba and 06f1829.

📒 Files selected for processing (1)

fastembed/late_interaction_multimodal/colpali.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

fastembed/late_interaction_multimodal/colpali.py (2)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)

get_image_mask (88-124)

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)

get_image_mask (187-220)

🪛 Ruff (0.14.8)

fastembed/late_interaction_multimodal/colpali.py

306-306: Unused method argument: kwargs

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: Python 3.13.x on macos-latest test
GitHub Check: Python 3.10.x on ubuntu-latest test
GitHub Check: Python 3.12.x on ubuntu-latest test
GitHub Check: Python 3.13.x on windows-latest test
GitHub Check: Python 3.13.x on ubuntu-latest test
GitHub Check: Python 3.11.x on ubuntu-latest test

🔇 Additional comments (3)

fastembed/late_interaction_multimodal/colpali.py (3)

39-42: Good refactor: Magic number extracted to named constant.

The introduction of IMAGE_TOKEN_ID with a descriptive comment improves code clarity and ensures consistency between EMPTY_TEXT_PLACEHOLDER and the new get_image_mask method.

302-322: Clear method signature and documentation.

The docstring accurately describes the 1030-token structure (1024 image tokens + 6 text tokens). The batch_size parameter maintains API consistency with other methods like embed_image, even though the current implementation doesn't perform heavy per-batch processing. The unused kwargs follows the base class interface pattern for future extensibility.

330-341: Implementation is correct for ColPali's fixed token structure.

The mask generation correctly leverages the fact that all ColPali image embeddings use the same 1030-token pattern (EMPTY_TEXT_PLACEHOLDER). The boolean comparison EMPTY_TEXT_PLACEHOLDER == IMAGE_TOKEN_ID efficiently produces the expected mask where the first 1024 positions are True (image tokens) and the last 6 are False (text tokens).

kacperlukawski added 3 commits December 15, 2025 12:17

feat: implement get_image_mask for LateInteractionMultimodalEmbedding…

5abd25a

…Base

fix: do not load the model for mask generation in colpali.py

5e0af47

remove Union usage

986c6ba

kacperlukawski requested review from Copilot and joein December 15, 2025 11:24

Copilot started reviewing on behalf of kacperlukawski December 15, 2025 11:24 View session

Copilot AI reviewed Dec 15, 2025

View reviewed changes

fastembed/late_interaction_multimodal/colpali.py Outdated Show resolved Hide resolved

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Dec 15, 2025

View reviewed changes

fastembed/late_interaction_multimodal/colpali.py Outdated Show resolved Hide resolved

fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py Show resolved Hide resolved

surpass mypy warnings

06f1829

coderabbitai bot reviewed Dec 15, 2025

View reviewed changes

chore: remove batch_size

c13576a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: image mask support for multi-modal late interaction models #591

feat: image mask support for multi-modal late interaction models #591

Uh oh!

kacperlukawski commented Dec 15, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Dec 15, 2025 •

edited

Loading

Other AI code review bot(s) detected

Walkthrough

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: image mask support for multi-modal late interaction models #591

Are you sure you want to change the base?

feat: image mask support for multi-modal late interaction models #591

Uh oh!

Conversation

kacperlukawski commented Dec 15, 2025

All Submissions:

New Feature Submissions:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Dec 15, 2025 •

edited

Loading