-
Notifications
You must be signed in to change notification settings - Fork 167
Model: ModernVBERT/colmodernvbert #588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
bf215e1 to
2f0e85a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for the ColModernVBERT model (Qdrant/colmodernvbert), a late-interaction multimodal embedding model that uses bidirectional attention for retrieval tasks. The implementation extends the existing late-interaction multimodal embedding infrastructure with support for image splitting and dynamic patch handling.
Key Changes:
- Introduced
ColModernVBERTclass with support for image splitting and patch-based processing - Added new image transformation operators (
ImageSplitter,ResizeForVisionEncoder,SquareResize,ResizeLongestEdge) to handle Idefics3-style image preprocessing - Enhanced ONNX multimodal model to handle nested image structures with attention masks and metadata
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
fastembed/late_interaction_multimodal/colmodernvbert.py |
New model implementation with text/image embedding support, patch computation logic, and worker classes |
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py |
Added ColModernVBERT to the embeddings registry |
fastembed/late_interaction_multimodal/onnx_multimodal_model.py |
Updated image embedding to support nested structures with patch-level attention masks |
fastembed/image/transform/operators.py |
Added transforms for longest edge resizing, vision encoder resizing, image splitting, and square resizing |
fastembed/image/transform/functional.py |
Added utility functions for resizing longest edge, cropping ndarrays, and resizing ndarrays |
fastembed/common/preprocessor_utils.py |
Added conditional check to avoid re-enabling padding if already configured |
fastembed/common/onnx_model.py |
Added optional metadata field to OnnxOutputContext |
tests/test_late_interaction_multimodal.py |
Added canonical values and test cases for the new model |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Post-process the ONNX model output to convert it into a usable format. | ||
| Args: | ||
| output (OnnxOutputContext): The raw output from the ONNX model. | ||
| Returns: | ||
| Iterable[NumpyArray]: Post-processed output as NumPy arrays. |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring describes post-processing output, but this method actually preprocesses text input. The docstring should describe what the method does: adding empty image placeholders to text input for the ONNX model. The Args should describe onnx_input (not output), and the Returns should describe the updated onnx_input dict.
| Post-process the ONNX model output to convert it into a usable format. | |
| Args: | |
| output (OnnxOutputContext): The raw output from the ONNX model. | |
| Returns: | |
| Iterable[NumpyArray]: Post-processed output as NumPy arrays. | |
| Preprocesses the text input for the ONNX model by adding empty image placeholders. | |
| Args: | |
| onnx_input (dict[str, NumpyArray]): The text input dictionary for the ONNX model. | |
| Returns: | |
| dict[str, NumpyArray]: The updated input dictionary with empty image placeholders added. |
📝 WalkthroughWalkthroughAdds ColModernVBERT, a late-interaction multimodal ONNX embedding model and registers it. Extends ONNX image handling to support patch-aware 4D/5D inputs and to return an OnnxOutputContext containing model_output, attention_mask, and metadata. Adds numpy/PIL image utilities (resize_longest_edge, crop_ndarray, resize_ndarray) and new image transforms (ResizeLongestEdge, ResizeForVisionEncoder, ImageSplitter, SquareResize); updates Normalize/Rescale to accept nested patch lists. Makes tokenizer padding conditional. Adds tests and a model-caching fixture for the new model. Also adds an optional metadata field to OnnxOutputContext. Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (5)
fastembed/image/transform/functional.py (1)
194-223: Verify float input range assumption.The function assumes float arrays are normalized to [0, 1] range (line 209:
img_hwc * 255). Values outside this range will clip or overflow. Also, float64 inputs are converted to float32 on output (line 212).If this is intentional for the pipeline (where rescale happens later), consider adding a docstring note about the expected input range.
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
16-19: Consider annotating withClassVarfor type safety.The
EMBEDDINGS_REGISTRYis a mutable class attribute. Annotating it withClassVarmakes the intent explicit and prevents accidental instance-level assignment.+from typing import Any, ClassVar, Iterable, Optional, Sequence, Type, Union -from typing import Any, Iterable, Optional, Sequence, Type, Union- EMBEDDINGS_REGISTRY: list[Type[LateInteractionMultimodalEmbeddingBase]] = [ + EMBEDDINGS_REGISTRY: ClassVar[list[Type[LateInteractionMultimodalEmbeddingBase]]] = [ ColPali, ColModernVBERT, ]fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)
173-177: Unused context manager.The
ExitStackis created but never used to manage any resources. If this was intended for managing image file handles, consider either implementing that pattern or removing the unused context manager.fastembed/late_interaction_multimodal/colmodernvbert.py (2)
136-153: Consider adding error handling for missing config files.If any of the config files (
processor_config.json,preprocessor_config.json,config.json) are missing, the error won't clearly indicate which model configuration is incomplete. Consider wrapping in try-except with a clearer message.# Load image processing configuration processor_config_path = self._model_dir / "processor_config.json" - with open(processor_config_path) as f: - processor_config = json.load(f) - self.image_seq_len = processor_config.get("image_seq_len", 64) + try: + with open(processor_config_path) as f: + processor_config = json.load(f) + self.image_seq_len = processor_config.get("image_seq_len", 64) + except FileNotFoundError: + raise FileNotFoundError( + f"processor_config.json not found at {processor_config_path}. " + "Ensure the model was downloaded correctly." + )
167-172: Add guard forself.image_sizebeingNone.
self.image_sizeisOptional[int]and initialized toNone. Whileload_onnx_modelshould be called before this method, adding an assertion would provide a clearer error message and satisfy type checking.def _preprocess_onnx_text_input( self, onnx_input: dict[str, NumpyArray], **kwargs: Any ) -> dict[str, NumpyArray]: - """ - Post-process the ONNX model output to convert it into a usable format. - - Args: - output (OnnxOutputContext): The raw output from the ONNX model. - - Returns: - Iterable[NumpyArray]: Post-processed output as NumPy arrays. - """ + """Preprocess text input by adding empty image placeholders.""" + assert self.image_size is not None, "image_size not initialized. Call load_onnx_model first." batch_size, seq_length = onnx_input["input_ids"].shape empty_image_placeholder: NumpyArray = np.zeros( - (batch_size, seq_length, 3, self.image_size, self.image_size), dtype=np.float32 # type: ignore[type-var,arg-type,assignment] + (batch_size, seq_length, 3, self.image_size, self.image_size), dtype=np.float32 ) onnx_input["pixel_values"] = empty_image_placeholder return onnx_inputNote: The docstring also appears to be copy-pasted from a post-process method — updated it to match the actual purpose.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
fastembed/common/onnx_model.py(1 hunks)fastembed/common/preprocessor_utils.py(1 hunks)fastembed/image/transform/functional.py(1 hunks)fastembed/image/transform/operators.py(9 hunks)fastembed/late_interaction_multimodal/colmodernvbert.py(1 hunks)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py(2 hunks)fastembed/late_interaction_multimodal/onnx_multimodal_model.py(1 hunks)tests/test_late_interaction_multimodal.py(4 hunks)
🧰 Additional context used
🪛 Ruff (0.14.7)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py
16-19: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
fastembed/late_interaction_multimodal/colmodernvbert.py
156-156: Unused method argument: kwargs
(ARG002)
189-189: Unused method argument: kwargs
(ARG002)
194-194: Unused method argument: kwargs
(ARG002)
fastembed/image/transform/operators.py
390-392: Avoid specifying long messages outside the exception class
(TRY003)
429-429: Avoid specifying long messages outside the exception class
(TRY003)
432-432: Unused static method argument: config
(ARG004)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Agent
🔇 Additional comments (18)
fastembed/common/preprocessor_utils.py (1)
53-56: LGTM! Conditional padding avoids overwriting existing configuration.The guard prevents re-enabling padding when it's already configured, which is useful when tokenizer configs already include padding settings.
fastembed/image/transform/operators.py (9)
44-55: LGTM! Safe handling of both flat and nested image structures.The
images and isinstance(images[0], list)pattern correctly short-circuits on empty input, preventing IndexError.
75-86: Consistent with Normalize pattern.Same safe short-circuit evaluation for nested structure detection.
109-121: Clean implementation delegating to functional utility.
124-167: Logic correctly aligns dimensions to multiples of max_size.The aspect ratio calculation and rounding ensure proper vision encoder compatibility. PIL's
(width, height)convention is correctly used in the resize call.
170-242: Patch splitting logic handles boundary patches correctly.Edge patches are properly clamped via
min(start_x + optimal_width, width), ensuring no out-of-bounds access. Note that boundary patches may have smaller dimensions than interior patches, which is expected behavior for grid-based splitting.
245-267: Output format correctly matches ImageSplitter for pipeline compatibility.Wrapping single resized image in inner list ensures downstream transforms (Normalize, Rescale) work uniformly.
386-404: Proper handling of Idefics3ImageProcessor configuration.The
resampleparameter conversion from int toImage.Resamplinghandles configs that serialize enums as integers.
426-429: Intentional no-op for Idefics3ImageProcessor center crop.The unused
configparameter (ARG004 hint) is required by the method signature pattern. Safely ignore this static analysis warning.
435-455: Clean conditional pipeline configuration for image splitting.The
@classmethoddecorator is appropriate here since it follows the pattern of other_get_*methods. The chained.get()calls provide safe defaults.fastembed/image/transform/functional.py (2)
152-175: Even dimension enforcement is useful for vision encoder compatibility.The aspect ratio calculation and dimension adjustment logic is correct. Ensuring even dimensions (lines 170-173) is a common requirement for certain encoder architectures.
178-191: Simple and correct crop implementation.Relies on caller for bounds validation, which is appropriate since
ImageSplitteralready clamps coordinates before calling this function.tests/test_late_interaction_multimodal.py (4)
24-34: Test data added for colmodernvbert image embeddings.Per the PR objectives, the code snippet showing how canonical test values were computed was not added. Consider adding this to the PR description or as a comment for reproducibility.
49-59: Query embedding canonical values added.Consistent with image embedding test data structure.
115-117: Embedding size verification for colmodernvbert.Confirms the model reports 128-dimensional embeddings, matching the existing colpali model.
130-133: Instance-based embedding size check added.Tests lazy-loaded model instance reports correct embedding size.
fastembed/common/onnx_model.py (1)
19-24: Clean backward-compatible extension of OnnxOutputContext.The optional
metadatafield allows carrying additional context (e.g., patch counts for image splitting) without breaking existing code that doesn't use it.fastembed/late_interaction_multimodal/colmodernvbert.py (1)
35-45: Clean implementation following existing patterns.The
ColModernVBERTclass properly inherits from bothLateInteractionMultimodalEmbeddingBaseandOnnxMultimodalModel, follows the established patterns fromColPali, and implements the required abstract methods. The visual prompt prefix and overall structure look correct.
| patch_counts = [len(patches) for patches in processed] | ||
| max_patches = max(patch_counts) | ||
|
|
||
| # Get dimensions from first patch | ||
| C, H, W = processed[0][0].shape | ||
|
|
||
| # Create padded array | ||
| batch_size = len(processed) | ||
| encoded = np.zeros((batch_size, max_patches, C, H, W), dtype=processed[0][0].dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential IndexError if an image produces no patches.
If processed[0] is an empty list, accessing processed[0][0] will raise an IndexError. Consider adding a guard or handling the edge case where an image might yield zero patches.
patch_counts = [len(patches) for patches in processed]
max_patches = max(patch_counts)
+
+ if max_patches == 0:
+ raise ValueError("All images produced zero patches")
# Get dimensions from first patch
- C, H, W = processed[0][0].shape
+ # Find first non-empty patch list to get dimensions
+ first_patch = next(
+ (patches[0] for patches in processed if len(patches) > 0), None
+ )
+ if first_patch is None:
+ raise ValueError("No valid patches found in processed images")
+ C, H, W = first_patch.shapeCommittable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In fastembed/late_interaction_multimodal/onnx_multimodal_model.py around lines
186-194, the code assumes processed[0] contains at least one patch and uses
processed[0][0].shape which will IndexError if any image produced zero patches;
change it to first find a non-empty entry (e.g., iterate processed to locate the
first patches list with length>0) and use that patch's shape for C,H,W, and
handle the case where no images have patches (max_patches == 0) by either
returning an appropriately shaped empty encoded array or raising a clear
exception; ensure subsequent logic that pads/fills encoded handles the
zero-patch case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (2)
fastembed/late_interaction_multimodal/colmodernvbert.py (2)
37-41: Clarify model identifier in docstring.The docstring references
ModernVBERT/colmodernvbertbut the registered model identifier isQdrant/colmodernvbert. Consider clarifying that this is the ONNX-converted version of the original model.
158-166: Incorrect docstring: describes post-processing but method preprocesses.This docstring describes post-processing output, but the method actually preprocesses text input by adding empty image placeholders.
🧹 Nitpick comments (1)
fastembed/late_interaction_multimodal/colmodernvbert.py (1)
136-153: Add error handling for config file loading.The config file loading lacks error handling for missing or malformed JSON files. Consider wrapping in try-except to provide clearer error messages when config files are missing or corrupted.
# Load image processing configuration processor_config_path = self._model_dir / "processor_config.json" - with open(processor_config_path) as f: - processor_config = json.load(f) - self.image_seq_len = processor_config.get("image_seq_len", 64) + try: + with open(processor_config_path) as f: + processor_config = json.load(f) + self.image_seq_len = processor_config.get("image_seq_len", 64) + except FileNotFoundError: + raise FileNotFoundError( + f"Required config file not found: {processor_config_path}" + ) + except json.JSONDecodeError as e: + raise ValueError(f"Invalid JSON in {processor_config_path}: {e}")Apply similar handling to the other config file reads.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
fastembed/late_interaction_multimodal/colmodernvbert.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.8)
fastembed/late_interaction_multimodal/colmodernvbert.py
156-156: Unused method argument: kwargs
(ARG002)
189-189: Unused method argument: kwargs
(ARG002)
194-194: Unused method argument: kwargs
(ARG002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: Python 3.13.x on windows-latest test
- GitHub Check: Python 3.9.x on macos-latest test
- GitHub Check: Python 3.13.x on macos-latest test
- GitHub Check: Python 3.10.x on ubuntu-latest test
- GitHub Check: Python 3.9.x on windows-latest test
- GitHub Check: Python 3.9.x on ubuntu-latest test
🔇 Additional comments (7)
fastembed/late_interaction_multimodal/colmodernvbert.py (7)
1-32: LGTM!Imports and model description are well-structured and follow the existing patterns in the codebase.
174-191: LGTM!These methods are straightforward. The unused
kwargsparameters are likely intentional for interface compatibility with base class method signatures.
258-280: LGTM!The grid dimension computation logic is sound with appropriate fallback handling for edge cases.
330-346: LGTM!The post-processing logic correctly reshapes the output tensor using the model's embedding dimension.
348-418: LGTM!Both embedding methods follow the established patterns and correctly delegate to internal implementation methods.
420-426: LGTM!Worker class references are correct.
429-446: LGTM!Worker classes correctly implement
init_embeddingwith appropriate defaults for parallel processing contexts.
| self.mask_token_id = None | ||
| self.pad_token_id = None | ||
| self.image_seq_len: Optional[int] = None | ||
| self.max_image_size: Optional[int] = None | ||
| self.image_size: Optional[int] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, check the class definition and its base classes
head -50 fastembed/late_interaction_multimodal/colmodernvbert.py | grep -A 5 "^class "Repository: qdrant/fastembed
Length of output: 367
🏁 Script executed:
#!/bin/bash
# Search for mask_token_id and pad_token_id usage across the codebase
rg -n "mask_token_id|pad_token_id" --type pyRepository: qdrant/fastembed
Length of output: 1452
🏁 Script executed:
#!/bin/bash
# Check if the file has any references to these attributes
cat -n fastembed/late_interaction_multimodal/colmodernvbert.py | grep -E "(mask_token_id|pad_token_id)"Repository: qdrant/fastembed
Length of output: 332
🏁 Script executed:
#!/bin/bash
# Find the LateInteractionMultimodalEmbeddingBase class definition
fd -e py | xargs grep -l "class LateInteractionMultimodalEmbeddingBase"Repository: qdrant/fastembed
Length of output: 143
🏁 Script executed:
#!/bin/bash
# Check the base class for any requirements on mask_token_id and pad_token_id
rg -A 30 "class LateInteractionMultimodalEmbeddingBase" --type pyRepository: qdrant/fastembed
Length of output: 3890
Remove unused instance attributes mask_token_id and pad_token_id.
These attributes are initialized to None at lines 106-107 but are never assigned values or used anywhere in the class. The base class LateInteractionMultimodalEmbeddingBase does not require them, and the class logic uses local variables instead (e.g., pad_token_id is obtained directly from self.tokenizer.padding["pad_id"] at line 235). These are dead code and should be removed.
🤖 Prompt for AI Agents
In fastembed/late_interaction_multimodal/colmodernvbert.py around lines 106 to
110, remove the dead instance attributes mask_token_id and pad_token_id which
are initialized to None but never used; delete those two attribute
initializations and ensure no other code depends on self.mask_token_id or
self.pad_token_id (use local tokenizer/padding lookups already present), then
run tests/linters to confirm no references remain.
| batch_size, seq_length = onnx_input["input_ids"].shape | ||
| empty_image_placeholder: NumpyArray = np.zeros( | ||
| (batch_size, seq_length, 3, self.image_size, self.image_size), dtype=np.float32 # type: ignore[type-var,arg-type,assignment] | ||
| ) | ||
| onnx_input["pixel_values"] = empty_image_placeholder | ||
| return onnx_input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against None value for self.image_size.
self.image_size is Optional[int] initialized to None, and is only set during load_onnx_model(). If this method is called before the model is loaded (e.g., with lazy_load=True), using self.image_size will produce an invalid array shape.
+ if self.image_size is None:
+ raise RuntimeError(
+ "Model not loaded. Call load_onnx_model() first or initialize with lazy_load=False"
+ )
batch_size, seq_length = onnx_input["input_ids"].shape
- empty_image_placeholder: NumpyArray = np.zeros(
- (batch_size, seq_length, 3, self.image_size, self.image_size), dtype=np.float32 # type: ignore[type-var,arg-type,assignment]
- )
+ empty_image_placeholder: NumpyArray = np.zeros(
+ (batch_size, seq_length, 3, self.image_size, self.image_size), dtype=np.float32
+ )🤖 Prompt for AI Agents
In fastembed/late_interaction_multimodal/colmodernvbert.py around lines 167-172,
guard against self.image_size being None before using it to build the
empty_image_placeholder; if self.image_size is None either call
self.load_onnx_model() (if safe here) or raise a clear ValueError indicating the
model must be loaded (or lazy_load disabled) before creating pixel_values, and
ensure you cast/check the value to an int before using it to construct the numpy
shape so an invalid shape is never created.
| # Get padding config from tokenizer | ||
| padding_direction = self.tokenizer.padding["direction"] # type: ignore[index,union-attr] | ||
| pad_token_id = self.tokenizer.padding["pad_id"] # type: ignore[index,union-attr] | ||
|
|
||
| # Initialize with pad token | ||
| padded_input_ids = np.full((batch_size, max_len), pad_token_id, dtype=np.int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add guard for uninitialized tokenizer.
self.tokenizer could be None if the model isn't fully loaded. The type: ignore comments suppress the error but don't prevent a runtime TypeError.
+ if self.tokenizer is None or self.tokenizer.padding is None:
+ raise RuntimeError("Tokenizer not initialized. Ensure model is loaded.")
+
# Get padding config from tokenizer
- padding_direction = self.tokenizer.padding["direction"] # type: ignore[index,union-attr]
- pad_token_id = self.tokenizer.padding["pad_id"] # type: ignore[index,union-attr]
+ padding_direction = self.tokenizer.padding["direction"]
+ pad_token_id = self.tokenizer.padding["pad_id"]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Get padding config from tokenizer | |
| padding_direction = self.tokenizer.padding["direction"] # type: ignore[index,union-attr] | |
| pad_token_id = self.tokenizer.padding["pad_id"] # type: ignore[index,union-attr] | |
| # Initialize with pad token | |
| padded_input_ids = np.full((batch_size, max_len), pad_token_id, dtype=np.int64) | |
| if self.tokenizer is None or self.tokenizer.padding is None: | |
| raise RuntimeError("Tokenizer not initialized. Ensure model is loaded.") | |
| # Get padding config from tokenizer | |
| padding_direction = self.tokenizer.padding["direction"] | |
| pad_token_id = self.tokenizer.padding["pad_id"] | |
| # Initialize with pad token | |
| padded_input_ids = np.full((batch_size, max_len), pad_token_id, dtype=np.int64) |
🤖 Prompt for AI Agents
fastembed/late_interaction_multimodal/colmodernvbert.py lines 233-238: the code
assumes self.tokenizer is set and immediately accesses its padding attributes,
which can raise a TypeError if tokenizer is None; add a guard that checks if
self.tokenizer is None and either (a) raise a clear RuntimeError/ValueError
stating the model/tokenizer is not initialized and instructing how to load it,
or (b) attempt to lazy-initialize or retrieve a fallback tokenizer before
accessing padding; after the guard, safely read padding_direction and
pad_token_id and remove reliance on the type: ignore suppression so the runtime
can fail loudly or recover with a proper fallback.
| def _create_single_image_prompt_string(self) -> str: | ||
| return ( | ||
| "<fake_token_around_image>" | ||
| + "<global-img>" | ||
| + "<image>" * self.image_seq_len # type: ignore[operator] | ||
| + "<fake_token_around_image>" | ||
| ) | ||
|
|
||
| def _create_split_image_prompt_string(self, rows: int, cols: int) -> str: | ||
| text_split_images = "" | ||
|
|
||
| # Add tokens for each patch in the grid | ||
| for n_h in range(rows): | ||
| for n_w in range(cols): | ||
| text_split_images += ( | ||
| "<fake_token_around_image>" | ||
| + f"<row_{n_h + 1}_col_{n_w + 1}>" | ||
| + "<image>" * self.image_seq_len # type: ignore[operator] | ||
| ) | ||
| text_split_images += "\n" | ||
|
|
||
| # Add global image at the end | ||
| text_split_images += ( | ||
| "\n<fake_token_around_image>" | ||
| + "<global-img>" | ||
| + "<image>" * self.image_seq_len # type: ignore[operator] | ||
| + "<fake_token_around_image>" | ||
| ) | ||
|
|
||
| return text_split_images |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against None value for self.image_seq_len.
Similar to self.image_size, self.image_seq_len could be None if the model isn't loaded, causing string multiplication to fail with TypeError.
Consider adding a runtime check at the start of methods that depend on model configuration:
if self.image_seq_len is None:
raise RuntimeError("Model not loaded. Call load_onnx_model() first.")🤖 Prompt for AI Agents
In fastembed/late_interaction_multimodal/colmodernvbert.py around lines 282 to
311, both _create_single_image_prompt_string and
_create_split_image_prompt_string assume self.image_seq_len is not None and will
raise a TypeError when it's None; add a runtime guard at the start of each
method that checks if self.image_seq_len is None and raises a RuntimeError with
a clear message like "Model not loaded. Call load_onnx_model() first." so
callers get a meaningful error instead of a TypeError from string
multiplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
fastembed/late_interaction_multimodal/colmodernvbert.py (4)
159-165: Fix incorrect docstring.The docstring describes post-processing output, but this method actually preprocesses text input by adding empty image placeholders. The Args describe
outputwhen the parameter isonnx_input, and Returns describe the updated input dictionary.- Post-process the ONNX model output to convert it into a usable format. + Preprocesses the text input for the ONNX model by adding empty image placeholders. Args: - output (OnnxOutputContext): The raw output from the ONNX model. + onnx_input (dict[str, NumpyArray]): The text input dictionary for the ONNX model. Returns: - Iterable[NumpyArray]: Post-processed output as NumPy arrays. + dict[str, NumpyArray]: The updated input dictionary with empty image placeholders added.
233-238: Add guard for uninitialized tokenizer.
self.tokenizerisOptional[Tokenizer]and could beNoneif the model isn't fully loaded. Accessingself.tokenizer.paddingat lines 234-235 will raise anAttributeErrorif called withlazy_load=Truebefore the model is loaded. Thetype: ignorecomments suppress type errors but don't prevent the runtime exception.+ if self.tokenizer is None or self.tokenizer.padding is None: + raise RuntimeError("Tokenizer not initialized. Ensure model is loaded.") + # Get padding config from tokenizer padding_direction = self.tokenizer.padding["direction"] # type: ignore[index,union-attr] pad_token_id = self.tokenizer.padding["pad_id"] # type: ignore[index,union-attr]
167-170: Guard againstNonevalue forself.image_size.
self.image_sizeisOptional[int]initialized toNone(line 110) and only set duringload_onnx_model(). If this method is called withlazy_load=Truebefore the model is loaded, usingself.image_sizeat line 169 will create an invalid array shape or raise a TypeError.+ if self.image_size is None: + raise RuntimeError( + "Model not loaded. Call load_onnx_model() first or initialize with lazy_load=False" + ) batch_size, seq_length = onnx_input["input_ids"].shape empty_image_placeholder: NumpyArray = np.zeros( (batch_size, seq_length, 3, self.image_size, self.image_size), dtype=np.float32 # type: ignore[type-var,arg-type,assignment]
282-311: Guard againstNonevalue forself.image_seq_len.Both
_create_single_image_prompt_stringand_create_split_image_prompt_stringuseself.image_seq_lenfor string multiplication (lines 286, 299, 307).self.image_seq_lenisOptional[int]initialized toNone(line 108) and only set duringload_onnx_model(). If called withlazy_load=Truebefore the model is loaded, these will raiseTypeError: can't multiply sequence by NoneType.Add a runtime check at the start of both methods:
def _create_single_image_prompt_string(self) -> str: + if self.image_seq_len is None: + raise RuntimeError("Model not loaded. Call load_onnx_model() first.") return (def _create_split_image_prompt_string(self, rows: int, cols: int) -> str: + if self.image_seq_len is None: + raise RuntimeError("Model not loaded. Call load_onnx_model() first.") text_split_images = ""
🧹 Nitpick comments (2)
fastembed/late_interaction_multimodal/colmodernvbert.py (2)
37-40: Clarify the relationship between original and ONNX models in docstring.The docstring references "ModernVBERT/colmodernvbert" but the actual model identifier used is "Qdrant/colmodernvbert" (line 23). Consider clarifying that Qdrant hosts the ONNX-converted version of the original ModernVBERT model to help users understand which model files are being loaded.
- The ModernVBERT/colmodernvbert model implementation. This model uses - bidirectional attention, which proves to work better for retrieval. - - See: https://huggingface.co/ModernVBERT/colmodernvbert + The ColModernVBERT model implementation using ONNX runtime. This model uses + bidirectional attention, which proves to work better for retrieval. + + This is the ONNX-converted version hosted at https://huggingface.co/Qdrant/colmodernvbert + Original model: https://huggingface.co/ModernVBERT/colmodernvbert
106-107: Remove unused instance attributes.Based on learnings,
mask_token_idandpad_token_idare initialized here but never assigned values or used anywhere in the class. The padding logic at lines 234-235 retrievespad_token_iddirectly fromself.tokenizer.padding["pad_id"]instead. These lines are dead code and can be safely removed.- self.mask_token_id = None - self.pad_token_id = None self.image_seq_len: Optional[int] = None
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
fastembed/late_interaction_multimodal/colmodernvbert.py(1 hunks)
🧰 Additional context used
🪛 Ruff (0.14.8)
fastembed/late_interaction_multimodal/colmodernvbert.py
156-156: Unused method argument: kwargs
(ARG002)
189-189: Unused method argument: kwargs
(ARG002)
194-194: Unused method argument: kwargs
(ARG002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Python 3.9.x on windows-latest test
- GitHub Check: Python 3.13.x on macos-latest test
- GitHub Check: Python 3.13.x on windows-latest test
- GitHub Check: Python 3.9.x on macos-latest test
- GitHub Check: Python 3.11.x on ubuntu-latest test
🔇 Additional comments (3)
fastembed/late_interaction_multimodal/colmodernvbert.py (3)
330-346: LGTM!The post-processing logic correctly reshapes the output and includes a proper assertion for the model dimension.
348-418: LGTM!Both
embed_textandembed_imagemethods properly delegate to inherited base methods with all configuration parameters correctly passed through.
420-446: LGTM!Worker class methods and class definitions follow the standard pattern. Both text and image workers correctly initialize ColModernVBERT instances with single-threaded configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (2)
217-225: Guard against empty patch lists in_process_nested_patches.If
processed[0]contains zero patches (or the first few images yield no patches),processed[0][0]will raise anIndexError, even though later images may have valid patches. Similarly, if all images produce zero patches,max(patch_counts)is 0 and you have no patch to inferC, H, Wfrom.Consider locating the first non-empty patch list and handling the all-empty case explicitly:
- patch_counts = [len(patches) for patches in processed] - max_patches = max(patch_counts) - - # Get dimensions from first patch - C, H, W = processed[0][0].shape + patch_counts = [len(patches) for patches in processed] + max_patches = max(patch_counts) + + if max_patches == 0: + raise ValueError("All images produced zero patches") + + # Get dimensions from the first non-empty patch list + first_patch = next((patches[0] for patches in processed if patches), None) + if first_patch is None: + raise ValueError("No valid patches found in processed images") + C, H, W = first_patch.shapeThis avoids hard failures on edge cases where some or all images contribute no patches.
187-190: Useencoded.shape[0]as the batch size source in_process_flat_images.
_process_flat_imagescurrently takesnum_imagesfrom the caller and uses it to sizeattention_maskandpatch_counts. If a processor ever returns a different number of encoded images than the originalimageslist (e.g., filtering, de-duplication), this will cause shape or length mismatches.You can make the function self-consistent by deriving the batch size from
encodedand dropping thenum_imagesargument:- else: - encoded, attention_mask, metadata = self._process_flat_images( - processed, # type: ignore[arg-type] - len(images), - ) + else: + encoded, attention_mask, metadata = self._process_flat_images( + processed, # type: ignore[arg-type] + )And in
_process_flat_images:- def _process_flat_images( - self, processed: list[NumpyArray], num_images: int - ) -> tuple[NumpyArray, NumpyArray, dict[str, Any]]: + def _process_flat_images( + self, processed: list[NumpyArray] + ) -> tuple[NumpyArray, NumpyArray, dict[str, Any]]: @@ - encoded = np.array(processed) + encoded = np.array(processed) + batch_size = encoded.shape[0] @@ - if len(encoded.shape) == 5: - # 5D tensor: attention_mask shape is (batch, num_patches) - attention_mask = np.ones((num_images, encoded.shape[1]), dtype=np.int64) - metadata = {"patch_counts": [encoded.shape[1]] * num_images} + if len(encoded.shape) == 5: + # 5D tensor: attention_mask shape is (batch, num_patches) + attention_mask = np.ones((batch_size, encoded.shape[1]), dtype=np.int64) + metadata = {"patch_counts": [encoded.shape[1]] * batch_size} else: - # 4D tensor: attention_mask shape is (batch, 1) - attention_mask = np.ones((num_images, 1), dtype=np.int64) - metadata = {"patch_counts": [1] * num_images} + # 4D tensor: attention_mask shape is (batch, 1) + attention_mask = np.ones((batch_size, 1), dtype=np.int64) + metadata = {"patch_counts": [1] * batch_size}This keeps attention masks and metadata consistent with the actual encoded tensor regardless of how the processor behaves.
Also applies to: 258-275
🧹 Nitpick comments (2)
tests/test_late_interaction_multimodal.py (1)
24-34: Canonical values for ColModernVBERT look consistent; please keep derivation reproducible.The added CANONICAL_IMAGE_VALUES and CANONICAL_QUERY_VALUES entries for
"Qdrant/colmodernvbert"integrate cleanly with the existing tests (shapes match the token×dim slicing logic used below). To keep these stable over time and aid future debugging, it would be helpful to either:
- add a short note or link to the script/notebook used to generate these canonical values, or
- include the missing “how canonical values were computed” snippet in the test docs/checklist.
This isn’t blocking, but it will make future updates to the model/tests much easier to validate.
Also applies to: 49-56
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)
179-191: Propagate preprocessedattention_maskintoOnnxOutputContextfor image path.After
_preprocess_onnx_image_input, you always pass the originalattention_maskintoOnnxOutputContext. If a subclass overrides_preprocess_onnx_image_inputand adjusts or replaces the mask, the context will become inconsistent with what was actually fed to the model.You can mirror the text path by reading from
onnx_inputfirst:- onnx_input = {"pixel_values": encoded, "attention_mask": attention_mask} - onnx_input = self._preprocess_onnx_image_input(onnx_input, **kwargs) - model_output = self.model.run(None, onnx_input) # type: ignore[union-attr] - - return OnnxOutputContext( - model_output=model_output[0], - attention_mask=attention_mask, # type: ignore[arg-type] - metadata=metadata, - ) + onnx_input = {"pixel_values": encoded, "attention_mask": attention_mask} + onnx_input = self._preprocess_onnx_image_input(onnx_input, **kwargs) + model_output = self.model.run(None, onnx_input) # type: ignore[union-attr] + + return OnnxOutputContext( + model_output=model_output[0], + attention_mask=onnx_input.get("attention_mask", attention_mask), # type: ignore[arg-type] + metadata=metadata, + )This keeps the context aligned with any preprocessing logic while remaining backward compatible.
Also applies to: 196-200
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
fastembed/late_interaction_multimodal/onnx_multimodal_model.py(1 hunks)tests/test_late_interaction_multimodal.py(4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_late_interaction_multimodal.py (1)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (3)
LateInteractionMultimodalEmbedding(15-189)get_embedding_size(98-122)embedding_size(91-95)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Python 3.9.x on macos-latest test
- GitHub Check: Python 3.13.x on macos-latest test
- GitHub Check: Python 3.9.x on windows-latest test
- GitHub Check: Python 3.13.x on windows-latest test
🔇 Additional comments (2)
tests/test_late_interaction_multimodal.py (1)
112-113: Embedding size tests for ColModernVBERT match registry expectations.The new assertions for
"Qdrant/colmodernvbert"in bothtest_get_embedding_sizeandtest_embedding_sizecorrectly expect a dim of 128 and are aligned with the registry-based lookup logic. No issues here.Also applies to: 127-129
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)
277-299: Patch-dimension detection via ONNX input shape looks sound.
_needs_patch_dimensioncleanly inspects thepixel_valuesinput shape to distinguish 4D vs 5D models and defaults toFalsewhen the model is not yet loaded or the input is missing. This matches the intended behavior for ColModernVBERT (5D) vs earlier 4D models and should be robust against dynamic dimension names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (6)
fastembed/late_interaction_multimodal/colmodernvbert.py (6)
37-41: Docstring references original model, but code uses Qdrant ONNX version.The docstring mentions
ModernVBERT/colmodernvbertand links to that HuggingFace page, but the actual model identifier insupported_colmodernvbert_modelsisQdrant/colmodernvbert. Consider clarifying this distinction.
107-108: Remove unused instance attributesmask_token_idandpad_token_id.These attributes are initialized but never assigned meaningful values or used anywhere in the class. This was previously flagged.
159-166: Incorrect docstring: describes post-processing but method preprocesses input.The docstring says "Post-process the ONNX model output" but this method actually preprocesses text input by adding empty image placeholders.
168-172: Guard againstNonevalue forself.image_size.With
lazy_load=True,self.image_sizeremainsNoneuntilload_onnx_model()is called, causing a runtime error when creating the placeholder array. This was previously flagged.
236-238: Add guard for uninitialized tokenizer.
self.tokenizercould beNoneif the model isn't loaded. This was previously flagged.
285-314: Guard againstNonevalue forself.image_seq_len.
self.image_seq_lencould beNoneif the model isn't loaded, causingTypeErroron string multiplication. This was previously flagged.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
fastembed/late_interaction_multimodal/colmodernvbert.pytests/test_late_interaction_multimodal.py
🧰 Additional context used
🪛 Ruff (0.14.10)
fastembed/late_interaction_multimodal/colmodernvbert.py
46-46: Possible hardcoded password assigned to: "QUERY_AUGMENTATION_TOKEN"
(S105)
157-157: Unused method argument: kwargs
(ARG002)
190-190: Unused method argument: kwargs
(ARG002)
197-197: Unused method argument: kwargs
(ARG002)
🔇 Additional comments (11)
fastembed/late_interaction_multimodal/colmodernvbert.py (7)
21-32: Model description looks well-structured.The model descriptor includes all necessary fields and the additional file
processor_config.jsonis correctly specified for the image processing configuration.
244-258: Padding logic correctly handles both left and right directions.The implementation properly places tokens based on
padding_directionand constructs appropriate attention masks.
261-283: Grid computation logic handles edge cases appropriately.The method correctly handles single patches (
patch_count <= 1), attempts square grid detection first, and falls back to factor enumeration for non-square grids. The final fallback to(0, 0)ensures graceful degradation.
316-331: Prompt construction and tokenization logic is correct.The method correctly selects the appropriate prompt format based on grid dimensions and properly tokenizes the expanded prompt.
333-349: Post-processing correctly reshapes output to expected dimensions.The assertion guards against misconfigured models, and the reshape operation produces the expected
(batch, n, dim)output format.
351-421: Embedding methods correctly delegate to base class utilities.Both
embed_textandembed_imageproperly pass through all configuration options and useyield fromfor efficient lazy iteration.
432-449: Worker classes are correctly implemented.Both
ColModernVBERTTextEmbeddingWorkerandColModernVBERTImageEmbeddingWorkerproperly initialize theColModernVBERTmodel withthreads=1for parallel processing scenarios.tests/test_late_interaction_multimodal.py (4)
24-34: Canonical image embedding values added for new model.The test data follows the established pattern with abridged and rounded values. Note that the PR checklist indicates the code snippet showing how these canonical values were computed was not added - consider adding this for reproducibility.
49-59: Canonical query embedding values added for new model.The test data is consistent with the existing format.
115-117: Embedding size test added for new model.The assertion correctly validates the 128-dimensional embedding size matching the model description.
130-133: Lazy-load embedding size test added for new model.The test correctly validates that the embedding size is accessible even with
lazy_load=True.
| preprocessor_config_path = self._model_dir / "preprocessor_config.json" | ||
| with open(preprocessor_config_path) as f: | ||
| preprocessor_config = json.load(f) | ||
| self.max_image_size = preprocessor_config.get("max_image_size", {}).get( | ||
| "longest_edge", 512 | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential AttributeError if max_image_size is not a dict.
If preprocessor_config.get("max_image_size", {}) returns a non-dict value (e.g., an integer), the subsequent .get("longest_edge", 512) call will fail with an AttributeError.
🔎 Suggested defensive handling
preprocessor_config_path = self._model_dir / "preprocessor_config.json"
with open(preprocessor_config_path) as f:
preprocessor_config = json.load(f)
- self.max_image_size = preprocessor_config.get("max_image_size", {}).get(
- "longest_edge", 512
- )
+ max_image_size_config = preprocessor_config.get("max_image_size", {})
+ if isinstance(max_image_size_config, dict):
+ self.max_image_size = max_image_size_config.get("longest_edge", 512)
+ else:
+ self.max_image_size = max_image_size_config if isinstance(max_image_size_config, int) else 512🤖 Prompt for AI Agents
In fastembed/late_interaction_multimodal/colmodernvbert.py around lines 142 to
147, the code assumes preprocessor_config.get("max_image_size", {}) returns a
dict and calls .get("longest_edge", 512), which will raise AttributeError if
it's an int or other non-dict; change this to read the value into a variable,
check isinstance(value, dict) and if so use value.get("longest_edge", 512), elif
isinstance(value, int) use the int directly, otherwise fall back to 512, and
assign that result to self.max_image_size so the code is robust to different
config shapes.
94366ea to
5a6d884
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (7)
fastembed/late_interaction_multimodal/colmodernvbert.py (6)
35-46: Class docstring references original model but deployed model differs.The docstring references
ModernVBERT/colmodernvbertbut the actual model identifier isQdrant/colmodernvbert(line 23). Consider clarifying that this is the ONNX-converted version.
107-108: Remove unused instance attributesmask_token_idandpad_token_id.These attributes are initialized to
Nonebut never assigned values or used anywhere in the class. The tokenizer padding is accessed directly viaself.tokenizer.padding["pad_id"]at line 238. These are dead code.
142-147: PotentialAttributeErrorifmax_image_sizeis not a dict.If
preprocessor_config.get("max_image_size", {})returns a non-dict value (e.g., an integer), the subsequent.get("longest_edge", 512)call will fail.🔎 Proposed defensive handling
preprocessor_config_path = self._model_dir / "preprocessor_config.json" with open(preprocessor_config_path) as f: preprocessor_config = json.load(f) - self.max_image_size = preprocessor_config.get("max_image_size", {}).get( - "longest_edge", 512 - ) + max_image_size_config = preprocessor_config.get("max_image_size", {}) + if isinstance(max_image_size_config, dict): + self.max_image_size = max_image_size_config.get("longest_edge", 512) + else: + self.max_image_size = max_image_size_config if isinstance(max_image_size_config, int) else 512
156-173: Docstring describes post-processing but method preprocesses input; also missingNoneguard forself.image_size.Two issues:
- The docstring incorrectly describes post-processing output when this method preprocesses text input by adding image placeholders.
self.image_sizeisOptional[int]and only set duringload_onnx_model(). Withlazy_load=True, using it before model load will produce an invalid array shape.🔎 Proposed fix
def _preprocess_onnx_text_input( self, onnx_input: dict[str, NumpyArray], **kwargs: Any ) -> dict[str, NumpyArray]: """ - Post-process the ONNX model output to convert it into a usable format. + Preprocess text input by adding empty image placeholders for the ONNX model. Args: - output (OnnxOutputContext): The raw output from the ONNX model. + onnx_input (dict[str, NumpyArray]): The text input dictionary for the ONNX model. Returns: - Iterable[NumpyArray]: Post-processed output as NumPy arrays. + dict[str, NumpyArray]: Updated input dictionary with empty image placeholders. """ + if self.image_size is None: + raise RuntimeError( + "Model not loaded. Call load_onnx_model() first or initialize with lazy_load=False" + ) batch_size, seq_length = onnx_input["input_ids"].shape
236-238: Add guard for uninitialized tokenizer.
self.tokenizercould beNoneif the model isn't fully loaded. Thetype: ignorecomments suppress the error but don't prevent a runtimeTypeError.🔎 Proposed fix
+ if self.tokenizer is None or self.tokenizer.padding is None: + raise RuntimeError("Tokenizer not initialized. Ensure model is loaded.") + # Get padding config from tokenizer - padding_direction = self.tokenizer.padding["direction"] # type: ignore[index,union-attr] - pad_token_id = self.tokenizer.padding["pad_id"] # type: ignore[index,union-attr] + padding_direction = self.tokenizer.padding["direction"] + pad_token_id = self.tokenizer.padding["pad_id"]
285-314: Guard againstNonevalue forself.image_seq_len.
self.image_seq_lencould beNoneif the model isn't loaded, causing string multiplication ("<image>" * self.image_seq_len) to fail withTypeError.🔎 Proposed fix
Add a runtime check at the start of methods that use
image_seq_len:if self.image_seq_len is None: raise RuntimeError("Model not loaded. Call load_onnx_model() first.")fastembed/late_interaction_multimodal/onnx_multimodal_model.py (1)
217-237: PotentialIndexErrorif an image produces zero patches.The past review comment about this issue was not fully addressed. If
processed[0]is an empty list, accessingprocessed[0][0].shapeat line 221 will raise anIndexError. Additionally, ifprocesseditself is empty or all images produce zero patches,max(patch_counts)will be 0, and the dimension extraction will fail.🔎 Proposed defensive handling
def _process_nested_patches( self, processed: list[list[NumpyArray]] ) -> tuple[NumpyArray, NumpyArray, dict[str, Any]]: ... patch_counts = [len(patches) for patches in processed] max_patches = max(patch_counts) + + if max_patches == 0: + raise ValueError("All images produced zero patches") # Get dimensions from first patch - C, H, W = processed[0][0].shape + # Find first non-empty patch list to get dimensions + first_patch = next( + (patches[0] for patches in processed if len(patches) > 0), None + ) + if first_patch is None: + raise ValueError("No valid patches found in processed images") + C, H, W = first_patch.shape batch_size = len(processed)
🧹 Nitpick comments (3)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
16-19: Registry extended to include ColModernVBERT.The EMBEDDINGS_REGISTRY now includes both ColPali and ColModernVBERT, enabling model discovery and instantiation via the unified interface.
Per the static analysis hint (RUF012), consider annotating mutable class attributes with
typing.ClassVarfor clarity:EMBEDDINGS_REGISTRY: ClassVar[list[Type[LateInteractionMultimodalEmbeddingBase]]] = [...]This is a minor improvement and not blocking.
fastembed/image/transform/functional.py (1)
192-221: Consider validating the normalization assumption for float inputs.For float32/float64 dtypes, the code assumes values are normalized to [0, 1] range (line 206). If values fall outside this range, they will be clipped during uint8 conversion, potentially causing silent data quality issues.
Consider adding validation:
🔎 Add range validation for float inputs
# Handle different dtypes if img_hwc.dtype == np.float32 or img_hwc.dtype == np.float64: - # Assume normalized, scale to 0-255 for PIL + # Validate normalized range and scale to 0-255 for PIL + if img_hwc.min() < 0 or img_hwc.max() > 1: + raise ValueError( + f"Float images must be normalized to [0, 1] range, got [{img_hwc.min()}, {img_hwc.max()}]" + ) img_hwc_scaled = (img_hwc * 255).astype(np.uint8)fastembed/image/transform/operators.py (1)
386-404: Optional: Extract error message to exception class.Static analysis suggests moving the long error message to a custom exception class for better maintainability.
🔎 Refactor suggestion
Define a custom exception:
class InvalidImageProcessorConfigError(ValueError): def __init__(self, processor_type: str, missing_key: str): super().__init__( f"Size dictionary must contain '{missing_key}' key for {processor_type}" )Then use it:
if "longest_edge" not in size: - raise ValueError( - "Size dictionary must contain 'longest_edge' key for Idefics3ImageProcessor" - ) + raise InvalidImageProcessorConfigError("Idefics3ImageProcessor", "longest_edge")
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
fastembed/common/onnx_model.pyfastembed/common/preprocessor_utils.pyfastembed/image/transform/functional.pyfastembed/image/transform/operators.pyfastembed/late_interaction_multimodal/colmodernvbert.pyfastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.pyfastembed/late_interaction_multimodal/onnx_multimodal_model.pytests/test_late_interaction_multimodal.py
🚧 Files skipped from review as they are similar to previous changes (2)
- fastembed/common/preprocessor_utils.py
- fastembed/common/onnx_model.py
🧰 Additional context used
🧬 Code graph analysis (5)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
fastembed/late_interaction_multimodal/colmodernvbert.py (1)
ColModernVBERT(35-429)
tests/test_late_interaction_multimodal.py (1)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (2)
get_embedding_size(98-122)embedding_size(91-95)
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (3)
fastembed/late_interaction_multimodal/colmodernvbert.py (1)
_preprocess_onnx_image_input(196-259)fastembed/late_interaction_multimodal/colpali.py (1)
_preprocess_onnx_image_input(210-227)fastembed/common/onnx_model.py (1)
OnnxOutputContext(20-24)
fastembed/image/transform/operators.py (2)
fastembed/image/transform/functional.py (5)
crop_ndarray(176-189)normalize(63-92)resize(95-111)resize_longest_edge(150-173)resize_ndarray(192-221)fastembed/common/utils.py (1)
normalize(18-23)
fastembed/late_interaction_multimodal/colmodernvbert.py (6)
fastembed/common/model_description.py (2)
DenseModelDescription(35-40)ModelSource(7-20)fastembed/common/onnx_model.py (2)
OnnxOutputContext(20-24)_select_exposed_session_options(128-137)fastembed/common/utils.py (1)
define_cache_dir(48-59)fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding_base.py (1)
LateInteractionMultimodalEmbeddingBase(10-86)fastembed/late_interaction_multimodal/onnx_multimodal_model.py (4)
OnnxMultimodalModel(20-356)load_onnx_model(82-83)_load_onnx_model(59-80)tokenize(85-86)fastembed/common/model_management.py (2)
_get_model_description(67-84)download_model(372-471)
🪛 GitHub Actions: Tests
fastembed/image/transform/functional.py
[error] 153-153: NameError: name 'Union' is not defined. Possibly missing 'from typing import Union' in the module.
🪛 GitHub Actions: type-checkers
fastembed/image/transform/functional.py
[error] 153-153: Name 'Union' is not defined. Did you forget to import it from 'typing'?
[error] 195-195: Name 'Union' is not defined. Did you forget to import it from 'typing'?
fastembed/image/transform/operators.py
[error] 45-45: Name 'Union' is not defined. Did you forget to import it from 'typing'?
[error] 46-46: Name 'Union' is not defined. Did you forget to import it from 'typing'?
[error] 76-76: Name 'Union' is not defined. Did you forget to import it from 'typing'?
[error] 77-77: Name 'Union' is not defined. Did you forget to import it from 'typing'?
🪛 Ruff (0.14.10)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py
16-19: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
fastembed/image/transform/functional.py
153-153: Undefined name Union
(F821)
195-195: Undefined name Union
(F821)
fastembed/image/transform/operators.py
45-45: Undefined name Union
(F821)
46-46: Undefined name Union
(F821)
76-76: Undefined name Union
(F821)
77-77: Undefined name Union
(F821)
390-392: Avoid specifying long messages outside the exception class
(TRY003)
429-429: Avoid specifying long messages outside the exception class
(TRY003)
432-432: Unused static method argument: config
(ARG004)
fastembed/late_interaction_multimodal/colmodernvbert.py
46-46: Possible hardcoded password assigned to: "QUERY_AUGMENTATION_TOKEN"
(S105)
157-157: Unused method argument: kwargs
(ARG002)
190-190: Unused method argument: kwargs
(ARG002)
197-197: Unused method argument: kwargs
(ARG002)
🔇 Additional comments (19)
fastembed/late_interaction_multimodal/onnx_multimodal_model.py (3)
179-200: Refactored image embedding flow with proper dispatch logic looks good.The dispatch between nested patches and flat images based on
isinstance(processed[0], list)is clear and appropriate for distinguishing ColModernVBERT's patch-based processing from standard processors.
239-275: Flat image processing handles both 4D and 5D tensor shapes correctly.The logic to conditionally add a patch dimension based on ONNX model signature is sound. The attention mask and metadata shapes are derived consistently from the encoded tensor dimensions.
277-299: ONNX signature inspection for patch dimension detection is well-implemented.Checking
len(input_meta.shape) == 5for thepixel_valuesinput is a reliable way to determine model expectations. The fallback toFalseensures backward compatibility.tests/test_late_interaction_multimodal.py (4)
24-34: Canonical image values for ColModernVBERT added correctly.The test data follows the same structure and precision as the existing ColPali values.
49-59: Canonical query values for ColModernVBERT added correctly.Consistent with the existing test pattern for query embeddings.
115-116: Embedding size test for ColModernVBERT.The expected embedding size of 128 matches the model description defined in
colmodernvbert.py.
130-132: Embedding size property test for ColModernVBERT with lazy_load.Good addition to ensure the embedding_size property works correctly with lazy loading for the new model.
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (1)
7-7: Import for ColModernVBERT added correctly.The import follows the existing pattern for ColPali.
fastembed/late_interaction_multimodal/colmodernvbert.py (6)
21-32: Model description for ColModernVBERT is well-defined.The model registry entry includes all required fields and correctly specifies the HuggingFace source.
261-283: Patch grid computation logic is correct.The algorithm to compute rows/cols from patch count handles both square and non-square grids, with appropriate fallback to unsplit (0, 0) when computation fails.
316-331: Input ID building logic for image prompts is well-structured.The method correctly dispatches between single and split image prompt strings and tokenizes the expanded prompt.
333-349: Image output post-processing correctly reshapes to expected dimensions.The reshape using
model_description.dimensures consistent embedding output format.
351-421:embed_textandembed_imagemethods delegate correctly to base implementation.Both methods properly pass all configuration parameters including providers, device settings, and session options.
423-449: Worker classes correctly instantiate ColModernVBERT with single thread.The worker implementations follow the established pattern for parallel processing with
threads=1.fastembed/image/transform/operators.py (5)
44-55: LGTM: Clean backward-compatible nested list support.The modifications to
NormalizeandRescaleelegantly handle both flat and nested list structures for patch-based processing while maintaining backward compatibility.Also applies to: 75-86
124-167: Verify that significant aspect ratio changes are acceptable.The implementation rounds both width and height up to the nearest multiple of
max_sizeindependently (lines 149-156), which can significantly alter the aspect ratio. For example, a 1000×700 image withmax_size=512becomes 1024×1024 (aspect ratio changes from ~1.43 to 1.0).While the docstring mentions "approximately" preserving aspect ratio, please confirm this behavior aligns with the ColModernVBERT model's requirements.
170-242: LGTM: Robust image splitting logic.The
ImageSplitterimplementation correctly handles grid-based patch generation with optimal patch sizing and appropriately includes a global view only when splitting is necessary.
245-267: LGTM: Consistent return structure with ImageSplitter.The
SquareResizeclass correctly returnslist[list[NumpyArray]]to maintain API consistency withImageSplitter, enabling seamless switching between splitting modes.
435-455: LGTM: Well-structured conditional image splitting logic.The
_get_image_splittingmethod cleanly orchestrates the choice between patch-based splitting and square resizing based on configuration. The docstring appropriately documents the ordering dependency withPILtoNDarray.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
fastembed/image/transform/operators.py (1)
170-242: Implicit dependency onResizeForVisionEncoderfor uniform patches.The patch extraction logic produces uniform
max_size × max_sizepatches only when input dimensions are exact multiples ofmax_size—which is guaranteed byResizeForVisionEncoderin the pipeline. IfImageSplitteris used standalone on arbitrary inputs, patches may have varying dimensions.Consider adding a note in the docstring about this prerequisite, or adding a runtime assertion/warning.
🔎 Optional: Add docstring clarification
class ImageSplitter(Transform): """ Split images into grid of patches plus a global view. If image dimensions exceed max_size: - Divide into ceil(H/max_size) x ceil(W/max_size) patches - Each patch is cropped from the image - Add a global view (original resized to max_size x max_size) If image is smaller than max_size: - Return single image unchanged Works on numpy arrays in (C, H, W) format. + + Note: For uniform max_size x max_size patches, input dimensions should + be multiples of max_size. Use ResizeForVisionEncoder before this transform. """
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
fastembed/image/transform/functional.pyfastembed/image/transform/operators.py
🚧 Files skipped from review as they are similar to previous changes (1)
- fastembed/image/transform/functional.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-12T10:48:30.978Z
Learnt from: joein
Repo: qdrant/fastembed PR: 574
File: fastembed/sparse/sparse_embedding_base.py:2-2
Timestamp: 2025-11-12T10:48:30.978Z
Learning: In fastembed codebase, when using numpy NDArray types in dataclass fields, keep Union syntax instead of PEP 604 pipe operator (|) because dataclasses evaluate annotations at runtime and NDArray types don't support the __or__ operator. Add a comment explaining the constraint.
Applied to files:
fastembed/image/transform/operators.py
🧬 Code graph analysis (1)
fastembed/image/transform/operators.py (2)
fastembed/image/transform/functional.py (9)
center_crop(15-60)convert_to_rgb(7-12)crop_ndarray(176-189)normalize(63-92)pil2ndarray(118-121)rescale(114-115)resize(95-111)resize_longest_edge(150-173)resize_ndarray(192-221)fastembed/common/utils.py (1)
normalize(18-23)
🪛 Ruff (0.14.10)
fastembed/image/transform/operators.py
390-392: Avoid specifying long messages outside the exception class
(TRY003)
429-429: Avoid specifying long messages outside the exception class
(TRY003)
432-432: Unused static method argument: config
(ARG004)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Python 3.13.x on windows-latest test
🔇 Additional comments (11)
fastembed/image/transform/operators.py (11)
1-17: LGTM!The new imports (
math,crop_ndarray,resize_longest_edge,resize_ndarray) are correctly added to support the new transform classes.
44-55: LGTM!The polymorphic handling of flat vs nested image structures is implemented correctly. The
images and isinstance(images[0], list)check safely handles empty lists via short-circuit evaluation.
75-86: LGTM!Consistent with the
Normalizeimplementation, correctly handling both flat and nested structures.
109-121: LGTM!Clean wrapper around the
resize_longest_edgefunctional helper, following the establishedTransformpattern.
139-167: LGTM!The resize logic correctly ensures both dimensions become multiples of
max_size. The aspect ratio is "approximately" preserved as documented, since independent ceiling operations on each dimension can introduce slight distortion—this is acceptable for vision encoder compatibility.
245-267: LGTM!The singleton-list output format (
list[list[NumpyArray]]) correctly matchesImageSplitter's output shape, enabling uniform downstream handling regardless of which path is taken.
306-306: LGTM!Documentation correctly updated to reflect the new
longest_edgesize key support.
317-317: LGTM!Correct pipeline ordering:
_get_image_splittingis called after_get_pil2ndarray(requires numpy arrays) and before_get_rescale/_get_normalize.
386-404: LGTM!The
Idefics3ImageProcessorhandling follows the established pattern. The resample int-to-enum conversion correctly handles cases where configuration provides the numeric value instead of the enum.
426-427: LGTM!No-op for center crop is correct;
Idefics3ImageProcessoruses image splitting instead.
435-455: LGTM!The conditional transform wiring is correct:
- Splitting enabled:
ResizeForVisionEncoder→ImageSplitter(ensures proper grid extraction)- Splitting disabled:
SquareResize(simple square output)The nested dict access with chained
.get()calls provides safe fallbacks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/test_late_interaction_multimodal.py (1)
92-95: Unused loop variablename.As flagged by the linter, the loop variable
nameis not used within the loop body.♻️ Suggested fix
if is_ci: - for name, model in cache.items(): + for _, model in cache.items(): delete_model_cache(model.model._model_dir) cache.clear()Alternatively, use
.values():if is_ci: - for name, model in cache.items(): - delete_model_cache(model.model._model_dir) + for model in cache.values(): + delete_model_cache(model.model._model_dir) cache.clear()
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
fastembed/image/onnx_image_model.pyfastembed/late_interaction_multimodal/colmodernvbert.pyfastembed/late_interaction_multimodal/onnx_multimodal_model.pytests/test_late_interaction_multimodal.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_late_interaction_multimodal.py (2)
fastembed/late_interaction_multimodal/late_interaction_multimodal_embedding.py (6)
LateInteractionMultimodalEmbedding(15-189)embed_image(147-168)embed_text(124-145)get_embedding_size(98-122)embedding_size(91-95)token_count(170-189)tests/utils.py (1)
delete_model_cache(11-39)
🪛 Ruff (0.14.10)
tests/test_late_interaction_multimodal.py
93-93: Loop control variable name not used within loop body
Rename unused name to _name
(B007)
fastembed/late_interaction_multimodal/colmodernvbert.py
46-46: Possible hardcoded password assigned to: "QUERY_AUGMENTATION_TOKEN"
(S105)
157-157: Unused method argument: kwargs
(ARG002)
191-191: Unused method argument: kwargs
(ARG002)
202-202: Unused method argument: kwargs
(ARG002)
215-215: Unused method argument: kwargs
(ARG002)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: Python 3.11.x on ubuntu-latest test
- GitHub Check: Python 3.12.x on ubuntu-latest test
- GitHub Check: Python 3.13.x on windows-latest test
- GitHub Check: Python 3.13.x on macos-latest test
- GitHub Check: Python 3.13.x on ubuntu-latest test
- GitHub Check: Python 3.10.x on ubuntu-latest test
🔇 Additional comments (17)
fastembed/image/onnx_image_model.py (1)
78-92: LGTM! Good resource management with ExitStack.The
ExitStackproperly manages image file handles, ensuring they're closed after processing. The image processing occurs within the context block before the handles are released, which is correct.tests/test_late_interaction_multimodal.py (1)
151-163: Good test coverage for token counting.The test properly validates that:
- Individual token counts sum correctly
- Batching doesn't affect total count
- Extension tokens are counted when
include_extension=Truefastembed/late_interaction_multimodal/onnx_multimodal_model.py (4)
172-202: LGTM! Clean dispatch logic for nested vs flat image processing.The method correctly:
- Uses
ExitStackfor resource management- Dispatches based on processor output structure
- Returns a complete
OnnxOutputContextwith attention mask and metadata
204-239: Well-structured patch processing with proper attention masking.The method correctly handles variable patch counts per image by:
- Finding max patches for batch padding
- Zero-padding shorter sequences
- Using attention mask to distinguish real patches from padding
241-277: LGTM! Proper handling of 4D/5D tensor requirements.The method adapts to model input requirements by checking the ONNX signature and appropriately reshaping tensors for backward compatibility.
279-301: LGTM! Good defensive programming.The method safely handles uninitialized models and defaults to
Falsefor backward compatibility.fastembed/late_interaction_multimodal/colmodernvbert.py (11)
35-46: LGTM! Clean class structure.The class constants are appropriately defined. The static analysis warning about
QUERY_AUGMENTATION_TOKENbeing a "hardcoded password" is a false positive—it's clearly a prompt formatting token, not a credential.
48-114: LGTM! Comprehensive initialization with good device handling.The constructor properly handles:
- Lazy loading for parallel processing scenarios
- Device ID fallback (explicit → first from list → None)
- Model download and configuration setup
191-195: LGTM! Query augmentation follows colpali-engine pattern.The 10 augmentation tokens match the upstream
process_querieslogic as noted in the comment. The unused**kwargsparameter maintains API consistency with the base class.
197-212: LGTM! Clean token counting implementation.The method properly:
- Ensures the model/tokenizer is loaded before counting
- Supports both single strings and iterables
- Uses appropriate tokenization based on
include_extensionflag
214-277: LGTM! Well-implemented dynamic input construction.The method correctly:
- Counts real patches from attention mask
- Computes grid dimensions from patch count
- Builds input_ids with appropriate padding direction from tokenizer config
279-302: LGTM! Correct grid dimension computation.The method properly handles:
- Single/no patches (returns 0, 0)
- Square and near-square grids
- Non-square grid fallback search
304-351: LGTM! Clean prompt construction following Idefics3 format.The methods properly construct prompt strings with appropriate special tokens for both single and split image scenarios.
352-368: LGTM! Correct output reshaping for late-interaction format.The method properly reshapes model output to the expected
(batch, n, dim)format for late-interaction retrieval.
370-440: LGTM! Clean delegation to base class methods.Both embedding methods properly forward all configuration options to the underlying implementation.
451-467: LGTM! Standard worker implementations.The workers correctly initialize the model with
threads=1, which is appropriate for parallel processing where each worker handles a subset of the workload.
136-155: Add error handling for JSON config file loading.Lines 137-154 open three JSON config files without error handling. While these files are guaranteed to exist (config.json and preprocessor_config.json are part of HuggingFace's standard model files, and processor_config.json is explicitly listed in additional_files), adding try/except blocks would improve debuggability if a file is corrupted or contains invalid JSON.
All Submissions:
New Feature Submissions:
pre-commitwithpip3 install pre-commitand set up hooks withpre-commit install?New models submission: