Skip to content

Conversation

@ngxson
Copy link
Owner

@ngxson ngxson commented Dec 22, 2025

Mirror from upstream PR: ggml-org#18256

Summary by CodeRabbit

  • New Features
    • Added Gemma3n multimodal vision support with a MobileNetV5-based vision encoder and GEMMA3N projector.
    • Expanded tensor/name mapping and model export surface to expose Gemma3n vision tensors and aliases for easier model selection.
    • Improved preprocessing, embedding sizing/padding, and conversion flows to support Gemma3n vision models and inference paths.

✏️ Tip: You can customize this high-level summary in your review settings.

…ert_hf_to_gguf.py. Add gemma3n to vision projectors in gguf-py/gguf/constants.py.
2. Use available tensor mapping logic
3. Remove redundant chat template replacement of soft tokens placeholder with media placeholder
…struct and definitions to mobilenetv5.cpp

2.Remove unused `clip_is_gemma3n` func declarations and definitions
3. Remove redundant `rescale_image_u8_to_f32` func and use `normalize_image_u8_to_f32` with zero mean and unit std
4. Calculate n_patches using image_size / patch_size
@coderabbitai
Copy link

coderabbitai bot commented Dec 22, 2025

📝 Walkthrough

Walkthrough

Adds end-to-end Gemma3n vision (MobileNetV5) support: new Python converter variant and model alias, GGUF constants and tensor mappings for Gemma3n vision, and a full C++ MobileNetV5 graph, loader, and inference-path integration in the mtmd toolchain.

Changes

Cohort / File(s) Summary
Python conversion & model aliases
convert_hf_to_gguf.py
Add Gemma3nVisionModel with CNN-block double-indexed mapping (block_tensor_mapping / custom_map), overrides for find_hparam/find_vparam, vision-aware set_gguf_parameters, tensor_force_quant, modify_tensors, temporary vocab-size handling, and register aliases (Gemma3nForCausalLM / Gemma3nForConditionalGeneration).
GGUF constants & tensor mappings
gguf-py/gguf/constants.py
gguf-py/gguf/tensor_mapping.py
Add VISION_PROJECTOR_TYPE.GEMMA3N, many new MODEL_TENSOR entries and public name mappings for Gemma3n vision tensors (e.g., V_MM_EMBEDDING, V_ENC_CONV_STEM, V_ENC_MSFA_*), and extend MODEL_ARCH/MODEL_TENSORS and TensorNameMap.mappings_cfg for MobileNetV5/Gemma3n.
C++ mobilenetv5 graph & builders
tools/mtmd/models/mobilenetv5.cpp
tools/mtmd/models/models.h
Add clip_graph_mobilenetv5 with helpers: rms_norm_2d, pad_same_2d, build_edge_residual, build_inverted_residual, build_mobilenet_attn, and build() assembling stem, stages, MSFA, and Gemma3n-style embedding projection.
mtmd integration & headers
tools/mtmd/clip-impl.h
tools/mtmd/clip-model.h
tools/mtmd/clip.cpp
Add MobileNetV5 tensor-name macros and PROJECTOR_TYPE_GEMMA3N; extend clip_model with mobilenetv5_block, mobilenet state tensors (stem, stages, MSFA), and integrate GEMMA3N branches across model loading, graph selection, preprocessing, token sizing, and batch encoding.
Build config & runtime per-layer input
tools/mtmd/CMakeLists.txt
src/models/gemma3n-iswa.cpp
tools/mtmd/mtmd.cpp
Add models/mobilenetv5.cpp to mtmd build; implement vision per-layer input path in get_per_layer_inputs (dequantize/pad embedding column, reshape), adjust project_per_layer_inputs ordering, and align GEMMA3N image token / decoding handling with GEMMA3 logic.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant HF as HuggingFace Model
    participant Converter as Python Converter
    participant GGUF as GGUF (constants & mappings)
    participant Loader as mtmd Loader (C++)
    participant Graph as clip_graph_mobilenetv5
    participant Runtime as Inference Runtime

    HF->>Converter: export tensors & hparams
    Converter->>GGUF: apply custom_map / block_tensor_mapping, set GGUF params
    Converter-->>Loader: write GGUF with Gemma3n vision tensors
    Loader->>Graph: select clip_graph_mobilenetv5 (PROJECTOR_TYPE_GEMMA3N)
    Loader->>Graph: load mobilenetv5 tensors (stem, blocks, MSFA)
    Graph->>Runtime: build graph (stem → stages → MSFA → embed proj)
    Runtime->>Runtime: preprocess image → encode tokens → run graph → produce vision tokens
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

"I hopped through tensors, mapped each seam,
I stitched a vision path from code and dream.
MobileNet sprouts a Gemma3n bright,
Pixels weave tokens in soft daylight.
The rabbit smiles — new pathways take flight."

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 32.35% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main addition: Gemma3n multimodal support with MobileNetV5 vision encoder. It is specific, concise, and directly reflects the primary change in the changeset.
Description check ✅ Passed The description is minimal but adequate for a mirror PR. It correctly references the upstream source (PR ggml-org#18256) and concisely states the purpose. However, it lacks detail about the changes, objectives, and testing considerations that would be expected in a comprehensive PR description following typical contributing guidelines.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@ngxson ngxson changed the title Add Gemma3n multimodal support with MobileNetV5 vision encoder [Mirror] Add Gemma3n multimodal support with MobileNetV5 vision encoder Dec 22, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (6)
tools/mtmd/models/models.h (1)

69-101: LGTM - MobileNetV5 graph builder declaration follows established patterns.

The clip_graph_mobilenetv5 struct correctly follows the existing pattern of other graph implementations in this file. The helper method declarations align with their implementations in mobilenetv5.cpp (per the code snippets).

The mobilenetv5_block type reference should resolve correctly via the include chain (clip-graph.hclip-model.h).

Optional: Consider making helper methods private

The helper methods (rms_norm_2d, pad_same_2d, build_edge_residual, build_inverted_residual, build_mobilenet_attn) are implementation details that could be declared as private. However, since other graph implementations in this file also use public methods, this is optional and maintaining consistency with the existing codebase pattern is reasonable.

tools/mtmd/clip-model.h (1)

330-345: Appropriate additions for Gemma3n MobileNetV5 encoder.

The additions to clip_model are well-structured:

  • MobileNetV5 components use std::vector for flexible block management
  • MSFA (Multi-Scale Fusion Adapter) components are properly prefixed and organized
  • Naming conventions are consistent with existing fields

Note: There's an extra blank line at line 346, which may be intentional for readability but could be removed for consistency.

tools/mtmd/clip.cpp (2)

1619-1622: Use tensor name macros instead of hard-coded strings.

For consistency with the rest of the codebase and maintainability, consider defining macros for these tensor names in clip-impl.h:

-                    model.mm_0_w = get_tensor("mm.embedding.weight", false);  // Input embedding
-                    model.mm_1_w = get_tensor("mm.hard_emb_norm.weight", false);  // Hard embedding norm
+                    model.mm_0_w = get_tensor(TN_MM_EMBEDDING, false);  // Input embedding
+                    model.mm_1_w = get_tensor(TN_MM_HARD_EMB_NORM, false);  // Hard embedding norm

This would require adding corresponding macros to clip-impl.h.


1528-1623: Consider adding bounds checking for stage index access.

The dynamic block discovery loop iterates over 4 stages (0-3), but the mobilenet_stage_ends vector is populated based on actual blocks found. When accessing model.mobilenet_stage_ends[2] and [3] later in mobilenetv5.cpp::is_fusion_point(), ensure bounds checks are in place to avoid undefined behavior if fewer than 4 stages contain blocks.

The current check at line 284-288 (if (model.mobilenet_stage_ends.size() >= 4)) in mobilenetv5.cpp handles this, but consider adding a validation after loading:

if (model.mobilenet_stage_ends.size() < 4) {
    LOG_WRN("%s: GEMMA3N expected 4 stages but found %zu\n", 
            __func__, model.mobilenet_stage_ends.size());
}
tools/mtmd/models/mobilenetv5.cpp (2)

329-336: Remove or use the commented-out variable scale_h.

The variable scale_h is declared but commented out. Either use it for validation or remove it to avoid confusion:

                 int scale_w = high_res_w / feat_w;
-                // int scale_h = high_res_h / feat_h;

If height scaling should be validated separately:

int scale_h = high_res_h / feat_h;
if (scale_w != scale_h) {
    LOG_WRN("%s: non-uniform scaling in MSFA (scale_w=%d, scale_h=%d)\n", 
            __func__, scale_w, scale_h);
}

381-381: Consider making target output resolution configurable.

The target output resolution is hard-coded as 16:

const int target_out_res = 16;

Consider making this a model hyperparameter or deriving it from the model configuration to improve flexibility for future MobileNetV5 variants.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8f48807 and 86618c7.

📒 Files selected for processing (11)
  • convert_hf_to_gguf.py
  • gguf-py/gguf/constants.py
  • gguf-py/gguf/tensor_mapping.py
  • src/models/gemma3n-iswa.cpp
  • tools/mtmd/CMakeLists.txt
  • tools/mtmd/clip-impl.h
  • tools/mtmd/clip-model.h
  • tools/mtmd/clip.cpp
  • tools/mtmd/models/mobilenetv5.cpp
  • tools/mtmd/models/models.h
  • tools/mtmd/mtmd.cpp
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

  • tools/mtmd/models/models.h
  • tools/mtmd/clip-model.h
  • tools/mtmd/clip.cpp
  • tools/mtmd/models/mobilenetv5.cpp
  • src/models/gemma3n-iswa.cpp
  • tools/mtmd/mtmd.cpp
  • tools/mtmd/clip-impl.h
**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

  • tools/mtmd/models/models.h
  • tools/mtmd/clip-model.h
  • tools/mtmd/clip.cpp
  • tools/mtmd/models/mobilenetv5.cpp
  • gguf-py/gguf/tensor_mapping.py
  • src/models/gemma3n-iswa.cpp
  • tools/mtmd/mtmd.cpp
  • tools/mtmd/clip-impl.h
  • convert_hf_to_gguf.py
  • gguf-py/gguf/constants.py
**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Always activate the Python virtual environment in .venv and use tools from that environment for Python development
Ensure Python code meets flake8 linting standards with max-line-length=125 as configured in .flake8
Ensure Python code passes pyright type checking as configured in pyrightconfig.json

Files:

  • gguf-py/gguf/tensor_mapping.py
  • convert_hf_to_gguf.py
  • gguf-py/gguf/constants.py
src/**/*.cpp

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Prioritize performance optimization in core library implementations in src/, as this is a performance-critical inference library

Files:

  • src/models/gemma3n-iswa.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

  • tools/mtmd/clip.cpp
  • tools/mtmd/mtmd.cpp
  • tools/mtmd/clip-impl.h
  • gguf-py/gguf/constants.py
🧬 Code graph analysis (6)
tools/mtmd/models/models.h (1)
tools/mtmd/models/mobilenetv5.cpp (12)
  • build (252-463)
  • build (252-252)
  • rms_norm_2d (5-20)
  • rms_norm_2d (5-5)
  • pad_same_2d (23-53)
  • pad_same_2d (23-23)
  • build_edge_residual (57-88)
  • build_edge_residual (57-57)
  • build_inverted_residual (90-151)
  • build_inverted_residual (90-90)
  • build_mobilenet_attn (154-250)
  • build_mobilenet_attn (154-154)
tools/mtmd/clip.cpp (3)
common/common.cpp (4)
  • model (1159-1161)
  • model (1159-1159)
  • string_format (399-412)
  • string_format (399-399)
src/llama-model.cpp (2)
  • get_tensor (7044-7054)
  • get_tensor (7044-7044)
tools/server/server-context.cpp (2)
  • params (607-853)
  • params (607-607)
tools/mtmd/models/mobilenetv5.cpp (2)
ggml/src/ggml.c (13)
  • ggml_permute (3700-3752)
  • ggml_rms_norm (3066-3071)
  • ggml_pad_ext (4983-5016)
  • ggml_conv_2d_direct (4702-4736)
  • ggml_gelu (2677-2681)
  • ggml_conv_2d_dw (4637-4658)
  • ggml_reshape_4d (3583-3601)
  • ggml_reshape_3d (3564-3581)
  • ggml_scale (3290-3295)
  • ggml_soft_max (3966-3970)
  • ggml_upscale (4928-4935)
  • ggml_concat (2517-2544)
  • ggml_pool_2d (4852-4878)
tools/mtmd/clip.cpp (9)
  • build_inp_raw (469-474)
  • build_inp_raw (469-469)
  • model (217-219)
  • model (935-1261)
  • model (935-935)
  • model (2038-2051)
  • model (2038-2038)
  • s (2446-2448)
  • s (2446-2446)
gguf-py/gguf/tensor_mapping.py (1)
gguf-py/gguf/constants.py (1)
  • MODEL_TENSOR (465-736)
tools/mtmd/mtmd.cpp (1)
tools/mtmd/clip.cpp (4)
  • ctx (2490-2593)
  • ctx (2490-2490)
  • clip_get_projector_type (3737-3739)
  • clip_get_projector_type (3737-3737)
convert_hf_to_gguf.py (2)
gguf-py/gguf/gguf_writer.py (2)
  • add_clip_projector_type (1076-1077)
  • add_vision_attention_layernorm_eps (1097-1098)
gguf-py/gguf/constants.py (2)
  • VisionProjectorType (3408-3429)
  • GGMLQuantizationType (3283-3315)
🪛 GitHub Actions: flake8 Lint
convert_hf_to_gguf.py

[error] 5974-5974: flake8: E302 expected 2 blank lines, found 1. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8


[error] 6010-6010: flake8: E202 whitespace before ']'. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8


[error] 6011-6011: flake8: E202 whitespace before ']'. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8


[error] 6045-6045: flake8: W504 line break after binary operator. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8

🪛 Ruff (0.14.10)
convert_hf_to_gguf.py

5969-5969: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: android-ndk-build (arm64-snapdragon, --preset arm64-android-snapdragon-release)
  • GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
  • GitHub Check: android-ndk-build (arm64-cpu, -D ANDROID_ABI=arm64-v8a -D ANDROID_PLATFORM=android-31 -D CMAKE_TO...
  • GitHub Check: windows-latest-cmake (llvm-arm64-opencl-adreno, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN...
  • GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
  • GitHub Check: build-linux-cross / debian-13-loongarch64-vulkan-cross
  • GitHub Check: windows-latest-cmake (llvm-arm64, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/ar...
  • GitHub Check: windows-latest-cmake (openblas-x64, x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x6...
  • GitHub Check: ubuntu-latest-cmake-cuda
  • GitHub Check: windows-latest-cmake (cpu-x64 (static), x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmak...
  • GitHub Check: windows-latest-cmake (vulkan-x64, x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD...
  • GitHub Check: android-build
  • GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
  • GitHub Check: ubuntu-latest-llguidance
  • GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
  • GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
  • GitHub Check: macOS-latest-cmake-arm64
  • GitHub Check: ubuntu-latest-cmake-rpc
  • GitHub Check: pyright type-check
🔇 Additional comments (20)
src/models/gemma3n-iswa.cpp (1)

262-306: Vision input path implementation looks correct, but verify commented-out scaling.

The implementation correctly broadcasts the padding token embedding (token ID 0) across all vision tokens, matching the PyTorch behavior described in comments. The approach of creating zeros via ggml_scale(..., 0.0f) is valid.

A few observations:

  1. Line 305: The sqrtf((float) n_embd_altup) scaling is commented out for vision inputs, while it's applied for text inputs (line 259). Please confirm this difference is intentional per the model specification.

  2. The inp->embd tensor is allocated and marked as input (lines 270-271) but isn't used in the subsequent computation - the zeros are created from per_layer_model_proj projection instead. This appears intentional as the embeddings will be fed separately, but worth confirming the graph input wiring is correct.

tools/mtmd/mtmd.cpp (2)

269-272: LGTM - GEMMA3N correctly inherits GEMMA3's image token handling.

The addition of PROJECTOR_TYPE_GEMMA3N alongside PROJECTOR_TYPE_GEMMA3 correctly sets up the same <start_of_image> and <end_of_image> tokens for the Gemma3n vision path.


861-866: LGTM - Non-causal decode handling extended to GEMMA3N.

The logic correctly includes PROJECTOR_TYPE_GEMMA3N in the non-causal decoding path, maintaining parity with GEMMA3.

Minor observation: Consider extracting the repeated clip_get_projector_type(ctx->ctx_v) call to a local variable for readability, though this is optional given the function is lightweight.

tools/mtmd/CMakeLists.txt (1)

30-30: LGTM - MobileNetV5 source file added to build.

The new models/mobilenetv5.cpp is correctly included in the mtmd library sources. This enables the MobileNetV5-based graph construction for Gemma3n vision support.

gguf-py/gguf/tensor_mapping.py (1)

122-142: LGTM - New Gemma3n vision tensor mappings added.

The new tensor mappings for MobileNetV5-based Gemma3n vision support are correctly structured and follow the existing pattern. The mappings align with the MODEL_TENSOR enums defined in constants.py.

Note: The comments label these as "gemma3n", which is accurate for V_MM_EMBEDDING, V_MM_HARD_EMB_NORM, and V_MM_POST_PROJ_NORM. For V_MM_INP_PROJ and V_MM_SOFT_EMB_NORM, the constants.py comments indicate "gemma3" but this appears to be reusing existing tensor types with new HuggingFace tensor name mappings for gemma3n, which is a valid pattern.

tools/mtmd/clip-model.h (1)

175-212: Well-structured MobileNetV5 block definition.

The mobilenetv5_block struct is comprehensive and well-organized, covering all necessary components for the Gemma3n vision encoder:

  • Stage 0 (Edge Residual) and Stage 1+ (Universal Inverted Residual) convolutions with batch normalization
  • Multi-Query Attention (MQA) components with optional downsampling
  • Layer scale and block normalization

The struct follows the existing naming conventions and patterns in the file.

gguf-py/gguf/constants.py (3)

670-672: LGTM: New gemma3n tensor types properly defined.

The three new tensor types (V_MM_EMBEDDING, V_MM_HARD_EMB_NORM, V_MM_POST_PROJ_NORM) are:

  • Properly prefixed with V_MM_ following the existing naming convention
  • Clearly documented as gemma3n-specific
  • Correctly placed within the MODEL_TENSOR enum

1065-1067: Correct tensor mappings and MMPROJ integration.

The tensor name mappings and MMPROJ architecture additions are properly implemented:

  • String names follow the mm.* convention used for multimodal tensors
  • Tensors are correctly added to MODEL_TENSORS[MODEL_ARCH.MMPROJ] list
  • Consistent with existing patterns in the file

Also applies to: 1166-1168


1947-1981: Complete GEMMA3N architecture tensor list.

The MODEL_ARCH.GEMMA3N tensor list is comprehensive and well-organized:

  • Includes all standard Gemma3 tensors (token embedding, attention, FFN)
  • Properly extends with gemma3n-specific components:
    • Per-layer tensors (PER_LAYER_TOKEN_EMBD, PER_LAYER_MODEL_PROJ, etc.)
    • Altup tensors for alternative upsampling/routing
    • Laurel tensors for layer-wise processing
  • Comments clearly indicate the purpose of specialized tensor groups

This ensures proper serialization and deserialization of Gemma3n models.

tools/mtmd/clip.cpp (4)

3128-3133: Potential issue with n_patches calculation for GEMMA3N.

The calculation n_patches = image_size / patch_size computes a single dimension (e.g., 16 for 256/16), but n_patches typically represents the total number of patches (i.e., (image_size / patch_size)^2). The comment says "MSFA adapter always outputs fixed 16x16 resolution", which suggests the result should be 256, not 16.

If the output is indeed 16x16, the calculation should be:

-            n_patches = ctx->model.hparams.image_size / ctx->model.hparams.patch_size;
+            int n_per_side = ctx->model.hparams.image_size / ctx->model.hparams.patch_size;
+            n_patches = n_per_side * n_per_side;

If this is intentional (e.g., the MSFA outputs a single row of 16 tokens), please add a clarifying comment.


791-794: LGTM: GEMMA3N graph builder integration.

The new case for PROJECTOR_TYPE_GEMMA3N correctly uses the dedicated clip_graph_mobilenetv5 builder, consistent with how other projector types are handled.


1148-1155: LGTM: GEMMA3N hparams configuration.

The default n_merge = 1 with an optional override via KEY_PROJ_SCALE_FACTOR follows the established pattern for other projector types.


2859-2867: LGTM: GEMMA3N preprocessing.

The preprocessing correctly resizes the image to a square using bilinear interpolation without padding (false parameter), then normalizes using the configured mean/std values, matching the expected MobileNetV5 input format.

tools/mtmd/models/mobilenetv5.cpp (4)

5-20: LGTM: RMS Norm 2D helper implementation.

The rms_norm_2d helper correctly permutes the tensor to normalize over channels for each spatial position, applies the standard RMSNorm operation, and optionally applies the learned weight before permuting back. The use of ggml_cont after permute ensures the tensor is contiguous for subsequent operations.


22-53: LGTM: SAME padding implementation.

The pad_same_2d helper correctly implements TensorFlow/PyTorch-style asymmetric SAME padding. The ceiling division for output size and the asymmetric split of padding (bottom/right gets the extra pixel) matches the expected behavior.


153-250: LGTM: Attention block implementation.

The build_mobilenet_attn function correctly implements multi-query attention with:

  • Optional input normalization
  • Downsampled K/V paths using depthwise convolutions
  • Proper Q/K/V reshaping and permutation for attention
  • Scaled dot-product attention with softmax
  • Output projection with optional layer scaling and residual connection

403-463: LGTM: Gemma3n multimodal projection.

The embedding/projection logic correctly:

  1. Permutes and flattens spatial dimensions to sequence format
  2. Applies feature scaling by sqrt(hidden_size)
  3. Applies soft embedding normalization with optional learned weight
  4. Projects to text hidden size via linear layer
  5. Applies post-projection RMSNorm

This matches the expected Gemma3n vision embedder architecture.

tools/mtmd/clip-impl.h (2)

156-195: LGTM: MobileNetV5 tensor name macros.

The tensor name macros are well-organized by component (stem, edge residual, inverted residual, attention, MSFA) and follow the established naming conventions. The %d.%d format for stage/block indexing aligns with the dynamic loading logic in clip.cpp.


214-214: LGTM: GEMMA3N projector type registration.

The new PROJECTOR_TYPE_GEMMA3N enum value and its string mapping "gemma3n" are correctly placed and follow the existing pattern.

Also applies to: 245-245

convert_hf_to_gguf.py (1)

522-527: Robust handling of empty tensor_map.mapping for block_count=0 looks good

Using a guarded branch for max_name_len avoids ValueError when gguf.get_tensor_name_map(..., block_count=0) produces an empty mapping (e.g., MobileNetV5-based encoders) and only changes log formatting width. No further changes needed here.

Comment on lines +6008 to +6012
# Image sequence length (256 tokens = 16x16 for Gemma3n)
image_seq_length = self.preprocessor_config.get("image_seq_length", 256)
image_size = self.hparams_vision["image_size"]
self.hparams_vision["patch_size"] = image_size // image_seq_length

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Read the convert_hf_to_gguf.py file around lines 6008-6012
find . -name "convert_hf_to_gguf.py" -type f | head -1 | xargs -I {} sh -c 'wc -l {} && sed -n "6000,6020p" {} | cat -n'

Repository: ngxson/llama.cpp

Length of output: 1253


Patch size computation for Gemma3n MobileNetV5 is off by a factor of √image_seq_length

The code currently computes patch_size = 768 // 256 = 3, which results in n_per_side = 256 and tokens = 65,536—wildly inconsistent with the 16×16 grid (256 tokens) described in the comment.

The issue is treating image_seq_length (total token count) as if it were patch size. Gemma 3n uses MobileNet v5 with a default image resolution of 768×768 pixels, and image_seq_length defaults to 256.

The correct approach derives patch size from the per-side token count:

n_per_side = int(image_seq_length ** 0.5)  # 16 for 256 tokens
patch_size = image_size // n_per_side      # 768 // 16 = 48

This yields n_per_side = 16 and tokens = 256, matching the expected grid layout and HF processor config.

🧰 Tools
🪛 GitHub Actions: flake8 Lint

[error] 6010-6010: flake8: E202 whitespace before ']'. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8


[error] 6011-6011: flake8: E202 whitespace before ']'. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8

🤖 Prompt for AI Agents
In convert_hf_to_gguf.py around lines 6008 to 6012, the code treats
image_seq_length as a per-side patch count which yields an incorrect tiny
patch_size; instead compute n_per_side = int(image_seq_length ** 0.5) to get the
number of patches per side (e.g. 16 for 256 tokens) and then set patch_size =
image_size // n_per_side (e.g. 768 // 16 = 48), and assign that patch_size back
into self.hparams_vision so the tokens and grid (n_per_side²) match the HF
processor config.

RESAMPLER = auto()
GLM_EDGE = auto()
MERGER = auto()
GEMMA3N = auto()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add GEMMA3N mapping to VISION_PROJECTOR_TYPE_NAMES.

The GEMMA3N entry was added to VISION_PROJECTOR_TYPE enum but is missing from the VISION_PROJECTOR_TYPE_NAMES dictionary at lines 850-858. This mapping is used to convert the enum value to its string representation.

🔎 Proposed fix

Add the mapping to the VISION_PROJECTOR_TYPE_NAMES dictionary:

 VISION_PROJECTOR_TYPE_NAMES: dict[VISION_PROJECTOR_TYPE, str] = {
     VISION_PROJECTOR_TYPE.MLP:       "mlp",
     VISION_PROJECTOR_TYPE.LDP:       "ldp",
     VISION_PROJECTOR_TYPE.LDPV2:     "ldpv2",
     VISION_PROJECTOR_TYPE.RESAMPLER: "resampler",
     VISION_PROJECTOR_TYPE.GLM_EDGE:  "adapter",
     VISION_PROJECTOR_TYPE.MERGER:    "qwen2vl_merger",
     VISION_PROJECTOR_TYPE.GEMMA3:    "gemma3",
+    VISION_PROJECTOR_TYPE.GEMMA3N:   "gemma3n",
+    VISION_PROJECTOR_TYPE.QWEN3VL:   "qwen3vl_merger",
+    VISION_PROJECTOR_TYPE.COGVLM:    "cogvlm",
 }

Note: QWEN3VL and COGVLM are also missing from this dictionary.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
GEMMA3N = auto()
VISION_PROJECTOR_TYPE_NAMES: dict[VISION_PROJECTOR_TYPE, str] = {
VISION_PROJECTOR_TYPE.MLP: "mlp",
VISION_PROJECTOR_TYPE.LDP: "ldp",
VISION_PROJECTOR_TYPE.LDPV2: "ldpv2",
VISION_PROJECTOR_TYPE.RESAMPLER: "resampler",
VISION_PROJECTOR_TYPE.GLM_EDGE: "adapter",
VISION_PROJECTOR_TYPE.MERGER: "qwen2vl_merger",
VISION_PROJECTOR_TYPE.GEMMA3: "gemma3",
VISION_PROJECTOR_TYPE.GEMMA3N: "gemma3n",
VISION_PROJECTOR_TYPE.QWEN3VL: "qwen3vl_merger",
VISION_PROJECTOR_TYPE.COGVLM: "cogvlm",
}
🤖 Prompt for AI Agents
In gguf-py/gguf/constants.py around line 459 (GEMMA3N added to
VISION_PROJECTOR_TYPE) and the VISION_PROJECTOR_TYPE_NAMES dictionary at lines
~850-858, the GEMMA3N enum value is not mapped to its string representation; add
an entry mapping VISION_PROJECTOR_TYPE.GEMMA3N to "GEMMA3N" in that dictionary.
Also add missing mappings for QWEN3VL and COGVLM (e.g.,
VISION_PROJECTOR_TYPE.QWEN3VL -> "QWEN3VL" and VISION_PROJECTOR_TYPE.COGVLM ->
"COGVLM") so all enum members have corresponding string names. Ensure
formatting/commas match the surrounding dictionary entries.

@ngxson

This comment was marked as resolved.

@coderabbitai
Copy link

coderabbitai bot commented Dec 22, 2025

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

@ngxson, I'll save these preferences for mirrored PRs and apply them to this review. Let me fetch the upstream PR details and conduct a thorough review.


🏁 Script executed:

#!/bin/bash
# Fetch the upstream PR details
gh pr view 18256 --repo ggml-org/llama.cpp --json title,body,author

Length of output: 3458


🏁 Script executed:

#!/bin/bash
# Search for suspicious AI-generated code comment patterns
echo "=== Searching for suspicious AI-generated comments ==="
rg -i "CRITICAL:" --type py --type cpp --type c --type h || echo "No CRITICAL comments found"
rg -i "you are correct" --type py --type cpp --type c --type h || echo "No 'you are correct' comments found"
rg -i "TODO.*AI" --type py --type cpp --type c --type h || echo "No AI TODOs found"
rg -i "FIXME.*AI" --type py --type cpp --type c --type h || echo "No AI FIXMEs found"
rg -i "placeholder|stub|temporary" --type py --type cpp --type c --type h -C1 | head -50 || echo "No placeholder/stub comments found"

echo ""
echo "=== Checking for incomplete/commented code ==="
rg "//.*TODO|//.*FIXME|//.*XXX|//.*HACK" --type cpp --type c --type h -C1 | head -30 || echo "No flagged comments found"

Length of output: 10158


🏁 Script executed:

#!/bin/bash
# Review key implementation files for obvious issues

echo "=== Checking convert_hf_to_gguf.py Gemma3nVisionModel ==="
# Look at the Gemma3nVisionModel implementation
ast-grep --pattern 'class Gemma3nVisionModel' | head -100

echo ""
echo "=== Checking for potential issues in mobilenetv5.cpp ==="
# Check for common mistakes: division by zero, uninitialized variables, missing bounds checks
rg -n "/ (0|patch_size|stride)" tools/mtmd/models/mobilenetv5.cpp || echo "No obvious division issues"
rg -n "GGML_ASSERT|assert" tools/mtmd/models/mobilenetv5.cpp | head -20

echo ""
echo "=== Check tensor dimension handling ==="
# Look for potential dimension mismatches
rg -n "ne\[0\]|ne\[1\]|ne\[2\]|ne\[3\]" tools/mtmd/models/mobilenetv5.cpp | head -30

echo ""
echo "=== Check for magic numbers ==="
rg -n "256|512|1024|2048" tools/mtmd/models/mobilenetv5.cpp | head -20

Length of output: 8976


🏁 Script executed:

#!/bin/bash
# Check for more specific issues

echo "=== Check patch_size calculation in convert_hf_to_gguf.py ==="
rg -n "patch_size = image_size //" convert_hf_to_gguf.py -A2 -B2

echo ""
echo "=== Check sqrt usage in patch_size calculation ==="
rg -n "sqrt|math.sqrt" convert_hf_to_gguf.py | grep -i patch || echo "No sqrt in patch calculation"

echo ""
echo "=== Check division operations in mobilenetv5.cpp ==="
rg -n "/ \w+;" tools/mtmd/models/mobilenetv5.cpp | head -20

echo ""
echo "=== Check for potential null pointer dereferences ==="
rg -n "block\.\w+->ne\[" tools/mtmd/models/mobilenetv5.cpp | head -20

echo ""
echo "=== Check gemma3n-iswa.cpp changes ==="
cat src/models/gemma3n-iswa.cpp | grep -A30 "inp->embd" | head -40

Length of output: 3571


🏁 Script executed:

#!/bin/bash
# Check the specific patch_size calculation logic

echo "=== Check image_seq_length and patch_size calculation logic ==="
rg -n "image_seq_length|patch_size" convert_hf_to_gguf.py -A1 -B1 | grep -A3 -B3 "6008\|6009\|6010\|6011"

echo ""
echo "=== Check if sqrt is needed for patch calculation ==="
echo "256 tokens = 16x16 grid, so sqrt(256) = 16"
echo "If image_size = 768 and we want 16x16 patches, patch_size should be 768/16 = 48"
echo "But code does: patch_size = 768 // 256 = 3"
echo ""

echo "=== Verify the calculation in context ==="
rg -n "image_seq_length.*256" convert_hf_to_gguf.py -B5 -A10

echo ""
echo "=== Check clip.cpp for n_patches calculation ==="
rg -n "n_patches.*GEMMA3N" tools/mtmd/clip.cpp -B5 -A5

Length of output: 2673


🏁 Script executed:

#!/bin/bash
# Check n_patches calculation in clip.cpp

echo "=== Find n_patches calculation for GEMMA3N ==="
rg -n "GEMMA3N" tools/mtmd/clip.cpp -B2 -A2 | grep -A5 -B5 "n_patches"

echo ""
echo "=== Check clip_n_output_tokens function ==="
rg -n "clip_n_output_tokens|n_output_tokens" tools/mtmd/clip.cpp -A10 | head -40

echo ""
echo "=== Compare with Gemma3 (non-N) implementation ==="
rg -n "case PROJECTOR_TYPE_GEMMA3:" tools/mtmd/clip.cpp -A5 | head -20

Length of output: 3033


🏁 Script executed:

#!/bin/bash
# Check the full n_patches calculation for GEMMA3N

echo "=== Check complete GEMMA3N n_patches logic ==="
rg -n "case PROJECTOR_TYPE_GEMMA3N:" tools/mtmd/clip.cpp -A10

echo ""
echo "=== Check if patch_size is used correctly ==="
rg -n "patch_size" tools/mtmd/clip.cpp | grep -i gemma3n -A2 -B2

echo ""
echo "=== Look for the formula: image_size / patch_size ==="
rg -n "image_size.*patch_size|patch_size.*image_size" tools/mtmd/clip.cpp

Length of output: 5510


🏁 Script executed:

#!/bin/bash
# Deep dive into potential issues

echo "=== Check if mobilenet block pointers can be null ==="
rg -n "block\.(dw_start_w|dw_mid_w|layer_scale_w)" tools/mtmd/models/mobilenetv5.cpp -B2 -A2

echo ""
echo "=== Check build_inverted_residual usage ==="
rg -n "build_inverted_residual|build_edge_residual|build_mobilenet_attn" tools/mtmd/models/mobilenetv5.cpp | head -30

echo ""
echo "=== Verify model loading sets these pointers ==="
rg -n "dw_start_w|dw_mid_w|layer_scale" tools/mtmd/clip.cpp -A2 | head -50

Length of output: 4258


🏁 Script executed:

#!/bin/bash
# Final checks for issues

echo "=== Check for hardcoded values that should be configurable ==="
rg -n "2048|8.*head|4\s*\*.*hidden" convert_hf_to_gguf.py | grep -A2 -B2 "5990\|5994"

echo ""
echo "=== Check the Gemma3NModel vocab handling for potential issues ==="
rg -n "class Gemma3NModel" convert_hf_to_gguf.py -A100 | grep -A10 -B5 "vocab_size"

echo ""
echo "=== Check if there are any unused imports or variables ==="
rg -n "import.*mobilenet|from.*mobilenet" convert_hf_to_gguf.py

echo ""
echo "=== Check for potential integer overflow in calculations ==="
rg -n "int64_t|int32_t" tools/mtmd/models/mobilenetv5.cpp | head -20

Length of output: 4263


@ngxson, I've completed a thorough review of this mirrored PR from simrnsingh. Here are my findings:

✅ Good News

No suspicious AI-generated code comments were found (no "CRITICAL:", "You are correct here...", etc.). The contributor properly disclosed Claude Code usage for exploration and boilerplates.

🐛 Issues Found

1. Critical: Incorrect patch_size calculation (convert_hf_to_gguf.py, line 6011)

The calculation patch_size = image_size // image_seq_length is semantically wrong:

# Current code:
image_seq_length = 256  # 16x16 grid
image_size = 768
patch_size = image_size // image_seq_length  # = 768 // 256 = 3

Problem: A patch_size of 3 pixels would mean dividing a 768px image into 256×256 tiny patches, not 16×16 patches as the comment states.

Should be:

n_per_side = int(image_seq_length ** 0.5)  # sqrt(256) = 16
patch_size = image_size // n_per_side  # = 768 // 16 = 48

This matches how Gemma3 (non-N) calculates it at line 5927. The C++ code at clip.cpp:3132 expects n_patches = image_size / patch_size, so the current buggy calculation accidentally produces the correct output (256 tokens) but with wrong semantics.

2. Hardcoded magic values (convert_hf_to_gguf.py, lines 5990-5995)

def find_vparam(self, keys: list[str], optional: bool = False) -> Any:
    if "intermediate_size" in keys:
        hidden_size = self.hparams_vision.get("hidden_size", 2048)  # Hardcoded default
        return hidden_size * 4  # Hardcoded expansion factor
    
    if "num_attention_heads" in keys or "num_heads" in keys:
        return 8  # Hardcoded MQA heads

These should ideally be read from the vision config if available, rather than hardcoded. While they may be correct for the current Gemma3n model, they're not future-proof.

3. Potential unsafe tensor dimension access (mobilenetv5.cpp, lines 138, 243)

The code accesses block.layer_scale_w->ne[0] inside blocks where layer_scale_w is checked for null, but doesn't verify it's non-null before dereferencing:

if (block.layer_scale_w) {
    ggml_tensor * scale_w_reshaped = ggml_reshape_4d(ctx0, block.layer_scale_w,
        1, 1, block.layer_scale_w->ne[0], 1);  // Safe: inside if-block

This is actually safe due to the if-check, but similar patterns exist that should be verified.

4. Hardcoded assumption (mobilenetv5.cpp, line 407)

int C = cur->ne[2]; // Should be 2048

This comment reveals a hardcoded assumption about channel count that might not hold for all model variants.

5. Minor: Commented-out code (mobilenetv5.cpp, line 331)

int scale_w = high_res_w / feat_w;
// int scale_h = high_res_h / feat_h;  // Unused

The contributor left commented-out code that should probably be removed.

📝 Recommendations

The patch_size calculation bug should be fixed before merging. The other issues are less critical but should be addressed for code quality.


✏️ Learnings added
Learnt from: ngxson
URL: https://github.com/ngxson/llama.cpp/pull/64

Timestamp: 2025-12-22T23:32:42.587Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

Learnt from: ngxson
URL: https://github.com/ngxson/llama.cpp/pull/64

Timestamp: 2025-12-22T23:32:42.587Z
Learning: When reviewing mirrored PRs for ngxson/llama.cpp, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.

…cit mapping: Custom map for double indexed blocks and tensor_mapping.py for rest

- convert_hf_to_gguf.py: Unsqueeze Stem Bias and Layer scale tensors to correct shape while converting to gguf
- mobilenetv5.cpp: Remove explicit reshaping of Stem Bias and Layer scale which are now handled while converting to gguf, replace fprintf with LOG_*
- clip.cpp: Remove unused embedding and hard_emb_norm tensor loading
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
gguf-py/gguf/constants.py (1)

857-865: [Duplicate] Missing VISION_PROJECTOR_TYPE_NAMES mappings for GEMMA3N, QWEN3VL, and COGVLM.

This issue was already flagged in previous reviews. The VISION_PROJECTOR_TYPE_NAMES dictionary is missing mappings for GEMMA3N (line 459), QWEN3VL (line 461), and COGVLM (line 462) enum entries. This will cause a KeyError if code attempts to convert these enum values to their string representations using this dictionary.

Based on previous review, the fix should add:

VISION_PROJECTOR_TYPE.GEMMA3N:   "gemma3n",
VISION_PROJECTOR_TYPE.QWEN3VL:   "qwen3vl_merger",
VISION_PROJECTOR_TYPE.COGVLM:    "cogvlm",
convert_hf_to_gguf.py (1)

5966-6091: Fix Gemma3n patch_size computation and note minor cleanups

There are a few points here:

  1. Patch size computation for Gemma3n MobileNetV5 is still wrong
    In Gemma3nVisionModel.set_gguf_parameters:

    image_seq_length = self.preprocessor_config.get("image_seq_length", 256)
    image_size = self.hparams_vision["image_size"]
    self.hparams_vision["patch_size"] = image_size // image_seq_length

    With the default Gemma3n setup (768×768, image_seq_length = 256), this yields patch_size = 3, which implies a 256×256 grid and 65,536 patches, while the comment explicitly states “256 tokens = 16×16”. Patch size should be derived from tokens per side, not total tokens.

    Recommended fix (same issue as previously flagged in earlier review; applying it here for the new MobileNetV5 path as well):

    Proposed fix for patch_size in Gemma3nVisionModel
    -        # Image sequence length (256 tokens = 16x16 for Gemma3n)
    -        image_seq_length = self.preprocessor_config.get("image_seq_length", 256)
    -        image_size = self.hparams_vision["image_size"]
    -        self.hparams_vision["patch_size"] = image_size // image_seq_length
  •    # Image sequence length (e.g. 256 tokens = 16x16 grid for Gemma3n)
    
  •    image_seq_length = self.preprocessor_config.get("image_seq_length", 256)
    
  •    image_size = self.hparams_vision["image_size"]
    
  •    # Derive patch size from patches-per-side, not total token count
    
  •    n_per_side = int(image_seq_length ** 0.5)
    
  •    if n_per_side * n_per_side != image_seq_length:
    
  •        raise ValueError(
    
  •            f"image_seq_length={image_seq_length} is not a perfect square; "
    
  •            "cannot infer square patch grid for Gemma3n vision encoder"
    
  •        )
    
  •    self.hparams_vision["patch_size"] = image_size // n_per_side
    

This matches the intended 16×16 grid for 256 tokens and keeps `patch_size` consistent with how other vision encoders in this file derive it.  


2. **Vocab / embedding handling for Gemma3n text model is a solid improvement**  
- `Gemma3NModel.set_vocab` temporarily removes `vocab_size_per_layer_input` so `_create_vocab_sentencepiece()` uses the full `vocab_size` (including the vision/audio special tokens) and then restores it.  
- The new `modify_tensors` branch pads `embed_tokens.weight` and per-layer embeddings up to `vocab_size`, instead of truncating to `vocab_size_per_layer_input`, which is required for multimodal Gemma3n.

This avoids dropping the 262144–262399 special IDs and aligns the text embeddings with the tokenizer. Behavior looks correct and non‑regressive for pure‑text use.

3. **Optional typing/ruff cleanup for class attributes (RUF012)**  
In `Gemma3nVisionModel`:

```python
n_block_keys = []
block_tensor_mapping = { ... }

These are effectively class‑level constants. To satisfy RUF012 and make the intent explicit to type checkers, consider:

Optional ClassVar annotation tweak
-from typing import TYPE_CHECKING, Any, Callable, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast
+from typing import TYPE_CHECKING, Any, Callable, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast, ClassVar
...
-    n_block_keys = []
+    n_block_keys: ClassVar[list[str]] = []
...
-    block_tensor_mapping = {
+    block_tensor_mapping: ClassVar[dict[str, str]] = {
      ...
    }

This is purely a typing / tooling nicety; behavior is unchanged.

Overall, once the patch_size formula is corrected, the Gemma3n MobileNetV5 path and the Gemma3n text vocab/embedding logic look structurally sound for the mirrored upstream changes.

Also applies to: 6115-6191

🧹 Nitpick comments (4)
gguf-py/gguf/tensor_mapping.py (1)

123-158: Gemma3n vision tensor mappings look consistent

The new V_MM_* and V_ENC_* entries align with the Gemma3n/MobileNetV5 tensor paths used in convert_hf_to_gguf.py and constants, so TensorNameMap will correctly resolve model.embed_vision.* and model.vision_tower.timm_model.* for Gemma3n.

Note that V_MM_INP_PROJ / V_MM_SOFT_EMB_NORM now have both generic multi_modal_projector.* and Gemma3n-specific model.embed_vision.* synonyms; that’s fine, but if more variants start using these tensors it may be worth documenting this dual use to avoid confusion later.

tools/mtmd/clip.cpp (1)

1528-1620: Verify mobilenetv5_block default initialization and stage boundary assumptions

The GEMMA3N tensor-loading branch dynamically discovers MobileNetV5 blocks per stage and accumulates them in model.mobilenet_blocks, with stage ends recorded in model.mobilenet_stage_ends. This is a good direction, but a couple of edge conditions are worth double‑checking:

  • mobilenetv5_block block; relies on the struct’s members being safely default‑initialized (e.g., ggml_tensor * foo = nullptr; or an explicit ctor). If any members lack default member initializers, they will contain indeterminate values in paths where that sub‑block type is absent (e.g., pure attention blocks vs. pure UIR blocks). Please confirm mobilenetv5_block is defined with default member initializers or add = {} here to value‑initialize it.

  • The for (int blk_idx = 0; ; ++blk_idx) loop for each stage stops at the first blk_idx that yields no tensors. This assumes that all blocks in a stage are densely indexed from 0..N-1 with no gaps. If future variants ever introduce gaps, discovery would silently truncate later blocks. That’s probably fine for current Gemma3n, but worth keeping in mind if more MobileNetV5 variants are added.

  • mobilenet_stage_ends.push_back(model.mobilenet_blocks.size() - 1) stores inclusive global indices. Ensure mobilenetv5.cpp interprets these indices the same way (inclusive vs exclusive) when iterating.

If you confirm mobilenetv5_block is defined with safe defaults and that stage indices are inclusive by design, this loader logic looks solid.

tools/mtmd/models/mobilenetv5.cpp (2)

372-372: Consider making target output resolution configurable.

The hardcoded target_out_res = 16 assumes a fixed output resolution for the MSFA downsampling stage. If the model architecture varies or if different Gemma3n variants use different resolutions, this value should be read from the model config rather than hardcoded.

💡 Suggested approach

Add a field to the model config for MSFA output resolution and read it during model loading, falling back to 16 if not present:

// Example usage (adjust based on actual model structure):
const int target_out_res = model.msfa_output_res ? model.msfa_output_res : 16;

Alternatively, if 16 is the fixed resolution for all Gemma3n models, consider adding a comment explaining this architectural constraint.

Based on learnings, this is a mirrored PR—please verify with the upstream contributor whether this value should remain fixed or be made configurable.


420-420: Consider extracting RMS norm epsilon as a named constant.

The hardcoded epsilon value 1e-6f appears twice (lines 420, 442) for Gemma3n RMS normalization. Extracting it as a named constant improves maintainability and makes the architectural choice explicit.

💡 Suggested refactor

Near the top of the file or in the class definition:

static constexpr float GEMMA3N_RMS_NORM_EPS = 1e-6f;

Then use it consistently:

-    const float eps = 1e-6f; // Gemma3n uses 1e-6
-    cur = ggml_rms_norm(ctx0, cur, eps);
+    cur = ggml_rms_norm(ctx0, cur, GEMMA3N_RMS_NORM_EPS);

Also applies to: 442-442

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 86618c7 and e2835e9.

📒 Files selected for processing (5)
  • convert_hf_to_gguf.py
  • gguf-py/gguf/constants.py
  • gguf-py/gguf/tensor_mapping.py
  • tools/mtmd/clip.cpp
  • tools/mtmd/models/mobilenetv5.cpp
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{c,cpp,h,hpp}

📄 CodeRabbit inference engine (AGENTS.md)

Always format C++ code before committing using git clang-format with the project's .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer alignment void * ptr, reference alignment int & ref)

Files:

  • tools/mtmd/models/mobilenetv5.cpp
  • tools/mtmd/clip.cpp
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Always use the Python environment in .venv and run Python tools from that environment
Apply Python linting rules configured in .flake8 (max-line-length=125, excludes examples/tools) and type checking with pyright

Files:

  • gguf-py/gguf/constants.py
  • gguf-py/gguf/tensor_mapping.py
  • convert_hf_to_gguf.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {ggml/**,src/**/backend*.{c,cpp,h,hpp},tools/server/**} : Backend-related changes (CPU, CUDA, Metal, Vulkan, etc.) and modifications to `tools/server` require AI usage disclosure if significant code is generated

Applied to files:

  • tools/mtmd/models/mobilenetv5.cpp
  • gguf-py/gguf/constants.py
  • tools/mtmd/clip.cpp
  • convert_hf_to_gguf.py
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {include/llama.h,ggml/**/*.h,mtmd/**/*.h} : Public API modifications in `include/llama.h`, `ggml.h`, or `mtmd.h` require AI usage disclosure if significant code is generated

Applied to files:

  • tools/mtmd/models/mobilenetv5.cpp
  • gguf-py/gguf/constants.py
  • tools/mtmd/clip.cpp
  • convert_hf_to_gguf.py
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

  • gguf-py/gguf/constants.py
  • tools/mtmd/clip.cpp
🧬 Code graph analysis (4)
tools/mtmd/models/mobilenetv5.cpp (2)
ggml/src/ggml.c (17)
  • ggml_permute (3700-3752)
  • ggml_cont (3461-3465)
  • ggml_rms_norm (3066-3071)
  • ggml_mul (2170-2175)
  • ggml_pad_ext (4983-5016)
  • ggml_conv_2d_direct (4702-4736)
  • ggml_gelu (2677-2681)
  • ggml_add (1969-1974)
  • ggml_reshape_3d (3564-3581)
  • ggml_reshape_4d (3583-3601)
  • ggml_mul_mat (3174-3189)
  • ggml_scale (3290-3295)
  • ggml_soft_max (3966-3970)
  • ggml_upscale (4928-4935)
  • ggml_concat (2517-2544)
  • ggml_pool_2d (4852-4878)
  • ggml_build_forward_expand (6793-6795)
tools/mtmd/clip.cpp (9)
  • build_inp_raw (469-474)
  • build_inp_raw (469-469)
  • model (217-219)
  • model (935-1261)
  • model (935-935)
  • model (2035-2048)
  • model (2035-2035)
  • s (2443-2445)
  • s (2443-2443)
gguf-py/gguf/tensor_mapping.py (1)
gguf-py/gguf/constants.py (1)
  • MODEL_TENSOR (465-743)
tools/mtmd/clip.cpp (2)
common/common.cpp (4)
  • model (1159-1161)
  • model (1159-1159)
  • string_format (399-412)
  • string_format (399-399)
src/llama-model.cpp (2)
  • get_tensor (7044-7054)
  • get_tensor (7044-7044)
convert_hf_to_gguf.py (1)
gguf-py/gguf/constants.py (1)
  • VisionProjectorType (3429-3450)
🪛 Ruff (0.14.10)
convert_hf_to_gguf.py

5969-5969: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


5972-5995: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


6067-6067: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (6)
tools/mtmd/clip.cpp (1)

783-795: GEMMA3N graph dispatch is consistent with new MobilenetV5 graph

Routing PROJECTOR_TYPE_GEMMA3N to clip_graph_mobilenetv5 mirrors how other projector types choose their specialized graphs and keeps mobilenet-specific logic isolated from the generic ViT path. No issues here.

convert_hf_to_gguf.py (1)

522-527: prepare_tensors: safe handling of empty tensor_map looks good

Guarding self.tensor_map.mapping before calling max() fixes the crash when block_count == 0 (e.g., MobileNetV5) while preserving the logging behavior with a reasonable fallback width.

No issues from a correctness or performance perspective.

tools/mtmd/models/mobilenetv5.cpp (4)

5-20: LGTM: RMS normalization helper is correctly implemented.

The permutation logic correctly transforms [W,H,C,B] → [C,H,W,B] for channel-wise normalization, applies RMS norm, optionally multiplies by weight, and restores the original layout. Null check for weight is properly guarded.


23-53: LGTM: Asymmetric SAME padding correctly implemented.

The calculation matches PyTorch/TensorFlow SAME padding semantics with proper ceil division for output size and asymmetric padding split. The ggml_pad_ext call uses correct dimension ordering for [W,H,C,B] layout.


57-88: LGTM: Block builders handle null pointers safely.

All three block builder functions (build_edge_residual, build_inverted_residual, build_mobilenet_attn) consistently check for null pointers before dereferencing optional weights and biases. Residual connection logic correctly validates dimension matching before addition.

Also applies to: 91-149, 152-246


248-453: LGTM: Main build function integrates all components correctly.

The build() function properly sequences stem convolution, stage/block iteration, fusion point collection, MSFA logic, and Gemma3n embedding projection. Logging statements use proper macros (LOG_WRN, LOG_ERR) as addressed in previous reviews. All optional weight/bias tensors are guarded by null checks before use.

Comment on lines +1149 to +1155
case PROJECTOR_TYPE_GEMMA3N:
{
// Gemma3n uses MobileNetV5 which produces 256 tokens (16x16)
// Similar configuration to Gemma3
hparams.n_merge = 1; // MobileNetV5 handles resizing internally
get_u32(KEY_PROJ_SCALE_FACTOR, hparams.n_merge, false);
} break;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the specific code sections mentioned in the review
cd tools/mtmd
wc -l clip.cpp

Repository: ngxson/llama.cpp

Length of output: 73


🏁 Script executed:

# Check the context around line 1149-1155
head -1160 clip.cpp | tail -20

Repository: ngxson/llama.cpp

Length of output: 127


🏁 Script executed:

# Also check the "Also applies to" lines: 2856-2864, 3125-3131, 3521-3522, 3647-3648
sed -n '2856,2864p' clip.cpp

Repository: ngxson/llama.cpp

Length of output: 111


🏁 Script executed:

sed -n '3125,3131p' clip.cpp

Repository: ngxson/llama.cpp

Length of output: 111


🏁 Script executed:

sed -n '3521,3522p' clip.cpp

Repository: ngxson/llama.cpp

Length of output: 111


🏁 Script executed:

sed -n '3647,3648p' clip.cpp

Repository: ngxson/llama.cpp

Length of output: 111


🏁 Script executed:

# Now let's look for the converter code mentioned
find . -name "convert_hf_to_gguf.py" -o -name "*convert*gemma*" 2>/dev/null | head -20

Repository: ngxson/llama.cpp

Length of output: 83


🏁 Script executed:

# Search for where patch_size is set for GEMMA3N
rg "patch_size.*gemma3n|gemma3n.*patch_size" -i

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

# Search for Gemma3nVisionModel in converter files
rg "Gemma3nVisionModel|set_gguf_parameters"

Repository: ngxson/llama.cpp

Length of output: 13181


🏁 Script executed:

find . -name "clip.cpp" -type f

Repository: ngxson/llama.cpp

Length of output: 81


🏁 Script executed:

# Search for Gemma3nVisionModel set_gguf_parameters implementation
rg -A 20 "class Gemma3nVisionModel"

Repository: ngxson/llama.cpp

Length of output: 2748


🏁 Script executed:

# Search for the specific patch_size calculation for GEMMA3N
rg -B 5 -A 5 "patch_size.*image_size.*image_seq_length|Gemma3n.*patch_size"

Repository: ngxson/llama.cpp

Length of output: 740


🏁 Script executed:

cat -n ./tools/mtmd/clip.cpp | sed -n '1145,1160p'

Repository: ngxson/llama.cpp

Length of output: 1026


🏁 Script executed:

cat -n ./tools/mtmd/clip.cpp | sed -n '2850,2870p'

Repository: ngxson/llama.cpp

Length of output: 1304


🏁 Script executed:

cat -n ./tools/mtmd/clip.cpp | sed -n '3120,3140p'

Repository: ngxson/llama.cpp

Length of output: 1225


🏁 Script executed:

cat -n ./tools/mtmd/clip.cpp | sed -n '3515,3530p'

Repository: ngxson/llama.cpp

Length of output: 771


🏁 Script executed:

cat -n ./tools/mtmd/clip.cpp | sed -n '3640,3655p'

Repository: ngxson/llama.cpp

Length of output: 873


🏁 Script executed:

# Now let's find where patch_size is being used for token counting in GEMMA3N
rg "clip_n_output_tokens|PROJECTOR_TYPE_GEMMA3N" tools/mtmd/clip.cpp -B 3 -A 10

Repository: ngxson/llama.cpp

Length of output: 7814


🏁 Script executed:

# Search for the n_patches calculation in clip_n_output_tokens
rg "n_patches.*patch_size|patch_size.*n_patches" tools/mtmd/clip.cpp -B 5 -A 5

Repository: ngxson/llama.cpp

Length of output: 3486


🌐 Web query:

Hugging Face Gemma3n image processor implementation MobileNetV5 patch_size image_seq_length

💡 Result:

Short answer:

  • Gemma3n's processor (Gemma3nProcessor / Gemma3nProcessor) wraps a SiglipImageProcessorFast for images and exposes image_seq_length with default = 256. [1][2]
  • The Gemma 3n vision encoder is MobileNetV5 and uses a default input resolution of 768×768; MobileNetV5 is a CNN backbone (not a ViT-style patch embed), so there is no model-level "patch_size" parameter the way patch-based transformers have. [2][3]

Sources:

  • Gemma3n processor docs (shows image_seq_length default = 256 and use of SiglipImageProcessorFast). [1]
  • Gemma3n / Gemma3 docs (states MobileNetV5 vision encoder, default resolution 768×768). [2][3]

🌐 Web query:

Gemma3nImageProcessor Gemma3nProcessor image_size patch_size configuration

💡 Result:

Short answer:

  • Gemma3 Image processor (transformers.Gemma3ImageProcessor / Gemma3ImageProcessorFast) defaults: size {"height":224,"width":224} (do_resize True), rescale_factor 1/255 etc. [1][2].
  • Gemma3nProcessor wraps feature_extractor + image_processor + tokenizer; image_seq_length defaults to 256. The image_processor used for Gemma3n is SiglipImageProcessor/SigLiPFromPatches. [3][4].
  • The Gemma3 vision encoder (SigLiP/Vision config) expects full-resolution images of 896×896 and uses patch_size = 14 (so 896/14 = 64 patches per side → 4096 patches); vision hidden dim ≈1152 and later pooled to 256 multimodal tokens. Patch_size=14 and image_size=896 are used in official Gemma config/backbone implementations. [5][6][7].

Sources:

  1. Hugging Face — Gemma3ImageProcessor docs (defaults: size 224, rescale_factor 1/255). [1]
  2. Hugging Face — Gemma3ImageProcessorFast / preprocess description. [2]
  3. Hugging Face — Gemma3nProcessor (image_seq_length default 256; uses SiglipImageProcessor). [3]
  4. Gemma readthedocs / model API showing SigLiPFromPatches → ViT patch_size=(14,14), image_height/width=896. [5]
  5. KerasHub / Gemma3Backbone (vision encoder params: image_size=896, patch_size=14, hidden_dim=1152, output tokens pooled to 256). [6]
  6. vLLM / Gemma3MultiModalProjector implementation (uses image_size and patch_size to compute patches_per_image and pooling to mm tokens). [7]

🏁 Script executed:

# Let's verify the current converter code more carefully to understand the exact implementation
rg -B 10 -A 10 "image_size.*image_seq_length|patch_size.*image_size.*image_seq" convert_hf_to_gguf.py | head -50

Repository: ngxson/llama.cpp

Length of output: 1042


🏁 Script executed:

# Check if there's any documentation about what patch_size is supposed to represent
rg "patch_size" tools/mtmd/clip.cpp | head -30

Repository: ngxson/llama.cpp

Length of output: 2186


🏁 Script executed:

# Let's verify the math: for image_size=768 and image_seq_length=256
python3 << 'EOF'
import math

# Current formula (converter)
image_size = 768
image_seq_length = 256
patch_size_current = image_size // image_seq_length
print(f"Current formula: patch_size = {image_size} // {image_seq_length} = {patch_size_current}")

# Reviewer's suggested fix
patch_size_sqrt = image_size // int(math.sqrt(image_seq_length))
print(f"Reviewer's suggested: patch_size = {image_size} // sqrt({image_seq_length}) = {image_size} // {int(math.sqrt(image_seq_length))} = {patch_size_sqrt}")

# Current C++ calculation with current patch_size
n_patches_current = image_size / patch_size_current
print(f"\nCurrent C++ result: n_patches = {image_size} / {patch_size_current} = {n_patches_current}")

# C++ calculation if patch_size were corrected
n_patches_if_fixed = image_size / patch_size_sqrt
print(f"C++ result if patch_size fixed: n_patches = {image_size} / {patch_size_sqrt} = {n_patches_if_fixed}")

print(f"\nWould the code break? {n_patches_current != n_patches_if_fixed}")
EOF

Repository: ngxson/llama.cpp

Length of output: 146


🏁 Script executed:

# Quick math verification since python timed out
echo "Current: 768 / (768 / 256) = 768 / 3 = $(( 768 / 3 ))"
echo "If fixed (sqrt): 768 / (768 / 16) = 768 / 48 = $(( 768 / 48 ))"

Repository: ngxson/llama.cpp

Length of output: 152


🏁 Script executed:

# Check if there's any reference to image_seq_length in the C++ code
rg "image_seq_length" tools/mtmd/clip.cpp

Repository: ngxson/llama.cpp

Length of output: 42


🏁 Script executed:

# Check how Gemma3 (not Gemma3N) handles patch_size to compare
rg -B 5 -A 10 "PROJECTOR_TYPE_GEMMA3:" tools/mtmd/clip.cpp | head -40

Repository: ngxson/llama.cpp

Length of output: 1955


🏁 Script executed:

# Check the full context around clip_n_output_tokens to understand the override logic
sed -n '3100,3145p' tools/mtmd/clip.cpp

Repository: ngxson/llama.cpp

Length of output: 2072


🏁 Script executed:

# Verify if image_seq_length should be stored as a hparam instead
rg "image_seq_length" convert_hf_to_gguf.py

Repository: ngxson/llama.cpp

Length of output: 346


🏁 Script executed:

# Check the comment in the converter about 16x16
rg -B 2 -A 2 "16x16|256 tokens" convert_hf_to_gguf.py

Repository: ngxson/llama.cpp

Length of output: 281


🌐 Web query:

SigLiPImageProcessor resize aspect ratio center crop bilinear interpolation

💡 Result:

Short answer:

  • SiglipImageProcessor follows the Transformers BaseImageProcessor resizing rules: if you pass size as {"shortest_edge": N} it resizes the shortest edge to N and preserves aspect ratio; if you pass explicit height/width it will use that output size (or treat a single int as square unless default_to_square=False). [1][2]

  • Center-crop is supported (do_center_crop / crop_size); the fast processor pads smaller inputs with zeros before center-cropping to ensure the requested crop size. [1][3]

  • Resampling: default resample is BICUBIC for Siglip, but you can set resample=Resampling.BILINEAR (or other PILImageResampling/F.InterpolationMode) to use bilinear interpolation. [1][2]

Sources: Hugging Face SigLIP docs (SiglipImageProcessor) and BaseImageProcessor docs. [1][2][3]


Decouple GEMMA3N token count from misused patch_size parameter

The GEMMA3N token-count logic has a brittle dependency on how patch_size is computed in the converter. Currently, convert_hf_to_gguf.py sets patch_size = image_size // image_seq_length (768 // 256 = 3), and the C++ code in clip_n_output_tokens() computes n_patches = image_size / patch_size, which accidentally yields the correct 256 tokens. However, if patch_size is ever corrected to its proper semantic meaning (e.g., image_size // sqrt(image_seq_length) = 48), this calculation will silently return 16 tokens and break the sanity check.

Two concrete fixes:

  1. Store image_seq_length (or a hardcoded 256) as a dedicated GGUF hparam instead of deriving token count from a misused patch_size. MobileNetV5 is a CNN backbone, not patch-based, so patch_size has no semantic meaning for it. By decoupling the token count from patch_size, future converter improvements won't silently break runtime logic.

  2. Verify preprocessing against HF's SigLiP reference. Confirm that "resize to square + bilinear + no padding" fully matches HF's Gemma3nProcessor behavior. SigLiP can support aspect-ratio-preserving resize + center crop; if that is the intended behavior, the preprocessing in clip_image_preprocess() at lines 2856–2864 should be aligned.

The code currently works but bakes in fragile assumptions about converter internals that will break when corrected.

Comment on lines +394 to +400
// 4. Gemma 3n Multimodal Projection (Embedder)
// Input: 'cur' is [Width, Height, Channels, Batch]
int W = cur->ne[0];
int H = cur->ne[1];
int C = cur->ne[2]; // Should be 2048
int B = cur->ne[3];

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Brittle assumption: channel count should not be hardcoded.

Line 398's comment "// Should be 2048" indicates a brittle assumption about the channel dimension. If MobileNetV5 variants or future models use different channel counts, this could fail silently or cause dimension mismatches in the embedding projection.

🔎 Recommended defensive check

Add validation to catch mismatches early:

 int W = cur->ne[0];
 int H = cur->ne[1];
-int C = cur->ne[2]; // Should be 2048
+int C = cur->ne[2];
 int B = cur->ne[3];
+
+// Validate expected channel dimension for Gemma3n projection
+const int expected_vision_hidden = 2048; // Or read from model config
+if (C != expected_vision_hidden) {
+    LOG_ERR("%s: unexpected vision hidden size %d (expected %d)\n", __func__, C, expected_vision_hidden);
+    return nullptr;
+}

Based on learnings, as this is a mirrored PR, please relay this concern to the upstream contributor.

🤖 Prompt for AI Agents
In tools/mtmd/models/mobilenetv5.cpp around lines 394 to 400, the code comments
and assumption that C == 2048 are brittle; add a defensive runtime validation
that checks the channel count (cur->ne[2]) against the expected embedding input
size, and if it differs emit a clear error or return early with a descriptive
message (include the actual C and expected value). If there are multiple
supported channel sizes, handle them explicitly (e.g., branch or map to
projection weights) rather than hardcoding; otherwise fail fast. Also include a
brief log or comment to notify upstream that the hardcoded "Should be 2048"
assumption was removed and request confirmation of expected channel sizes for
different MobileNetV5 variants.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
gguf-py/gguf/constants.py (1)

857-865: GEMMA3N mapping still missing from VISION_PROJECTOR_TYPE_NAMES dictionary.

This is the same issue flagged in the previous review. The GEMMA3N enum value added at line 459 still has no corresponding entry in this dictionary. This mapping is required for the enum-to-string conversion to work correctly.

convert_hf_to_gguf.py (1)

6014-6039: Fix Gemma3n MobileNetV5 patch_size semantics and avoid brittle hard‑coded vision hparams

Two related issues here:

  1. patch_size is still computed from total token count, not per‑side patches (critical)

    • Current code: patch_size = image_size // image_seq_length (e.g., 768 // 256 = 3), which implies n_per_side = 256 and a 65k‑token grid.
    • Semantically, image_seq_length is total patches (e.g., 256 = 16×16). Patch size must be derived from sqrt(image_seq_length) so that both the converter and C++ vision path agree on a 16×16 grid and the correct patch_size (48 for 768×768).
  2. Hard‑coded MobileNetV5 defaults in find_vparam are brittle

    • hidden_size defaulting to 2048 and num_heads forced to 8 will silently be wrong if future Gemma3n variants change these values in their config. It’s safer to read from self.hparams_vision when available and only fall back to defaults if the config is missing them.
Patch: derive `patch_size` from √image_seq_length
-        # Image sequence length (256 tokens = 16x16 for Gemma3n)
-        image_seq_length = self.preprocessor_config.get("image_seq_length", 256)
-        image_size = self.hparams_vision["image_size"]
-        self.hparams_vision["patch_size"] = image_size // image_seq_length
+        # Image sequence length is total tokens (e.g. 256 = 16×16 grid)
+        image_seq_length = self.preprocessor_config.get("image_seq_length", 256)
+        image_size = self.hparams_vision["image_size"]
+
+        n_per_side = int(image_seq_length ** 0.5)
+        if n_per_side * n_per_side != image_seq_length:
+            raise ValueError(f"image_seq_length={image_seq_length} is not a perfect square")
+
+        # e.g. 768 // 16 = 48 for a 16×16 patch grid
+        self.hparams_vision["patch_size"] = image_size // n_per_side
Patch: prefer config‑driven head / FFN sizes in `find_vparam`
     def find_vparam(self, keys: list[str], optional: bool = False) -> Any:
         """Override to provide hardcoded MobileNetV5 parameters that aren't in config"""
         # Handle empty keys list (n_block_keys) - return 0 for CNN architecture
         if not keys:
             return 0
 
-        if "intermediate_size" in keys:
-            # Typical expansion is 4x the embedding dimension
-            hidden_size = self.hparams_vision.get("hidden_size", 2048)
-            return hidden_size * 4
-
-        if "num_attention_heads" in keys or "num_heads" in keys:
-            # Multi-Query Attention with 8 heads
-            return 8
+        if "intermediate_size" in keys:
+            assert self.hparams_vision is not None
+            if "intermediate_size" in self.hparams_vision:
+                return self.hparams_vision["intermediate_size"]
+            # Fallback: typical MobileNetV5 expansion is 4× hidden_size
+            hidden_size = self.hparams_vision.get("hidden_size")
+            if hidden_size is not None:
+                return hidden_size * 4
+
+        if any(k in ("num_attention_heads", "num_heads") for k in keys):
+            assert self.hparams_vision is not None
+            for k in ("num_attention_heads", "num_heads"):
+                if k in self.hparams_vision:
+                    return self.hparams_vision[k]
+            # Final fallback if config is missing heads
+            return 8
 
         # For other parameters, use parent implementation
         return super().find_vparam(keys, optional)

Given this is a mirrored PR, you’ll probably want to carry this fix locally and/or ping upstream about the patch_size formula and config‑driven defaults.

🧹 Nitpick comments (3)
gguf-py/gguf/constants.py (1)

1071-1071: Consider clarifying comment for V_MM_SOFT_EMB_NORM.

The comment here shows # gemma3n, but the enum definition at line 669 shows # gemma3. If this tensor is used by both gemma3 and gemma3n architectures, consider using a comment like # gemma3, gemma3n to clarify the shared usage and avoid confusion.

convert_hf_to_gguf.py (2)

5969-5995: Annotate mutable class attributes with ClassVar to satisfy Ruff RUF012

n_block_keys = [] and block_tensor_mapping = {…} are mutable class attributes; Ruff expects them to be annotated as typing.ClassVar[...].

Proposed type annotations for class attributes

Add ClassVar to the typing imports:

from typing import TYPE_CHECKING, Any, Callable, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast, ClassVar

Then update the class attributes:

-    n_block_keys = []
+    n_block_keys: ClassVar[list[str]] = []
@@
-    block_tensor_mapping = {
+    block_tensor_mapping: ClassVar[dict[str, str]] = {
         "model.vision_tower.timm_model.blocks.{bid}.{sid}.conv_exp.weight":             "v.blk.{bid}.{sid}.conv_exp.weight",
         ...
     }

6171-6226: Embedding padding and ALTUP stacking logic look correct; consider a small guard

This block:

  • Only affects language_model.* tensors; others are skipped, which keeps mmproj / vision clean.
  • Pads embed_tokens.weight and embed_tokens_per_layer up to hparams["vocab_size"], filling new rows with zeros for vision/audio tokens (which get real features from the vision/audio towers anyway).
  • Leaves altup_unembed_projections and altup_projections unpadded and stacks three shard tensors into single [3, …, …] matrices, matching how GGUF expects them.

You might consider adding a simple sanity check on the padding path to catch config mismatches earlier (optional):

Optional: assert current vs target vocab sizes when padding
-            vocab_size = self.hparams.get("vocab_size", 262400)
-            current_size = data_torch.shape[0]  # First dimension is vocab_size
+            vocab_size = self.hparams.get("vocab_size", 262400)
+            current_size = data_torch.shape[0]  # first dim is vocab size
+
+            if current_size > vocab_size:
+                raise ValueError(
+                    f"embed tensor rows ({current_size}) exceed vocab_size ({vocab_size})"
+                )
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e2835e9 and 632e29f.

📒 Files selected for processing (3)
  • convert_hf_to_gguf.py
  • gguf-py/gguf/constants.py
  • tools/mtmd/clip-impl.h
🚧 Files skipped from review as they are similar to previous changes (1)
  • tools/mtmd/clip-impl.h
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Always use the Python environment in .venv and run Python tools from that environment
Apply Python linting rules configured in .flake8 (max-line-length=125, excludes examples/tools) and type checking with pyright

Files:

  • gguf-py/gguf/constants.py
  • convert_hf_to_gguf.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {ggml/**,src/**/backend*.{c,cpp,h,hpp},tools/server/**} : Backend-related changes (CPU, CUDA, Metal, Vulkan, etc.) and modifications to `tools/server` require AI usage disclosure if significant code is generated

Applied to files:

  • gguf-py/gguf/constants.py
  • convert_hf_to_gguf.py
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {include/llama.h,ggml/**/*.h,mtmd/**/*.h} : Public API modifications in `include/llama.h`, `ggml.h`, or `mtmd.h` require AI usage disclosure if significant code is generated

Applied to files:

  • gguf-py/gguf/constants.py
  • convert_hf_to_gguf.py
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

  • gguf-py/gguf/constants.py
🧬 Code graph analysis (1)
convert_hf_to_gguf.py (3)
convert_lora_to_gguf.py (2)
  • modify_tensors (448-473)
  • to (185-186)
gguf-py/gguf/gguf_writer.py (2)
  • add_clip_projector_type (1076-1077)
  • add_vision_attention_layernorm_eps (1097-1098)
gguf-py/gguf/constants.py (1)
  • VisionProjectorType (3429-3450)
🪛 Ruff (0.14.10)
convert_hf_to_gguf.py

5969-5969: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


5972-5995: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)


6067-6067: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (3)
convert_hf_to_gguf.py (3)

522-527: Defensive handling of empty tensor_map in prepare_tensors looks good

The fallback max_name_len path for models with block_count == 0 (e.g., MobileNetV5) avoids max() on an empty mapping and only affects logging alignment. No functional risks spotted.


6070-6090: Verify MobileNetV5 tensor remapping and reshaping assumptions

The overall routing logic looks consistent:

  • Non‑Gemma3n tensors are skipped early.
  • Block tensors under model.vision_tower.timm_model.blocks.* are mapped via custom_map into the v.blk.{bid}.{sid}.* namespace.
  • Other MobileNetV5 pieces (stem / msfa / embeddings) fall back to the standard tensor map.
  • Biases and layer_scale.gamma are reshaped from [C] into [1, C, 1, 1], which matches a conv‑style broadcast.

If you haven’t already, it would be worth running a quick shape sanity check against a real Gemma3n vision checkpoint (ensure conv_stem.conv.bias and layer_scale.gamma are 1‑D [C] and that the resulting GGUF tensors have the shapes expected by mobilenetv5.cpp / clip.cpp for those fields).


6115-6134: Gemma3NModel.set_vocab correctly forces use of full vocab_size

Temporarily removing vocab_size_per_layer_input so the vocab builders see the full vocab_size (including vision/audio specials) and then restoring it is a clean way to reconcile:

  • vocab generation: needs 262400 entries
  • per‑layer embeddings: still driven by vocab_size_per_layer_input later

No functional issues seen.

Comment on lines +6027 to +6032
# MobileNetV5 does not use normalisation at all
self.preprocessor_config["image_mean"] = [0.0 , 0.0 , 0.0 ]
self.preprocessor_config["image_std"] = [1.0 , 1.0 , 1.0 ]
self.hparams_vision["image_size"] = self.preprocessor_config.get(
"size", {"height": 768, "width": 768}
)["height"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clean up list literals to satisfy flake8 E202 (whitespace before ']')

The image_mean / image_std assignments currently have a trailing space before the closing ], which flake8 reports as E202 whitespace before ']'.

Patch to fix flake8 E202
-        self.preprocessor_config["image_mean"] = [0.0 , 0.0 , 0.0 ]
-        self.preprocessor_config["image_std"] = [1.0 , 1.0 , 1.0 ]
+        self.preprocessor_config["image_mean"] = [0.0, 0.0, 0.0]
+        self.preprocessor_config["image_std"] = [1.0, 1.0, 1.0]
🤖 Prompt for AI Agents
In convert_hf_to_gguf.py around lines 6027 to 6032, the list literals for
self.preprocessor_config["image_mean"] and self.preprocessor_config["image_std"]
contain a space before the closing bracket (triggering flake8 E202); remove the
trailing spaces so the lists read [0.0, 0.0, 0.0] and [1.0, 1.0, 1.0]
respectively (keeping commas and spacing consistent), then run flake8 to confirm
the E202 warning is resolved.

…roadcasting of vision inp_per_layer

- Simplify the vision embeddings path of "get_per_layer_inputs" to output [n_embd_altup, n_layer, 1], broadcastable
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
gguf-py/gguf/constants.py (1)

459-459: GEMMA3N still missing from VISION_PROJECTOR_TYPE_NAMES dictionary.

This issue was previously flagged but remains unresolved. The GEMMA3N entry was added to the VISION_PROJECTOR_TYPE enum at line 459, but the corresponding mapping in VISION_PROJECTOR_TYPE_NAMES dictionary (lines 856-864) is still missing. This will cause runtime errors when converting the enum value to its string representation.

Please add the mapping:

VISION_PROJECTOR_TYPE.GEMMA3N:   "gemma3n",

Note: QWEN3VL and COGVLM mappings are also still missing as noted in the previous review.

🧹 Nitpick comments (1)
src/models/gemma3n-iswa.cpp (1)

263-274: Consider adding dimension validation for the vision embedding path.

The vision embedding path assumes embd_size = model.tok_embd_per_layer->ne[0] equals n_embd_altup * n_layer. If this assumption is violated, the reshape at line 272 will produce incorrect results or fail silently.

💡 Optional: Add assertion to validate dimensions
 // Vision embedding path: use padding token (ID=0) embedding
 const int64_t embd_size = model.tok_embd_per_layer->ne[0];  // n_embd_altup * n_layer
+GGML_ASSERT(embd_size == n_embd_altup * n_layer && "tok_embd_per_layer dimension mismatch");

 // Extract and dequantize padding token embedding (column 0)
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d37c22b and 47b7dd1.

📒 Files selected for processing (3)
  • gguf-py/gguf/constants.py
  • gguf-py/gguf/tensor_mapping.py
  • src/models/gemma3n-iswa.cpp
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{c,cpp,h,hpp}

📄 CodeRabbit inference engine (AGENTS.md)

Always format C++ code before committing using git clang-format with the project's .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer alignment void * ptr, reference alignment int & ref)

Files:

  • src/models/gemma3n-iswa.cpp
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Always use the Python environment in .venv and run Python tools from that environment
Apply Python linting rules configured in .flake8 (max-line-length=125, excludes examples/tools) and type checking with pyright

Files:

  • gguf-py/gguf/tensor_mapping.py
  • gguf-py/gguf/constants.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {ggml/**,src/**/backend*.{c,cpp,h,hpp},tools/server/**} : Backend-related changes (CPU, CUDA, Metal, Vulkan, etc.) and modifications to `tools/server` require AI usage disclosure if significant code is generated

Applied to files:

  • src/models/gemma3n-iswa.cpp
  • gguf-py/gguf/constants.py
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {include/llama.h,ggml/**/*.h,mtmd/**/*.h} : Public API modifications in `include/llama.h`, `ggml.h`, or `mtmd.h` require AI usage disclosure if significant code is generated

Applied to files:

  • gguf-py/gguf/constants.py
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

  • gguf-py/gguf/constants.py
🧬 Code graph analysis (2)
src/models/gemma3n-iswa.cpp (1)
ggml/src/ggml.c (7)
  • ggml_new_tensor_1d (1747-1752)
  • ggml_set_input (7435-7437)
  • ggml_get_rows (3776-3797)
  • ggml_reshape_3d (3564-3581)
  • ggml_scale (3290-3295)
  • ggml_cpy (3426-3431)
  • ggml_add (1969-1974)
gguf-py/gguf/tensor_mapping.py (1)
gguf-py/gguf/constants.py (1)
  • MODEL_TENSOR (465-742)
🔇 Additional comments (7)
gguf-py/gguf/tensor_mapping.py (1)

123-155: Verify V_MM_INP_PROJ comment and tensor reuse.

The mapping for V_MM_INP_PROJ at lines 129-131 is commented as "# gemma3n", but in constants.py (line 668), V_MM_INP_PROJ is defined with comment "# gemma3". Please verify:

  1. Is V_MM_INP_PROJ being reused between gemma3 and gemma3n, or should gemma3n have a distinct tensor?
  2. If it's reused, consider clarifying the comment or noting the shared usage.

Otherwise, the new gemma3n vision tensor mappings look correct and follow the established pattern.

gguf-py/gguf/constants.py (4)

670-678: LGTM: MODEL_TENSOR enum additions.

The new gemma3n vision tensor enum members are correctly defined with auto() and follow the established naming conventions and patterns.


1070-1079: LGTM: TENSOR_NAMES mappings for gemma3n.

The new TENSOR_NAMES mappings correctly associate the MODEL_TENSOR enum members with their GGUF tensor name strings, following established naming conventions.


1178-1186: LGTM: MODEL_TENSORS additions for MMPROJ arch.

The new gemma3n vision tensors are correctly added to the MODEL_TENSORS[MODEL_ARCH.MMPROJ] list, ensuring they will be recognized during tensor loading and validation.


3428-3428: LGTM: VisionProjectorType.GEMMA3N constant.

The GEMMA3N constant is correctly defined with the appropriate lowercase string value "gemma3n", consistent with other projector type definitions.

src/models/gemma3n-iswa.cpp (2)

253-261: LGTM: Token input path refactoring.

The refactoring correctly moves the input object creation and lifecycle management into the token branch scope. The res->add_input(std::move(inp)) call after processing is appropriately placed.


292-292: Broadcasting order change is intentional and properly implemented.

The addition order change—placing per_layer_proj (shape [n_embd_altup, n_layer, n_tokens]) first and inp_per_layer second—is correct per the PR objectives and supports broadcasting for vision inputs (shape [n_embd_altup, n_layer, 1]). The implementation follows standard conventions (larger tensor first) and the code is properly formatted per project guidelines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants