-
Notifications
You must be signed in to change notification settings - Fork 5
[Mirror] Add Gemma3n multimodal support with MobileNetV5 vision encoder #64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…ert_hf_to_gguf.py. Add gemma3n to vision projectors in gguf-py/gguf/constants.py.
2. Use available tensor mapping logic 3. Remove redundant chat template replacement of soft tokens placeholder with media placeholder
…struct and definitions to mobilenetv5.cpp 2.Remove unused `clip_is_gemma3n` func declarations and definitions 3. Remove redundant `rescale_image_u8_to_f32` func and use `normalize_image_u8_to_f32` with zero mean and unit std 4. Calculate n_patches using image_size / patch_size
📝 WalkthroughWalkthroughAdds end-to-end Gemma3n vision (MobileNetV5) support: new Python converter variant and model alias, GGUF constants and tensor mappings for Gemma3n vision, and a full C++ MobileNetV5 graph, loader, and inference-path integration in the mtmd toolchain. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant HF as HuggingFace Model
participant Converter as Python Converter
participant GGUF as GGUF (constants & mappings)
participant Loader as mtmd Loader (C++)
participant Graph as clip_graph_mobilenetv5
participant Runtime as Inference Runtime
HF->>Converter: export tensors & hparams
Converter->>GGUF: apply custom_map / block_tensor_mapping, set GGUF params
Converter-->>Loader: write GGUF with Gemma3n vision tensors
Loader->>Graph: select clip_graph_mobilenetv5 (PROJECTOR_TYPE_GEMMA3N)
Loader->>Graph: load mobilenetv5 tensors (stem, blocks, MSFA)
Graph->>Runtime: build graph (stem → stages → MSFA → embed proj)
Runtime->>Runtime: preprocess image → encode tokens → run graph → produce vision tokens
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (6)
tools/mtmd/models/models.h (1)
69-101: LGTM - MobileNetV5 graph builder declaration follows established patterns.The
clip_graph_mobilenetv5struct correctly follows the existing pattern of other graph implementations in this file. The helper method declarations align with their implementations inmobilenetv5.cpp(per the code snippets).The
mobilenetv5_blocktype reference should resolve correctly via the include chain (clip-graph.h→clip-model.h).Optional: Consider making helper methods private
The helper methods (
rms_norm_2d,pad_same_2d,build_edge_residual,build_inverted_residual,build_mobilenet_attn) are implementation details that could be declared asprivate. However, since other graph implementations in this file also use public methods, this is optional and maintaining consistency with the existing codebase pattern is reasonable.tools/mtmd/clip-model.h (1)
330-345: Appropriate additions for Gemma3n MobileNetV5 encoder.The additions to
clip_modelare well-structured:
- MobileNetV5 components use
std::vectorfor flexible block management- MSFA (Multi-Scale Fusion Adapter) components are properly prefixed and organized
- Naming conventions are consistent with existing fields
Note: There's an extra blank line at line 346, which may be intentional for readability but could be removed for consistency.
tools/mtmd/clip.cpp (2)
1619-1622: Use tensor name macros instead of hard-coded strings.For consistency with the rest of the codebase and maintainability, consider defining macros for these tensor names in
clip-impl.h:- model.mm_0_w = get_tensor("mm.embedding.weight", false); // Input embedding - model.mm_1_w = get_tensor("mm.hard_emb_norm.weight", false); // Hard embedding norm + model.mm_0_w = get_tensor(TN_MM_EMBEDDING, false); // Input embedding + model.mm_1_w = get_tensor(TN_MM_HARD_EMB_NORM, false); // Hard embedding normThis would require adding corresponding macros to
clip-impl.h.
1528-1623: Consider adding bounds checking for stage index access.The dynamic block discovery loop iterates over 4 stages (0-3), but the
mobilenet_stage_endsvector is populated based on actual blocks found. When accessingmodel.mobilenet_stage_ends[2]and[3]later inmobilenetv5.cpp::is_fusion_point(), ensure bounds checks are in place to avoid undefined behavior if fewer than 4 stages contain blocks.The current check at line 284-288 (
if (model.mobilenet_stage_ends.size() >= 4)) inmobilenetv5.cpphandles this, but consider adding a validation after loading:if (model.mobilenet_stage_ends.size() < 4) { LOG_WRN("%s: GEMMA3N expected 4 stages but found %zu\n", __func__, model.mobilenet_stage_ends.size()); }tools/mtmd/models/mobilenetv5.cpp (2)
329-336: Remove or use the commented-out variablescale_h.The variable
scale_his declared but commented out. Either use it for validation or remove it to avoid confusion:int scale_w = high_res_w / feat_w; - // int scale_h = high_res_h / feat_h;If height scaling should be validated separately:
int scale_h = high_res_h / feat_h; if (scale_w != scale_h) { LOG_WRN("%s: non-uniform scaling in MSFA (scale_w=%d, scale_h=%d)\n", __func__, scale_w, scale_h); }
381-381: Consider making target output resolution configurable.The target output resolution is hard-coded as 16:
const int target_out_res = 16;Consider making this a model hyperparameter or deriving it from the model configuration to improve flexibility for future MobileNetV5 variants.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
convert_hf_to_gguf.pygguf-py/gguf/constants.pygguf-py/gguf/tensor_mapping.pysrc/models/gemma3n-iswa.cpptools/mtmd/CMakeLists.txttools/mtmd/clip-impl.htools/mtmd/clip-model.htools/mtmd/clip.cpptools/mtmd/models/mobilenetv5.cpptools/mtmd/models/models.htools/mtmd/mtmd.cpp
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.{cpp,h,hpp}: Always format C++ code usinggit clang-formatbefore committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary
Files:
tools/mtmd/models/models.htools/mtmd/clip-model.htools/mtmd/clip.cpptools/mtmd/models/mobilenetv5.cppsrc/models/gemma3n-iswa.cpptools/mtmd/mtmd.cpptools/mtmd/clip-impl.h
**/*.{cpp,h,hpp,py}
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible
Files:
tools/mtmd/models/models.htools/mtmd/clip-model.htools/mtmd/clip.cpptools/mtmd/models/mobilenetv5.cppgguf-py/gguf/tensor_mapping.pysrc/models/gemma3n-iswa.cpptools/mtmd/mtmd.cpptools/mtmd/clip-impl.hconvert_hf_to_gguf.pygguf-py/gguf/constants.py
**/*.py
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
**/*.py: Always activate the Python virtual environment in.venvand use tools from that environment for Python development
Ensure Python code meets flake8 linting standards with max-line-length=125 as configured in.flake8
Ensure Python code passes pyright type checking as configured inpyrightconfig.json
Files:
gguf-py/gguf/tensor_mapping.pyconvert_hf_to_gguf.pygguf-py/gguf/constants.py
src/**/*.cpp
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
Prioritize performance optimization in core library implementations in
src/, as this is a performance-critical inference library
Files:
src/models/gemma3n-iswa.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
Applied to files:
tools/mtmd/clip.cpptools/mtmd/mtmd.cpptools/mtmd/clip-impl.hgguf-py/gguf/constants.py
🧬 Code graph analysis (6)
tools/mtmd/models/models.h (1)
tools/mtmd/models/mobilenetv5.cpp (12)
build(252-463)build(252-252)rms_norm_2d(5-20)rms_norm_2d(5-5)pad_same_2d(23-53)pad_same_2d(23-23)build_edge_residual(57-88)build_edge_residual(57-57)build_inverted_residual(90-151)build_inverted_residual(90-90)build_mobilenet_attn(154-250)build_mobilenet_attn(154-154)
tools/mtmd/clip.cpp (3)
common/common.cpp (4)
model(1159-1161)model(1159-1159)string_format(399-412)string_format(399-399)src/llama-model.cpp (2)
get_tensor(7044-7054)get_tensor(7044-7044)tools/server/server-context.cpp (2)
params(607-853)params(607-607)
tools/mtmd/models/mobilenetv5.cpp (2)
ggml/src/ggml.c (13)
ggml_permute(3700-3752)ggml_rms_norm(3066-3071)ggml_pad_ext(4983-5016)ggml_conv_2d_direct(4702-4736)ggml_gelu(2677-2681)ggml_conv_2d_dw(4637-4658)ggml_reshape_4d(3583-3601)ggml_reshape_3d(3564-3581)ggml_scale(3290-3295)ggml_soft_max(3966-3970)ggml_upscale(4928-4935)ggml_concat(2517-2544)ggml_pool_2d(4852-4878)tools/mtmd/clip.cpp (9)
build_inp_raw(469-474)build_inp_raw(469-469)model(217-219)model(935-1261)model(935-935)model(2038-2051)model(2038-2038)s(2446-2448)s(2446-2446)
gguf-py/gguf/tensor_mapping.py (1)
gguf-py/gguf/constants.py (1)
MODEL_TENSOR(465-736)
tools/mtmd/mtmd.cpp (1)
tools/mtmd/clip.cpp (4)
ctx(2490-2593)ctx(2490-2490)clip_get_projector_type(3737-3739)clip_get_projector_type(3737-3737)
convert_hf_to_gguf.py (2)
gguf-py/gguf/gguf_writer.py (2)
add_clip_projector_type(1076-1077)add_vision_attention_layernorm_eps(1097-1098)gguf-py/gguf/constants.py (2)
VisionProjectorType(3408-3429)GGMLQuantizationType(3283-3315)
🪛 GitHub Actions: flake8 Lint
convert_hf_to_gguf.py
[error] 5974-5974: flake8: E302 expected 2 blank lines, found 1. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8
[error] 6010-6010: flake8: E202 whitespace before ']'. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8
[error] 6011-6011: flake8: E202 whitespace before ']'. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8
[error] 6045-6045: flake8: W504 line break after binary operator. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8
🪛 Ruff (0.14.10)
convert_hf_to_gguf.py
5969-5969: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: android-ndk-build (arm64-snapdragon, --preset arm64-android-snapdragon-release)
- GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
- GitHub Check: android-ndk-build (arm64-cpu, -D ANDROID_ABI=arm64-v8a -D ANDROID_PLATFORM=android-31 -D CMAKE_TO...
- GitHub Check: windows-latest-cmake (llvm-arm64-opencl-adreno, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN...
- GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
- GitHub Check: build-linux-cross / debian-13-loongarch64-vulkan-cross
- GitHub Check: windows-latest-cmake (llvm-arm64, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/ar...
- GitHub Check: windows-latest-cmake (openblas-x64, x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x6...
- GitHub Check: ubuntu-latest-cmake-cuda
- GitHub Check: windows-latest-cmake (cpu-x64 (static), x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmak...
- GitHub Check: windows-latest-cmake (vulkan-x64, x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD...
- GitHub Check: android-build
- GitHub Check: ubuntu-latest-cmake-sanitizer (ADDRESS, Debug)
- GitHub Check: ubuntu-latest-cmake-sanitizer (UNDEFINED, Debug)
- GitHub Check: ubuntu-latest-llguidance
- GitHub Check: ubuntu-cpu-cmake (arm64, ubuntu-22.04-arm)
- GitHub Check: ubuntu-latest-cmake-sanitizer (THREAD, Debug)
- GitHub Check: macOS-latest-cmake-arm64
- GitHub Check: ubuntu-latest-cmake-rpc
- GitHub Check: pyright type-check
🔇 Additional comments (20)
src/models/gemma3n-iswa.cpp (1)
262-306: Vision input path implementation looks correct, but verify commented-out scaling.The implementation correctly broadcasts the padding token embedding (token ID 0) across all vision tokens, matching the PyTorch behavior described in comments. The approach of creating zeros via
ggml_scale(..., 0.0f)is valid.A few observations:
Line 305: The
sqrtf((float) n_embd_altup)scaling is commented out for vision inputs, while it's applied for text inputs (line 259). Please confirm this difference is intentional per the model specification.The
inp->embdtensor is allocated and marked as input (lines 270-271) but isn't used in the subsequent computation - the zeros are created fromper_layer_model_projprojection instead. This appears intentional as the embeddings will be fed separately, but worth confirming the graph input wiring is correct.tools/mtmd/mtmd.cpp (2)
269-272: LGTM - GEMMA3N correctly inherits GEMMA3's image token handling.The addition of
PROJECTOR_TYPE_GEMMA3NalongsidePROJECTOR_TYPE_GEMMA3correctly sets up the same<start_of_image>and<end_of_image>tokens for the Gemma3n vision path.
861-866: LGTM - Non-causal decode handling extended to GEMMA3N.The logic correctly includes
PROJECTOR_TYPE_GEMMA3Nin the non-causal decoding path, maintaining parity with GEMMA3.Minor observation: Consider extracting the repeated
clip_get_projector_type(ctx->ctx_v)call to a local variable for readability, though this is optional given the function is lightweight.tools/mtmd/CMakeLists.txt (1)
30-30: LGTM - MobileNetV5 source file added to build.The new
models/mobilenetv5.cppis correctly included in the mtmd library sources. This enables the MobileNetV5-based graph construction for Gemma3n vision support.gguf-py/gguf/tensor_mapping.py (1)
122-142: LGTM - New Gemma3n vision tensor mappings added.The new tensor mappings for MobileNetV5-based Gemma3n vision support are correctly structured and follow the existing pattern. The mappings align with the
MODEL_TENSORenums defined inconstants.py.Note: The comments label these as "gemma3n", which is accurate for
V_MM_EMBEDDING,V_MM_HARD_EMB_NORM, andV_MM_POST_PROJ_NORM. ForV_MM_INP_PROJandV_MM_SOFT_EMB_NORM, the constants.py comments indicate "gemma3" but this appears to be reusing existing tensor types with new HuggingFace tensor name mappings for gemma3n, which is a valid pattern.tools/mtmd/clip-model.h (1)
175-212: Well-structured MobileNetV5 block definition.The
mobilenetv5_blockstruct is comprehensive and well-organized, covering all necessary components for the Gemma3n vision encoder:
- Stage 0 (Edge Residual) and Stage 1+ (Universal Inverted Residual) convolutions with batch normalization
- Multi-Query Attention (MQA) components with optional downsampling
- Layer scale and block normalization
The struct follows the existing naming conventions and patterns in the file.
gguf-py/gguf/constants.py (3)
670-672: LGTM: New gemma3n tensor types properly defined.The three new tensor types (
V_MM_EMBEDDING,V_MM_HARD_EMB_NORM,V_MM_POST_PROJ_NORM) are:
- Properly prefixed with
V_MM_following the existing naming convention- Clearly documented as gemma3n-specific
- Correctly placed within the MODEL_TENSOR enum
1065-1067: Correct tensor mappings and MMPROJ integration.The tensor name mappings and MMPROJ architecture additions are properly implemented:
- String names follow the
mm.*convention used for multimodal tensors- Tensors are correctly added to
MODEL_TENSORS[MODEL_ARCH.MMPROJ]list- Consistent with existing patterns in the file
Also applies to: 1166-1168
1947-1981: Complete GEMMA3N architecture tensor list.The
MODEL_ARCH.GEMMA3Ntensor list is comprehensive and well-organized:
- Includes all standard Gemma3 tensors (token embedding, attention, FFN)
- Properly extends with gemma3n-specific components:
- Per-layer tensors (
PER_LAYER_TOKEN_EMBD,PER_LAYER_MODEL_PROJ, etc.)- Altup tensors for alternative upsampling/routing
- Laurel tensors for layer-wise processing
- Comments clearly indicate the purpose of specialized tensor groups
This ensures proper serialization and deserialization of Gemma3n models.
tools/mtmd/clip.cpp (4)
3128-3133: Potential issue with n_patches calculation for GEMMA3N.The calculation
n_patches = image_size / patch_sizecomputes a single dimension (e.g., 16 for 256/16), butn_patchestypically represents the total number of patches (i.e.,(image_size / patch_size)^2). The comment says "MSFA adapter always outputs fixed 16x16 resolution", which suggests the result should be 256, not 16.If the output is indeed 16x16, the calculation should be:
- n_patches = ctx->model.hparams.image_size / ctx->model.hparams.patch_size; + int n_per_side = ctx->model.hparams.image_size / ctx->model.hparams.patch_size; + n_patches = n_per_side * n_per_side;If this is intentional (e.g., the MSFA outputs a single row of 16 tokens), please add a clarifying comment.
791-794: LGTM: GEMMA3N graph builder integration.The new case for
PROJECTOR_TYPE_GEMMA3Ncorrectly uses the dedicatedclip_graph_mobilenetv5builder, consistent with how other projector types are handled.
1148-1155: LGTM: GEMMA3N hparams configuration.The default
n_merge = 1with an optional override viaKEY_PROJ_SCALE_FACTORfollows the established pattern for other projector types.
2859-2867: LGTM: GEMMA3N preprocessing.The preprocessing correctly resizes the image to a square using bilinear interpolation without padding (
falseparameter), then normalizes using the configured mean/std values, matching the expected MobileNetV5 input format.tools/mtmd/models/mobilenetv5.cpp (4)
5-20: LGTM: RMS Norm 2D helper implementation.The
rms_norm_2dhelper correctly permutes the tensor to normalize over channels for each spatial position, applies the standard RMSNorm operation, and optionally applies the learned weight before permuting back. The use ofggml_contafter permute ensures the tensor is contiguous for subsequent operations.
22-53: LGTM: SAME padding implementation.The
pad_same_2dhelper correctly implements TensorFlow/PyTorch-style asymmetric SAME padding. The ceiling division for output size and the asymmetric split of padding (bottom/right gets the extra pixel) matches the expected behavior.
153-250: LGTM: Attention block implementation.The
build_mobilenet_attnfunction correctly implements multi-query attention with:
- Optional input normalization
- Downsampled K/V paths using depthwise convolutions
- Proper Q/K/V reshaping and permutation for attention
- Scaled dot-product attention with softmax
- Output projection with optional layer scaling and residual connection
403-463: LGTM: Gemma3n multimodal projection.The embedding/projection logic correctly:
- Permutes and flattens spatial dimensions to sequence format
- Applies feature scaling by sqrt(hidden_size)
- Applies soft embedding normalization with optional learned weight
- Projects to text hidden size via linear layer
- Applies post-projection RMSNorm
This matches the expected Gemma3n vision embedder architecture.
tools/mtmd/clip-impl.h (2)
156-195: LGTM: MobileNetV5 tensor name macros.The tensor name macros are well-organized by component (stem, edge residual, inverted residual, attention, MSFA) and follow the established naming conventions. The
%d.%dformat for stage/block indexing aligns with the dynamic loading logic inclip.cpp.
214-214: LGTM: GEMMA3N projector type registration.The new
PROJECTOR_TYPE_GEMMA3Nenum value and its string mapping"gemma3n"are correctly placed and follow the existing pattern.Also applies to: 245-245
convert_hf_to_gguf.py (1)
522-527: Robust handling of emptytensor_map.mappingfor block_count=0 looks goodUsing a guarded branch for
max_name_lenavoidsValueErrorwhengguf.get_tensor_name_map(..., block_count=0)produces an empty mapping (e.g., MobileNetV5-based encoders) and only changes log formatting width. No further changes needed here.
| # Image sequence length (256 tokens = 16x16 for Gemma3n) | ||
| image_seq_length = self.preprocessor_config.get("image_seq_length", 256) | ||
| image_size = self.hparams_vision["image_size"] | ||
| self.hparams_vision["patch_size"] = image_size // image_seq_length | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Read the convert_hf_to_gguf.py file around lines 6008-6012
find . -name "convert_hf_to_gguf.py" -type f | head -1 | xargs -I {} sh -c 'wc -l {} && sed -n "6000,6020p" {} | cat -n'Repository: ngxson/llama.cpp
Length of output: 1253
Patch size computation for Gemma3n MobileNetV5 is off by a factor of √image_seq_length
The code currently computes patch_size = 768 // 256 = 3, which results in n_per_side = 256 and tokens = 65,536—wildly inconsistent with the 16×16 grid (256 tokens) described in the comment.
The issue is treating image_seq_length (total token count) as if it were patch size. Gemma 3n uses MobileNet v5 with a default image resolution of 768×768 pixels, and image_seq_length defaults to 256.
The correct approach derives patch size from the per-side token count:
n_per_side = int(image_seq_length ** 0.5) # 16 for 256 tokens
patch_size = image_size // n_per_side # 768 // 16 = 48This yields n_per_side = 16 and tokens = 256, matching the expected grid layout and HF processor config.
🧰 Tools
🪛 GitHub Actions: flake8 Lint
[error] 6010-6010: flake8: E202 whitespace before ']'. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8
[error] 6011-6011: flake8: E202 whitespace before ']'. Command: /opt/hostedtoolcache/Python/3.11.14/x64/bin/flake8
🤖 Prompt for AI Agents
In convert_hf_to_gguf.py around lines 6008 to 6012, the code treats
image_seq_length as a per-side patch count which yields an incorrect tiny
patch_size; instead compute n_per_side = int(image_seq_length ** 0.5) to get the
number of patches per side (e.g. 16 for 256 tokens) and then set patch_size =
image_size // n_per_side (e.g. 768 // 16 = 48), and assign that patch_size back
into self.hparams_vision so the tokens and grid (n_per_side²) match the HF
processor config.
| RESAMPLER = auto() | ||
| GLM_EDGE = auto() | ||
| MERGER = auto() | ||
| GEMMA3N = auto() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add GEMMA3N mapping to VISION_PROJECTOR_TYPE_NAMES.
The GEMMA3N entry was added to VISION_PROJECTOR_TYPE enum but is missing from the VISION_PROJECTOR_TYPE_NAMES dictionary at lines 850-858. This mapping is used to convert the enum value to its string representation.
🔎 Proposed fix
Add the mapping to the VISION_PROJECTOR_TYPE_NAMES dictionary:
VISION_PROJECTOR_TYPE_NAMES: dict[VISION_PROJECTOR_TYPE, str] = {
VISION_PROJECTOR_TYPE.MLP: "mlp",
VISION_PROJECTOR_TYPE.LDP: "ldp",
VISION_PROJECTOR_TYPE.LDPV2: "ldpv2",
VISION_PROJECTOR_TYPE.RESAMPLER: "resampler",
VISION_PROJECTOR_TYPE.GLM_EDGE: "adapter",
VISION_PROJECTOR_TYPE.MERGER: "qwen2vl_merger",
VISION_PROJECTOR_TYPE.GEMMA3: "gemma3",
+ VISION_PROJECTOR_TYPE.GEMMA3N: "gemma3n",
+ VISION_PROJECTOR_TYPE.QWEN3VL: "qwen3vl_merger",
+ VISION_PROJECTOR_TYPE.COGVLM: "cogvlm",
}Note: QWEN3VL and COGVLM are also missing from this dictionary.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| GEMMA3N = auto() | |
| VISION_PROJECTOR_TYPE_NAMES: dict[VISION_PROJECTOR_TYPE, str] = { | |
| VISION_PROJECTOR_TYPE.MLP: "mlp", | |
| VISION_PROJECTOR_TYPE.LDP: "ldp", | |
| VISION_PROJECTOR_TYPE.LDPV2: "ldpv2", | |
| VISION_PROJECTOR_TYPE.RESAMPLER: "resampler", | |
| VISION_PROJECTOR_TYPE.GLM_EDGE: "adapter", | |
| VISION_PROJECTOR_TYPE.MERGER: "qwen2vl_merger", | |
| VISION_PROJECTOR_TYPE.GEMMA3: "gemma3", | |
| VISION_PROJECTOR_TYPE.GEMMA3N: "gemma3n", | |
| VISION_PROJECTOR_TYPE.QWEN3VL: "qwen3vl_merger", | |
| VISION_PROJECTOR_TYPE.COGVLM: "cogvlm", | |
| } |
🤖 Prompt for AI Agents
In gguf-py/gguf/constants.py around line 459 (GEMMA3N added to
VISION_PROJECTOR_TYPE) and the VISION_PROJECTOR_TYPE_NAMES dictionary at lines
~850-858, the GEMMA3N enum value is not mapped to its string representation; add
an entry mapping VISION_PROJECTOR_TYPE.GEMMA3N to "GEMMA3N" in that dictionary.
Also add missing mappings for QWEN3VL and COGVLM (e.g.,
VISION_PROJECTOR_TYPE.QWEN3VL -> "QWEN3VL" and VISION_PROJECTOR_TYPE.COGVLM ->
"COGVLM") so all enum members have corresponding string names. Ensure
formatting/commas match the surrounding dictionary entries.
This comment was marked as resolved.
This comment was marked as resolved.
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain@ngxson, I'll save these preferences for mirrored PRs and apply them to this review. Let me fetch the upstream PR details and conduct a thorough review. 🏁 Script executed: Length of output: 3458 🏁 Script executed: Length of output: 10158 🏁 Script executed: Length of output: 8976 🏁 Script executed: Length of output: 3571 🏁 Script executed: Length of output: 2673 🏁 Script executed: Length of output: 3033 🏁 Script executed: Length of output: 5510 🏁 Script executed: Length of output: 4258 🏁 Script executed: Length of output: 4263 @ngxson, I've completed a thorough review of this mirrored PR from simrnsingh. Here are my findings: ✅ Good NewsNo suspicious AI-generated code comments were found (no "CRITICAL:", "You are correct here...", etc.). The contributor properly disclosed Claude Code usage for exploration and boilerplates. 🐛 Issues Found1. Critical: Incorrect patch_size calculation (convert_hf_to_gguf.py, line 6011)The calculation # Current code:
image_seq_length = 256 # 16x16 grid
image_size = 768
patch_size = image_size // image_seq_length # = 768 // 256 = 3Problem: A Should be: n_per_side = int(image_seq_length ** 0.5) # sqrt(256) = 16
patch_size = image_size // n_per_side # = 768 // 16 = 48This matches how Gemma3 (non-N) calculates it at line 5927. The C++ code at clip.cpp:3132 expects 2. Hardcoded magic values (convert_hf_to_gguf.py, lines 5990-5995)def find_vparam(self, keys: list[str], optional: bool = False) -> Any:
if "intermediate_size" in keys:
hidden_size = self.hparams_vision.get("hidden_size", 2048) # Hardcoded default
return hidden_size * 4 # Hardcoded expansion factor
if "num_attention_heads" in keys or "num_heads" in keys:
return 8 # Hardcoded MQA headsThese should ideally be read from the vision config if available, rather than hardcoded. While they may be correct for the current Gemma3n model, they're not future-proof. 3. Potential unsafe tensor dimension access (mobilenetv5.cpp, lines 138, 243)The code accesses if (block.layer_scale_w) {
ggml_tensor * scale_w_reshaped = ggml_reshape_4d(ctx0, block.layer_scale_w,
1, 1, block.layer_scale_w->ne[0], 1); // Safe: inside if-blockThis is actually safe due to the if-check, but similar patterns exist that should be verified. 4. Hardcoded assumption (mobilenetv5.cpp, line 407)int C = cur->ne[2]; // Should be 2048This comment reveals a hardcoded assumption about channel count that might not hold for all model variants. 5. Minor: Commented-out code (mobilenetv5.cpp, line 331)int scale_w = high_res_w / feat_w;
// int scale_h = high_res_h / feat_h; // UnusedThe contributor left commented-out code that should probably be removed. 📝 RecommendationsThe patch_size calculation bug should be fixed before merging. The other issues are less critical but should be addressed for code quality. ✏️ Learnings added
🧠 Learnings used |
…cit mapping: Custom map for double indexed blocks and tensor_mapping.py for rest - convert_hf_to_gguf.py: Unsqueeze Stem Bias and Layer scale tensors to correct shape while converting to gguf - mobilenetv5.cpp: Remove explicit reshaping of Stem Bias and Layer scale which are now handled while converting to gguf, replace fprintf with LOG_* - clip.cpp: Remove unused embedding and hard_emb_norm tensor loading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
gguf-py/gguf/constants.py (1)
857-865: [Duplicate] Missing VISION_PROJECTOR_TYPE_NAMES mappings for GEMMA3N, QWEN3VL, and COGVLM.This issue was already flagged in previous reviews. The
VISION_PROJECTOR_TYPE_NAMESdictionary is missing mappings forGEMMA3N(line 459),QWEN3VL(line 461), andCOGVLM(line 462) enum entries. This will cause aKeyErrorif code attempts to convert these enum values to their string representations using this dictionary.Based on previous review, the fix should add:
VISION_PROJECTOR_TYPE.GEMMA3N: "gemma3n", VISION_PROJECTOR_TYPE.QWEN3VL: "qwen3vl_merger", VISION_PROJECTOR_TYPE.COGVLM: "cogvlm",convert_hf_to_gguf.py (1)
5966-6091: Fix Gemma3n patch_size computation and note minor cleanupsThere are a few points here:
Patch size computation for Gemma3n MobileNetV5 is still wrong
InGemma3nVisionModel.set_gguf_parameters:image_seq_length = self.preprocessor_config.get("image_seq_length", 256) image_size = self.hparams_vision["image_size"] self.hparams_vision["patch_size"] = image_size // image_seq_lengthWith the default Gemma3n setup (768×768,
image_seq_length = 256), this yieldspatch_size = 3, which implies a 256×256 grid and 65,536 patches, while the comment explicitly states “256 tokens = 16×16”. Patch size should be derived from tokens per side, not total tokens.Recommended fix (same issue as previously flagged in earlier review; applying it here for the new MobileNetV5 path as well):
Proposed fix for patch_size in Gemma3nVisionModel
- # Image sequence length (256 tokens = 16x16 for Gemma3n) - image_seq_length = self.preprocessor_config.get("image_seq_length", 256) - image_size = self.hparams_vision["image_size"] - self.hparams_vision["patch_size"] = image_size // image_seq_length
# Image sequence length (e.g. 256 tokens = 16x16 grid for Gemma3n)image_seq_length = self.preprocessor_config.get("image_seq_length", 256)image_size = self.hparams_vision["image_size"]# Derive patch size from patches-per-side, not total token countn_per_side = int(image_seq_length ** 0.5)if n_per_side * n_per_side != image_seq_length:raise ValueError(f"image_seq_length={image_seq_length} is not a perfect square; ""cannot infer square patch grid for Gemma3n vision encoder")self.hparams_vision["patch_size"] = image_size // n_per_sideThis matches the intended 16×16 grid for 256 tokens and keeps `patch_size` consistent with how other vision encoders in this file derive it. 2. **Vocab / embedding handling for Gemma3n text model is a solid improvement** - `Gemma3NModel.set_vocab` temporarily removes `vocab_size_per_layer_input` so `_create_vocab_sentencepiece()` uses the full `vocab_size` (including the vision/audio special tokens) and then restores it. - The new `modify_tensors` branch pads `embed_tokens.weight` and per-layer embeddings up to `vocab_size`, instead of truncating to `vocab_size_per_layer_input`, which is required for multimodal Gemma3n. This avoids dropping the 262144–262399 special IDs and aligns the text embeddings with the tokenizer. Behavior looks correct and non‑regressive for pure‑text use. 3. **Optional typing/ruff cleanup for class attributes (RUF012)** In `Gemma3nVisionModel`: ```python n_block_keys = [] block_tensor_mapping = { ... }These are effectively class‑level constants. To satisfy
RUF012and make the intent explicit to type checkers, consider:Optional ClassVar annotation tweak
-from typing import TYPE_CHECKING, Any, Callable, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast +from typing import TYPE_CHECKING, Any, Callable, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast, ClassVar ... - n_block_keys = [] + n_block_keys: ClassVar[list[str]] = [] ... - block_tensor_mapping = { + block_tensor_mapping: ClassVar[dict[str, str]] = { ... }This is purely a typing / tooling nicety; behavior is unchanged.
Overall, once the
patch_sizeformula is corrected, the Gemma3n MobileNetV5 path and the Gemma3n text vocab/embedding logic look structurally sound for the mirrored upstream changes.Also applies to: 6115-6191
🧹 Nitpick comments (4)
gguf-py/gguf/tensor_mapping.py (1)
123-158: Gemma3n vision tensor mappings look consistentThe new V_MM_* and V_ENC_* entries align with the Gemma3n/MobileNetV5 tensor paths used in
convert_hf_to_gguf.pyand constants, so TensorNameMap will correctly resolvemodel.embed_vision.*andmodel.vision_tower.timm_model.*for Gemma3n.Note that
V_MM_INP_PROJ/V_MM_SOFT_EMB_NORMnow have both genericmulti_modal_projector.*and Gemma3n-specificmodel.embed_vision.*synonyms; that’s fine, but if more variants start using these tensors it may be worth documenting this dual use to avoid confusion later.tools/mtmd/clip.cpp (1)
1528-1620: Verifymobilenetv5_blockdefault initialization and stage boundary assumptionsThe GEMMA3N tensor-loading branch dynamically discovers MobileNetV5 blocks per stage and accumulates them in
model.mobilenet_blocks, with stage ends recorded inmodel.mobilenet_stage_ends. This is a good direction, but a couple of edge conditions are worth double‑checking:
mobilenetv5_block block;relies on the struct’s members being safely default‑initialized (e.g.,ggml_tensor * foo = nullptr;or an explicit ctor). If any members lack default member initializers, they will contain indeterminate values in paths where that sub‑block type is absent (e.g., pure attention blocks vs. pure UIR blocks). Please confirmmobilenetv5_blockis defined with default member initializers or add= {}here to value‑initialize it.The
for (int blk_idx = 0; ; ++blk_idx)loop for each stage stops at the firstblk_idxthat yields no tensors. This assumes that all blocks in a stage are densely indexed from 0..N-1 with no gaps. If future variants ever introduce gaps, discovery would silently truncate later blocks. That’s probably fine for current Gemma3n, but worth keeping in mind if more MobileNetV5 variants are added.
mobilenet_stage_ends.push_back(model.mobilenet_blocks.size() - 1)stores inclusive global indices. Ensure mobilenetv5.cpp interprets these indices the same way (inclusive vs exclusive) when iterating.If you confirm
mobilenetv5_blockis defined with safe defaults and that stage indices are inclusive by design, this loader logic looks solid.tools/mtmd/models/mobilenetv5.cpp (2)
372-372: Consider making target output resolution configurable.The hardcoded
target_out_res = 16assumes a fixed output resolution for the MSFA downsampling stage. If the model architecture varies or if different Gemma3n variants use different resolutions, this value should be read from the model config rather than hardcoded.💡 Suggested approach
Add a field to the model config for MSFA output resolution and read it during model loading, falling back to 16 if not present:
// Example usage (adjust based on actual model structure): const int target_out_res = model.msfa_output_res ? model.msfa_output_res : 16;Alternatively, if 16 is the fixed resolution for all Gemma3n models, consider adding a comment explaining this architectural constraint.
Based on learnings, this is a mirrored PR—please verify with the upstream contributor whether this value should remain fixed or be made configurable.
420-420: Consider extracting RMS norm epsilon as a named constant.The hardcoded epsilon value
1e-6fappears twice (lines 420, 442) for Gemma3n RMS normalization. Extracting it as a named constant improves maintainability and makes the architectural choice explicit.💡 Suggested refactor
Near the top of the file or in the class definition:
static constexpr float GEMMA3N_RMS_NORM_EPS = 1e-6f;Then use it consistently:
- const float eps = 1e-6f; // Gemma3n uses 1e-6 - cur = ggml_rms_norm(ctx0, cur, eps); + cur = ggml_rms_norm(ctx0, cur, GEMMA3N_RMS_NORM_EPS);Also applies to: 442-442
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
convert_hf_to_gguf.pygguf-py/gguf/constants.pygguf-py/gguf/tensor_mapping.pytools/mtmd/clip.cpptools/mtmd/models/mobilenetv5.cpp
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{c,cpp,h,hpp}
📄 CodeRabbit inference engine (AGENTS.md)
Always format C++ code before committing using
git clang-formatwith the project's.clang-formatconfiguration (4-space indentation, 120 column limit, braces on same line for functions, pointer alignmentvoid * ptr, reference alignmentint & ref)
Files:
tools/mtmd/models/mobilenetv5.cpptools/mtmd/clip.cpp
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Always use the Python environment in.venvand run Python tools from that environment
Apply Python linting rules configured in.flake8(max-line-length=125, excludes examples/tools) and type checking with pyright
Files:
gguf-py/gguf/constants.pygguf-py/gguf/tensor_mapping.pyconvert_hf_to_gguf.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".
When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {ggml/**,src/**/backend*.{c,cpp,h,hpp},tools/server/**} : Backend-related changes (CPU, CUDA, Metal, Vulkan, etc.) and modifications to `tools/server` require AI usage disclosure if significant code is generated
Applied to files:
tools/mtmd/models/mobilenetv5.cppgguf-py/gguf/constants.pytools/mtmd/clip.cppconvert_hf_to_gguf.py
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {include/llama.h,ggml/**/*.h,mtmd/**/*.h} : Public API modifications in `include/llama.h`, `ggml.h`, or `mtmd.h` require AI usage disclosure if significant code is generated
Applied to files:
tools/mtmd/models/mobilenetv5.cppgguf-py/gguf/constants.pytools/mtmd/clip.cppconvert_hf_to_gguf.py
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
Applied to files:
gguf-py/gguf/constants.pytools/mtmd/clip.cpp
🧬 Code graph analysis (4)
tools/mtmd/models/mobilenetv5.cpp (2)
ggml/src/ggml.c (17)
ggml_permute(3700-3752)ggml_cont(3461-3465)ggml_rms_norm(3066-3071)ggml_mul(2170-2175)ggml_pad_ext(4983-5016)ggml_conv_2d_direct(4702-4736)ggml_gelu(2677-2681)ggml_add(1969-1974)ggml_reshape_3d(3564-3581)ggml_reshape_4d(3583-3601)ggml_mul_mat(3174-3189)ggml_scale(3290-3295)ggml_soft_max(3966-3970)ggml_upscale(4928-4935)ggml_concat(2517-2544)ggml_pool_2d(4852-4878)ggml_build_forward_expand(6793-6795)tools/mtmd/clip.cpp (9)
build_inp_raw(469-474)build_inp_raw(469-469)model(217-219)model(935-1261)model(935-935)model(2035-2048)model(2035-2035)s(2443-2445)s(2443-2443)
gguf-py/gguf/tensor_mapping.py (1)
gguf-py/gguf/constants.py (1)
MODEL_TENSOR(465-743)
tools/mtmd/clip.cpp (2)
common/common.cpp (4)
model(1159-1161)model(1159-1159)string_format(399-412)string_format(399-399)src/llama-model.cpp (2)
get_tensor(7044-7054)get_tensor(7044-7044)
convert_hf_to_gguf.py (1)
gguf-py/gguf/constants.py (1)
VisionProjectorType(3429-3450)
🪛 Ruff (0.14.10)
convert_hf_to_gguf.py
5969-5969: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
5972-5995: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
6067-6067: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (6)
tools/mtmd/clip.cpp (1)
783-795: GEMMA3N graph dispatch is consistent with new MobilenetV5 graphRouting
PROJECTOR_TYPE_GEMMA3Ntoclip_graph_mobilenetv5mirrors how other projector types choose their specialized graphs and keeps mobilenet-specific logic isolated from the generic ViT path. No issues here.convert_hf_to_gguf.py (1)
522-527: prepare_tensors: safe handling of empty tensor_map looks goodGuarding
self.tensor_map.mappingbefore callingmax()fixes the crash whenblock_count == 0(e.g., MobileNetV5) while preserving the logging behavior with a reasonable fallback width.No issues from a correctness or performance perspective.
tools/mtmd/models/mobilenetv5.cpp (4)
5-20: LGTM: RMS normalization helper is correctly implemented.The permutation logic correctly transforms [W,H,C,B] → [C,H,W,B] for channel-wise normalization, applies RMS norm, optionally multiplies by weight, and restores the original layout. Null check for weight is properly guarded.
23-53: LGTM: Asymmetric SAME padding correctly implemented.The calculation matches PyTorch/TensorFlow SAME padding semantics with proper ceil division for output size and asymmetric padding split. The ggml_pad_ext call uses correct dimension ordering for [W,H,C,B] layout.
57-88: LGTM: Block builders handle null pointers safely.All three block builder functions (
build_edge_residual,build_inverted_residual,build_mobilenet_attn) consistently check for null pointers before dereferencing optional weights and biases. Residual connection logic correctly validates dimension matching before addition.Also applies to: 91-149, 152-246
248-453: LGTM: Main build function integrates all components correctly.The build() function properly sequences stem convolution, stage/block iteration, fusion point collection, MSFA logic, and Gemma3n embedding projection. Logging statements use proper macros (LOG_WRN, LOG_ERR) as addressed in previous reviews. All optional weight/bias tensors are guarded by null checks before use.
| case PROJECTOR_TYPE_GEMMA3N: | ||
| { | ||
| // Gemma3n uses MobileNetV5 which produces 256 tokens (16x16) | ||
| // Similar configuration to Gemma3 | ||
| hparams.n_merge = 1; // MobileNetV5 handles resizing internally | ||
| get_u32(KEY_PROJ_SCALE_FACTOR, hparams.n_merge, false); | ||
| } break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the specific code sections mentioned in the review
cd tools/mtmd
wc -l clip.cppRepository: ngxson/llama.cpp
Length of output: 73
🏁 Script executed:
# Check the context around line 1149-1155
head -1160 clip.cpp | tail -20Repository: ngxson/llama.cpp
Length of output: 127
🏁 Script executed:
# Also check the "Also applies to" lines: 2856-2864, 3125-3131, 3521-3522, 3647-3648
sed -n '2856,2864p' clip.cppRepository: ngxson/llama.cpp
Length of output: 111
🏁 Script executed:
sed -n '3125,3131p' clip.cppRepository: ngxson/llama.cpp
Length of output: 111
🏁 Script executed:
sed -n '3521,3522p' clip.cppRepository: ngxson/llama.cpp
Length of output: 111
🏁 Script executed:
sed -n '3647,3648p' clip.cppRepository: ngxson/llama.cpp
Length of output: 111
🏁 Script executed:
# Now let's look for the converter code mentioned
find . -name "convert_hf_to_gguf.py" -o -name "*convert*gemma*" 2>/dev/null | head -20Repository: ngxson/llama.cpp
Length of output: 83
🏁 Script executed:
# Search for where patch_size is set for GEMMA3N
rg "patch_size.*gemma3n|gemma3n.*patch_size" -iRepository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
# Search for Gemma3nVisionModel in converter files
rg "Gemma3nVisionModel|set_gguf_parameters"Repository: ngxson/llama.cpp
Length of output: 13181
🏁 Script executed:
find . -name "clip.cpp" -type fRepository: ngxson/llama.cpp
Length of output: 81
🏁 Script executed:
# Search for Gemma3nVisionModel set_gguf_parameters implementation
rg -A 20 "class Gemma3nVisionModel"Repository: ngxson/llama.cpp
Length of output: 2748
🏁 Script executed:
# Search for the specific patch_size calculation for GEMMA3N
rg -B 5 -A 5 "patch_size.*image_size.*image_seq_length|Gemma3n.*patch_size"Repository: ngxson/llama.cpp
Length of output: 740
🏁 Script executed:
cat -n ./tools/mtmd/clip.cpp | sed -n '1145,1160p'Repository: ngxson/llama.cpp
Length of output: 1026
🏁 Script executed:
cat -n ./tools/mtmd/clip.cpp | sed -n '2850,2870p'Repository: ngxson/llama.cpp
Length of output: 1304
🏁 Script executed:
cat -n ./tools/mtmd/clip.cpp | sed -n '3120,3140p'Repository: ngxson/llama.cpp
Length of output: 1225
🏁 Script executed:
cat -n ./tools/mtmd/clip.cpp | sed -n '3515,3530p'Repository: ngxson/llama.cpp
Length of output: 771
🏁 Script executed:
cat -n ./tools/mtmd/clip.cpp | sed -n '3640,3655p'Repository: ngxson/llama.cpp
Length of output: 873
🏁 Script executed:
# Now let's find where patch_size is being used for token counting in GEMMA3N
rg "clip_n_output_tokens|PROJECTOR_TYPE_GEMMA3N" tools/mtmd/clip.cpp -B 3 -A 10Repository: ngxson/llama.cpp
Length of output: 7814
🏁 Script executed:
# Search for the n_patches calculation in clip_n_output_tokens
rg "n_patches.*patch_size|patch_size.*n_patches" tools/mtmd/clip.cpp -B 5 -A 5Repository: ngxson/llama.cpp
Length of output: 3486
🌐 Web query:
Hugging Face Gemma3n image processor implementation MobileNetV5 patch_size image_seq_length
💡 Result:
Short answer:
- Gemma3n's processor (Gemma3nProcessor / Gemma3nProcessor) wraps a SiglipImageProcessorFast for images and exposes image_seq_length with default = 256. [1][2]
- The Gemma 3n vision encoder is MobileNetV5 and uses a default input resolution of 768×768; MobileNetV5 is a CNN backbone (not a ViT-style patch embed), so there is no model-level "patch_size" parameter the way patch-based transformers have. [2][3]
Sources:
- Gemma3n processor docs (shows image_seq_length default = 256 and use of SiglipImageProcessorFast). [1]
- Gemma3n / Gemma3 docs (states MobileNetV5 vision encoder, default resolution 768×768). [2][3]
🌐 Web query:
Gemma3nImageProcessor Gemma3nProcessor image_size patch_size configuration
💡 Result:
Short answer:
- Gemma3 Image processor (transformers.Gemma3ImageProcessor / Gemma3ImageProcessorFast) defaults: size {"height":224,"width":224} (do_resize True), rescale_factor 1/255 etc. [1][2].
- Gemma3nProcessor wraps feature_extractor + image_processor + tokenizer; image_seq_length defaults to 256. The image_processor used for Gemma3n is SiglipImageProcessor/SigLiPFromPatches. [3][4].
- The Gemma3 vision encoder (SigLiP/Vision config) expects full-resolution images of 896×896 and uses patch_size = 14 (so 896/14 = 64 patches per side → 4096 patches); vision hidden dim ≈1152 and later pooled to 256 multimodal tokens. Patch_size=14 and image_size=896 are used in official Gemma config/backbone implementations. [5][6][7].
Sources:
- Hugging Face — Gemma3ImageProcessor docs (defaults: size 224, rescale_factor 1/255). [1]
- Hugging Face — Gemma3ImageProcessorFast / preprocess description. [2]
- Hugging Face — Gemma3nProcessor (image_seq_length default 256; uses SiglipImageProcessor). [3]
- Gemma readthedocs / model API showing SigLiPFromPatches → ViT patch_size=(14,14), image_height/width=896. [5]
- KerasHub / Gemma3Backbone (vision encoder params: image_size=896, patch_size=14, hidden_dim=1152, output tokens pooled to 256). [6]
- vLLM / Gemma3MultiModalProjector implementation (uses image_size and patch_size to compute patches_per_image and pooling to mm tokens). [7]
🏁 Script executed:
# Let's verify the current converter code more carefully to understand the exact implementation
rg -B 10 -A 10 "image_size.*image_seq_length|patch_size.*image_size.*image_seq" convert_hf_to_gguf.py | head -50Repository: ngxson/llama.cpp
Length of output: 1042
🏁 Script executed:
# Check if there's any documentation about what patch_size is supposed to represent
rg "patch_size" tools/mtmd/clip.cpp | head -30Repository: ngxson/llama.cpp
Length of output: 2186
🏁 Script executed:
# Let's verify the math: for image_size=768 and image_seq_length=256
python3 << 'EOF'
import math
# Current formula (converter)
image_size = 768
image_seq_length = 256
patch_size_current = image_size // image_seq_length
print(f"Current formula: patch_size = {image_size} // {image_seq_length} = {patch_size_current}")
# Reviewer's suggested fix
patch_size_sqrt = image_size // int(math.sqrt(image_seq_length))
print(f"Reviewer's suggested: patch_size = {image_size} // sqrt({image_seq_length}) = {image_size} // {int(math.sqrt(image_seq_length))} = {patch_size_sqrt}")
# Current C++ calculation with current patch_size
n_patches_current = image_size / patch_size_current
print(f"\nCurrent C++ result: n_patches = {image_size} / {patch_size_current} = {n_patches_current}")
# C++ calculation if patch_size were corrected
n_patches_if_fixed = image_size / patch_size_sqrt
print(f"C++ result if patch_size fixed: n_patches = {image_size} / {patch_size_sqrt} = {n_patches_if_fixed}")
print(f"\nWould the code break? {n_patches_current != n_patches_if_fixed}")
EOFRepository: ngxson/llama.cpp
Length of output: 146
🏁 Script executed:
# Quick math verification since python timed out
echo "Current: 768 / (768 / 256) = 768 / 3 = $(( 768 / 3 ))"
echo "If fixed (sqrt): 768 / (768 / 16) = 768 / 48 = $(( 768 / 48 ))"Repository: ngxson/llama.cpp
Length of output: 152
🏁 Script executed:
# Check if there's any reference to image_seq_length in the C++ code
rg "image_seq_length" tools/mtmd/clip.cppRepository: ngxson/llama.cpp
Length of output: 42
🏁 Script executed:
# Check how Gemma3 (not Gemma3N) handles patch_size to compare
rg -B 5 -A 10 "PROJECTOR_TYPE_GEMMA3:" tools/mtmd/clip.cpp | head -40Repository: ngxson/llama.cpp
Length of output: 1955
🏁 Script executed:
# Check the full context around clip_n_output_tokens to understand the override logic
sed -n '3100,3145p' tools/mtmd/clip.cppRepository: ngxson/llama.cpp
Length of output: 2072
🏁 Script executed:
# Verify if image_seq_length should be stored as a hparam instead
rg "image_seq_length" convert_hf_to_gguf.pyRepository: ngxson/llama.cpp
Length of output: 346
🏁 Script executed:
# Check the comment in the converter about 16x16
rg -B 2 -A 2 "16x16|256 tokens" convert_hf_to_gguf.pyRepository: ngxson/llama.cpp
Length of output: 281
🌐 Web query:
SigLiPImageProcessor resize aspect ratio center crop bilinear interpolation
💡 Result:
Short answer:
-
SiglipImageProcessor follows the Transformers BaseImageProcessor resizing rules: if you pass size as {"shortest_edge": N} it resizes the shortest edge to N and preserves aspect ratio; if you pass explicit height/width it will use that output size (or treat a single int as square unless default_to_square=False). [1][2]
-
Center-crop is supported (do_center_crop / crop_size); the fast processor pads smaller inputs with zeros before center-cropping to ensure the requested crop size. [1][3]
-
Resampling: default resample is BICUBIC for Siglip, but you can set resample=Resampling.BILINEAR (or other PILImageResampling/F.InterpolationMode) to use bilinear interpolation. [1][2]
Sources: Hugging Face SigLIP docs (SiglipImageProcessor) and BaseImageProcessor docs. [1][2][3]
Decouple GEMMA3N token count from misused patch_size parameter
The GEMMA3N token-count logic has a brittle dependency on how patch_size is computed in the converter. Currently, convert_hf_to_gguf.py sets patch_size = image_size // image_seq_length (768 // 256 = 3), and the C++ code in clip_n_output_tokens() computes n_patches = image_size / patch_size, which accidentally yields the correct 256 tokens. However, if patch_size is ever corrected to its proper semantic meaning (e.g., image_size // sqrt(image_seq_length) = 48), this calculation will silently return 16 tokens and break the sanity check.
Two concrete fixes:
-
Store
image_seq_length(or a hardcoded 256) as a dedicated GGUF hparam instead of deriving token count from a misusedpatch_size. MobileNetV5 is a CNN backbone, not patch-based, sopatch_sizehas no semantic meaning for it. By decoupling the token count frompatch_size, future converter improvements won't silently break runtime logic. -
Verify preprocessing against HF's SigLiP reference. Confirm that "resize to square + bilinear + no padding" fully matches HF's Gemma3nProcessor behavior. SigLiP can support aspect-ratio-preserving resize + center crop; if that is the intended behavior, the preprocessing in
clip_image_preprocess()at lines 2856–2864 should be aligned.
The code currently works but bakes in fragile assumptions about converter internals that will break when corrected.
| // 4. Gemma 3n Multimodal Projection (Embedder) | ||
| // Input: 'cur' is [Width, Height, Channels, Batch] | ||
| int W = cur->ne[0]; | ||
| int H = cur->ne[1]; | ||
| int C = cur->ne[2]; // Should be 2048 | ||
| int B = cur->ne[3]; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brittle assumption: channel count should not be hardcoded.
Line 398's comment "// Should be 2048" indicates a brittle assumption about the channel dimension. If MobileNetV5 variants or future models use different channel counts, this could fail silently or cause dimension mismatches in the embedding projection.
🔎 Recommended defensive check
Add validation to catch mismatches early:
int W = cur->ne[0];
int H = cur->ne[1];
-int C = cur->ne[2]; // Should be 2048
+int C = cur->ne[2];
int B = cur->ne[3];
+
+// Validate expected channel dimension for Gemma3n projection
+const int expected_vision_hidden = 2048; // Or read from model config
+if (C != expected_vision_hidden) {
+ LOG_ERR("%s: unexpected vision hidden size %d (expected %d)\n", __func__, C, expected_vision_hidden);
+ return nullptr;
+}Based on learnings, as this is a mirrored PR, please relay this concern to the upstream contributor.
🤖 Prompt for AI Agents
In tools/mtmd/models/mobilenetv5.cpp around lines 394 to 400, the code comments
and assumption that C == 2048 are brittle; add a defensive runtime validation
that checks the channel count (cur->ne[2]) against the expected embedding input
size, and if it differs emit a clear error or return early with a descriptive
message (include the actual C and expected value). If there are multiple
supported channel sizes, handle them explicitly (e.g., branch or map to
projection weights) rather than hardcoding; otherwise fail fast. Also include a
brief log or comment to notify upstream that the hardcoded "Should be 2048"
assumption was removed and request confirmation of expected channel sizes for
different MobileNetV5 variants.
…th already existing terminology
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
gguf-py/gguf/constants.py (1)
857-865: GEMMA3N mapping still missing from VISION_PROJECTOR_TYPE_NAMES dictionary.This is the same issue flagged in the previous review. The GEMMA3N enum value added at line 459 still has no corresponding entry in this dictionary. This mapping is required for the enum-to-string conversion to work correctly.
convert_hf_to_gguf.py (1)
6014-6039: Fix Gemma3n MobileNetV5patch_sizesemantics and avoid brittle hard‑coded vision hparamsTwo related issues here:
patch_sizeis still computed from total token count, not per‑side patches (critical)
- Current code:
patch_size = image_size // image_seq_length(e.g.,768 // 256 = 3), which impliesn_per_side = 256and a 65k‑token grid.- Semantically,
image_seq_lengthis total patches (e.g., 256 = 16×16). Patch size must be derived fromsqrt(image_seq_length)so that both the converter and C++ vision path agree on a 16×16 grid and the correctpatch_size(48 for 768×768).Hard‑coded MobileNetV5 defaults in
find_vparamare brittle
hidden_sizedefaulting to 2048 andnum_headsforced to 8 will silently be wrong if future Gemma3n variants change these values in their config. It’s safer to read fromself.hparams_visionwhen available and only fall back to defaults if the config is missing them.Patch: derive `patch_size` from √image_seq_length
- # Image sequence length (256 tokens = 16x16 for Gemma3n) - image_seq_length = self.preprocessor_config.get("image_seq_length", 256) - image_size = self.hparams_vision["image_size"] - self.hparams_vision["patch_size"] = image_size // image_seq_length + # Image sequence length is total tokens (e.g. 256 = 16×16 grid) + image_seq_length = self.preprocessor_config.get("image_seq_length", 256) + image_size = self.hparams_vision["image_size"] + + n_per_side = int(image_seq_length ** 0.5) + if n_per_side * n_per_side != image_seq_length: + raise ValueError(f"image_seq_length={image_seq_length} is not a perfect square") + + # e.g. 768 // 16 = 48 for a 16×16 patch grid + self.hparams_vision["patch_size"] = image_size // n_per_sidePatch: prefer config‑driven head / FFN sizes in `find_vparam`
def find_vparam(self, keys: list[str], optional: bool = False) -> Any: """Override to provide hardcoded MobileNetV5 parameters that aren't in config""" # Handle empty keys list (n_block_keys) - return 0 for CNN architecture if not keys: return 0 - if "intermediate_size" in keys: - # Typical expansion is 4x the embedding dimension - hidden_size = self.hparams_vision.get("hidden_size", 2048) - return hidden_size * 4 - - if "num_attention_heads" in keys or "num_heads" in keys: - # Multi-Query Attention with 8 heads - return 8 + if "intermediate_size" in keys: + assert self.hparams_vision is not None + if "intermediate_size" in self.hparams_vision: + return self.hparams_vision["intermediate_size"] + # Fallback: typical MobileNetV5 expansion is 4× hidden_size + hidden_size = self.hparams_vision.get("hidden_size") + if hidden_size is not None: + return hidden_size * 4 + + if any(k in ("num_attention_heads", "num_heads") for k in keys): + assert self.hparams_vision is not None + for k in ("num_attention_heads", "num_heads"): + if k in self.hparams_vision: + return self.hparams_vision[k] + # Final fallback if config is missing heads + return 8 # For other parameters, use parent implementation return super().find_vparam(keys, optional)Given this is a mirrored PR, you’ll probably want to carry this fix locally and/or ping upstream about the patch_size formula and config‑driven defaults.
🧹 Nitpick comments (3)
gguf-py/gguf/constants.py (1)
1071-1071: Consider clarifying comment for V_MM_SOFT_EMB_NORM.The comment here shows
# gemma3n, but the enum definition at line 669 shows# gemma3. If this tensor is used by both gemma3 and gemma3n architectures, consider using a comment like# gemma3, gemma3nto clarify the shared usage and avoid confusion.convert_hf_to_gguf.py (2)
5969-5995: Annotate mutable class attributes withClassVarto satisfy Ruff RUF012
n_block_keys = []andblock_tensor_mapping = {…}are mutable class attributes; Ruff expects them to be annotated astyping.ClassVar[...].Proposed type annotations for class attributes
Add
ClassVarto the typing imports:from typing import TYPE_CHECKING, Any, Callable, ContextManager, Iterable, Iterator, Literal, Sequence, TypeVar, cast, ClassVarThen update the class attributes:
- n_block_keys = [] + n_block_keys: ClassVar[list[str]] = [] @@ - block_tensor_mapping = { + block_tensor_mapping: ClassVar[dict[str, str]] = { "model.vision_tower.timm_model.blocks.{bid}.{sid}.conv_exp.weight": "v.blk.{bid}.{sid}.conv_exp.weight", ... }
6171-6226: Embedding padding and ALTUP stacking logic look correct; consider a small guardThis block:
- Only affects
language_model.*tensors; others are skipped, which keeps mmproj / vision clean.- Pads
embed_tokens.weightandembed_tokens_per_layerup tohparams["vocab_size"], filling new rows with zeros for vision/audio tokens (which get real features from the vision/audio towers anyway).- Leaves
altup_unembed_projectionsandaltup_projectionsunpadded and stacks three shard tensors into single[3, …, …]matrices, matching how GGUF expects them.You might consider adding a simple sanity check on the padding path to catch config mismatches earlier (optional):
Optional: assert current vs target vocab sizes when padding
- vocab_size = self.hparams.get("vocab_size", 262400) - current_size = data_torch.shape[0] # First dimension is vocab_size + vocab_size = self.hparams.get("vocab_size", 262400) + current_size = data_torch.shape[0] # first dim is vocab size + + if current_size > vocab_size: + raise ValueError( + f"embed tensor rows ({current_size}) exceed vocab_size ({vocab_size})" + )
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
convert_hf_to_gguf.pygguf-py/gguf/constants.pytools/mtmd/clip-impl.h
🚧 Files skipped from review as they are similar to previous changes (1)
- tools/mtmd/clip-impl.h
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Always use the Python environment in.venvand run Python tools from that environment
Apply Python linting rules configured in.flake8(max-line-length=125, excludes examples/tools) and type checking with pyright
Files:
gguf-py/gguf/constants.pyconvert_hf_to_gguf.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".
When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {ggml/**,src/**/backend*.{c,cpp,h,hpp},tools/server/**} : Backend-related changes (CPU, CUDA, Metal, Vulkan, etc.) and modifications to `tools/server` require AI usage disclosure if significant code is generated
Applied to files:
gguf-py/gguf/constants.pyconvert_hf_to_gguf.py
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {include/llama.h,ggml/**/*.h,mtmd/**/*.h} : Public API modifications in `include/llama.h`, `ggml.h`, or `mtmd.h` require AI usage disclosure if significant code is generated
Applied to files:
gguf-py/gguf/constants.pyconvert_hf_to_gguf.py
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
Applied to files:
gguf-py/gguf/constants.py
🧬 Code graph analysis (1)
convert_hf_to_gguf.py (3)
convert_lora_to_gguf.py (2)
modify_tensors(448-473)to(185-186)gguf-py/gguf/gguf_writer.py (2)
add_clip_projector_type(1076-1077)add_vision_attention_layernorm_eps(1097-1098)gguf-py/gguf/constants.py (1)
VisionProjectorType(3429-3450)
🪛 Ruff (0.14.10)
convert_hf_to_gguf.py
5969-5969: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
5972-5995: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
6067-6067: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (3)
convert_hf_to_gguf.py (3)
522-527: Defensive handling of emptytensor_mapinprepare_tensorslooks goodThe fallback
max_name_lenpath for models withblock_count == 0(e.g., MobileNetV5) avoidsmax()on an empty mapping and only affects logging alignment. No functional risks spotted.
6070-6090: Verify MobileNetV5 tensor remapping and reshaping assumptionsThe overall routing logic looks consistent:
- Non‑Gemma3n tensors are skipped early.
- Block tensors under
model.vision_tower.timm_model.blocks.*are mapped viacustom_mapinto thev.blk.{bid}.{sid}.*namespace.- Other MobileNetV5 pieces (stem / msfa / embeddings) fall back to the standard tensor map.
- Biases and
layer_scale.gammaare reshaped from[C]into[1, C, 1, 1], which matches a conv‑style broadcast.If you haven’t already, it would be worth running a quick shape sanity check against a real Gemma3n vision checkpoint (ensure
conv_stem.conv.biasandlayer_scale.gammaare 1‑D[C]and that the resulting GGUF tensors have the shapes expected bymobilenetv5.cpp/clip.cppfor those fields).
6115-6134:Gemma3NModel.set_vocabcorrectly forces use of fullvocab_sizeTemporarily removing
vocab_size_per_layer_inputso the vocab builders see the fullvocab_size(including vision/audio specials) and then restoring it is a clean way to reconcile:
- vocab generation: needs 262400 entries
- per‑layer embeddings: still driven by
vocab_size_per_layer_inputlaterNo functional issues seen.
| # MobileNetV5 does not use normalisation at all | ||
| self.preprocessor_config["image_mean"] = [0.0 , 0.0 , 0.0 ] | ||
| self.preprocessor_config["image_std"] = [1.0 , 1.0 , 1.0 ] | ||
| self.hparams_vision["image_size"] = self.preprocessor_config.get( | ||
| "size", {"height": 768, "width": 768} | ||
| )["height"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clean up list literals to satisfy flake8 E202 (whitespace before ']')
The image_mean / image_std assignments currently have a trailing space before the closing ], which flake8 reports as E202 whitespace before ']'.
Patch to fix flake8 E202
- self.preprocessor_config["image_mean"] = [0.0 , 0.0 , 0.0 ]
- self.preprocessor_config["image_std"] = [1.0 , 1.0 , 1.0 ]
+ self.preprocessor_config["image_mean"] = [0.0, 0.0, 0.0]
+ self.preprocessor_config["image_std"] = [1.0, 1.0, 1.0]🤖 Prompt for AI Agents
In convert_hf_to_gguf.py around lines 6027 to 6032, the list literals for
self.preprocessor_config["image_mean"] and self.preprocessor_config["image_std"]
contain a space before the closing bracket (triggering flake8 E202); remove the
trailing spaces so the lists read [0.0, 0.0, 0.0] and [1.0, 1.0, 1.0]
respectively (keeping commas and spacing consistent), then run flake8 to confirm
the E202 warning is resolved.
…roadcasting of vision inp_per_layer - Simplify the vision embeddings path of "get_per_layer_inputs" to output [n_embd_altup, n_layer, 1], broadcastable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
gguf-py/gguf/constants.py (1)
459-459: GEMMA3N still missing from VISION_PROJECTOR_TYPE_NAMES dictionary.This issue was previously flagged but remains unresolved. The
GEMMA3Nentry was added to theVISION_PROJECTOR_TYPEenum at line 459, but the corresponding mapping inVISION_PROJECTOR_TYPE_NAMESdictionary (lines 856-864) is still missing. This will cause runtime errors when converting the enum value to its string representation.Please add the mapping:
VISION_PROJECTOR_TYPE.GEMMA3N: "gemma3n",Note:
QWEN3VLandCOGVLMmappings are also still missing as noted in the previous review.
🧹 Nitpick comments (1)
src/models/gemma3n-iswa.cpp (1)
263-274: Consider adding dimension validation for the vision embedding path.The vision embedding path assumes
embd_size = model.tok_embd_per_layer->ne[0]equalsn_embd_altup * n_layer. If this assumption is violated, the reshape at line 272 will produce incorrect results or fail silently.💡 Optional: Add assertion to validate dimensions
// Vision embedding path: use padding token (ID=0) embedding const int64_t embd_size = model.tok_embd_per_layer->ne[0]; // n_embd_altup * n_layer +GGML_ASSERT(embd_size == n_embd_altup * n_layer && "tok_embd_per_layer dimension mismatch"); // Extract and dequantize padding token embedding (column 0)
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
gguf-py/gguf/constants.pygguf-py/gguf/tensor_mapping.pysrc/models/gemma3n-iswa.cpp
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{c,cpp,h,hpp}
📄 CodeRabbit inference engine (AGENTS.md)
Always format C++ code before committing using
git clang-formatwith the project's.clang-formatconfiguration (4-space indentation, 120 column limit, braces on same line for functions, pointer alignmentvoid * ptr, reference alignmentint & ref)
Files:
src/models/gemma3n-iswa.cpp
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Always use the Python environment in.venvand run Python tools from that environment
Apply Python linting rules configured in.flake8(max-line-length=125, excludes examples/tools) and type checking with pyright
Files:
gguf-py/gguf/tensor_mapping.pygguf-py/gguf/constants.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".
When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {ggml/**,src/**/backend*.{c,cpp,h,hpp},tools/server/**} : Backend-related changes (CPU, CUDA, Metal, Vulkan, etc.) and modifications to `tools/server` require AI usage disclosure if significant code is generated
Applied to files:
src/models/gemma3n-iswa.cppgguf-py/gguf/constants.py
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {include/llama.h,ggml/**/*.h,mtmd/**/*.h} : Public API modifications in `include/llama.h`, `ggml.h`, or `mtmd.h` require AI usage disclosure if significant code is generated
Applied to files:
gguf-py/gguf/constants.py
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.
Applied to files:
gguf-py/gguf/constants.py
🧬 Code graph analysis (2)
src/models/gemma3n-iswa.cpp (1)
ggml/src/ggml.c (7)
ggml_new_tensor_1d(1747-1752)ggml_set_input(7435-7437)ggml_get_rows(3776-3797)ggml_reshape_3d(3564-3581)ggml_scale(3290-3295)ggml_cpy(3426-3431)ggml_add(1969-1974)
gguf-py/gguf/tensor_mapping.py (1)
gguf-py/gguf/constants.py (1)
MODEL_TENSOR(465-742)
🔇 Additional comments (7)
gguf-py/gguf/tensor_mapping.py (1)
123-155: Verify V_MM_INP_PROJ comment and tensor reuse.The mapping for
V_MM_INP_PROJat lines 129-131 is commented as "# gemma3n", but inconstants.py(line 668),V_MM_INP_PROJis defined with comment "# gemma3". Please verify:
- Is
V_MM_INP_PROJbeing reused between gemma3 and gemma3n, or should gemma3n have a distinct tensor?- If it's reused, consider clarifying the comment or noting the shared usage.
Otherwise, the new gemma3n vision tensor mappings look correct and follow the established pattern.
gguf-py/gguf/constants.py (4)
670-678: LGTM: MODEL_TENSOR enum additions.The new gemma3n vision tensor enum members are correctly defined with
auto()and follow the established naming conventions and patterns.
1070-1079: LGTM: TENSOR_NAMES mappings for gemma3n.The new TENSOR_NAMES mappings correctly associate the MODEL_TENSOR enum members with their GGUF tensor name strings, following established naming conventions.
1178-1186: LGTM: MODEL_TENSORS additions for MMPROJ arch.The new gemma3n vision tensors are correctly added to the
MODEL_TENSORS[MODEL_ARCH.MMPROJ]list, ensuring they will be recognized during tensor loading and validation.
3428-3428: LGTM: VisionProjectorType.GEMMA3N constant.The
GEMMA3Nconstant is correctly defined with the appropriate lowercase string value "gemma3n", consistent with other projector type definitions.src/models/gemma3n-iswa.cpp (2)
253-261: LGTM: Token input path refactoring.The refactoring correctly moves the input object creation and lifecycle management into the token branch scope. The
res->add_input(std::move(inp))call after processing is appropriately placed.
292-292: Broadcasting order change is intentional and properly implemented.The addition order change—placing
per_layer_proj(shape[n_embd_altup, n_layer, n_tokens]) first andinp_per_layersecond—is correct per the PR objectives and supports broadcasting for vision inputs (shape[n_embd_altup, n_layer, 1]). The implementation follows standard conventions (larger tensor first) and the code is properly formatted per project guidelines.
Mirror from upstream PR: ggml-org#18256
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.