Skip to content

Conversation

@ngxson
Copy link
Owner

@ngxson ngxson commented Dec 23, 2025

Mirror from upstream PR: ggml-org#18315

Note: @coderabbitai use my 'Mirror PR' preset for reviewing this.

Summary by CodeRabbit

  • New Features

    • UTUVL vision model support added to conversion and runtime, expanding multimodal vision-language compatibility.
    • New UTU_VL pre-tokenizer option and utu-vl model entry in the conversion tool registry.
    • Vision config metadata: window size and window-attention layer settings exposed.
  • Bug Fixes

    • More graceful fallback when output tensors are missing during model loading.
  • Enhancements

    • Expanded Unicode category handling and additional tensor mapping paths for vision models.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 23, 2025

📝 Walkthrough

Walkthrough

Adds UTUVL vision-model support across conversion tooling, tensor mappings, GGUF metadata, tokenizer, and mtmd inference: new model registration/class, GGUF keys for wa_layers/window_size, tokenizer pre-type, tensor mapping entries, and a UTUVL-specific clip graph and preprocessing pipeline.

Changes

Cohort / File(s) Summary
Conversion & Model Registration
convert_hf_to_gguf.py, convert_hf_to_gguf_update.py
New UtuVLVisionModel class and registration for UtuVLForConditionalGeneration/UTUVLForCausalLM; UTUVL-specific set_gguf_parameters and modify_tensors; utu-vl model entry added.
GGUF Metadata & Writer
gguf-py/gguf/constants.py, gguf-py/gguf/gguf_writer.py
Added ClipVision keys WA_LAYERS, WINDOW_SIZE, enum UTUVL; writer methods add_vision_wa_layers() and add_vision_window_size().
Tensor Mapping
gguf-py/gguf/tensor_mapping.py
Extended TensorNameMap with additional Siglip2/UTUVL source keys for many vision tensors (projections, norms, FFN, layers).
Tokenizer / Vocab
src/llama-vocab.h, src/llama-vocab.cpp
Added LLAMA_VOCAB_PRE_TYPE_UTU_VL and BPE pre-tokenizer rules for "utu-vl" (clean_spaces=false, ignore_merges=true).
Unicode / Regex Parser
src/unicode.cpp
Dynamic \p{...} parsing replaced fixed-length handling; expanded LETTER categories (Lu, Ll, Lt, Lm, Lo) and parsing advances to closing brace.
Llama Model Loading
src/llama-model.cpp
Output weight tensor load made optional; fallback to token embedding load when output.weight missing.
MTMD Build & Integration
tools/mtmd/CMakeLists.txt, tools/mtmd/mtmd.cpp
Added models/utuvl.cpp to build; UTUVL included with QWEN variants for image delimiter handling.
Clip Projector & Hparams
tools/mtmd/clip-impl.h, tools/mtmd/clip-model.h
Added PROJECTOR_TYPE_UTUVL, macro key for clip.vision.wa_layers, and std::unordered_set<int32_t> wa_layers in clip_hparams.
UTUVL Graph & Model Implementation
tools/mtmd/models/models.h, tools/mtmd/models/utuvl.cpp
New clip_graph_utuvl struct and clip_graph_utuvl::build() implementing patch embedding, optional windowed attention, transformer layers, and VLPatchMerger pipeline.
Clip Processing & Preproc
tools/mtmd/clip.cpp
UTUVL-specific graph selection, hparam loading (spatial_merge_size, attn_window_size, wa_layers), preprocessing/resizing, token counting, and window-attention logic wired through encoding and embedding extraction.
Template Utility
vendor/minja/minja.hpp
Added string rsplit() utility and .rsplit() method handling in template engine.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant HF as "HuggingFace Model"
    participant Conv as "convert_hf_to_gguf"
    participant Vocab as "LLAMA Tokenizer"
    participant TMap as "TensorMapping"
    participant GGUF as "GGUF Writer"

    User->>Conv: request convert (utu-vl)
    Conv->>HF: load model weights & hparams
    Conv->>Vocab: map pre-tokenizer "utu-vl"
    Vocab-->>Conv: LLAMA_VOCAB_PRE_TYPE_UTU_VL
    Conv->>TMap: resolve tensor source names
    TMap-->>Conv: mapped names (incl. Siglip2 variants)
    Conv->>Conv: instantiate UtuVLVisionModel
    Conv->>Conv: set_gguf_parameters (wa_layers, window_size, ...)
    Conv->>Conv: modify_tensors (remap or skip vision tensors)
    Conv->>GGUF: write metadata & tensors (WA_LAYERS, WINDOW_SIZE)
    GGUF-->>User: produced GGUF file
Loading
sequenceDiagram
    actor App
    participant Clip as "mtmd/clip"
    participant Pre as "Preprocessor"
    participant Graph as "clip_graph_utuvl"
    participant Encode as "BatchEncoder"

    App->>Clip: load UTUVL model & hparams
    Clip->>Clip: init wa_layers, attn_window_size, spatial_merge_size
    App->>Pre: provide image
    Pre->>Pre: dynamic resize, patch/merge alignment
    Pre-->>Encode: preprocessed batch
    Encode->>Clip: compute token counts (with spatial merge)
    App->>Graph: build forward graph
    Graph->>Graph: patch embed → transformer layers (optional windowed attn)
    Graph->>Graph: VLPatchMerger (RMS + FFN)
    Graph-->>Clip: forward graph ready
    Clip-->>App: produce vision embeddings
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐰 A rabbit's note for UTUVL's new day

Patches hop and windows gently play,
Tensors weave a meadow in the light,
Tokens bound and embedders take flight,
Hooray—new vision springs into sight!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 2.56% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ❓ Inconclusive The PR description is minimal but adequate, referencing the upstream PR ggml-org#18315 and requesting a specific review preset. However, it lacks detailed context about what changes are included, why they're needed, or what functionality is being added. Consider expanding the description to summarize key changes (UTUVL vision model support, new tensor mappings, vocabulary additions) and explain the purpose of this mirror PR for better context.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title '[Mirror] Support Youtu-VL Model' clearly and concisely summarizes the main objective of the PR, which is to add support for the Youtu-VL model to the codebase.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
convert_hf_to_gguf.py (1)

7247-7344: OUTPUT tensor bypass of quantization is intentional but undocumented; embedding naming assumption is reasonable for DeepseekV2.

The tied-embeddings implementation in DeepseekV2Model is functionally correct but has two points worth clarifying:

  1. Quantization bypass for OUTPUT is real: The code calls self.gguf_writer.add_tensor(output_name, self._token_embd.numpy()) directly, which bypasses the quantization pipeline in ModelBase.prepare_tensors(). This means OUTPUT will be stored at the original tensor precision (likely F16) rather than following the model's ftype setting (e.g., MOSTLY_Q8_0). This appears intentional for tied-embedding models to preserve logit precision, but should be documented if that's the design goal.

  2. Embedding naming assumption is reasonable: The check for "model.embed_tokens.weight" is the correct naming for DeepseekV2 architecture variants. If future variants use different naming (e.g., without the model. prefix), the OUTPUT tensor would not be written, but this is unlikely for the DeepseekV2/UTUVL family.

If high-precision OUTPUT tensors are intended for quality, consider adding a comment explaining this trade-off. If OUTPUT should follow quantization settings instead, capture the quantized tensor from the parent prepare_tensors() call rather than the raw _token_embd.

🧹 Nitpick comments (8)
src/llama-vocab.cpp (1)

1869-1873: Minor: Trailing whitespace on line 1873.

The tokenizer configuration looks correct with clean_spaces = false and ignore_merges = true, consistent with similar pre-tokenizers like llama3 and tekken.

🔎 Proposed fix
             } else if (
                     tokenizer_pre == "utu-vl") {
                 pre_type = LLAMA_VOCAB_PRE_TYPE_UTU_VL;
                 clean_spaces = false;
-                ignore_merges = true; 
+                ignore_merges = true;
tools/mtmd/models/utuvl.cpp (2)

160-160: Inconsistent epsilon value - uses hardcoded 1e-6 instead of member eps.

The RMS norm for the merger uses a hardcoded 1e-6 epsilon, while other norms in this file use the member variable eps. If this is intentional (perhaps the merger requires a different epsilon), consider adding a comment to clarify.

🔎 Proposed fix (if same eps is intended)
-    embeddings = build_norm(embeddings, model.mm_input_norm_w, nullptr, NORM_TYPE_RMS, 1e-6, -1);
+    embeddings = build_norm(embeddings, model.mm_input_norm_w, nullptr, NORM_TYPE_RMS, eps, -1);

173-185: Remove commented-out dead code.

This block was replaced by the build_ffn call above. Commented-out code should be removed to improve maintainability.

🔎 Proposed fix
     embeddings = build_ffn(embeddings,
                     model.mm_0_w, model.mm_0_b,
                     nullptr, nullptr,
                     model.mm_1_w, model.mm_1_b,
                     FFN_GELU,
                     -1);
-    // // 3. First linear layer
-    // embeddings = ggml_mul_mat(ctx0, model.mm_0_w, embeddings);
-    // embeddings = ggml_add(ctx0, embeddings, model.mm_0_b);
-    // cb(embeddings, "merger_fc1", -1);
-
-    // // 4. GELU activation
-    // embeddings = ggml_gelu(ctx0, embeddings);
-    // cb(embeddings, "merger_gelu", -1);
-
-    // // 5. Second linear layer
-    // embeddings = ggml_mul_mat(ctx0, model.mm_1_w, embeddings);
-    // embeddings = ggml_add(ctx0, embeddings, model.mm_1_b);
 
     // build the graph
tools/mtmd/clip.cpp (3)

1165-1175: Consider aligning warning messages with QWEN25VL for consistency.

The warning for UTUVL is more terse compared to QWEN25VL which provides additional context about the --image-min-tokens flag and a reference link. Also, QWEN25VL loads n_wa_pattern from config while UTUVL hardcodes window attention behavior in the batch encoder.

If window attention pattern should be configurable, consider loading it from the model config similar to QWEN25VL.

🔎 Suggested improvement for warning message
                         const int warn_min_pixels = 1024 * hparams.n_merge * hparams.n_merge * hparams.patch_size * hparams.patch_size;
                         if (hparams.image_min_pixels < warn_min_pixels) {
                             LOG_WRN("%s: Youtu-VL models require at minimum 1024 image tokens to function correctly on grounding tasks\n", __func__);
+                            LOG_WRN("%s: if you encounter problems with accuracy, try adding --image-min-tokens 1024\n", __func__);
                         }

2710-2760: Potential edge cases in scaling loop.

A few observations about the preprocessing logic:

  1. Loop termination: The while (scale > 0.0f) loop could theoretically reach very small or zero scale values for extremely large images. Consider adding a minimum scale threshold to prevent potential issues.

  2. Fallback value: When params.image_max_pixels is 0, max_num_patches becomes 0, which would always trigger the scale reduction.

  3. Comment accuracy: The comment mentions "binary search" but the implementation is a linear search with 0.02 decrements.

🔎 Suggested defensive bounds
-                // Binary search for optimal scale to fit within max_num_patches
+                // Linear search for optimal scale to fit within max_num_patches
                 float scale = 1.0f;
                 int target_height = original_size.height;
                 int target_width = original_size.width;
                 
                 auto get_scaled_image_size = [align_size](float scale, int size) -> int {
                     float scaled_size = size * scale;
                     // Round up to nearest multiple of align_size
                     int aligned = static_cast<int>(std::ceil(scaled_size / align_size)) * align_size;
                     // Ensure at least one patch
                     return std::max(align_size, aligned);
                 };
                 
-                // Binary search with 0.02 step size
-                while (scale > 0.0f) {
+                // Linear search with 0.02 step size
+                const float min_scale = 0.02f;
+                while (scale >= min_scale) {
                     target_height = get_scaled_image_size(scale, original_size.height);
                     target_width = get_scaled_image_size(scale, original_size.width);
                     
                     int num_patches_h = target_height / patch_size;
                     int num_patches_w = target_width / patch_size;
                     int num_patches = num_patches_h * num_patches_w;
                     
                     if (num_patches > max_num_patches) {
                         scale -= 0.02f;
                     } else {
                         break;
                     }
                 }

3426-3476: Dead code and hardcoded window attention.

  1. Dead code: Since use_window_attn is hardcoded to true (line 3428), the else branch at lines 3472-3475 will never execute.

  2. Consider making window attention configurable: Unlike QWEN25VL which reads n_wa_pattern from model config to determine window attention behavior, UTUVL hardcodes it. If this is intentional (UTUVL always requires window attention), consider removing the dead else branch or adding a comment explaining this design choice.

🔎 Remove dead code
         case PROJECTOR_TYPE_UTUVL:
             {
-                const bool use_window_attn = true;
                 const int merge_ratio = 2;
                 const int pw  = image_size_width  / patch_size / merge_ratio;  // patches after merger
                 const int ph  = image_size_height / patch_size / merge_ratio;
                 const int ipw = image_size_width  / patch_size;  // internal patches in ViT
                 const int iph = image_size_height / patch_size;
                 std::vector<int> idx    (ph * pw);
                 std::vector<int> inv_idx(ph * pw);
-                if (use_window_attn) {
+                // UTUVL always uses window attention
+                {
                     const int attn_window_size = patch_size * 2 * 8;
                     // ... rest of window attention code ...
-                } else {
-                    for (int i = 0; i < ph * pw; i++) {
-                        idx[i] = i;
-                    }
                 }
convert_hf_to_gguf.py (2)

7138-7237: DeepseekV2 MoE GGUF metadata wiring for UTUVL is reasonable but a bit implicit.

  • Registering "UTUVLForCausalLM" on DeepseekV2Model matches how other multimodal families register both text and mmproj variants under the same HF arch name; no issue there.
  • In set_gguf_parameters, the new MoE section:
    • Falls back from moe_intermediate_size to intermediate_size (or 0) for expert_feed_forward_length, which keeps GGUF consistent for both MoE and dense-only DeepseekV2‑style configs.
    • Writes expert_count, expert_shared_count (default 0), expert_weights_scale (default 1.0), and expert_weights_norm when norm_topk_prob is truthy. This mirrors how other MoE models in this file are handled and should be safe.
      Two minor nits you might want to consider (not blockers):
  • If a model is not MoE but happens to have an intermediate_size, you’ll still emit expert_feed_forward_length with a non‑zero value; if that’s unintended, consider guarding all MoE metadata behind a single if hparams.get("n_routed_experts") is not None: block.
  • add_expert_weights_norm is called with hparams["norm_topk_prob"] which appears to be boolean in Deepseek configs; if any future variant uses this field differently, this call might need to be revisited.

Given this mirrors upstream behavior, I’d leave it as‑is unless you see non‑MoE DeepseekV2 models hitting this path.


10501-10541: UtuVLVisionModel implementation looks sound; small polish opportunities only.

Behavior:

  • Registration under both "UtuVLForConditionalGeneration" and "UTUVLForCausalLM" on an MmprojModel mirrors how other VL families (e.g. Qwen2VL) expose both text and vision variants; this is consistent.

  • set_gguf_parameters:

    • Calls super().set_gguf_parameters() so all generic vision/audio metadata is preserved.
    • Sets projector type to UTUVL.
    • Uses self.hparams["layer_norm_eps"] (inherited from MmprojModel, which rebinds self.hparams to the vision config) and a tolerant activation check:
      • Treats several GELU aliases (gelu_pytorch_tanh, etc.) as vision_use_gelu=True.
      • Supports silu via vision_use_silu=True.
      • Raises ValueError for anything else, which is an appropriate hard‑fail given unknown activations would break runtime.
    • Writes vision_spatial_merge_size with a default of 2 when missing, which is at least explicit and will not crash.
  • modify_tensors:

    • Cleanly drops obvious language‑model tensors (lm_head., model.layers., model.embed_tokens., model.norm.).
    • For everything else, delegates to self.map_tensor_name(name) and, on failure, logs a warning and silently drops the tensor. That’s in line with other vision subclasses that prefer “best‑effort + warning” over hard failure.

Optional polish (not required):

  • Ruff’s TRY003/TRY300 hints here are stylistic only:
    • The ValueError(f"Unsupported activation function for UTUVL: {hidden_act}") could be shortened slightly, but it’s already clear and localized.
    • The try/except ValueError in modify_tensors is fine as written; splitting the return [] into an else block would be marginally clearer but not necessary.

From a functional standpoint, this class fits well with the existing multimodal patterns in this file.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ed75977 and 8733bf3.

📒 Files selected for processing (12)
  • convert_hf_to_gguf.py
  • gguf-py/gguf/constants.py
  • gguf-py/gguf/tensor_mapping.py
  • src/llama-vocab.cpp
  • src/llama-vocab.h
  • src/unicode.cpp
  • tools/mtmd/CMakeLists.txt
  • tools/mtmd/clip-impl.h
  • tools/mtmd/clip.cpp
  • tools/mtmd/models/models.h
  • tools/mtmd/models/utuvl.cpp
  • tools/mtmd/mtmd.cpp
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

  • src/llama-vocab.h
  • tools/mtmd/models/models.h
  • src/llama-vocab.cpp
  • tools/mtmd/models/utuvl.cpp
  • tools/mtmd/mtmd.cpp
  • src/unicode.cpp
  • tools/mtmd/clip.cpp
  • tools/mtmd/clip-impl.h
**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

  • src/llama-vocab.h
  • gguf-py/gguf/constants.py
  • tools/mtmd/models/models.h
  • src/llama-vocab.cpp
  • tools/mtmd/models/utuvl.cpp
  • tools/mtmd/mtmd.cpp
  • src/unicode.cpp
  • tools/mtmd/clip.cpp
  • gguf-py/gguf/tensor_mapping.py
  • tools/mtmd/clip-impl.h
  • convert_hf_to_gguf.py
**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Always activate the Python virtual environment in .venv and use tools from that environment for Python development
Ensure Python code meets flake8 linting standards with max-line-length=125 as configured in .flake8
Ensure Python code passes pyright type checking as configured in pyrightconfig.json

Files:

  • gguf-py/gguf/constants.py
  • gguf-py/gguf/tensor_mapping.py
  • convert_hf_to_gguf.py
src/**/*.cpp

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Prioritize performance optimization in core library implementations in src/, as this is a performance-critical inference library

Files:

  • src/llama-vocab.cpp
  • src/unicode.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.587Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

  • tools/mtmd/CMakeLists.txt
  • tools/mtmd/mtmd.cpp
  • tools/mtmd/clip.cpp
  • tools/mtmd/clip-impl.h
🧬 Code graph analysis (3)
tools/mtmd/models/models.h (1)
tools/mtmd/models/utuvl.cpp (2)
  • build (3-190)
  • build (3-3)
tools/mtmd/models/utuvl.cpp (1)
ggml/src/ggml.c (9)
  • ggml_reshape_4d (3583-3601)
  • ggml_permute (3700-3752)
  • ggml_cont_4d (3492-3508)
  • ggml_mul_mat (3174-3189)
  • ggml_add (1969-1974)
  • ggml_reshape_2d (3546-3562)
  • ggml_set_name (1841-1848)
  • ggml_cast (3433-3445)
  • ggml_build_forward_expand (6793-6795)
convert_hf_to_gguf.py (3)
gguf-py/gguf/gguf_writer.py (10)
  • add_expert_feed_forward_length (714-715)
  • add_expert_count (800-801)
  • add_expert_shared_count (806-807)
  • add_expert_weights_scale (815-816)
  • add_expert_weights_norm (818-819)
  • add_clip_projector_type (1083-1084)
  • add_vision_attention_layernorm_eps (1104-1105)
  • add_vision_use_gelu (1122-1123)
  • add_vision_use_silu (1125-1126)
  • add_vision_spatial_merge_size (1119-1120)
convert_lora_to_gguf.py (1)
  • modify_tensors (448-473)
gguf-py/gguf/constants.py (1)
  • VisionProjectorType (3414-3435)
🪛 Ruff (0.14.10)
convert_hf_to_gguf.py

10521-10521: Avoid specifying long messages outside the exception class

(TRY003)


10536-10536: Consider moving this statement to an else block

(TRY300)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: build-linux-cross / debian-13-loongarch64-vulkan-cross
  • GitHub Check: build-linux-cross / debian-13-loongarch64-cpu-cross
  • GitHub Check: build-cmake-pkg / linux
  • GitHub Check: windows-latest-cmake (llvm-arm64, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/ar...
  • GitHub Check: build-linux-cross / ubuntu-24-riscv64-cpu-spacemit-ime-cross
  • GitHub Check: windows-latest-cmake (cpu-x64 (static), x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmak...
  • GitHub Check: windows-latest-cmake (llvm-arm64-opencl-adreno, arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN...
  • GitHub Check: windows-latest-cmake (openblas-x64, x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x6...
  • GitHub Check: windows-latest-cmake (vulkan-x64, x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD...
  • GitHub Check: windows-msys2 (CLANG64, clang-x86_64, Release)
  • GitHub Check: ubuntu-24-cmake-vulkan-deb
  • GitHub Check: ubuntu-24-cmake-webgpu
  • GitHub Check: ubuntu-24-wasm-webgpu
  • GitHub Check: macOS-latest-cmake-visionos
  • GitHub Check: windows-msys2 (UCRT64, ucrt-x86_64, Release)
  • GitHub Check: ios-xcode-build
  • GitHub Check: macOS-latest-cmake-ios
  • GitHub Check: macOS-latest-cmake-arm64-webgpu
  • GitHub Check: ubuntu-24-cmake-vulkan
  • GitHub Check: macOS-latest-cmake-arm64
🔇 Additional comments (19)
src/unicode.cpp (2)

967-971: LGTM! Unicode letter subcategories correctly added.

The addition of Lu, Ll, Lt, Lm, and Lo subcategories mapping to the LETTER flag is correct and follows the existing pattern. The inline comments clearly document each subcategory type.


1082-1103: LGTM! Dynamic \p{...} pattern resolution is well-implemented.

The replacement of fixed-length pattern matching with dynamic resolution correctly:

  • Searches for the closing brace dynamically (Line 1087)
  • Validates the pattern length with a reasonable limit (Line 1088)
  • Conditionally inserts brackets to avoid nesting when not inside a character class (Lines 1091-1098)
  • Advances the loop index correctly past the matched pattern (Line 1099)

The boundary checks are sound, and the logic handles varying-length Unicode property patterns correctly.

gguf-py/gguf/constants.py (1)

3414-3435: UTUVL projector string is consistent with existing naming

VisionProjectorType.UTUVL = "utuvl" aligns with the new C++ projector type name and other projector strings; no issues here.

gguf-py/gguf/tensor_mapping.py (1)

1218-1458: UTUVL/Siglip2 tensor mappings look consistent; validate against real checkpoints

The added "merger.*" and "siglip2.vision_model.*" source keys for V_MMPROJ, V_ENC_* and related tensors are consistent with existing patterns for other vision models and correctly target the corresponding MODEL_TENSOR entries. Behavior remains purely additive.

It would still be good to run at least one UTUVL HF → GGUF conversion and load in mtmd to confirm all expected tensors are picked up and no additional Siglip2 variants are missing.

tools/mtmd/CMakeLists.txt (1)

5-31: UTUVL model source correctly wired into mtmd target

Adding models/utuvl.cpp to the mtmd library matches the pattern of other model sources and is sufficient to include the new graph implementation in the build.

tools/mtmd/clip-impl.h (1)

163-221: Enum and string mapping for UTUVL are correctly aligned

PROJECTOR_TYPE_UTUVL is added to projector_type and mapped to "utuvl" in PROJECTOR_TYPE_NAMES, matching the Python-side constant and gguf metadata string. No issues seen.

src/llama-vocab.h (1)

9-55: New UTU_VL pre-tokenizer enum value is fine

Appending LLAMA_VOCAB_PRE_TYPE_UTU_VL = 43 to llama_vocab_pre_type is consistent with existing entries and keeps ordinal values stable. Assuming the corresponding handling is implemented in llama-vocab.cpp, this looks good.

tools/mtmd/mtmd.cpp (1)

268-290: UTUVL reuses Qwen-style vision delimiters correctly

Including PROJECTOR_TYPE_UTUVL in the Qwen2VL/Qwen25VL/Qwen3VL branch makes UTUVL use <|vision_start|> / <|vision_end|> markers, which is consistent with how other Qwen-based VL projectors are handled.

Please double‑check that the UTUVL text tokenizer indeed defines these exact special tokens.

tools/mtmd/models/models.h (1)

25-28: UTUVL clip graph declaration matches existing pattern

clip_graph_utuvl follows the same inheritance and constructor pattern as other clip_graph_* types and declares a virtual build() override, which fits the documented implementation in utuvl.cpp.

src/llama-vocab.cpp (1)

317-322: LGTM - UTU_VL regex patterns are well-structured.

The regex expressions correctly handle Korean Hangul, CJK punctuation, Bopomofo, and Chinese/Japanese characters, followed by a comprehensive pattern for mixed-case words with contractions. This follows the established pattern for other multilingual tokenizers in the codebase.

tools/mtmd/models/utuvl.cpp (3)

3-14: LGTM - Clean initialization with appropriate assertions.

The setup correctly initializes window attention parameters. The batch_size = 1 constraint is properly enforced via GGML_ASSERT statements later in the code (lines 79, 146).


18-49: LGTM - Window partitioning logic for patch embeddings.

The sequence of reshape/permute/cont operations correctly reorganizes patches into a windowed format for efficient attention computation. The math appears consistent: patches are organized into 2×2 windows (m=2) and the final shape matches the expected (3*patch_size*patch_size, Hm*Wm*m*m, 1) before the patch embedding projection.


86-138: LGTM - Standard transformer layer implementation with windowed attention.

The layer loop correctly implements:

  • Pre-norm architecture with residual connections
  • Q/K/V projections with RoPE positional encoding
  • Windowed attention with full attention every 8th layer and on the last layer ((il + 1) % 8 == 0 || il == n_layer - 1)
  • FFN with configurable activation
tools/mtmd/clip.cpp (5)

848-851: LGTM!

The UTUVL case follows the established pattern for other projector types in the graph builder switch statement.


1511-1518: LGTM!

Tensor loading correctly loads UTUVL-specific tensors including the RMS normalization weight for the merger.


2993-2994: LGTM!

Token counting functions correctly include UTUVL alongside related projector types (QWEN2VL, QWEN25VL, QWEN3VL, GLM4V) that share the same merge-based output token calculation.

Also applies to: 3009-3010, 3070-3076


3477-3496: LGTM!

Position calculation for UTUVL correctly remaps positions based on window grouping, following the same pattern as QWEN25VL but with parameterized merge_ratio which is cleaner.


3668-3669: LGTM!

The embedding dimension correctly uses mm_1_b->ne[0], consistent with tensor loading and similar projector types.

convert_hf_to_gguf.py (1)

1176-1177: UTUVL tokenizer hash entry looks consistent; just ensure runtime support exists.

The new chkhsh == "9d7013..."res = "utu-vl" branch follows the same pattern as other models and should work as long as the C++ side and convert_hf_to_gguf_update.py have been updated to recognize "utu-vl" as a valid tokenizer.ggml.pre value. Nothing functionally wrong here from this script’s perspective; just double‑check the upstream/runtime side is in sync so this identifier isn’t treated as “unknown pre‑tokenizer”.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tools/mtmd/models/utuvl.cpp (1)

41-41: Minor: Inconsistent spacing in dimension calculation.

Line 41 has inconsistent spacing: 3*patch_size* patch_size should be 3 * patch_size * patch_size for consistency with the rest of the codebase.

Proposed fix
-        3*patch_size* patch_size,  Hm * Wm * m * m, 1);
+        3 * patch_size * patch_size, Hm * Wm * m * m, 1);
tools/mtmd/clip.cpp (1)

1165-1175: Consider adding guidance to the warning message.

The warning at line 1173-1174 only states the minimum requirement but doesn't provide guidance on how to fix it, unlike the Qwen-VL warning (lines 1160-1162) which suggests --image-min-tokens 1024.

Proposed enhancement
                     if (hparams.image_min_pixels < warn_min_pixels) {
                         LOG_WRN("%s: Youtu-VL models require at minimum 1024 image tokens to function correctly on grounding tasks\n", __func__);
+                        LOG_WRN("%s: if you encounter problems with accuracy, try adding --image-min-tokens 1024\n", __func__);
                     }
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8733bf3 and 1600974.

📒 Files selected for processing (2)
  • tools/mtmd/clip.cpp
  • tools/mtmd/models/utuvl.cpp
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

  • tools/mtmd/clip.cpp
  • tools/mtmd/models/utuvl.cpp
**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

  • tools/mtmd/clip.cpp
  • tools/mtmd/models/utuvl.cpp
🧠 Learnings (2)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.587Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

  • tools/mtmd/clip.cpp
🔇 Additional comments (8)
tools/mtmd/models/utuvl.cpp (3)

64-83: LGTM!

The window attention setup correctly handles index remapping and mask creation, with appropriate F16 casting for flash attention compatibility.


86-138: LGTM!

The transformer layer implementation correctly handles:

  • Window/full attention alternation pattern (every 8th layer + final layer uses full attention)
  • Multi-head attention with separate Q/K/V projections
  • RoPE positional encoding via ggml_rope_multi
  • Proper residual connections and FFN processing

160-160: Verify the hardcoded epsilon value for merger RMS norm.

Line 160 uses a hardcoded 1e-6 for the RMS normalization epsilon, while other build_norm calls in this function use the eps member variable. This may be intentional if the VLPatchMerger requires a different epsilon, but it's worth confirming this matches the original model specification.

tools/mtmd/clip.cpp (5)

848-851: LGTM!

The graph builder selection for UTUVL follows the established pattern for other projector types.


1511-1518: LGTM!

The tensor loading correctly loads the merger normalization weight and MLP projection weights, matching the graph construction in utuvl.cpp.


2993-2994: LGTM!

The token counting logic for UTUVL correctly shares the same calculation as Qwen and GLM4V models, consistent with the 2x spatial merge configuration.

Also applies to: 3009-3010, 3070-3071


3597-3598: LGTM!

The embedding size resolution correctly uses mm_1_b->ne[0], consistent with the merger MLP output tensor loaded for UTUVL.


3345-3360: and

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tools/mtmd/clip.cpp (1)

2751-2765: Misleading comment: this is linear search, not binary search

The comment on line 2751 states "Binary search with 0.02 step size" but the implementation performs a linear decrement (scale -= 0.02f) until the patch count fits within limits. While the logic works correctly, the comment should be updated to accurately describe the approach (e.g., "Linear search: decrement scale by 0.02 until patches fit").

Based on learnings, this is a mirrored PR from upstream, so ngxson should be informed to relay this feedback to the contributor.

🧹 Nitpick comments (4)
tools/mtmd/models/utuvl.cpp (2)

9-13: Patch reshaping pipeline is algebraically sound but could use explicit invariants and fewer magic numbers

The reshape/permute/cont chain from build_inp_raw() down to the final ggml_cont_3d(..., 3*patch_size*patch_size, Hm * Wm * m * m, 1) is consistent in terms of element count and yields the expected [3 * patch_size^2, n_patches, 1] layout as long as n_patches_x and n_patches_y are multiples of m (2 here). Right now that’s only implicit; a pair of assertions like GGML_ASSERT(Wp % m == 0); and GGML_ASSERT(Hp % m == 0); would make the preconditions obvious and fail fast if a future UTUVL variant uses different shapes. Likewise, const int spatial_merge_unit = 4; later in the file is logically m * m; deriving it from m instead of hard‑coding 4 would keep the encoder and merger logic in sync if the window size ever changes.

Also applies to: 20-45


5-8: Encoder/window-attention/merger wiring matches the QwenVL pattern but encodes several implicit contracts

The UTUVL graph largely mirrors clip_graph_qwen2vl:

  • use_window_attn = !hparams.wa_layers.empty() and full_attn = use_window_attn ? hparams.wa_layers.count(il) > 0 : true give clear per-layer control, assuming wa_layers uses 0‑based layer indices.
  • The inv_window_idx windowing (Lines 67–86) and subsequent window_idx un-windowing (Lines 143–154) both assume a fixed 4‑token spatial unit; if the converter ever changes its grouping scheme, both indices must stay perfectly inverse over those 4‑token blocks or the restored sequence will be scrambled.
  • The merger section (Lines 156–175) applies RMS norm with mm_input_norm_w and then a 2‑layer FFN over a [n_embd * 4, n_pos / 4, 1] view, consistent with the SigLIP‑style VLPatchMerger described in the PR.

Given how tightly this all depends on converter‑side metadata (WA_LAYERS contents, window indices, and patch grid layout), it would be helpful to document these assumptions in comments and to verify with a small UTUVL sample that round‑tripping through windowing/unwindowing truly preserves token order.

Also applies to: 63-86, 88-141, 143-176

convert_hf_to_gguf.py (2)

3812-3815: Qwen2VL window‑attention layer export is correct but assert is a bit brittle

Using fullatt_block_indexes and wiring it directly into add_vision_wa_layers matches how UTUVL is handled and gives llama.cpp the exact full‑attention block layout it expects. One minor robustness nit: relying on assert fullatt_block_indexes is not None means running Python with -O would silently skip the check; if you ever touch this outside a mirror, consider a normal if ...: raise ValueError(...) for clearer failures.


10489-10535: UTUVL mmproj integration is functionally sound; minor style nits only

Functionally, UtuVLVisionModel looks correct:

  • Reuses MmprojModel so self.hparams becomes the vision config, then normalizes image_size and emits projector metadata: projector type UTUVL, attention eps, spatial merge size, optional window_size, and window‑attention layers via add_vision_wa_layers.
  • Activation handling groups all GELU‑style variants (gelu_pytorch_tanh, gelu_fast, etc.) under vision_use_gelu, with SILU mapped to vision_use_silu, which matches how other vision projectors in this file are encoded.
  • modify_tensors cleanly skips the text side and maps only tensors that have a corresponding entry in tensor_map, logging and ignoring anything unmapped instead of raising.

Two small, non‑blocking nits (you may want to leave them as‑is to stay aligned with upstream):

  • The ValueError(f"Unsupported activation function for UTUVL: {hidden_act}") triggers Ruff TRY003; if you ever refactor, you could move this into a custom exception type or shorten the message, but it’s harmless.
  • Ruff TRY300 would prefer a try/except/else shape in modify_tensors instead of returning directly from the try block, again purely stylistic.

Given this is a mirror PR, I’d keep the implementation as‑is and surface any style tweaks upstream if they decide to change their patterns.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1600974 and 867709c.

📒 Files selected for processing (10)
  • convert_hf_to_gguf.py
  • convert_hf_to_gguf_update.py
  • gguf-py/gguf/constants.py
  • gguf-py/gguf/gguf_writer.py
  • src/llama-model.cpp
  • tools/mtmd/clip-impl.h
  • tools/mtmd/clip-model.h
  • tools/mtmd/clip.cpp
  • tools/mtmd/models/qwen2vl.cpp
  • tools/mtmd/models/utuvl.cpp
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.{cpp,h,hpp}: Always format C++ code using git clang-format before committing, following .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer/reference alignment in middle)
Minimize external dependencies; avoid adding new external dependencies unless absolutely necessary

Files:

  • tools/mtmd/clip-model.h
  • tools/mtmd/clip.cpp
  • tools/mtmd/models/qwen2vl.cpp
  • tools/mtmd/clip-impl.h
  • tools/mtmd/models/utuvl.cpp
  • src/llama-model.cpp
**/*.{cpp,h,hpp,py}

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Ensure cross-platform compatibility by testing code changes on Linux, macOS, and Windows when possible

Files:

  • tools/mtmd/clip-model.h
  • gguf-py/gguf/constants.py
  • tools/mtmd/clip.cpp
  • tools/mtmd/models/qwen2vl.cpp
  • tools/mtmd/clip-impl.h
  • tools/mtmd/models/utuvl.cpp
  • convert_hf_to_gguf_update.py
  • src/llama-model.cpp
  • convert_hf_to_gguf.py
  • gguf-py/gguf/gguf_writer.py
**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: Always activate the Python virtual environment in .venv and use tools from that environment for Python development
Ensure Python code meets flake8 linting standards with max-line-length=125 as configured in .flake8
Ensure Python code passes pyright type checking as configured in pyrightconfig.json

Files:

  • gguf-py/gguf/constants.py
  • convert_hf_to_gguf_update.py
  • convert_hf_to_gguf.py
  • gguf-py/gguf/gguf_writer.py
src/**/*.cpp

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

Prioritize performance optimization in core library implementations in src/, as this is a performance-critical inference library

Files:

  • src/llama-model.cpp
🧠 Learnings (4)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

  • tools/mtmd/clip.cpp
  • tools/mtmd/clip-impl.h
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to src/llama.cpp : Maintain core library implementation in `src/llama.cpp` with careful attention to API contracts and backward compatibility

Applied to files:

  • src/llama-model.cpp
📚 Learning: 2025-11-29T22:55:53.865Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-29T22:55:53.865Z
Learning: Applies to include/llama.h : Exercise careful consideration when making changes to the public API in `include/llama.h`, as API stability is critical

Applied to files:

  • src/llama-model.cpp
🧬 Code graph analysis (2)
tools/mtmd/models/utuvl.cpp (3)
tools/mtmd/models/qwen2vl.cpp (2)
  • build (3-182)
  • build (3-3)
tools/mtmd/clip.cpp (18)
  • model (217-219)
  • model (935-1283)
  • model (935-935)
  • model (1971-1984)
  • model (1971-1971)
  • build_inp_raw (469-474)
  • build_inp_raw (469-469)
  • positions (3299-3299)
  • positions (3467-3467)
  • positions (3479-3479)
  • build_norm (476-499)
  • build_norm (476-482)
  • build_attn (583-646)
  • build_attn (583-591)
  • build_ffn (501-581)
  • build_ffn (501-510)
  • cb (255-264)
  • cb (255-255)
ggml/src/ggml.c (16)
  • ggml_reshape_4d (3583-3601)
  • ggml_permute (3700-3752)
  • ggml_cont_4d (3492-3508)
  • ggml_cont_3d (3483-3490)
  • ggml_mul_mat (3174-3189)
  • ggml_add (1969-1974)
  • ggml_reshape_2d (3546-3562)
  • ggml_new_tensor_1d (1747-1752)
  • ggml_set_name (1841-1848)
  • ggml_set_input (7435-7437)
  • ggml_new_tensor_2d (1754-1761)
  • ggml_cast (3433-3445)
  • ggml_get_rows (3776-3797)
  • ggml_reshape_3d (3564-3581)
  • ggml_rope_multi (4121-4140)
  • ggml_build_forward_expand (6793-6795)
src/llama-model.cpp (1)
src/llama-model-loader.cpp (2)
  • create_tensor (817-838)
  • create_tensor (817-817)
🪛 Ruff (0.14.10)
convert_hf_to_gguf.py

10509-10509: Avoid specifying long messages outside the exception class

(TRY003)


10531-10531: Consider moving this statement to an else block

(TRY300)

🔇 Additional comments (19)
convert_hf_to_gguf_update.py (1)

148-148: LGTM — local repo path is properly supported.

The new utu-vl entry correctly follows the existing pattern. The local directory path ./Youtu-VL is a valid approach since download_model() already handles local paths via the os.path.isdir(repo) branch (lines 201-213), copying files instead of downloading.

Note for contributors: running this script requires the ./Youtu-VL directory to be present locally with the necessary tokenizer files (config.json, tokenizer.json, tokenizer_config.json).

src/llama-model.cpp (2)

4702-4706: DeepSeek: Optional output.weight with tied-embedding fallback is consistent and safe

Making output.weight non-required and falling back to token_embd with TENSOR_DUPLICATED matches the existing pattern used for other arches (e.g., LLaMA, GPT2, BLOOM) and correctly handles GGUFs that rely on tied word embeddings instead of a separate output head. Legacy DeepSeek GGUFs that include output.weight will continue to load unchanged, while new tied-embedding variants avoid a hard “missing tensor” failure.

I don’t see functional or performance issues here; just recommend exercising both a legacy DeepSeek model (with explicit output.weight) and a new tied-embedding GGUF to confirm the fallback behaves as intended. Based on learnings, this keeps the mirror PR aligned with upstream while being robust to model variations.


4769-4773: DeepSeek2: Mirrored tied-embedding fallback matches DeepSeek and other arches

The DeepSeek2 block now mirrors the DeepSeek logic: output.weight is optional (TENSOR_NOT_REQUIRED), and when absent, output is created as a duplicated token_embd tensor. This is consistent with how tied embeddings are handled elsewhere in this file and should be compatible with DeepSeek V2 GGUFs that omit a separate output head.

Behavior for existing DeepSeek2 GGUFs that already define output.weight is preserved, since the fallback only triggers when that tensor is missing. As with the previous hunk, it would be good to load at least one existing DeepSeek V2 GGUF and one new Youtu‑VL/DeepSeek2 variant to validate that both the direct and fallback paths work correctly. Based on learnings, this looks like a faithful mirror of upstream tied-output handling.

tools/mtmd/models/qwen2vl.cpp (2)

7-8: Window-attention enable/disable and full-attn gating look consistent with wa_layers semantics

use_window_attn = !hparams.wa_layers.empty() combined with full_attn = use_window_attn ? hparams.wa_layers.count(il) > 0 : true makes behavior explicit and avoids the previous pattern-based heuristic. This should keep old models (no WA_LAYERS key) on full attention while allowing per-layer overrides when WA_LAYERS is present, assuming wa_layers uses 0‑based layer indices consistent with il. You may want to double-check that the converter always emits 0‑based indices here so UTUVL/Qwen2.5VL don’t accidentally shift the intended full-attn layers.

Also applies to: 79-82


156-176: Projection/window reorder block assumes a specific relation between projection_dim and n_embd

The reshape at Line 173 requires hparams.projection_dim * (n_patches_x * n_patches_y / 4) to equal the element count of embeddings coming out of the FFN (n_embd * n_patches_x * n_patches_y). That implies hparams.projection_dim == 4 * n_embd when n_pos == n_patches. If UTUVL/Qwen2.5VL models ever use a different projection dimension, this reshape will assert at runtime. Please confirm that the converter always sets projection_dim accordingly, or, if not, adjust the reshape to derive its first dimension from the actual tensor shape instead of projection_dim.

tools/mtmd/clip-model.h (1)

59-64: wa_layers wiring in clip_hparams is consistent and idiomatic

Adding std::unordered_set<int32_t> wa_layers alongside vision_feature_layer is a straightforward way to represent sparse full-attention layers. This aligns well with the wa_layers.count(il) > 0 checks in the graphs and keeps the default “no window attention” behavior when the set is empty.

tools/mtmd/clip-impl.h (1)

51-52: WA_LAYERS key and PROJECTOR_TYPE_UTUVL mapping are coherent across C++ and Python

KEY_WIN_ATTN_LAYERS uses the same "clip.vision.wa_layers" string as the Python constants, and PROJECTOR_TYPE_UTUVL is correctly registered in PROJECTOR_TYPE_NAMES as "utuvl", matching VisionProjectorType.UTUVL. This should make the new Youtu‑VL projector discoverable and correctly parsed from GGUF without extra glue.

Also applies to: 163-221

gguf-py/gguf/constants.py (1)

283-299: New vision metadata keys and UTUVL projector string are aligned with the C++ side

Keys.ClipVision.WA_LAYERS = "clip.vision.wa_layers" and WINDOW_SIZE = "clip.vision.window_size" line up with the C++ macros and the new hparams.wa_layers / attn_window_size fields, and VisionProjectorType.UTUVL = "utuvl" matches the C++ PROJECTOR_TYPE_UTUVL name. The only behavioral change to be aware of is that legacy clip.vision.n_wa_pattern metadata will no longer be recognized; that’s expected for this refactor, but worth keeping in mind when converting older QwenVL models.

Also applies to: 3415-3437

gguf-py/gguf/gguf_writer.py (1)

1131-1138: New WA_LAYERS/WINDOW_SIZE writer helpers are correctly implemented

add_vision_wa_layers() and add_vision_window_size() correctly emit an INT32 array and UINT32 scalar for the new clip.vision.wa_layers and clip.vision.window_size keys, and the empty‑sequence early return for arrays is consistent with the runtime behavior. Both methods are already being used in convert_hf_to_gguf.py.

Likely an incorrect or invalid review comment.

tools/mtmd/clip.cpp (6)

848-851: LGTM: UTUVL graph builder integration

The UTUVL case addition follows the same pattern as other projector types and correctly instantiates the clip_graph_utuvl builder.


1154-1161: LGTM: Window attention layer loading for QWEN25VL

The window attention layer loading logic is correct and properly populates the wa_layers set.


1172-1188: LGTM: UTUVL hyperparameter initialization

The UTUVL hyperparameter loading follows the established pattern from QWEN2VL/QWEN25VL models. The configuration properly requires attn_window_size and wa_layers, sets appropriate token limits, and includes a helpful warning about minimum pixel requirements for grounding tasks.


1256-1262: LGTM: Window attention layers logging

The logging addition provides useful debug information about which layers use window attention.


1530-1537: LGTM: UTUVL projection tensor loading

The tensor loading correctly maps to the UTUVL merger architecture (ln_q normalization and MLP layers). The inline comments helpfully document the tensor mappings.


3012-3034: LGTM: UTUVL integration in token counting and batch encoding

The UTUVL additions to clip_n_output_tokens_x, clip_n_output_tokens_y, clip_n_output_tokens, batch encoding logic, and clip_n_mmproj_embd are consistent with the QWEN2VL/QWEN25VL/GLM4V architecture. The window attention handling at lines 3364-3445 correctly uses the default attn_window_size of 112 for QWEN25VL (where it's optional) while UTUVL mandates it during loading.

Also applies to: 3089-3095, 3364-3445, 3616-3617

convert_hf_to_gguf.py (4)

1233-1235: Tokenizer hash entry for Youtu‑VL looks consistent

The new chkhsh"utu-vl" mapping is in line with the existing pattern in get_vocab_base_pre and should safely identify the Youtu‑VL pre‑tokenizer variant. No further changes needed here.


7131-7137: DeepseekV2Model registration of UTUVLForCausalLM is coherent

Adding "UTUVLForCausalLM" to the DeepseekV2Model registry plugs UTUVL’s text side into the existing DeepSeek‑V2 path while the new UtuVLVisionModel handles the mmproj side via the MMPROJ registry. The dual registration under different ModelTypes should not conflict and aligns with how other multimodal families are wired.


7213-7233: MoE metadata export for DeepseekV2/UTUVL is reasonable

The added block that sets expert_feed_forward_length, expert_count, expert_shared_count, and expert weight scale/norm from hparams mirrors patterns used in other MoE implementations (e.g. Qwen2Moe/Dots1). The fallbacks (intermediate_size and defaults of 0/1.0) give sane behavior when some MoE fields are absent, including non‑MoE models reusing this class. Just be aware that you will now always emit an expert_feed_forward_length (possibly 0) and expert_shared_count (0) for such models; if upstream ever cares about omitting these entirely, that would need a follow‑up there rather than in this mirror.


7247-7257: DeepseekV2 tensor‑filtering and MoE guard logic look solid

  • Skipping vision_tower / multi_modal_projector / siglip2. / merger. tensors is consistent with deferring UTUVL/KimiVL vision handling to the mmproj path instead of the text model.
  • The tie_word_embeddings check that drops lm_head.weight when tied prevents redundant output‑head storage and matches patterns used elsewhere in this file.
  • Gating the experts merge block on n_routed_experts being present avoids crashes for non‑MoE configs while keeping behavior unchanged for actual MoE setups.

All of these changes improve robustness without altering expected conversions for supported Deepseek/UTUVL variants.

Also applies to: 7271-7299

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tools/mtmd/clip.cpp (1)

2745-2759: Misleading comment: this is a linear search, not binary search.

The comment says "Binary search with 0.02 step size" but the implementation is actually a linear search that decrements the scale by 0.02 until the patch count fits within limits.

Proposed fix
-                // Binary search with 0.02 step size
+                // Linear search: decrement scale by 0.02 until patches fit
                 while (scale > 0.0f) {
🧹 Nitpick comments (4)
vendor/minja/minja.hpp (1)

1467-1479: Implementation looks correct.

The rsplit function correctly implements right-to-left string splitting. Edge cases are handled properly:

  • Empty string: returns array with single empty string
  • Separator not found: returns array with the original string
  • Separators at boundaries: correctly produces empty strings
  • The if (pos == 0) break on Line 1474 prevents underflow when decrementing pos
Optional performance optimization

Inserting at the beginning of a vector is O(n) per operation, making this O(n×k) where k is the number of separators. For better performance, consider using push_back and then reversing:

 static std::vector<std::string> rsplit(const std::string & s, const std::string & sep) {
   std::vector<std::string> result;
   size_t end = s.length();
   size_t pos = s.rfind(sep);
   while (pos != std::string::npos) {
-    result.insert(result.begin(), s.substr(pos + sep.length(), end - pos - sep.length()));
+    result.push_back(s.substr(pos + sep.length(), end - pos - sep.length()));
     end = pos;
     if (pos == 0) break;
     pos = s.rfind(sep, pos - 1);
   }
-  result.insert(result.begin(), s.substr(0, end));
+  result.push_back(s.substr(0, end));
+  std::reverse(result.begin(), result.end());
   return result;
 }

However, the current implementation is acceptable for typical use cases with reasonably-sized strings.

convert_hf_to_gguf.py (2)

7218-7237: Expert metadata wiring is reasonable but could be made conditional on actual MoE usage

The new block that unconditionally writes expert metadata:

  • Picks expert_feed_forward_length from moe_intermediate_size when present, otherwise from intermediate_size (or 0).
  • Sets expert_count, expert_shared_count (defaulting to 0), expert_weights_scale (default 1.0), and expert_weights_norm when norm_topk_prob is truthy.

This matches how other MoE models in this file use GGUF’s expert metadata, and should work for Deepseek‑style MoE (including UTUVL). One small improvement would be to only emit these expert keys when the model is actually MoE (e.g., when n_routed_experts or moe_intermediate_size is present) to avoid writing “dummy” expert lengths like 0 on dense models:

moe_intermediate_size = hparams.get("moe_intermediate_size")
n_routed_experts = hparams.get("n_routed_experts")
n_shared_experts = hparams.get("n_shared_experts")

if moe_intermediate_size is not None or n_routed_experts is not None or n_shared_experts is not None:
    self.gguf_writer.add_expert_feed_forward_length(
        moe_intermediate_size if moe_intermediate_size is not None
        else hparams["intermediate_size"]
    )
    # ... then expert_count / shared_count / scale / norm as you already do

Not strictly required for correctness, but it reduces ambiguity for non‑MoE arches that happen to share this code path.


10494-10540: UTUVL mmproj implementation is minimal but coherent; only minor nits

The UtuVLVisionModel class follows the existing Mmproj patterns:

  • Uses hparams_vision as the main config source and defaults image_size if missing, matching other vision projectors.

  • set_gguf_parameters():

    • Calls the base MmprojModel implementation, then sets clip_projector_type to UTUVL.
    • Derives attention layernorm eps, activation (GELU variants vs SiLU), spatial merge size, window size, and full‑attention window‑attention (wa_layers) from self.hparams, which is already rebound to the vision config in MmprojModel.__init__.
    • The activation handling correctly collapses the usual gelu_* variants into the “use GELU” flag and will error out on unknown activations instead of silently mis‑encoding them.
  • modify_tensors():

    • Drops obvious text‑side tensors via skip_prefixes, leaving vision/projector tensors to go through map_tensor_name.
    • Falls back to logging and skipping on ValueError from map_tensor_name, which is a reasonable failure mode for any stragglers.

Two tiny, optional cleanups if you care about style/static‑analysis noise:

  • The ValueError message for unsupported activations could be shortened (or factored into a helper) to satisfy TRY003, but it’s already clear and not excessive.
  • The try/except in modify_tensors already encloses only the map_tensor_name call; there’s no real benefit to restructuring it just for TRY300.

Functionally this looks fine to ship as‑is.

tools/mtmd/clip.cpp (1)

1165-1181: Consider adding the fix suggestion to the warning message.

The Qwen-VL warning (lines 1160-1162) includes helpful guidance about --image-min-tokens and links to related issues. The UTUVL warning is missing these details.

Suggested improvement
                         if (hparams.image_min_pixels < warn_min_pixels) {
                             LOG_WRN("%s: Youtu-VL models require at minimum 1024 image tokens to function correctly on grounding tasks\n", __func__);
+                            LOG_WRN("%s: if you encounter problems with accuracy, try adding --image-min-tokens 1024\n", __func__);
                         }
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 867709c and 3e816b2.

📒 Files selected for processing (7)
  • convert_hf_to_gguf.py
  • gguf-py/gguf/constants.py
  • gguf-py/gguf/gguf_writer.py
  • tools/mtmd/clip-impl.h
  • tools/mtmd/clip-model.h
  • tools/mtmd/clip.cpp
  • vendor/minja/minja.hpp
🚧 Files skipped from review as they are similar to previous changes (3)
  • tools/mtmd/clip-impl.h
  • gguf-py/gguf/constants.py
  • gguf-py/gguf/gguf_writer.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{c,cpp,h,hpp}

📄 CodeRabbit inference engine (AGENTS.md)

Always format C++ code before committing using git clang-format with the project's .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer alignment void * ptr, reference alignment int & ref)

Files:

  • tools/mtmd/clip-model.h
  • vendor/minja/minja.hpp
  • tools/mtmd/clip.cpp
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Always use the Python environment in .venv and run Python tools from that environment
Apply Python linting rules configured in .flake8 (max-line-length=125, excludes examples/tools) and type checking with pyright

Files:

  • convert_hf_to_gguf.py
🧠 Learnings (4)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {include/llama.h,ggml/**/*.h,mtmd/**/*.h} : Public API modifications in `include/llama.h`, `ggml.h`, or `mtmd.h` require AI usage disclosure if significant code is generated

Applied to files:

  • tools/mtmd/clip-model.h
  • tools/mtmd/clip.cpp
  • convert_hf_to_gguf.py
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

  • tools/mtmd/clip.cpp
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {ggml/**,src/**/backend*.{c,cpp,h,hpp},tools/server/**} : Backend-related changes (CPU, CUDA, Metal, Vulkan, etc.) and modifications to `tools/server` require AI usage disclosure if significant code is generated

Applied to files:

  • tools/mtmd/clip.cpp
  • convert_hf_to_gguf.py
🧬 Code graph analysis (1)
convert_hf_to_gguf.py (2)
gguf-py/gguf/gguf_writer.py (12)
  • add_expert_feed_forward_length (714-715)
  • add_expert_count (800-801)
  • add_expert_shared_count (806-807)
  • add_expert_weights_scale (815-816)
  • add_expert_weights_norm (818-819)
  • add_clip_projector_type (1083-1084)
  • add_vision_attention_layernorm_eps (1104-1105)
  • add_vision_use_gelu (1122-1123)
  • add_vision_use_silu (1125-1126)
  • add_vision_spatial_merge_size (1119-1120)
  • add_vision_window_size (1140-1141)
  • add_vision_wa_layers (1134-1135)
gguf-py/gguf/constants.py (1)
  • VisionProjectorType (3416-3437)
🪛 Ruff (0.14.10)
convert_hf_to_gguf.py

10514-10514: Avoid specifying long messages outside the exception class

(TRY003)


10536-10536: Consider moving this statement to an else block

(TRY300)

🔇 Additional comments (11)
vendor/minja/minja.hpp (1)

1590-1598: LGTM!

The rsplit method integration follows the same pattern as the existing split method and correctly:

  • Validates that exactly one separator argument is provided
  • Calls the rsplit utility function
  • Converts the result to a Value array
tools/mtmd/clip-model.h (1)

64-64: LGTM! The wa_layers field addition is well-placed, appropriately typed for storing window attention layer indices, and properly used throughout the codebase (populated in clip.cpp, checked in utuvl.cpp, and logged as needed).

convert_hf_to_gguf.py (3)

1233-1235: New tokenizer pre‑hash entry for Youtu‑VL looks consistent

The added chkhsh branch returning "utu-vl" matches the existing pattern in get_vocab_base_pre (fixed hash → short model identifier). Nothing else in this function depends on that value, so this is safe and localized.


7139-7140: UTUVL text/vision registration pattern is consistent with other multimodal models

Adding "UTUVLForCausalLM" to DeepseekV2Model’s @ModelBase.register(...) list, and also registering it for UtuVLVisionModel (MMPROJ) later, mirrors how Qwen2VL/Qwen2VLVisionModel and similar pairs are wired: same HF architecture string, different ModelType buckets. This looks intentional and should not cause collisions.


7253-7263: DeepseekV2 tensor‑handling tweaks for vision bits, tied head, and experts look sound

  • Skipping tensors starting with "siglip2." or "merger." keeps UTUVL/vision‑side pieces out of the text GGUF, which is consistent with how other multimodal text models here drop their vision parts.
  • Skipping lm_head.weight when tie_word_embeddings is true avoids redundant storage and matches patterns already used in other model classes.
  • Guarding the MoE merge logic under self.hparams.get("n_routed_experts") is not None prevents accidentally treating non‑MoE models that might use mlp.experts‑like names as routed experts.

All of these are defensive and align with surrounding conventions; I don’t see functional issues.

Also applies to: 7275-7276, 7318-7325

tools/mtmd/clip.cpp (6)

848-851: LGTM - Graph builder case follows established pattern.

The UTUVL case is correctly added to dispatch to clip_graph_utuvl, consistent with other projector types.


1249-1256: LGTM - Logging additions are functional.

The wa_layers logging correctly handles UTUVL's window attention layer configuration.


1524-1531: LGTM - Tensor loading follows QWEN pattern correctly.

The tensor indices (0 and 2) correctly skip the activation layer (index 1), matching the merger MLP structure.


3006-3007: LGTM - Token counting correctly groups UTUVL with similar models.

UTUVL uses the same 2x merge ratio as the Qwen VL models, so sharing the token count logic is appropriate.

Also applies to: 3022-3023, 3083-3084


3358-3362: LGTM - Window attention detection correctly differentiates UTUVL from QWEN25VL.

The logic appropriately uses wa_layers for UTUVL (set via KEY_WIN_ATTN_LAYERS) vs n_wa_pattern for QWEN25VL, reflecting their different metadata structures.


3610-3611: LGTM - Embedding dimension correctly derived from mm_1_b.

Consistent with the tensor loading and matches the pattern used by QWEN2VL/QWEN25VL.

auto charset = chars.empty() ? " \t\n\r" : chars;
auto start = left ? s.find_first_not_of(charset) : 0;
if (start == std::string::npos) return "";
if (start == std::string::npos) return "";
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent indentation within the strip function.

This line now has 4-space indentation while the surrounding lines (1447-1448, 1450-1451) use 2-space indentation. According to the coding guidelines, the project uses 4-space indentation and should be formatted with git clang-format. Either this entire function should be reformatted consistently, or this line should match the existing 2-space indentation used throughout the function.

🔎 Suggested fix
   auto start = left ? s.find_first_not_of(charset) : 0;
-    if (start == std::string::npos) return "";
+  if (start == std::string::npos) return "";
   auto end = right ? s.find_last_not_of(charset) : s.size() - 1;

Alternatively, reformat the entire function with git clang-format to use consistent 4-space indentation as per the coding guidelines.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (start == std::string::npos) return "";
if (start == std::string::npos) return "";
🤖 Prompt for AI Agents
In vendor/minja/minja.hpp around line 1449, the single line "if (start ==
std::string::npos) return "";" has inconsistent 4-space indentation while
surrounding lines use 2-space, causing style drift; fix by reformatting the
entire strip function (or at least that line) to match the project style: run
git clang-format (or apply the project's clang-format settings) so the function
uses consistent 4-space indentation throughout (or change the line to the
surrounding 2-space indentation if you must match existing local lines), then
stage the formatted file and update the commit.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tools/mtmd/clip.cpp (1)

2746-2759: Misleading comment: this is a linear search, not binary search.

The comment says "Binary search with 0.02 step size" but the implementation actually performs a linear search that decrements the scale by 0.02 until the patch count fits within limits.

🔎 Proposed fix
-                // Binary search with 0.02 step size
+                // Linear search: decrement scale by 0.02 until patches fit
                 while (scale > 0.0f) {

Based on learnings, this is a mirrored PR so I'm reporting this to ngxson for relay to the contributor.

🧹 Nitpick comments (1)
tools/mtmd/clip.cpp (1)

1179-1179: Minor: Fix grammatical error in warning message.

The warning message says "at minimum 1 image tokens" but should use singular "token" for consistency.

🔎 Proposed fix
-                            LOG_WRN("%s: Youtu-VL models require at minimum 1 image tokens to function correctly on grounding tasks\n", __func__);
+                            LOG_WRN("%s: Youtu-VL models require at minimum 1 image token to function correctly on grounding tasks\n", __func__);
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3e816b2 and 3ec91fb.

📒 Files selected for processing (1)
  • tools/mtmd/clip.cpp
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{c,cpp,h,hpp}

📄 CodeRabbit inference engine (AGENTS.md)

Always format C++ code before committing using git clang-format with the project's .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer alignment void * ptr, reference alignment int & ref)

Files:

  • tools/mtmd/clip.cpp
🧠 Learnings (4)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

  • tools/mtmd/clip.cpp
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {include/llama.h,ggml/**/*.h,mtmd/**/*.h} : Public API modifications in `include/llama.h`, `ggml.h`, or `mtmd.h` require AI usage disclosure if significant code is generated

Applied to files:

  • tools/mtmd/clip.cpp
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {ggml/**,src/**/backend*.{c,cpp,h,hpp},tools/server/**} : Backend-related changes (CPU, CUDA, Metal, Vulkan, etc.) and modifications to `tools/server` require AI usage disclosure if significant code is generated

Applied to files:

  • tools/mtmd/clip.cpp
🧬 Code graph analysis (1)
tools/mtmd/clip.cpp (3)
tools/mtmd/mtmd-cli.cpp (3)
  • ctx (287-287)
  • params (132-147)
  • params (132-132)
src/llama-model.cpp (4)
  • get_tensor (7114-7124)
  • get_tensor (7114-7114)
  • size (6826-6828)
  • size (6826-6826)
tools/server/server-context.cpp (2)
  • params (614-860)
  • params (614-614)
🔇 Additional comments (6)
tools/mtmd/clip.cpp (6)

848-851: LGTM! Graph builder integration looks correct.

The UTUVL case follows the established pattern for other projector types.


1249-1256: LGTM! Logging implementation is consistent.

The wa_layers logging follows the established pattern for model hyperparameters.


1524-1531: LGTM! Tensor loading follows the standard pattern.

The tensor mappings are clearly commented and consistent with the UTUVL architecture.


3006-3007: LGTM! Token counting logic is consistent.

The UTUVL token counting follows the same pattern as other models with spatial merging (QWEN2VL, QWEN25VL, etc.).

Also applies to: 3022-3023, 3083-3084


3610-3611: LGTM! Embedding dimension retrieval is correct.

The UTUVL case correctly returns the embedding dimension from mm_1_b, consistent with similar model architectures.


3373-3373: Verify that the attn_window_size default of 112 is correct for UTUVL models.

The code at line 3373 defaults attn_window_size to 112 when not set. However, this value is loaded from the GGUF file via get_u32(KEY_ATTN_WINDOW_SIZE, ...) with a "true" parameter, indicating it is required for UTUVL. If the GGUF file always provides this value, the default may be unused in practice. Confirm that 112 is the correct window size for UTUVL and whether the shared default with QWEN25VL is appropriate.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
tools/mtmd/clip.cpp (2)

1177-1180: Warning message is incomplete and potentially confusing.

The warning says "require at minimum 1 image tokens" but the actual check is against warn_min_pixels which equals patch_size² × merge_size² = 1024 for typical settings (patch=16, merge=2). The message should either reflect the computed pixel requirement or, like the Qwen-VL warning above, provide guidance such as --image-min-tokens suggestion.

Suggested improvement
                         const int warn_min_pixels = 1 * hparams.n_merge * hparams.n_merge * hparams.patch_size * hparams.patch_size;
                         if (hparams.image_min_pixels < warn_min_pixels) {
-                            LOG_WRN("%s: Youtu-VL models require at minimum 1 image tokens to function correctly on grounding tasks\n", __func__);
+                            LOG_WRN("%s: Youtu-VL models may require minimum image tokens for grounding tasks\n", __func__);
+                            LOG_WRN("%s: if you encounter problems with accuracy, try increasing --image-min-tokens\n", __func__);
                         }

2746-2759: Linear search may exit without a valid scale in edge cases.

The loop decrements scale by 0.02 until patches fit or scale reaches 0. If the minimum possible image size (align_size × align_size) still exceeds max_num_patches, the loop exits with potentially invalid dimensions. While this is unlikely with typical configurations (minimum 4 patches vs default limit 256), consider adding a safeguard or logging.

Additionally, the linear search could require ~50 iterations for large images. A binary search would be more efficient, though this may be acceptable for preprocessing overhead.

Optional: Add validation after loop
                 // Linear search with 0.02 step size
                 while (scale > 0.0f) {
                     target_height = get_scaled_image_size(scale, original_size.height);
                     target_width = get_scaled_image_size(scale, original_size.width);
                     
                     int num_patches_h = target_height / patch_size;
                     int num_patches_w = target_width / patch_size;
                     int num_patches = num_patches_h * num_patches_w;
                     
                     if (num_patches > max_num_patches) {
                         scale -= 0.02f;
                     } else {
                         break;
                     }
                 }
+                
+                // Validate final dimensions
+                int final_patches = (target_height / patch_size) * (target_width / patch_size);
+                if (final_patches > max_num_patches) {
+                    LOG_WRN("%s: UTUVL preprocessing could not fit image within %d patches (got %d)\n", 
+                            __func__, max_num_patches, final_patches);
+                }
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3ec91fb and 251852a.

📒 Files selected for processing (1)
  • tools/mtmd/clip.cpp
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{c,cpp,h,hpp}

📄 CodeRabbit inference engine (AGENTS.md)

Always format C++ code before committing using git clang-format with the project's .clang-format configuration (4-space indentation, 120 column limit, braces on same line for functions, pointer alignment void * ptr, reference alignment int & ref)

Files:

  • tools/mtmd/clip.cpp
🧠 Learnings (4)
📓 Common learnings
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 0
File: :0-0
Timestamp: 2025-12-22T23:32:42.603Z
Learning: For mirrored PRs (those with [Mirror] in title or mirroring upstream), ngxson wants AI-assisted code review with these preferences: check the upstream PR URL for description, be nit-picky about obvious mistakes, don't tag contributors directly (inform ngxson instead), and alert about suspicious AI-generated code comments like "CRITICAL: ..." or "You are correct here...".

When reviewing mirrored PRs, the PR is from an upstream contributor, not from ngxson himself. Any issues should be reported to ngxson so he can relay them to the contributor.
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 14
File: examples/lookahead/lookahead.cpp:97-98
Timestamp: 2025-03-13T22:56:47.417Z
Learning: ngxson prefers to prioritize simplification in some cases and defer adding error handling to a later time, as indicated by their response to a suggestion about adding error checks for llama_decode_ext calls.
📚 Learning: 2025-05-26T09:45:20.653Z
Learnt from: ngxson
Repo: ngxson/llama.cpp PR: 25
File: tools/mtmd/mtmd.cpp:275-293
Timestamp: 2025-05-26T09:45:20.653Z
Learning: In tools/mtmd/clip.cpp, PROJECTOR_TYPE_QWEN25O is a placeholder that gets replaced by either PROJECTOR_TYPE_QWEN25VL (for vision) or PROJECTOR_TYPE_QWEN2A (for audio) before the respective init_vision() or init_audio() functions are called, ensuring proper token handling.

Applied to files:

  • tools/mtmd/clip.cpp
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {include/llama.h,ggml/**/*.h,mtmd/**/*.h} : Public API modifications in `include/llama.h`, `ggml.h`, or `mtmd.h` require AI usage disclosure if significant code is generated

Applied to files:

  • tools/mtmd/clip.cpp
📚 Learning: 2025-12-24T11:28:22.582Z
Learnt from: CR
Repo: ngxson/llama.cpp PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-24T11:28:22.582Z
Learning: Applies to {ggml/**,src/**/backend*.{c,cpp,h,hpp},tools/server/**} : Backend-related changes (CPU, CUDA, Metal, Vulkan, etc.) and modifications to `tools/server` require AI usage disclosure if significant code is generated

Applied to files:

  • tools/mtmd/clip.cpp
🔇 Additional comments (8)
tools/mtmd/clip.cpp (8)

848-851: LGTM!

The UTUVL graph builder integration follows the established pattern for other projector types.


1249-1256: LGTM!

The logging for window attention configuration is helpful for debugging.


1524-1531: LGTM!

Tensor loading follows the established pattern for similar projector types (Qwen2VL, Qwen25VL). The index jump (0, 2) for LLAVA_PROJ appears intentional based on the model architecture.


3006-3007: LGTM!

Token dimension calculation follows the same pattern as similar projectors using 2x2 patch merging.


3022-3023: LGTM!

Consistent with x-dimension token calculation.


3083-3084: LGTM!

Dynamic token count calculation is consistent with the 2x merge ratio configuration.


3362-3373: Window attention logic correctly handles UTUVL configuration.

The conditional properly routes UTUVL to use wa_layers for determining window attention, while QWEN25VL uses n_wa_pattern. Since UTUVL loads attn_window_size as required (line 1169), the fallback to 112 shouldn't be hit for UTUVL.

The magic number 112 could be extracted to a named constant for clarity, but this is a minor nit given it primarily affects QWEN25VL.


3610-3611: LGTM!

Embedding dimension extraction follows the same pattern as similar projector types.

@f291400 f291400 closed this by deleting the head repository Dec 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants