Skip to content

Conversation

@f291400
Copy link

@f291400 f291400 commented Dec 23, 2025

Support for the large youtu-vl model, which will be open-sourced soon.

LOG_WRN("%s: more info: https://github.com/ggml-org/llama.cpp/issues/16842\n\n", __func__);
}
} break;
case PROJECTOR_TYPE_UTUVL:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of duplicating the code block, adding it to the same case above instead:

                case PROJECTOR_TYPE_QWEN2VL:
                case PROJECTOR_TYPE_QWEN25VL:
                case PROJECTOR_TYPE_QWEN3VL:
                case PROJECTOR_TYPE_UTUVL:

Copy link
Author

@f291400 f291400 Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK


set_input_i32("positions", positions);
} break;
case PROJECTOR_TYPE_UTUVL:
Copy link
Collaborator

@ngxson ngxson Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not duplicate code if there is no difference compared to qwen

Copy link
Author

@f291400 f291400 Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then make it a hparams.attn_window_size

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete this file, reuse qwen2vl.cpp

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utu differs from qwen2.5 in several aspects. It's difficult to merge together

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you list these differences?

Copy link
Author

@f291400 f291400 Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. change conv3d to linear
  2. const bool full_attn = use_window_attn ? (il + 1) % n_wa_pattern == 0 : true; chang to
    const bool full_attn = (il + 1) % 8 == 0 || il == n_layer - 1; added il == n_layer-1, which n_layer = 27
  3. delete ff_gate_w in build_ffn
  4. exchange merge and windows attention

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thanks. please also leave this list in the code comment

// loop over layers
for (int il = 0; il < n_layer; il++) {
const auto & layer = model.layers[il];
const bool full_attn = (il + 1) % 8 == 0 || il == n_layer - 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the number 8 here? can it be a hparam ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed it.

int num_patches = num_patches_h * num_patches_w;

if (num_patches > max_num_patches) {
scale -= 0.02f;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the number 0.02?

any reasons why we cannot just reuse the code from qwen model?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our strategy is different from Qwen's; this approach is more suitable for this scenario.


if (use_window_attn) {
const int attn_window_size = 112;
const int attn_window_size = ctx->model.proj_type == PROJECTOR_TYPE_QWEN25VL ? 112 : patch_size * 2 * 8;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract these number to a new hparams instead.

hparams.attn_window_size

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed it.

@ngxson
Copy link
Collaborator

ngxson commented Dec 23, 2025

please also fix any failed check regarding code formatting

Comment on lines 1176 to 1177
if chkhsh == "9d70134b369a70e5735009b6de918f7581b5211f7c074d1f89f753aea8248af1":
res = "utu-vl"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not manually add these, they are generated by convert_hf_to_gguf_update.py, edit that and run it to get the correct entry here!

Comment on lines 7217 to 7236
if hparams.get("moe_intermediate_size") is not None:
self.gguf_writer.add_expert_feed_forward_length(hparams["moe_intermediate_size"])
else:
self.gguf_writer.add_expert_feed_forward_length(hparams.get("intermediate_size", 0))

if hparams.get("n_routed_experts") is not None:
self.gguf_writer.add_expert_count(hparams["n_routed_experts"])

if hparams.get("n_shared_experts") is not None:
self.gguf_writer.add_expert_shared_count(hparams["n_shared_experts"])
else:
self.gguf_writer.add_expert_shared_count(0)

if hparams.get("routed_scaling_factor") is not None:
self.gguf_writer.add_expert_weights_scale(hparams["routed_scaling_factor"])
else:
self.gguf_writer.add_expert_weights_scale(1.0)

if hparams.get("norm_topk_prob") is not None and hparams["norm_topk_prob"]:
self.gguf_writer.add_expert_weights_norm(hparams["norm_topk_prob"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't get the same value in condition and assignment, use walrus assignment in the condition as seen elsewhere.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed it.

Comment on lines 7258 to 7265
# skip lm_head.weight if tie_word_embeddings is True
if self.hparams.get("tie_word_embeddings", False):
# Save token_embd for potential duplication as output if tie_word_embeddings is True
if name == "model.embed_tokens.weight":
self._token_embd = data_torch
if name == "lm_head.weight" or name == "model.lm_head.weight":
logger.info("Skipping tied output layer 'lm_head.weight' - will duplicate from token_embd.weight")
return []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't do this on conversion, do it on model load like every other model.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed it.

def add_vision_n_wa_pattern(self, value: int) -> None:
self.add_uint32(Keys.ClipVision.N_WA_PATTERN, value)
def add_vision_wa_layers(self, layers: Sequence[int]) -> None:
self.add_array(Keys.ClipVision.WA_LAYERS, layers)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this change, add a dedicated metadata instead

USE_GELU = "clip.use_gelu"
USE_SILU = "clip.use_silu"
N_WA_PATTERN = "clip.vision.n_wa_pattern" # used by qwen2.5vl
WA_LAYERS = "clip.vision.wa_layers" # used by qwen2.5vl and utuvl
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert this change, add a dedicated metadata instead

Comment on lines +3812 to +3815
# save window attention layers (full attention block indexes)
fullatt_block_indexes = hparams.get("fullatt_block_indexes")
assert fullatt_block_indexes is not None, "fullatt_block_indexes is required for qwen2_5_vl"
n_wa_pattern = fullatt_block_indexes[0] + 1
# validate n_wa_pattern
for i in range(1, len(fullatt_block_indexes)):
if fullatt_block_indexes[i] - fullatt_block_indexes[i - 1] != n_wa_pattern:
raise ValueError(f"Invalid fullatt_block_indexes: {fullatt_block_indexes}")
self.gguf_writer.add_vision_n_wa_pattern(n_wa_pattern)
self.gguf_writer.add_vision_wa_layers(fullatt_block_indexes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert any changes to qwen code

hparams.set_warmup_n_tokens(46*46); // avoid OOM on warmup
const int warn_min_pixels = 1024 * hparams.n_merge * hparams.n_merge * hparams.patch_size * hparams.patch_size;
if (hparams.image_min_pixels < warn_min_pixels) {
LOG_WRN("%s: Youtu-VL models require at minimum 1024 image tokens to function correctly on grounding tasks\n", __func__);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this warning true or just a blindly copy-paste?

return std::max(align_size, aligned);
};

// Binary search with 0.02 step size
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a binary or linear search? where is the binary part?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants