-
Notifications
You must be signed in to change notification settings - Fork 14.2k
Support Youtu-VL Model #18315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Support Youtu-VL Model #18315
Conversation
| LOG_WRN("%s: more info: https://github.com/ggml-org/llama.cpp/issues/16842\n\n", __func__); | ||
| } | ||
| } break; | ||
| case PROJECTOR_TYPE_UTUVL: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of duplicating the code block, adding it to the same case above instead:
case PROJECTOR_TYPE_QWEN2VL:
case PROJECTOR_TYPE_QWEN25VL:
case PROJECTOR_TYPE_QWEN3VL:
case PROJECTOR_TYPE_UTUVL:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
tools/mtmd/clip.cpp
Outdated
|
|
||
| set_input_i32("positions", positions); | ||
| } break; | ||
| case PROJECTOR_TYPE_UTUVL: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not duplicate code if there is no difference compared to qwen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then make it a hparams.attn_window_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have fixed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete this file, reuse qwen2vl.cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utu differs from qwen2.5 in several aspects. It's difficult to merge together
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you list these differences?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- change conv3d to linear
- const bool full_attn = use_window_attn ? (il + 1) % n_wa_pattern == 0 : true; chang to
const bool full_attn = (il + 1) % 8 == 0 || il == n_layer - 1; added il == n_layer-1, which n_layer = 27 - delete ff_gate_w in build_ffn
- exchange merge and windows attention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok thanks. please also leave this list in the code comment
tools/mtmd/models/utuvl.cpp
Outdated
| // loop over layers | ||
| for (int il = 0; il < n_layer; il++) { | ||
| const auto & layer = model.layers[il]; | ||
| const bool full_attn = (il + 1) % 8 == 0 || il == n_layer - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the number 8 here? can it be a hparam ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have fixed it.
| int num_patches = num_patches_h * num_patches_w; | ||
|
|
||
| if (num_patches > max_num_patches) { | ||
| scale -= 0.02f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the number 0.02?
any reasons why we cannot just reuse the code from qwen model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our strategy is different from Qwen's; this approach is more suitable for this scenario.
tools/mtmd/clip.cpp
Outdated
|
|
||
| if (use_window_attn) { | ||
| const int attn_window_size = 112; | ||
| const int attn_window_size = ctx->model.proj_type == PROJECTOR_TYPE_QWEN25VL ? 112 : patch_size * 2 * 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract these number to a new hparams instead.
hparams.attn_window_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have fixed it.
|
please also fix any failed check regarding code formatting |
convert_hf_to_gguf.py
Outdated
| if chkhsh == "9d70134b369a70e5735009b6de918f7581b5211f7c074d1f89f753aea8248af1": | ||
| res = "utu-vl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not manually add these, they are generated by convert_hf_to_gguf_update.py, edit that and run it to get the correct entry here!
convert_hf_to_gguf.py
Outdated
| if hparams.get("moe_intermediate_size") is not None: | ||
| self.gguf_writer.add_expert_feed_forward_length(hparams["moe_intermediate_size"]) | ||
| else: | ||
| self.gguf_writer.add_expert_feed_forward_length(hparams.get("intermediate_size", 0)) | ||
|
|
||
| if hparams.get("n_routed_experts") is not None: | ||
| self.gguf_writer.add_expert_count(hparams["n_routed_experts"]) | ||
|
|
||
| if hparams.get("n_shared_experts") is not None: | ||
| self.gguf_writer.add_expert_shared_count(hparams["n_shared_experts"]) | ||
| else: | ||
| self.gguf_writer.add_expert_shared_count(0) | ||
|
|
||
| if hparams.get("routed_scaling_factor") is not None: | ||
| self.gguf_writer.add_expert_weights_scale(hparams["routed_scaling_factor"]) | ||
| else: | ||
| self.gguf_writer.add_expert_weights_scale(1.0) | ||
|
|
||
| if hparams.get("norm_topk_prob") is not None and hparams["norm_topk_prob"]: | ||
| self.gguf_writer.add_expert_weights_norm(hparams["norm_topk_prob"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't get the same value in condition and assignment, use walrus assignment in the condition as seen elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have fixed it.
| # skip lm_head.weight if tie_word_embeddings is True | ||
| if self.hparams.get("tie_word_embeddings", False): | ||
| # Save token_embd for potential duplication as output if tie_word_embeddings is True | ||
| if name == "model.embed_tokens.weight": | ||
| self._token_embd = data_torch | ||
| if name == "lm_head.weight" or name == "model.lm_head.weight": | ||
| logger.info("Skipping tied output layer 'lm_head.weight' - will duplicate from token_embd.weight") | ||
| return [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't do this on conversion, do it on model load like every other model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have fixed it.
| def add_vision_n_wa_pattern(self, value: int) -> None: | ||
| self.add_uint32(Keys.ClipVision.N_WA_PATTERN, value) | ||
| def add_vision_wa_layers(self, layers: Sequence[int]) -> None: | ||
| self.add_array(Keys.ClipVision.WA_LAYERS, layers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert this change, add a dedicated metadata instead
| USE_GELU = "clip.use_gelu" | ||
| USE_SILU = "clip.use_silu" | ||
| N_WA_PATTERN = "clip.vision.n_wa_pattern" # used by qwen2.5vl | ||
| WA_LAYERS = "clip.vision.wa_layers" # used by qwen2.5vl and utuvl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert this change, add a dedicated metadata instead
| # save window attention layers (full attention block indexes) | ||
| fullatt_block_indexes = hparams.get("fullatt_block_indexes") | ||
| assert fullatt_block_indexes is not None, "fullatt_block_indexes is required for qwen2_5_vl" | ||
| n_wa_pattern = fullatt_block_indexes[0] + 1 | ||
| # validate n_wa_pattern | ||
| for i in range(1, len(fullatt_block_indexes)): | ||
| if fullatt_block_indexes[i] - fullatt_block_indexes[i - 1] != n_wa_pattern: | ||
| raise ValueError(f"Invalid fullatt_block_indexes: {fullatt_block_indexes}") | ||
| self.gguf_writer.add_vision_n_wa_pattern(n_wa_pattern) | ||
| self.gguf_writer.add_vision_wa_layers(fullatt_block_indexes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert any changes to qwen code
| hparams.set_warmup_n_tokens(46*46); // avoid OOM on warmup | ||
| const int warn_min_pixels = 1024 * hparams.n_merge * hparams.n_merge * hparams.patch_size * hparams.patch_size; | ||
| if (hparams.image_min_pixels < warn_min_pixels) { | ||
| LOG_WRN("%s: Youtu-VL models require at minimum 1024 image tokens to function correctly on grounding tasks\n", __func__); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this warning true or just a blindly copy-paste?
| return std::max(align_size, aligned); | ||
| }; | ||
|
|
||
| // Binary search with 0.02 step size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a binary or linear search? where is the binary part?
Support for the large youtu-vl model, which will be open-sourced soon.