Skip to content

Add files via upload#259

Open
youkuaz2 wants to merge 1 commit into
FlagAI-Open:mainfrom
youkuaz2:main
Open

Add files via upload#259
youkuaz2 wants to merge 1 commit into
FlagAI-Open:mainfrom
youkuaz2:main

Conversation

@youkuaz2

Copy link
Copy Markdown

No description provided.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces task-specific prompt routing, Chain-of-Thought (CoT) prompts, and specialized example selection strategies (M05, M09, M10, M11, M19, M20) to optimize the long-context in-context learning annotation pipeline. It also implements a mixed context length strategy and task-differentiated configurations for LLM generation. However, several critical issues were identified: caching examples_str outside the loop in main.py defeats dynamic example selection and the mixed context length strategy; an incorrect HTML entity (&rt; instead of >) in the regex pattern prevents proper cleaning of the thinking process for Task 8; and repeatedly loading the tokenizer from disk across multiple example selection functions creates a severe performance bottleneck.

Comment on lines 78 to +80
if examples_str is None:
examples_str = select_examples(icl_examples, task_description, text2annotate)
# M11优化:使用Task 7 Jeopardy线索拆解策略
examples_str = select_examples_M11(icl_examples, task_description, text2annotate, task_id, sample_idx)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

🚨 Logic Error: Caching examples_str defeats dynamic example selection

In the current implementation, examples_str is initialized to None outside the loop and cached after the first iteration:

if examples_str is None:
    examples_str = select_examples_M11(...)

Because of this, select_examples_M11 (or any other dynamic selection method) is only called once for the very first sample (sample_idx = 0). For all subsequent samples, the exact same examples are reused.

This completely breaks:

  1. Dynamic Retrieval: Examples are not selected based on the current sample's text2annotate similarity or keywords.
  2. Mixed Context Length Strategy: The sample_index < 50 check in select_examples_M11 will always evaluate to True (since it only runs for sample_idx = 0), meaning a 30k context is used for all samples, which is highly inefficient and defeats the 8k fallback optimization.
  3. Specialized Routing: The other specialized strategies (select_examples_M09 for Task 5, select_examples_M10 for Task 6, select_examples_M20 for other tasks) are imported but never used.

Recommendation: Remove the if examples_str is None: check and route the example selection dynamically based on task_id.

Suggested change
if examples_str is None:
examples_str = select_examples(icl_examples, task_description, text2annotate)
# M11优化:使用Task 7 Jeopardy线索拆解策略
examples_str = select_examples_M11(icl_examples, task_description, text2annotate, task_id, sample_idx)
# Dynamic example selection based on task type (M09, M10, M11, M20)
if task_id == 5:
examples_str = select_examples_M09(icl_examples, task_description, text2annotate, task_id, sample_idx)
elif task_id == 6:
examples_str = select_examples_M10(icl_examples, task_description, text2annotate, task_id, sample_idx)
elif task_id == 7:
examples_str = select_examples_M11(icl_examples, task_description, text2annotate, task_id, sample_idx)
else:
examples_str = select_examples_M20(icl_examples, task_description, text2annotate, task_id, sample_idx)

Comment on lines +584 to +585
# Step 1: 过滤<think&rt;标签内容(去思维链)
cleaned_result = re.sub(r'<think&rt;.*?</think&rt;', '', whole_result, flags=re.DOTALL)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

🐛 Bug: Incorrect HTML entity &rt; in regex pattern

The regex pattern uses <think&rt; and </think&rt; to clean the thinking process:

cleaned_result = re.sub(r'<think&rt;.*?</think&rt;', '', whole_result, flags=re.DOTALL)

However:

  1. The LLM output (whole_result) contains plain text tags like <think> and </think>, not HTML-escaped entities.
  2. Even if it were escaped, the correct HTML entity for > is &gt; (greater than), not &rt;.

Because of this, the regex will fail to match and remove the thinking process, which defeats the "M12 optimization" and can corrupt the extracted code.

Recommendation: Update the regex to match plain text <think> and </think> tags.

Suggested change
# Step 1: 过滤<think&rt;标签内容(去思维链)
cleaned_result = re.sub(r'<think&rt;.*?</think&rt;', '', whole_result, flags=re.DOTALL)
# Step 1: 过滤<think>标签内容(去思维链)
cleaned_result = re.sub(r'<think>.*?</think>', '', whole_result, flags=re.DOTALL)

Comment on lines 433 to +435
# 初始化Qwen3-4B的tokenizer(自动下载/加载千问3-4B的分词器)
# 若本地已下载模型,可替换为本地路径,如 "./qwen3-4b"
tokenizer = AutoTokenizer.from_pretrained("/share/project/wuhaiming/spaces/data_agent/OpenSeek-main/openseek/competition/LongContext-ICL-Annotation/src/Qwen3-4B", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("/root/Qwen3-4B", trust_remote_code=True)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

⚡ Performance Bottleneck: Repeatedly loading tokenizer from disk

In select_examples (and similarly in select_examples_M05, select_examples_M19, select_examples_M09, select_examples_M10, and select_examples_M11), the tokenizer is loaded from disk on every single function call:

tokenizer = AutoTokenizer.from_pretrained("/root/Qwen3-4B", trust_remote_code=True)

Since example selection is now dynamic and runs for every test sample, this will reload the tokenizer thousands of times, causing a massive performance bottleneck and extremely slow evaluation.

You already implemented a singleton helper get_tokenizer_m20() to cache the tokenizer:

def get_tokenizer_m20():
    global _tokenizer_m20
    if _tokenizer_m20 is None:
        _tokenizer_m20 = AutoTokenizer.from_pretrained("/root/Qwen3-4B", trust_remote_code=True)
    return _tokenizer_m20

Recommendation: Use get_tokenizer_m20() across all select_examples functions to reuse the cached tokenizer instance.

Suggested change
# 初始化Qwen3-4B的tokenizer(自动下载/加载千问3-4B的分词器)
# 若本地已下载模型,可替换为本地路径,如 "./qwen3-4b"
tokenizer = AutoTokenizer.from_pretrained("/share/project/wuhaiming/spaces/data_agent/OpenSeek-main/openseek/competition/LongContext-ICL-Annotation/src/Qwen3-4B", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("/root/Qwen3-4B", trust_remote_code=True)
# 使用全局单例避免重复加载分词器
tokenizer = get_tokenizer_m20()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant