Add files via upload#259
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces task-specific prompt routing, Chain-of-Thought (CoT) prompts, and specialized example selection strategies (M05, M09, M10, M11, M19, M20) to optimize the long-context in-context learning annotation pipeline. It also implements a mixed context length strategy and task-differentiated configurations for LLM generation. However, several critical issues were identified: caching examples_str outside the loop in main.py defeats dynamic example selection and the mixed context length strategy; an incorrect HTML entity (&rt; instead of >) in the regex pattern prevents proper cleaning of the thinking process for Task 8; and repeatedly loading the tokenizer from disk across multiple example selection functions creates a severe performance bottleneck.
| if examples_str is None: | ||
| examples_str = select_examples(icl_examples, task_description, text2annotate) | ||
| # M11优化:使用Task 7 Jeopardy线索拆解策略 | ||
| examples_str = select_examples_M11(icl_examples, task_description, text2annotate, task_id, sample_idx) |
There was a problem hiding this comment.
🚨 Logic Error: Caching examples_str defeats dynamic example selection
In the current implementation, examples_str is initialized to None outside the loop and cached after the first iteration:
if examples_str is None:
examples_str = select_examples_M11(...)Because of this, select_examples_M11 (or any other dynamic selection method) is only called once for the very first sample (sample_idx = 0). For all subsequent samples, the exact same examples are reused.
This completely breaks:
- Dynamic Retrieval: Examples are not selected based on the current sample's
text2annotatesimilarity or keywords. - Mixed Context Length Strategy: The
sample_index < 50check inselect_examples_M11will always evaluate toTrue(since it only runs forsample_idx = 0), meaning a 30k context is used for all samples, which is highly inefficient and defeats the 8k fallback optimization. - Specialized Routing: The other specialized strategies (
select_examples_M09for Task 5,select_examples_M10for Task 6,select_examples_M20for other tasks) are imported but never used.
Recommendation: Remove the if examples_str is None: check and route the example selection dynamically based on task_id.
| if examples_str is None: | |
| examples_str = select_examples(icl_examples, task_description, text2annotate) | |
| # M11优化:使用Task 7 Jeopardy线索拆解策略 | |
| examples_str = select_examples_M11(icl_examples, task_description, text2annotate, task_id, sample_idx) | |
| # Dynamic example selection based on task type (M09, M10, M11, M20) | |
| if task_id == 5: | |
| examples_str = select_examples_M09(icl_examples, task_description, text2annotate, task_id, sample_idx) | |
| elif task_id == 6: | |
| examples_str = select_examples_M10(icl_examples, task_description, text2annotate, task_id, sample_idx) | |
| elif task_id == 7: | |
| examples_str = select_examples_M11(icl_examples, task_description, text2annotate, task_id, sample_idx) | |
| else: | |
| examples_str = select_examples_M20(icl_examples, task_description, text2annotate, task_id, sample_idx) |
| # Step 1: 过滤<think&rt;标签内容(去思维链) | ||
| cleaned_result = re.sub(r'<think&rt;.*?</think&rt;', '', whole_result, flags=re.DOTALL) |
There was a problem hiding this comment.
🐛 Bug: Incorrect HTML entity &rt; in regex pattern
The regex pattern uses <think&rt; and </think&rt; to clean the thinking process:
cleaned_result = re.sub(r'<think&rt;.*?</think&rt;', '', whole_result, flags=re.DOTALL)However:
- The LLM output (
whole_result) contains plain text tags like<think>and</think>, not HTML-escaped entities. - Even if it were escaped, the correct HTML entity for
>is>(greater than), not&rt;.
Because of this, the regex will fail to match and remove the thinking process, which defeats the "M12 optimization" and can corrupt the extracted code.
Recommendation: Update the regex to match plain text <think> and </think> tags.
| # Step 1: 过滤<think&rt;标签内容(去思维链) | |
| cleaned_result = re.sub(r'<think&rt;.*?</think&rt;', '', whole_result, flags=re.DOTALL) | |
| # Step 1: 过滤<think>标签内容(去思维链) | |
| cleaned_result = re.sub(r'<think>.*?</think>', '', whole_result, flags=re.DOTALL) |
| # 初始化Qwen3-4B的tokenizer(自动下载/加载千问3-4B的分词器) | ||
| # 若本地已下载模型,可替换为本地路径,如 "./qwen3-4b" | ||
| tokenizer = AutoTokenizer.from_pretrained("/share/project/wuhaiming/spaces/data_agent/OpenSeek-main/openseek/competition/LongContext-ICL-Annotation/src/Qwen3-4B", trust_remote_code=True) | ||
| tokenizer = AutoTokenizer.from_pretrained("/root/Qwen3-4B", trust_remote_code=True) |
There was a problem hiding this comment.
⚡ Performance Bottleneck: Repeatedly loading tokenizer from disk
In select_examples (and similarly in select_examples_M05, select_examples_M19, select_examples_M09, select_examples_M10, and select_examples_M11), the tokenizer is loaded from disk on every single function call:
tokenizer = AutoTokenizer.from_pretrained("/root/Qwen3-4B", trust_remote_code=True)Since example selection is now dynamic and runs for every test sample, this will reload the tokenizer thousands of times, causing a massive performance bottleneck and extremely slow evaluation.
You already implemented a singleton helper get_tokenizer_m20() to cache the tokenizer:
def get_tokenizer_m20():
global _tokenizer_m20
if _tokenizer_m20 is None:
_tokenizer_m20 = AutoTokenizer.from_pretrained("/root/Qwen3-4B", trust_remote_code=True)
return _tokenizer_m20Recommendation: Use get_tokenizer_m20() across all select_examples functions to reuse the cached tokenizer instance.
| # 初始化Qwen3-4B的tokenizer(自动下载/加载千问3-4B的分词器) | |
| # 若本地已下载模型,可替换为本地路径,如 "./qwen3-4b" | |
| tokenizer = AutoTokenizer.from_pretrained("/share/project/wuhaiming/spaces/data_agent/OpenSeek-main/openseek/competition/LongContext-ICL-Annotation/src/Qwen3-4B", trust_remote_code=True) | |
| tokenizer = AutoTokenizer.from_pretrained("/root/Qwen3-4B", trust_remote_code=True) | |
| # 使用全局单例避免重复加载分词器 | |
| tokenizer = get_tokenizer_m20() |
No description provided.