Add files via upload#255
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces several optimization strategies for a long-context in-context learning (ICL) annotation system, including task-specific prompt routing, advanced example selection strategies (M05, M19, M20), and differentiated generation parameters for the Huawei Ascend backend. The code review identified several critical issues: a typo in the regex for stripping thinking processes (<think&rt; instead of ), a logical guard that prevents the dynamic retrieval strategy from executing per-sample, performance bottlenecks from repeatedly loading the tokenizer, potential bugs when parsing string-based outputs, and portability issues due to hardcoded absolute paths.
| # M12优化: Task 8去思维链清洗 | ||
| if task_id == 8: | ||
| # Step 1: 过滤<think&rt;标签内容(去思维链) | ||
| cleaned_result = re.sub(r'<think&rt;.*?</think&rt;', '', whole_result, flags=re.DOTALL) |
There was a problem hiding this comment.
There is a critical typo in the regular expression used to filter out the thinking process. You wrote <think&rt; and </think&rt; instead of <think> and </think> (or > if HTML-escaped).
Because of this typo, the regex will fail to match the standard <think>...</think> tags output by reasoning models. This means the entire thinking process will remain in the final output, which will likely violate the strict format requirements of the competition and lead to parsing failures.
| cleaned_result = re.sub(r'<think&rt;.*?</think&rt;', '', whole_result, flags=re.DOTALL) | |
| cleaned_result = re.sub(r'<think>.*?</think>', '', whole_result, flags=re.DOTALL) |
| if examples_str is None: | ||
| examples_str = select_examples(icl_examples, task_description, text2annotate) | ||
| # M19优化:使用检索+重试+后处理组合策略 | ||
| examples_str = select_examples_M19(icl_examples, task_description, text2annotate, task_id, sample_idx) |
There was a problem hiding this comment.
There is a logical contradiction here. The newly introduced select_examples_M19 is designed as a dynamic retrieval strategy (as stated in its docstring: "每个样本都重新选择示例" / "re-selects examples for each sample"). However, because of the if examples_str is None: guard, select_examples_M19 is only called once during the first iteration of the loop. For all subsequent test samples, the same examples_str is reused, completely bypassing the dynamic retrieval logic.
If you remove the if examples_str is None: guard to make it truly dynamic, select_examples_M19 will be called on every iteration. This will introduce a severe performance bottleneck because select_examples_M19 currently loads the tokenizer from disk on every single call.
To fix both issues:
- Remove the
if examples_str is None:guard so that examples are retrieved dynamically per sample. - Update
select_examples_M19(and other retrieval functions) to use a cached tokenizer singleton (similar toget_tokenizer_m20()) instead of callingAutoTokenizer.from_pretrainedon every invocation.
# M19优化:使用检索+重试+后处理组合策略 (动态检索每个样本)
examples_str = select_examples_M19(icl_examples, task_description, text2annotate, task_id, sample_idx)| M19优化:检索+重试+后处理组合方案的示例选择 | ||
| 结合M02动态检索策略 | ||
| """ | ||
| tokenizer = AutoTokenizer.from_pretrained("/root/Qwen3-4B", trust_remote_code=True) |
There was a problem hiding this comment.
Calling AutoTokenizer.from_pretrained inside select_examples_M19 (and similarly in select_examples_M05 and select_examples) is extremely inefficient. When dynamic retrieval is enabled, these functions are called for every test sample inside a loop. Loading the tokenizer from disk/network on every single call will cause a massive performance bottleneck.
You should reuse the cached tokenizer singleton pattern you implemented for select_examples_M20 (via get_tokenizer_m20()), or define a similar helper function.
| tokenizer = AutoTokenizer.from_pretrained("/root/Qwen3-4B", trust_remote_code=True) | |
| tokenizer = get_tokenizer_m20() |
| for i, example in enumerate(all_examples): | ||
| try: | ||
| input_text = example['input'] | ||
| output_text = example['output'][0] |
There was a problem hiding this comment.
In select_examples_M19 (and select_examples_M05), you extract the output text using output_text = example['output'][0].
If example['output'] is a string instead of a list (which can happen depending on the task dataset format), example['output'][0] will silently return just the first character of the string (e.g., "G" instead of "Good Review"). This will corrupt the few-shot examples constructed for the prompt.
You should handle this defensively, similar to how you did in select_examples_M20.
| output_text = example['output'][0] | |
| output_text = example['output'][0] if isinstance(example['output'], list) else example['output'] |
| 1: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-1_closest_integers.json', | ||
| 2: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-2_count_nouns_verbs.json', | ||
| 3: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-3_collatz_conjecture.json', | ||
| 4: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-4_conala_concat_strings.json', | ||
| 5: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-5_semeval_2018_task1_tweet_sadness_detection.json', | ||
| 6: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-6_mnli_same_genre_classification.json', | ||
| 7: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-7_jeopardy_answer_generation_all.json', | ||
| 8: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-8_kernel_generation.json', |
There was a problem hiding this comment.
Hardcoding absolute paths like /root/OpenSeek/... reduces the portability of the codebase, as it will fail on any environment where the directory structure is different. It is highly recommended to use relative paths (as was done previously) or construct paths dynamically relative to the project root.
| 1: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-1_closest_integers.json', | |
| 2: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-2_count_nouns_verbs.json', | |
| 3: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-3_collatz_conjecture.json', | |
| 4: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-4_conala_concat_strings.json', | |
| 5: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-5_semeval_2018_task1_tweet_sadness_detection.json', | |
| 6: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-6_mnli_same_genre_classification.json', | |
| 7: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-7_jeopardy_answer_generation_all.json', | |
| 8: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-8_kernel_generation.json', | |
| 1: './data/openseek-1_closest_integers.json', | |
| 2: './data/openseek-2_count_nouns_verbs.json', | |
| 3: './data/openseek-3_collatz_conjecture.json', | |
| 4: './data/openseek-4_conala_concat_strings.json', | |
| 5: './data/openseek-5_semeval_2018_task1_tweet_sadness_detection.json', | |
| 6: './data/openseek-6_mnli_same_genre_classification.json', | |
| 7: './data/openseek-7_jeopardy_answer_generation_all.json', | |
| 8: './data/openseek-8_kernel_generation.json', |
No description provided.