Skip to content

Add files via upload#255

Open
satic197 wants to merge 1 commit into
FlagAI-Open:mainfrom
satic197:main
Open

Add files via upload#255
satic197 wants to merge 1 commit into
FlagAI-Open:mainfrom
satic197:main

Conversation

@satic197

Copy link
Copy Markdown

No description provided.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several optimization strategies for a long-context in-context learning (ICL) annotation system, including task-specific prompt routing, advanced example selection strategies (M05, M19, M20), and differentiated generation parameters for the Huawei Ascend backend. The code review identified several critical issues: a typo in the regex for stripping thinking processes (<think&rt; instead of ), a logical guard that prevents the dynamic retrieval strategy from executing per-sample, performance bottlenecks from repeatedly loading the tokenizer, potential bugs when parsing string-based outputs, and portability issues due to hardcoded absolute paths.

# M12优化: Task 8去思维链清洗
if task_id == 8:
# Step 1: 过滤<think&rt;标签内容(去思维链)
cleaned_result = re.sub(r'<think&rt;.*?</think&rt;', '', whole_result, flags=re.DOTALL)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a critical typo in the regular expression used to filter out the thinking process. You wrote <think&rt; and </think&rt; instead of <think> and </think> (or &gt; if HTML-escaped).

Because of this typo, the regex will fail to match the standard <think>...</think> tags output by reasoning models. This means the entire thinking process will remain in the final output, which will likely violate the strict format requirements of the competition and lead to parsing failures.

Suggested change
cleaned_result = re.sub(r'<think&rt;.*?</think&rt;', '', whole_result, flags=re.DOTALL)
cleaned_result = re.sub(r'<think>.*?</think>', '', whole_result, flags=re.DOTALL)

Comment on lines 78 to +80
if examples_str is None:
examples_str = select_examples(icl_examples, task_description, text2annotate)
# M19优化:使用检索+重试+后处理组合策略
examples_str = select_examples_M19(icl_examples, task_description, text2annotate, task_id, sample_idx)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a logical contradiction here. The newly introduced select_examples_M19 is designed as a dynamic retrieval strategy (as stated in its docstring: "每个样本都重新选择示例" / "re-selects examples for each sample"). However, because of the if examples_str is None: guard, select_examples_M19 is only called once during the first iteration of the loop. For all subsequent test samples, the same examples_str is reused, completely bypassing the dynamic retrieval logic.

If you remove the if examples_str is None: guard to make it truly dynamic, select_examples_M19 will be called on every iteration. This will introduce a severe performance bottleneck because select_examples_M19 currently loads the tokenizer from disk on every single call.

To fix both issues:

  1. Remove the if examples_str is None: guard so that examples are retrieved dynamically per sample.
  2. Update select_examples_M19 (and other retrieval functions) to use a cached tokenizer singleton (similar to get_tokenizer_m20()) instead of calling AutoTokenizer.from_pretrained on every invocation.
        # M19优化:使用检索+重试+后处理组合策略 (动态检索每个样本)
        examples_str = select_examples_M19(icl_examples, task_description, text2annotate, task_id, sample_idx)

M19优化:检索+重试+后处理组合方案的示例选择
结合M02动态检索策略
"""
tokenizer = AutoTokenizer.from_pretrained("/root/Qwen3-4B", trust_remote_code=True)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Calling AutoTokenizer.from_pretrained inside select_examples_M19 (and similarly in select_examples_M05 and select_examples) is extremely inefficient. When dynamic retrieval is enabled, these functions are called for every test sample inside a loop. Loading the tokenizer from disk/network on every single call will cause a massive performance bottleneck.

You should reuse the cached tokenizer singleton pattern you implemented for select_examples_M20 (via get_tokenizer_m20()), or define a similar helper function.

Suggested change
tokenizer = AutoTokenizer.from_pretrained("/root/Qwen3-4B", trust_remote_code=True)
tokenizer = get_tokenizer_m20()

for i, example in enumerate(all_examples):
try:
input_text = example['input']
output_text = example['output'][0]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In select_examples_M19 (and select_examples_M05), you extract the output text using output_text = example['output'][0].

If example['output'] is a string instead of a list (which can happen depending on the task dataset format), example['output'][0] will silently return just the first character of the string (e.g., "G" instead of "Good Review"). This will corrupt the few-shot examples constructed for the prompt.

You should handle this defensively, similar to how you did in select_examples_M20.

Suggested change
output_text = example['output'][0]
output_text = example['output'][0] if isinstance(example['output'], list) else example['output']

Comment on lines +13 to +20
1: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-1_closest_integers.json',
2: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-2_count_nouns_verbs.json',
3: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-3_collatz_conjecture.json',
4: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-4_conala_concat_strings.json',
5: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-5_semeval_2018_task1_tweet_sadness_detection.json',
6: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-6_mnli_same_genre_classification.json',
7: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-7_jeopardy_answer_generation_all.json',
8: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-8_kernel_generation.json',

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding absolute paths like /root/OpenSeek/... reduces the portability of the codebase, as it will fail on any environment where the directory structure is different. It is highly recommended to use relative paths (as was done previously) or construct paths dynamically relative to the project root.

Suggested change
1: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-1_closest_integers.json',
2: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-2_count_nouns_verbs.json',
3: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-3_collatz_conjecture.json',
4: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-4_conala_concat_strings.json',
5: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-5_semeval_2018_task1_tweet_sadness_detection.json',
6: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-6_mnli_same_genre_classification.json',
7: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-7_jeopardy_answer_generation_all.json',
8: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-8_kernel_generation.json',
1: './data/openseek-1_closest_integers.json',
2: './data/openseek-2_count_nouns_verbs.json',
3: './data/openseek-3_collatz_conjecture.json',
4: './data/openseek-4_conala_concat_strings.json',
5: './data/openseek-5_semeval_2018_task1_tweet_sadness_detection.json',
6: './data/openseek-6_mnli_same_genre_classification.json',
7: './data/openseek-7_jeopardy_answer_generation_all.json',
8: './data/openseek-8_kernel_generation.json',

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant