Add files via upload by satic197 · Pull Request #255 · FlagAI-Open/OpenSeek

satic197 · 2026-05-29T03:03:45Z

No description provided.

gemini-code-assist

Code Review

This pull request introduces several optimization strategies for a long-context in-context learning (ICL) annotation system, including task-specific prompt routing, advanced example selection strategies (M05, M19, M20), and differentiated generation parameters for the Huawei Ascend backend. The code review identified several critical issues: a typo in the regex for stripping thinking processes (<think&rt; instead of ), a logical guard that prevents the dynamic retrieval strategy from executing per-sample, performance bottlenecks from repeatedly loading the tokenizer, potential bugs when parsing string-based outputs, and portability issues due to hardcoded absolute paths.

gemini-code-assist · 2026-05-29T03:05:18Z

+    # M12优化: Task 8去思维链清洗
+    if task_id == 8:
+        # Step 1: 过滤<think&rt;标签内容（去思维链）
+        cleaned_result = re.sub(r'<think&rt;.*?</think&rt;', '', whole_result, flags=re.DOTALL)


There is a critical typo in the regular expression used to filter out the thinking process. You wrote <think&rt; and </think&rt; instead of <think> and </think> (or > if HTML-escaped).

Because of this typo, the regex will fail to match the standard <think>...</think> tags output by reasoning models. This means the entire thinking process will remain in the final output, which will likely violate the strict format requirements of the competition and lead to parsing failures.

Suggested change

cleaned_result = re.sub(r'<think&rt;.*?</think&rt;', '', whole_result, flags=re.DOTALL)

cleaned_result = re.sub(r'<think>.*?</think>', '', whole_result, flags=re.DOTALL)

gemini-code-assist · 2026-05-29T03:05:18Z

        if examples_str is None:
-            examples_str = select_examples(icl_examples, task_description, text2annotate)
+            # M19优化：使用检索+重试+后处理组合策略
+            examples_str = select_examples_M19(icl_examples, task_description, text2annotate, task_id, sample_idx)


There is a logical contradiction here. The newly introduced select_examples_M19 is designed as a dynamic retrieval strategy (as stated in its docstring: "每个样本都重新选择示例" / "re-selects examples for each sample"). However, because of the if examples_str is None: guard, select_examples_M19 is only called once during the first iteration of the loop. For all subsequent test samples, the same examples_str is reused, completely bypassing the dynamic retrieval logic.

If you remove the if examples_str is None: guard to make it truly dynamic, select_examples_M19 will be called on every iteration. This will introduce a severe performance bottleneck because select_examples_M19 currently loads the tokenizer from disk on every single call.

To fix both issues:

Remove the if examples_str is None: guard so that examples are retrieved dynamically per sample.

Update select_examples_M19 (and other retrieval functions) to use a cached tokenizer singleton (similar to get_tokenizer_m20()) instead of calling AutoTokenizer.from_pretrained on every invocation.

# M19优化：使用检索+重试+后处理组合策略 (动态检索每个样本) examples_str = select_examples_M19(icl_examples, task_description, text2annotate, task_id, sample_idx)

gemini-code-assist · 2026-05-29T03:05:18Z

+    M19优化：检索+重试+后处理组合方案的示例选择
+    结合M02动态检索策略
+    """
+    tokenizer = AutoTokenizer.from_pretrained("/root/Qwen3-4B", trust_remote_code=True)


Calling AutoTokenizer.from_pretrained inside select_examples_M19 (and similarly in select_examples_M05 and select_examples) is extremely inefficient. When dynamic retrieval is enabled, these functions are called for every test sample inside a loop. Loading the tokenizer from disk/network on every single call will cause a massive performance bottleneck.

You should reuse the cached tokenizer singleton pattern you implemented for select_examples_M20 (via get_tokenizer_m20()), or define a similar helper function.

Suggested change

tokenizer = AutoTokenizer.from_pretrained("/root/Qwen3-4B", trust_remote_code=True)

tokenizer = get_tokenizer_m20()

gemini-code-assist · 2026-05-29T03:05:18Z

+    for i, example in enumerate(all_examples):
+        try:
+            input_text = example['input']
+            output_text = example['output'][0]


In select_examples_M19 (and select_examples_M05), you extract the output text using output_text = example['output'][0].

If example['output'] is a string instead of a list (which can happen depending on the task dataset format), example['output'][0] will silently return just the first character of the string (e.g., "G" instead of "Good Review"). This will corrupt the few-shot examples constructed for the prompt.

You should handle this defensively, similar to how you did in select_examples_M20.

Suggested change

output_text = example['output'][0]

output_text = example['output'][0] if isinstance(example['output'], list) else example['output']

gemini-code-assist · 2026-05-29T03:05:19Z

+    1: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-1_closest_integers.json',
+    2: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-2_count_nouns_verbs.json',
+    3: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-3_collatz_conjecture.json',
+    4: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-4_conala_concat_strings.json',
+    5: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-5_semeval_2018_task1_tweet_sadness_detection.json',
+    6: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-6_mnli_same_genre_classification.json',
+    7: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-7_jeopardy_answer_generation_all.json',
+    8: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-8_kernel_generation.json',


Hardcoding absolute paths like /root/OpenSeek/... reduces the portability of the codebase, as it will fail on any environment where the directory structure is different. It is highly recommended to use relative paths (as was done previously) or construct paths dynamically relative to the project root.

Suggested change

1: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-1_closest_integers.json',

2: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-2_count_nouns_verbs.json',

3: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-3_collatz_conjecture.json',

4: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-4_conala_concat_strings.json',

5: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-5_semeval_2018_task1_tweet_sadness_detection.json',

6: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-6_mnli_same_genre_classification.json',

7: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-7_jeopardy_answer_generation_all.json',

8: '/root/OpenSeek/openseek/competition/LongContext-ICL-Annotation/data/openseek-8_kernel_generation.json',

1: './data/openseek-1_closest_integers.json',

2: './data/openseek-2_count_nouns_verbs.json',

3: './data/openseek-3_collatz_conjecture.json',

4: './data/openseek-4_conala_concat_strings.json',

5: './data/openseek-5_semeval_2018_task1_tweet_sadness_detection.json',

6: './data/openseek-6_mnli_same_genre_classification.json',

7: './data/openseek-7_jeopardy_answer_generation_all.json',

8: './data/openseek-8_kernel_generation.json',

Add files via upload

a4aea3d

gemini-code-assist Bot reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add files via upload#255

Add files via upload#255
satic197 wants to merge 1 commit into
FlagAI-Open:mainfrom
satic197:main

satic197 commented May 29, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 29, 2026

Uh oh!

gemini-code-assist Bot May 29, 2026

Uh oh!

gemini-code-assist Bot May 29, 2026

Uh oh!

gemini-code-assist Bot May 29, 2026

Uh oh!

gemini-code-assist Bot May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	cleaned_result = re.sub(r'<think&rt;.*?</think&rt;', '', whole_result, flags=re.DOTALL)
	cleaned_result = re.sub(r'<think>.*?</think>', '', whole_result, flags=re.DOTALL)

	tokenizer = AutoTokenizer.from_pretrained("/root/Qwen3-4B", trust_remote_code=True)
	tokenizer = get_tokenizer_m20()

	output_text = example['output'][0]
	output_text = example['output'][0] if isinstance(example['output'], list) else example['output']

Uh oh!

Conversation

satic197 commented May 29, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant