Skip to content

Add files via upload#247

Open
dujiahei wants to merge 1 commit into
FlagAI-Open:mainfrom
dujiahei:main
Open

Add files via upload#247
dujiahei wants to merge 1 commit into
FlagAI-Open:mainfrom
dujiahei:main

Conversation

@dujiahei

Copy link
Copy Markdown

No description provided.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a task-specific prompt routing system, specialized few-shot example selection strategies (M04 through M20), and a dynamic mixed-context length strategy to optimize Qwen3-4B data annotation. It also updates the pipeline to run on Huawei Ascend hardware and enhances answer extraction stability with multiple fallback parsing strategies. The code review feedback identifies several critical issues: a caching bug in main.py that defeats dynamic retrieval and context-length optimizations, a potential string truncation bug when indexing example['output'], a performance bottleneck caused by repeatedly loading the tokenizer inside selection functions, premature termination of example selection due to using break instead of continue, and mismatched return type hints in the annotation functions.

Comment on lines 84 to +86
if examples_str is None:
examples_str = select_examples(icl_examples, task_description, text2annotate)
# 使用动态选择的select_examples函数
examples_str = select_func(icl_examples, task_description, text2annotate, task_id, sample_idx)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Bug/Performance Issue: Caching examples_str via if examples_str is None: prevents the mixed context length strategy and any dynamic/similarity-based retrieval from working correctly.

  1. Mixed Context Length Strategy: Since examples_str is only computed once at sample_idx = 0, the sample_index < 50 check in select_examples is only evaluated once. Consequently, the 30k context is used for all subsequent samples, completely bypassing the 8k context efficiency optimization for samples 50+.
  2. Dynamic Retrieval: For similarity-based retrieval methods (e.g., select_examples_M05, select_examples_M19, select_examples_M20), the examples are selected based only on the first sample's input. All subsequent samples will reuse the same examples, defeating the purpose of dynamic retrieval.

We should remove the if examples_str is None: check and compute examples_str for every sample.

        # 使用动态选择的select_examples函数
        examples_str = select_func(icl_examples, task_description, text2annotate, task_id, sample_idx)

for i, example in enumerate(all_examples):
try:
input_text = example['input']
output_text = example['output'][0]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Correctness Bug: Accessing example['output'][0] directly assumes that example['output'] is always a list. If example['output'] is a string, this will retrieve only the first character of the string (e.g., 'G' instead of 'Good Review'), leading to truncated and incorrect few-shot examples.

We should check if example['output'] is a list before indexing it, similar to how it is done in select_examples_M06 and select_examples_M04. This issue also exists in select_examples_M19, select_examples_M09, select_examples_M10, and select_examples_M11.

Suggested change
output_text = example['output'][0]
output_text = example['output'][0] if isinstance(example['output'], list) else example['output']

return len(intersection) / len(union) if union else 0.0

# 初始化tokenizer
tokenizer = AutoTokenizer.from_pretrained("/root/Qwen3-4B", trust_remote_code=True)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Critical Performance Bottleneck: Calling AutoTokenizer.from_pretrained inside the selection function loads the tokenizer from disk/cache every single time the function is called. Once the caching bug in main.py is fixed (so that selection runs per sample), this will cause a massive performance bottleneck, slowing down the evaluation significantly.

We should use a global tokenizer singleton pattern (similar to get_tokenizer_m20() used in M20) across all selection functions to ensure the tokenizer is only loaded once.

Suggested change
tokenizer = AutoTokenizer.from_pretrained("/root/Qwen3-4B", trust_remote_code=True)
tokenizer = get_tokenizer_m20()

Comment on lines +929 to +930
if current_length + example['length'] > target_length:
break

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Efficiency/Quality Issue: Using break when an example exceeds the remaining context budget prematurely terminates the selection process. Since example_scores is sorted by similarity, a highly similar example might be very long and exceed the budget, but other slightly less similar examples might be much shorter and could still fit. Using break can result in a prompt with very few or even zero examples.

Changing break to continue allows the algorithm to skip only the oversized examples and continue filling the context budget with other suitable examples. This applies to other selection methods as well (e.g., select_examples_M20, select_examples_M19, select_examples_M09, select_examples_M10, select_examples_M11).

Suggested change
if current_length + example['length'] > target_length:
break
if current_length + example['length'] > target_length:
continue

return prediction

def annotate_ascend(input_prompt:str)->list[str]:
def annotate_ascend(input_prompt:str, task_id:int=None)->list[str]:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Maintainability/Type Hint Mismatch: The return type annotation of annotate_ascend is list[str], but the function actually returns a single str (either the cleaned code for Task 8 or the extracted label-tagged content from count_answer). This mismatch can confuse developers and static analysis tools.

We should update the return type hint to str.

Suggested change
def annotate_ascend(input_prompt:str, task_id:int=None)->list[str]:
def annotate_ascend(input_prompt:str, task_id:int=None)->str:

return "unknown"


def annotate_nvidia(input_prompt:str)->list[str]:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Maintainability/Type Hint Mismatch: The return type annotation of annotate_nvidia is list[str], but the function actually returns a single str from count_answer.

We should update the return type hint to str.

Suggested change
def annotate_nvidia(input_prompt:str)->list[str]:
def annotate_nvidia(input_prompt:str)->str:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant