Add files via upload#247
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a task-specific prompt routing system, specialized few-shot example selection strategies (M04 through M20), and a dynamic mixed-context length strategy to optimize Qwen3-4B data annotation. It also updates the pipeline to run on Huawei Ascend hardware and enhances answer extraction stability with multiple fallback parsing strategies. The code review feedback identifies several critical issues: a caching bug in main.py that defeats dynamic retrieval and context-length optimizations, a potential string truncation bug when indexing example['output'], a performance bottleneck caused by repeatedly loading the tokenizer inside selection functions, premature termination of example selection due to using break instead of continue, and mismatched return type hints in the annotation functions.
| if examples_str is None: | ||
| examples_str = select_examples(icl_examples, task_description, text2annotate) | ||
| # 使用动态选择的select_examples函数 | ||
| examples_str = select_func(icl_examples, task_description, text2annotate, task_id, sample_idx) |
There was a problem hiding this comment.
Bug/Performance Issue: Caching examples_str via if examples_str is None: prevents the mixed context length strategy and any dynamic/similarity-based retrieval from working correctly.
- Mixed Context Length Strategy: Since
examples_stris only computed once atsample_idx = 0, thesample_index < 50check inselect_examplesis only evaluated once. Consequently, the 30k context is used for all subsequent samples, completely bypassing the 8k context efficiency optimization for samples 50+. - Dynamic Retrieval: For similarity-based retrieval methods (e.g.,
select_examples_M05,select_examples_M19,select_examples_M20), the examples are selected based only on the first sample's input. All subsequent samples will reuse the same examples, defeating the purpose of dynamic retrieval.
We should remove the if examples_str is None: check and compute examples_str for every sample.
# 使用动态选择的select_examples函数
examples_str = select_func(icl_examples, task_description, text2annotate, task_id, sample_idx)| for i, example in enumerate(all_examples): | ||
| try: | ||
| input_text = example['input'] | ||
| output_text = example['output'][0] |
There was a problem hiding this comment.
Correctness Bug: Accessing example['output'][0] directly assumes that example['output'] is always a list. If example['output'] is a string, this will retrieve only the first character of the string (e.g., 'G' instead of 'Good Review'), leading to truncated and incorrect few-shot examples.
We should check if example['output'] is a list before indexing it, similar to how it is done in select_examples_M06 and select_examples_M04. This issue also exists in select_examples_M19, select_examples_M09, select_examples_M10, and select_examples_M11.
| output_text = example['output'][0] | |
| output_text = example['output'][0] if isinstance(example['output'], list) else example['output'] |
| return len(intersection) / len(union) if union else 0.0 | ||
|
|
||
| # 初始化tokenizer | ||
| tokenizer = AutoTokenizer.from_pretrained("/root/Qwen3-4B", trust_remote_code=True) |
There was a problem hiding this comment.
Critical Performance Bottleneck: Calling AutoTokenizer.from_pretrained inside the selection function loads the tokenizer from disk/cache every single time the function is called. Once the caching bug in main.py is fixed (so that selection runs per sample), this will cause a massive performance bottleneck, slowing down the evaluation significantly.
We should use a global tokenizer singleton pattern (similar to get_tokenizer_m20() used in M20) across all selection functions to ensure the tokenizer is only loaded once.
| tokenizer = AutoTokenizer.from_pretrained("/root/Qwen3-4B", trust_remote_code=True) | |
| tokenizer = get_tokenizer_m20() |
| if current_length + example['length'] > target_length: | ||
| break |
There was a problem hiding this comment.
Efficiency/Quality Issue: Using break when an example exceeds the remaining context budget prematurely terminates the selection process. Since example_scores is sorted by similarity, a highly similar example might be very long and exceed the budget, but other slightly less similar examples might be much shorter and could still fit. Using break can result in a prompt with very few or even zero examples.
Changing break to continue allows the algorithm to skip only the oversized examples and continue filling the context budget with other suitable examples. This applies to other selection methods as well (e.g., select_examples_M20, select_examples_M19, select_examples_M09, select_examples_M10, select_examples_M11).
| if current_length + example['length'] > target_length: | |
| break | |
| if current_length + example['length'] > target_length: | |
| continue |
| return prediction | ||
|
|
||
| def annotate_ascend(input_prompt:str)->list[str]: | ||
| def annotate_ascend(input_prompt:str, task_id:int=None)->list[str]: |
There was a problem hiding this comment.
Maintainability/Type Hint Mismatch: The return type annotation of annotate_ascend is list[str], but the function actually returns a single str (either the cleaned code for Task 8 or the extracted label-tagged content from count_answer). This mismatch can confuse developers and static analysis tools.
We should update the return type hint to str.
| def annotate_ascend(input_prompt:str, task_id:int=None)->list[str]: | |
| def annotate_ascend(input_prompt:str, task_id:int=None)->str: |
| return "unknown" | ||
|
|
||
|
|
||
| def annotate_nvidia(input_prompt:str)->list[str]: |
There was a problem hiding this comment.
Maintainability/Type Hint Mismatch: The return type annotation of annotate_nvidia is list[str], but the function actually returns a single str from count_answer.
We should update the return type hint to str.
| def annotate_nvidia(input_prompt:str)->list[str]: | |
| def annotate_nvidia(input_prompt:str)->str: |
No description provided.