-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Fix MBPP task configuration and code extraction logic #3387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Hi! Thanks for the PR! I've been meaning to look into this, so happy you got to it first. Would you be able to provide a couple of results from different models? |
|
Thanks for the feedback! I ran a few evaluations using Qwen2.5-Coder-7B-Instruct on both MBPP and MBPP+ benchmarks. Results
|
| if text.startswith("```python"): | ||
| end = text.find("```", len("```python")) | ||
| if end != -1: | ||
| return text[len("```python") : end].strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An issue I noticed in MBPP, both in this and the current implementation, is that the gen_prefix is followed by a space, due to
lm-evaluation-harness/lm_eval/api/samplers.py
Line 132 in 7ddb2b1
| prefix = gen_prefix + " " if gen_prefix else "" |
This causes the fewshot examples to have a space at the beginning of the code (e.g. def is_not_prime(n)). Models that follow this instruction and includes the space the output (i.e. def) then get penalized for having an indentation syntax error, even though they are just following the syntax error in the fewshot examples.
The .strip() here removes that space but I don't know if that is the most appropriate fix. I think the more appropriate fix would be to avoid adding the space in the fewshot examples.
Summary
This PR fixes issues in the MBPP task configuration and improves code extraction for evaluation.
Changes
mbpp.yaml: Added missingfilter_listdefinition forextract_code.mbpp_plus.yaml: Corrected target field.mbpp_plus_instruct.yaml: Updated to include all three assertions indoc_to_textand corrected target field.utils.py: Refactoredextract_code_blocksto handle edge cases where code fences or language tags cause incorrect matches.Motivation
The previous configuration caused MBPP and MBPP+ tasks to fail during evaluation due to:
This update ensures stable parsing and consistent evaluation across MBPP variants.