Skip to content

Conversation

@tanios13
Copy link

@tanios13 tanios13 commented Nov 6, 2025

Summary

This PR fixes issues in the MBPP task configuration and improves code extraction for evaluation.

Changes

  • mbpp.yaml: Added missing filter_list definition for extract_code.
  • mbpp_plus.yaml: Corrected target field.
  • mbpp_plus_instruct.yaml: Updated to include all three assertions in doc_to_text and corrected target field.
  • utils.py: Refactored extract_code_blocks to handle edge cases where code fences or language tags cause incorrect matches.

Motivation

The previous configuration caused MBPP and MBPP+ tasks to fail during evaluation due to:

  • Missing or incorrect YAML fields.
  • Inconsistent extraction of Python code blocks from model outputs.

This update ensures stable parsing and consistent evaluation across MBPP variants.

@tanios13 tanios13 requested a review from baberabb as a code owner November 6, 2025 14:08
@CLAassistant
Copy link

CLAassistant commented Nov 6, 2025

CLA assistant check
All committers have signed the CLA.

@baberabb
Copy link
Contributor

baberabb commented Nov 9, 2025

Hi! Thanks for the PR! I've been meaning to look into this, so happy you got to it first. Would you be able to provide a couple of results from different models?

@tanios13
Copy link
Author

Thanks for the feedback!

I ran a few evaluations using Qwen2.5-Coder-7B-Instruct on both MBPP and MBPP+ benchmarks.

Results

Task Dataset pass@1 stderr #Samples
mbpp_instruct google-research-datasets/mbpp 0.833 0.023 257
mbpp_plus_instruct evalplus/mbppplus 0.696 0.024 378

if text.startswith("```python"):
end = text.find("```", len("```python"))
if end != -1:
return text[len("```python") : end].strip()
Copy link

@AndreasMadsen AndreasMadsen Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An issue I noticed in MBPP, both in this and the current implementation, is that the gen_prefix is followed by a space, due to

prefix = gen_prefix + " " if gen_prefix else ""
.

This causes the fewshot examples to have a space at the beginning of the code (e.g. def is_not_prime(n)). Models that follow this instruction and includes the space the output (i.e. def) then get penalized for having an indentation syntax error, even though they are just following the syntax error in the fewshot examples.

The .strip() here removes that space but I don't know if that is the most appropriate fix. I think the more appropriate fix would be to avoid adding the space in the fewshot examples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants