Fix MBPP task configuration and code extraction logic #3387

tanios13 · 2025-11-06T14:08:45Z

Summary

This PR fixes issues in the MBPP task configuration and improves code extraction for evaluation.

Changes

mbpp.yaml: Added missing filter_list definition for extract_code.
mbpp_plus.yaml: Corrected target field.
mbpp_plus_instruct.yaml: Updated to include all three assertions in doc_to_text and corrected target field.
utils.py: Refactored extract_code_blocks to handle edge cases where code fences or language tags cause incorrect matches.

Motivation

The previous configuration caused MBPP and MBPP+ tasks to fail during evaluation due to:

Missing or incorrect YAML fields.
Inconsistent extraction of Python code blocks from model outputs.

This update ensures stable parsing and consistent evaluation across MBPP variants.

CLAassistant · 2025-11-06T14:08:53Z

All committers have signed the CLA.

baberabb · 2025-11-09T22:42:23Z

Hi! Thanks for the PR! I've been meaning to look into this, so happy you got to it first. Would you be able to provide a couple of results from different models?

tanios13 · 2025-11-10T12:33:41Z

Thanks for the feedback!

I ran a few evaluations using Qwen2.5-Coder-7B-Instruct on both MBPP and MBPP+ benchmarks.

Results

Task	Dataset	pass@1	stderr	#Samples
mbpp_instruct	google-research-datasets/mbpp	0.833	0.023	257
mbpp_plus_instruct	evalplus/mbppplus	0.696	0.024	378

AndreasMadsen · 2025-11-19T22:32:03Z

lm_eval/tasks/mbpp/utils.py

+    if text.startswith("```python"):
+        end = text.find("```", len("```python"))
+        if end != -1:
+            return text[len("```python") : end].strip()


An issue I noticed in MBPP, both in this and the current implementation, is that the gen_prefix is followed by a space, due to

lm-evaluation-harness/lm_eval/api/samplers.py

Line 132 in 7ddb2b1

prefix = gen_prefix + " " if gen_prefix else ""

.

This causes the fewshot examples to have a space at the beginning of the code (e.g. def is_not_prime(n)). Models that follow this instruction and includes the space the output (i.e. def) then get penalized for having an indentation syntax error, even though they are just following the syntax error in the fewshot examples.

The .strip() here removes that space but I don't know if that is the most appropriate fix. I think the more appropriate fix would be to avoid adding the space in the fewshot examples.

tanios13 added 4 commits November 6, 2025 08:57

Fix: corrected extract_code_blocks function

3d86ba5

Fix: Add code extraction to mbpp

1a56f61

Fix: Use the correct testing column for MBPP+

6bfff50

Fix: Remove debug prints

73d94d9

tanios13 requested a review from baberabb as a code owner November 6, 2025 14:08

tanios13 added 2 commits November 10, 2025 07:24

Fix: Use sanitized dataset for mbpp

7e6e8a5

Fix: fix pre-commit formatting issues

a4dd373

AndreasMadsen reviewed Nov 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix MBPP task configuration and code extraction logic #3387

Fix MBPP task configuration and code extraction logic #3387

tanios13 commented Nov 6, 2025

Uh oh!

CLAassistant commented Nov 6, 2025 •

edited

Loading

Uh oh!

baberabb commented Nov 9, 2025

Uh oh!

tanios13 commented Nov 10, 2025

Uh oh!

AndreasMadsen Nov 19, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix MBPP task configuration and code extraction logic #3387

Are you sure you want to change the base?

Fix MBPP task configuration and code extraction logic #3387

Conversation

tanios13 commented Nov 6, 2025

Summary

Changes

Motivation

Uh oh!

CLAassistant commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

baberabb commented Nov 9, 2025

Uh oh!

tanios13 commented Nov 10, 2025

Results

Uh oh!

AndreasMadsen Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CLAassistant commented Nov 6, 2025 •

edited

Loading

AndreasMadsen Nov 19, 2025 •

edited

Loading