Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions lm_eval/tasks/mbpp/mbpp.yaml
Original file line number Diff line number Diff line change
@@ -1,16 +1,21 @@
task: mbpp
dataset_path: google-research-datasets/mbpp
dataset_name: full
dataset_name: sanitized
unsafe_code: true
output_type: generate_until
test_split: test
doc_to_text: "You are an expert Python programmer, and here is your task: {{text}} Your code should pass these tests:\n\n{{test_list[0]}}\n{{test_list[1]}}\n{{test_list[2]}}\n[BEGIN]\n"
doc_to_text: "You are an expert Python programmer, and here is your task: {{prompt}} Your code should pass these tests:\n\n{{test_list[0]}}\n{{test_list[1]}}\n{{test_list[2]}}\n[BEGIN]\n"
doc_to_target: "{% if is_fewshot is defined %}{{code}}\n[DONE]{% else %}{{test_list[0]}}\n{{test_list[1]}}\n{{test_list[2]}}{% endif %}"
target_delimiter: ""
metric_list:
- metric: !function utils.pass_at_1
aggregation: mean
higher_is_better: true
filter_list:
- name: "extract_code"
filter:
- function: "custom"
filter_fn: !function utils.build_predictions
generation_kwargs:
until:
- "[DONE]"
Expand Down
4 changes: 2 additions & 2 deletions lm_eval/tasks/mbpp/mbpp_instruct.yaml
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
task: mbpp_instruct
dataset_path: google-research-datasets/mbpp
dataset_name: full
dataset_name: sanitized
unsafe_code: true
output_type: generate_until
test_split: test
doc_to_text: "You are an expert Python programmer, and here is your task:\n{{text}}\nYour code should pass these tests:\n{{test_list[0]}}\n{{test_list[1]}}\n{{test_list[2]}}"
doc_to_text: "You are an expert Python programmer, and here is your task:\n{{prompt}}\nYour code should pass these tests:\n{{test_list[0]}}\n{{test_list[1]}}\n{{test_list[2]}}"
doc_to_target: "{% if is_fewshot is defined %}{{code}}\n```{% else %}{{test_list[0]}}\n{{test_list[1]}}\n{{test_list[2]}}{% endif %}"
gen_prefix: "\n```python\n"
target_delimiter: ""
Expand Down
1 change: 1 addition & 0 deletions lm_eval/tasks/mbpp/mbpp_plus.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ task: mbpp_plus
dataset_path: evalplus/mbppplus
dataset_name: null
doc_to_text: "You are an expert Python programmer, and here is your task: {{prompt if prompt is defined else text}} Your code should pass these tests:\n\n{{test_list[0]}}\n{{test_list[1]}}\n{{test_list[2]}}\n[BEGIN]\n"
doc_to_target: "{{test}}"
4 changes: 2 additions & 2 deletions lm_eval/tasks/mbpp/mbpp_plus_instruct.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@ include: mbpp_instruct.yaml
task: mbpp_plus_instruct
dataset_path: evalplus/mbppplus
dataset_name: null
doc_to_text: "{{prompt if prompt is defined else text}} Your code should satisfy the following assertion:\n{{test_list[0]}}"
doc_to_target: "{{test_list[0]}}"
doc_to_text: "{{prompt if prompt is defined else text}} Your code should satisfy the following assertion:\n{{test_list[0]}}\n{{test_list[1]}}\n{{test_list[2]}}"
doc_to_target: "{{test}}"
gen_prefix: "Here is a solution to this programming problem:\n```python\n"
num_fewshot: 0
generation_kwargs:
Expand Down
45 changes: 33 additions & 12 deletions lm_eval/tasks/mbpp/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ def pass_at_1(
references = [references]
if isinstance(predictions[0], str):
predictions = [[p] for p in predictions]

return pass_at_k.compute(
references=references,
predictions=predictions,
Expand All @@ -30,18 +31,38 @@ def pass_at_1(


def extract_code_blocks(text: str) -> str:
# Pattern to match ```...``` blocks
pattern = r"```(?:\w+)?\n?(.*?)\n?```"
# (+ ```) as we add the opening "```python" to the gen_prefix
matches = re.findall(pattern, r"```" + text, re.DOTALL)
# if no matches, try to match ```...``` blocks (after removing the language)
if not matches:
text_without_lang = re.sub(r"```python", "```", text)
matches = re.findall(pattern, text_without_lang, re.DOTALL)
if not matches:
return ""
else:
return matches[0]
text = text.strip()

# 1. If starts with ```python → take everything until the next ```
if text.startswith("```python"):
end = text.find("```", len("```python"))
if end != -1:
return text[len("```python") : end].strip()
Copy link

@AndreasMadsen AndreasMadsen Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An issue I noticed in MBPP, both in this and the current implementation, is that the gen_prefix is followed by a space, due to

prefix = gen_prefix + " " if gen_prefix else ""
.

This causes the fewshot examples to have a space at the beginning of the code (e.g. def is_not_prime(n)). Models that follow this instruction and includes the space the output (i.e. def) then get penalized for having an indentation syntax error, even though they are just following the syntax error in the fewshot examples.

The .strip() here removes that space but I don't know if that is the most appropriate fix. I think the more appropriate fix would be to avoid adding the space in the fewshot examples.

return text[len("```python") :].strip()

# 2. If starts with ``` but not python → take until next ```
if text.startswith("```"):
end = text.find("```", 3)
if end != -1:
return text[3:end].strip()
return text[3:].strip()

# 3. If doesn’t start with ```
text = text.replace("```python", "```")
count_backticks = text.count("```")

# 4. If count is odd → take everything until the last ```
if count_backticks % 2 == 1 and count_backticks > 0:
last = text.rfind("```")
return text[:last].strip()

# 5. If count is even and >= 2 → take first complete block between ```
if count_backticks >= 2:
first = text.find("```")
second = text.find("```", first + 3)
return text[first + 3 : second].strip()

return text


def build_predictions(resps: list[list[str]], docs: list[dict]) -> list[list[str]]:
Expand Down
Loading