[AIME24 | AIME25] Enable Multiple Generation Repeats with Pass@k and Majority@k Metrics #3351

ihebchaa · 2025-10-17T06:31:39Z

Summary

This PR adds support to repeats > 1 for aime24/15, and implements pass@k and maj@k (majority voting) metrics commonly used for these reasoning benchmarks.
The current implementation only evaluates the first generation when repeats > 1 is specified, limiting the ability to compute pass@k metrics.

Changes

Exposed repeats argument
Added Identity Filter: preserves all generations instead of only returning the first one
Modified process_results: to handle multiple generations per sample
Implemented Pass@k(pass@1 and pass@N) and maj@k metrics

Validation

Successfully reproduced DeepSeek-R1-Qwen3-8B results on AIME 2025:
command:
lm_eval --model vllm \ --model_args pretrained=deepseek-ai/DeepSeek-R1-0528-Qwen3-8B,tensor_parallel_size=4,data_parallel_size=2,gpu_memory_utilization=0.85 \ --tasks aime25 \ --gen_kwargs temperature=0.6,top_p=0.95,max_gen_toks=65536 \ --apply_chat_template \ --output_path work/iheb/evals/ \ --log_samples \ --system_instruction '该助手为DeepSeek-R1，由深度求索公司创造。\n今天是2025年5月28日，星期一。' \ --repeats 32
Results Comparison:

reported results: 76.3 (here) ; this implementation: 76.6

ihebchaa · 2025-10-22T10:41:34Z

@baberabb could you please review ?

ihebchaa · 2025-11-11T12:38:32Z

Hey @baberabb , could you please take a quick look ? I just need your feedback to decide whether to keep the PR open or close it.

baberabb · 2025-11-19T01:59:49Z

lm_eval/api/metrics.py

+    Returns:
+        float: Estimated pass@k value.
+    """
+    return 1 - binom(n-ci, k) / binom(n, k)


We should use the unbiased estimator from sec 2.1 of the codex paper

baberabb · 2025-11-19T02:17:00Z

lm_eval/tasks/aime/utils.py

+            answers.append(None)
+        retvals.append(retval)
+
+    mode, model_index = majority_voting(answers)


So for maths I think checking the majority vote after we've validated equivalence would be better, as there might be different ways of writing the same equivalent answer (which parser can check). But willing to be convinced otherwise!

Good point, but in a production scenario where we want to use majority voting to increase consistency, we vote over the model’s answers, right, since we don’t have access to the ground truth?

wait, i think even in production without ground truth, we’d still want to group equivalent answers together before voting.

baberabb · 2025-11-19T02:23:56Z

Hi! Thanks for the PR, and sorry for the extreme tardiness. Left some comments, but approach looks solid. Pinging @jannalulu, to ask if they have any concerns that this relies solely on math_verify and has removed the old sympy logic. I think generally it makes it much neater and from my experience math verify is pretty accurate, but maybe there's a case for keeping sympy as a fallback?

lm_eval/tasks/aime/utils.py

jannalulu · 2025-11-20T21:55:44Z

Hi! Thanks for the PR, and sorry for the extreme tardiness. Left some comments, but approach looks solid. Pinging @jannalulu, to ask if they have any concerns that this relies solely on math_verify and has removed the old sympy logic. I think generally it makes it much neater and from my experience math verify is pretty accurate, but maybe there's a case for keeping sympy as a fallback?

yeah only using math_verify seems fine, especially since AIME answers are only integers

ihebchaa · 2025-11-24T04:54:18Z

Hi @baberabb, I’ve addressed the majority of the feedback. Let me know if anything else needs to be updated.

ihebchaa added 5 commits October 17, 2025 09:46

expose repeats arg

3115f1b

add identity filter

8eacb95

add metrics: pass@k

f7cb52d

update utils

7927a26

update filter and metrics list

d123fb7

ihebchaa requested a review from baberabb as a code owner October 17, 2025 06:31

jannalulu mentioned this pull request Oct 17, 2025

Subject: [BUG] Zero Accuracy on aime Task, but Non-Zero Results on aime24 with Identical Configuration #3352

Closed

feat: math_verify parsing / verification

d5a476c

ihebchaa mentioned this pull request Oct 22, 2025

Question: Does repeats work for evaluation averaging? #3297

Open

baberabb reviewed Nov 19, 2025

View reviewed changes

ihebchaa added 2 commits November 19, 2025 09:26

refactor: use results for maj voting

2f6de01

fix: use numerically stable estimator for pass@k metric

7e33491

jannalulu reviewed Nov 20, 2025

View reviewed changes

lm_eval/tasks/aime/utils.py Outdated Show resolved Hide resolved

add try except from math_verify import

2a84e36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AIME24 | AIME25] Enable Multiple Generation Repeats with Pass@k and Majority@k Metrics #3351

[AIME24 | AIME25] Enable Multiple Generation Repeats with Pass@k and Majority@k Metrics #3351

ihebchaa commented Oct 17, 2025

Uh oh!

ihebchaa commented Oct 22, 2025

Uh oh!

ihebchaa commented Nov 11, 2025

Uh oh!

baberabb Nov 19, 2025

Uh oh!

ihebchaa Nov 19, 2025

Uh oh!

baberabb Nov 19, 2025

Uh oh!

ihebchaa Nov 19, 2025

Uh oh!

ihebchaa Nov 19, 2025

Uh oh!

baberabb commented Nov 19, 2025

Uh oh!

Uh oh!

jannalulu commented Nov 20, 2025

Uh oh!

ihebchaa commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[AIME24 | AIME25] Enable Multiple Generation Repeats with Pass@k and Majority@k Metrics #3351

Are you sure you want to change the base?

[AIME24 | AIME25] Enable Multiple Generation Repeats with Pass@k and Majority@k Metrics #3351

Conversation

ihebchaa commented Oct 17, 2025

Summary

Changes

Validation

Uh oh!

ihebchaa commented Oct 22, 2025

Uh oh!

ihebchaa commented Nov 11, 2025

Uh oh!

baberabb Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

ihebchaa Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

baberabb Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

ihebchaa Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

ihebchaa Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

baberabb commented Nov 19, 2025

Uh oh!

Uh oh!

jannalulu commented Nov 20, 2025

Uh oh!

ihebchaa commented Nov 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants