-
Notifications
You must be signed in to change notification settings - Fork 2.9k
[AIME24 | AIME25] Enable Multiple Generation Repeats with Pass@k and Majority@k Metrics #3351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@baberabb could you please review ? |
|
Hey @baberabb , could you please take a quick look ? I just need your feedback to decide whether to keep the PR open or close it. |
lm_eval/api/metrics.py
Outdated
| Returns: | ||
| float: Estimated pass@k value. | ||
| """ | ||
| return 1 - binom(n-ci, k) / binom(n, k) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use the unbiased estimator from sec 2.1 of the codex paper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
lm_eval/tasks/aime/utils.py
Outdated
| answers.append(None) | ||
| retvals.append(retval) | ||
|
|
||
| mode, model_index = majority_voting(answers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for maths I think checking the majority vote after we've validated equivalence would be better, as there might be different ways of writing the same equivalent answer (which parser can check). But willing to be convinced otherwise!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, but in a production scenario where we want to use majority voting to increase consistency, we vote over the model’s answers, right, since we don’t have access to the ground truth?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, i think even in production without ground truth, we’d still want to group equivalent answers together before voting.
|
Hi! Thanks for the PR, and sorry for the extreme tardiness. Left some comments, but approach looks solid. Pinging @jannalulu, to ask if they have any concerns that this relies solely on |
yeah only using |
|
Hi @baberabb, I’ve addressed the majority of the feedback. Let me know if anything else needs to be updated. |
Summary
This PR adds support to repeats > 1 for aime24/15, and implements pass@k and maj@k (majority voting) metrics commonly used for these reasoning benchmarks.
The current implementation only evaluates the first generation when repeats > 1 is specified, limiting the ability to compute pass@k metrics.
Changes
repeatsargumentprocess_results: to handle multiple generations per sampleValidation
Successfully reproduced DeepSeek-R1-Qwen3-8B results on AIME 2025:
command:
lm_eval --model vllm \ --model_args pretrained=deepseek-ai/DeepSeek-R1-0528-Qwen3-8B,tensor_parallel_size=4,data_parallel_size=2,gpu_memory_utilization=0.85 \ --tasks aime25 \ --gen_kwargs temperature=0.6,top_p=0.95,max_gen_toks=65536 \ --apply_chat_template \ --output_path work/iheb/evals/ \ --log_samples \ --system_instruction '该助手为DeepSeek-R1, 由深度求索公司创造。\n今天是2025年5月28日,星期一。' \ --repeats 32Results Comparison: