Skip to content

Conversation

@ihebchaa
Copy link

Summary

This PR adds support to repeats > 1 for aime24/15, and implements pass@k and maj@k (majority voting) metrics commonly used for these reasoning benchmarks.
The current implementation only evaluates the first generation when repeats > 1 is specified, limiting the ability to compute pass@k metrics.

Changes

  • Exposed repeats argument
  • Added Identity Filter: preserves all generations instead of only returning the first one
  • Modified process_results: to handle multiple generations per sample
  • Implemented Pass@k(pass@1 and pass@N) and maj@k metrics

Validation

Successfully reproduced DeepSeek-R1-Qwen3-8B results on AIME 2025:
command:
lm_eval --model vllm \ --model_args pretrained=deepseek-ai/DeepSeek-R1-0528-Qwen3-8B,tensor_parallel_size=4,data_parallel_size=2,gpu_memory_utilization=0.85 \ --tasks aime25 \ --gen_kwargs temperature=0.6,top_p=0.95,max_gen_toks=65536 \ --apply_chat_template \ --output_path work/iheb/evals/ \ --log_samples \ --system_instruction '该助手为DeepSeek-R1, 由深度求索公司创造。\n今天是2025年5月28日,星期一。' \ --repeats 32
Results Comparison:

  • reported results: 76.3 (here) ; this implementation: 76.6

@ihebchaa
Copy link
Author

@baberabb could you please review ?

@ihebchaa
Copy link
Author

Hey @baberabb , could you please take a quick look ? I just need your feedback to decide whether to keep the PR open or close it.

Returns:
float: Estimated pass@k value.
"""
return 1 - binom(n-ci, k) / binom(n, k)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the unbiased estimator from sec 2.1 of the codex paper

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

answers.append(None)
retvals.append(retval)

mode, model_index = majority_voting(answers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for maths I think checking the majority vote after we've validated equivalence would be better, as there might be different ways of writing the same equivalent answer (which parser can check). But willing to be convinced otherwise!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, but in a production scenario where we want to use majority voting to increase consistency, we vote over the model’s answers, right, since we don’t have access to the ground truth?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, i think even in production without ground truth, we’d still want to group equivalent answers together before voting.

@baberabb
Copy link
Contributor

Hi! Thanks for the PR, and sorry for the extreme tardiness. Left some comments, but approach looks solid. Pinging @jannalulu, to ask if they have any concerns that this relies solely on math_verify and has removed the old sympy logic. I think generally it makes it much neater and from my experience math verify is pretty accurate, but maybe there's a case for keeping sympy as a fallback?

@jannalulu
Copy link
Contributor

Hi! Thanks for the PR, and sorry for the extreme tardiness. Left some comments, but approach looks solid. Pinging @jannalulu, to ask if they have any concerns that this relies solely on math_verify and has removed the old sympy logic. I think generally it makes it much neater and from my experience math verify is pretty accurate, but maybe there's a case for keeping sympy as a fallback?

yeah only using math_verify seems fine, especially since AIME answers are only integers

@ihebchaa
Copy link
Author

Hi @baberabb, I’ve addressed the majority of the feedback. Let me know if anything else needs to be updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants