T2

Benchmarking Testing in Automated Theorem Proving — code, data, and experiment harness.

Jongyoon Kim, Hojae Han, Seung-won Hwang. ACL 2026 (Industry Track). arXiv:2604.23698

T proposes a test-based evaluation for formal theorem proving: a generated theorem is correct iff all its successor theorems compile. The benchmark covers 5 Lean 4 repositories — 2,206 problems, ~41 successor theorems each.

Setup

git clone --recurse-submodules <repo-url> T2 && cd T2
python3.10 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pip install -e third_party/expand_langchain

# For Lean evaluation (optional):
curl https://raw.githubusercontent.com/leanprover/elan/master/elan-init.sh -sSf | sh
(cd third_party/mathlib4 && lake exe cache get)

Configure LLM access by either pointing experiments/llms/litellm/env.yaml at your LiteLLM proxy or exporting provider API keys (ANTHROPIC_API_KEY, OPENAI_API_KEY, …).

Data

Shipped under data/final/ (ref2 + nosorry variant):

File	Records
`existing_datasets.{prop,noprop,all}.jsonl`	2 / 120 / 122	Numina, Kimina, CombiBench, FormalMATH
`research_papers.{prop,noprop,all}.jsonl`	57 / 1,817 / 1,874	math2001, Directed-Topology-Lean-4, adele-ring, Untangle, WhitneyGraustein

prop = paper's Hard subset (Prop-typed targets); noprop = non-Prop targets; all = union. Schema: id, target_code, formal_language, natural_language, target_code_name, before_target_code, after_target_code, header, dataset_name. natural_language is filled at the informalize step.

To regenerate from raw Lean repositories, run src/setups/download_datasets.py, then the per-track preprocessing scripts in src/preprocessing/{existing_datasets,research_papers}/.

Reproducing experiments

# 1. Informalize (canonical results already in results/informalize/.../20260125_*/)
bash scripts/run_informalize.sh <track> <prop_type> claude-sonnet-4-5

# 2. Autoformalize
bash scripts/run_autoformalize.sh <track> <prop_type> <model> <ablation>

# 3. Evaluate
bash scripts/run_evaluate_server.sh <track> <prop_type> <model>      # pass@k via Lean server
bash scripts/run_evaluate_beq.sh    <track> <prop_type> <model>      # local BEq / BEq⁺

<track> ∈ existing, research
<prop_type> ∈ prop, noprop
<model> = filename stem under experiments/configs/autoformalize/<track>/<prop_type>/
<ablation> ∈ standard, wo_nl, wo_ref, wo_both

Aggregate metrics (one evaluation.summary.json per model × ablation × track) are shipped under results/autoformalize*/.

License

Released under Creative Commons Attribution 4.0 International (CC BY 4.0)

Citation

@inproceedings{kim2026benchmarking,
  title     = {Benchmarking Testing in Automated Theorem Proving},
  author    = {Kim, Jongyoon and Han, Hojae and Hwang, Seung-won},
  booktitle = {Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Industry Track)},
  year      = {2026},
  publisher = {Association for Computational Linguistics},
  url       = {https://arxiv.org/abs/2604.23698}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data/final		data/final
experiments		experiments
results		results
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

T2

Setup

Data

Reproducing experiments

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

T2

Setup

Data

Reproducing experiments

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages