DuET: Testcase Output Prediction via Consistency across Prediction Modes

Paper: DuET: Dual Execution for Test Output Prediction with Generated Code and Pseudocode, Findings of ACL 2026.

If you use this code, please cite the paper — see the Citation section.

DuET is a framework for testcase output prediction and code-candidate selection that combines multiple independent prediction modes and picks the prediction that the modes most agree on ("self-consistency across modes").

Given a program under test and a testcase input tc_input, DuET derives candidate outputs tc_output through four modes:

Mode	How the output is obtained
`vanilla`	LLM predicts the output from the prompt/context only
`pseudo`	LLM writes pseudocode, then predicts the output from it
`code`	LLM writes code, then predicts the output from the code
`exec`	LLM writes code, the code is executed, the result is the output

Each candidate output is normalised into a canonical observation ({"status": "ok"|"error"|"timeout", "value": ...}) and aggregated into mixtures (subsets of modes). The predicted output is the highest-agreement observation within the chosen mixture.

Those predictions are then used as an oracle for ranking code candidates: each candidate is executed on the tc_inputs, and the score is how often its output matches DuET's predicted output. pass@k is computed over this ranking.

This repository contains the code and configurations used for all experiments in the paper.

Repository Layout

duet/
├── duet/
│   ├── configs/            # Pipeline configurations (per benchmark)
│   ├── datasets/           # Dataset loaders and mergers
│   ├── evaluators/         # Benchmark evaluators (BCB, DevEval, LiveCodeBench)
│   ├── tools/              # One-off analysis / re-scoring utilities
│   ├── tests/              # Unit tests for observation normalisation & scoring
│   └── utils/              # Execution drivers, logging, output parsers
├── third_party/            # Vendored dependencies (no submodules)
│   ├── bigcodebench/       # BigCodeBench benchmark
│   ├── LiveCodeBench/      # LiveCodeBench benchmark
│   ├── expand_langchain/   # Core pipeline framework
│   └── deveval_venv_addon/ # Per-repo venv setup helpers for DevEval
├── scripts/
│   └── fetch_deveval.sh    # One-time DevEval dataset fetch (optional)
├── run.py                  # Fire CLI entry point
├── pyproject.toml
├── uv.lock
└── LICENSE

Experiment outputs land in results/ and cached execution environments / datasets in data/. Both are gitignored.

Installation

Requirements: Python 3.10+, uv, git.

# 1. Clone
git clone <repo-url>
cd duet

# 2. Install Python deps (all third-party dependencies are vendored)
uv sync

# 3. (Optional) Set up env vars
cp .env.example .env
# edit .env with your OPENAI_API_BASE / OPENAI_API_KEY, etc.

# 4. (Only for DevEval experiments) Fetch the DevEval benchmark.
#    ~3.9 GB; skip if you don't plan to run DevEval.
bash scripts/fetch_deveval.sh

Each benchmark evaluator creates its own isolated execution venv under data/ the first time it runs (e.g. data/deveval_env_cache/, data/livecodebench_venv/, data/exec_venv/). These venvs are gitignored.

Configurations by Benchmark

The framework covers three benchmarks. Each config directory registers one or more pipeline names that are passed to run.py generator --config_name <name>.

Directory	Benchmark	Representative `--config_name`
`duet/configs/bcb_hard_codegen/`	BigCodeBench-Hard	`bcb_hard_codegen`
`duet/configs/20251219-livecodebench_codegen/`	LiveCodeBench (codegen)	`20251219-livecodebench_codegen-vanilla`
`duet/configs/livecodebench_tcoutpred/`	LiveCodeBench (TC output)	`livecodebench_tcoutpred_llama31_8b`
`duet/configs/deveval_codegen/`	DevEval (function-level)	`deveval_codegen`, `..._local_completion`, `..._local_infilling`, `..._without_context`, plus `smoke_n1`/`smoke_n20` variants

duet/configs/original_qwq/ is kept for reference (early / exploratory config, not used in the paper).

Quick Start

1. BigCodeBench-Hard (codegen + DuET scoring)

# Generate + score + merge
uv run python run.py generator \
  --config_name bcb_hard_codegen \
  --dataset_name bcb_hard \
  --max_concurrency 16 \
  - run --n 10 - merge_json - exit

# Evaluate (BigCodeBench executes in an isolated venv)
uv run python run.py evaluator verify_ground_truth --use_eval_venv
uv run python run.py evaluator evaluate \
  results/<run_name>/results_merged.json \
  --use_eval_venv

Score-only configs (like bcb_hard_codegen) store per-candidate DuET scores in each result; the evaluator computes pass@k by ranking candidates with those scores (with probabilistic tie-breaking for unbiased pass@k).

2. LiveCodeBench (testcase output prediction)

uv run python run.py generator \
  --config_name livecodebench_tcoutpred_llama31_8b \
  --dataset_name livecodebench_tcoutpred \
  --max_concurrency 8 \
  - run - merge_json - exit

uv run python duet/evaluators/livecodebench_evaluator.py \
  results/<run_name>/results_merged.json \
  --start_date 2024-01-01 --end_date 2024-04-01

3. DevEval (function-level codegen)

# One-time setup (clones DevEval + downloads Source_Code / data archives)
bash scripts/fetch_deveval.sh

# Verify per-project venvs (caches under data/deveval_env_cache/)
uv run python duet/tools/deveval_gt_verify.py \
  --deveval_root third_party/DevEval

# Run one of the context-modes (local_completion / local_infilling / without_context)
bash duet/configs/deveval_codegen/scripts/run_deveval_100_local_completion.sh

End-to-end runs, including smoke tests, are shipped as reproducible shell scripts under duet/configs/deveval_codegen/scripts/.

Re-scoring From Existing Generations (no LLM calls)

If you already have a results_merged.json and want to re-run filtering / re-score with a different DuET mixture without re-calling the LLM:

uv run python run.py generator \
  --config_name <score_only_config> \
  --dataset_name reuse_results \
  --reuse_results_path results/<old_run>/results_merged.json \
  --max_concurrency 16 \
  - run - merge_json - exit

The original file is read-only. Pass --run_name <name> to choose the output directory.

Evaluation

Benchmark	Entry point
BigCodeBench	`run.py evaluator evaluate[_multiple_fields]`
LiveCodeBench	`duet/evaluators/livecodebench_evaluator.py`
DevEval	`duet/evaluators/deveval_evaluator.py` (has an `eval_loop` that watches a running generator and evaluates new tasks as they appear)

All evaluators emit both a JSON summary and a CSV of per-candidate pass/fail for post-hoc analysis under duet/tools/.

Citation

@inproceedings{duet2026,
  title     = {DuET: Dual Execution for Test Output Prediction with Generated Code and Pseudocode},
  author    = {Han, Hojae and Kim, Jaejin and Hwang, Seung-won and Kim, Yu Jin and Lee, Moontae},
  booktitle = {Findings of the Association for Computational Linguistics: ACL 2026},
  year      = {2026}
}

License

This repository is released under the terms of LICENSE. Each vendored third-party component under third_party/ retains its own license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DuET: Testcase Output Prediction via Consistency across Prediction Modes

Repository Layout

Installation

Configurations by Benchmark

Quick Start

1. BigCodeBench-Hard (codegen + DuET scoring)

2. LiveCodeBench (testcase output prediction)

3. DevEval (function-level codegen)

Re-scoring From Existing Generations (no LLM calls)

Evaluation

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
duet		duet
scripts		scripts
third_party		third_party
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
run.py		run.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

DuET: Testcase Output Prediction via Consistency across Prediction Modes

Repository Layout

Installation

Configurations by Benchmark

Quick Start

1. BigCodeBench-Hard (codegen + DuET scoring)

2. LiveCodeBench (testcase output prediction)

3. DevEval (function-level codegen)

Re-scoring From Existing Generations (no LLM calls)

Evaluation

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages