Paper: DuET: Dual Execution for Test Output Prediction with Generated Code and Pseudocode, Findings of ACL 2026.
If you use this code, please cite the paper — see the Citation section.
DuET is a framework for testcase output prediction and code-candidate selection that combines multiple independent prediction modes and picks the prediction that the modes most agree on ("self-consistency across modes").
Given a program under test and a testcase input tc_input, DuET derives candidate outputs tc_output through four modes:
| Mode | How the output is obtained |
|---|---|
vanilla |
LLM predicts the output from the prompt/context only |
pseudo |
LLM writes pseudocode, then predicts the output from it |
code |
LLM writes code, then predicts the output from the code |
exec |
LLM writes code, the code is executed, the result is the output |
Each candidate output is normalised into a canonical observation ({"status": "ok"|"error"|"timeout", "value": ...}) and aggregated into mixtures (subsets of modes). The predicted output is the highest-agreement observation within the chosen mixture.
Those predictions are then used as an oracle for ranking code candidates: each candidate is executed on the tc_inputs, and the score is how often its output matches DuET's predicted output. pass@k is computed over this ranking.
This repository contains the code and configurations used for all experiments in the paper.
duet/
├── duet/
│ ├── configs/ # Pipeline configurations (per benchmark)
│ ├── datasets/ # Dataset loaders and mergers
│ ├── evaluators/ # Benchmark evaluators (BCB, DevEval, LiveCodeBench)
│ ├── tools/ # One-off analysis / re-scoring utilities
│ ├── tests/ # Unit tests for observation normalisation & scoring
│ └── utils/ # Execution drivers, logging, output parsers
├── third_party/ # Vendored dependencies (no submodules)
│ ├── bigcodebench/ # BigCodeBench benchmark
│ ├── LiveCodeBench/ # LiveCodeBench benchmark
│ ├── expand_langchain/ # Core pipeline framework
│ └── deveval_venv_addon/ # Per-repo venv setup helpers for DevEval
├── scripts/
│ └── fetch_deveval.sh # One-time DevEval dataset fetch (optional)
├── run.py # Fire CLI entry point
├── pyproject.toml
├── uv.lock
└── LICENSE
Experiment outputs land in results/ and cached execution environments / datasets in data/. Both are gitignored.
Requirements: Python 3.10+, uv, git.
# 1. Clone
git clone <repo-url>
cd duet
# 2. Install Python deps (all third-party dependencies are vendored)
uv sync
# 3. (Optional) Set up env vars
cp .env.example .env
# edit .env with your OPENAI_API_BASE / OPENAI_API_KEY, etc.
# 4. (Only for DevEval experiments) Fetch the DevEval benchmark.
# ~3.9 GB; skip if you don't plan to run DevEval.
bash scripts/fetch_deveval.shEach benchmark evaluator creates its own isolated execution venv under data/ the first time it runs (e.g. data/deveval_env_cache/, data/livecodebench_venv/, data/exec_venv/). These venvs are gitignored.
The framework covers three benchmarks. Each config directory registers one or more pipeline names that are passed to run.py generator --config_name <name>.
| Directory | Benchmark | Representative --config_name |
|---|---|---|
duet/configs/bcb_hard_codegen/ |
BigCodeBench-Hard | bcb_hard_codegen |
duet/configs/20251219-livecodebench_codegen/ |
LiveCodeBench (codegen) | 20251219-livecodebench_codegen-vanilla |
duet/configs/livecodebench_tcoutpred/ |
LiveCodeBench (TC output) | livecodebench_tcoutpred_llama31_8b |
duet/configs/deveval_codegen/ |
DevEval (function-level) | deveval_codegen, ..._local_completion, ..._local_infilling, ..._without_context, plus smoke_n1/smoke_n20 variants |
duet/configs/original_qwq/ is kept for reference (early / exploratory config, not used in the paper).
# Generate + score + merge
uv run python run.py generator \
--config_name bcb_hard_codegen \
--dataset_name bcb_hard \
--max_concurrency 16 \
- run --n 10 - merge_json - exit
# Evaluate (BigCodeBench executes in an isolated venv)
uv run python run.py evaluator verify_ground_truth --use_eval_venv
uv run python run.py evaluator evaluate \
results/<run_name>/results_merged.json \
--use_eval_venvScore-only configs (like bcb_hard_codegen) store per-candidate DuET scores in each result; the evaluator computes pass@k by ranking candidates with those scores (with probabilistic tie-breaking for unbiased pass@k).
uv run python run.py generator \
--config_name livecodebench_tcoutpred_llama31_8b \
--dataset_name livecodebench_tcoutpred \
--max_concurrency 8 \
- run - merge_json - exit
uv run python duet/evaluators/livecodebench_evaluator.py \
results/<run_name>/results_merged.json \
--start_date 2024-01-01 --end_date 2024-04-01# One-time setup (clones DevEval + downloads Source_Code / data archives)
bash scripts/fetch_deveval.sh
# Verify per-project venvs (caches under data/deveval_env_cache/)
uv run python duet/tools/deveval_gt_verify.py \
--deveval_root third_party/DevEval
# Run one of the context-modes (local_completion / local_infilling / without_context)
bash duet/configs/deveval_codegen/scripts/run_deveval_100_local_completion.shEnd-to-end runs, including smoke tests, are shipped as reproducible shell scripts under duet/configs/deveval_codegen/scripts/.
If you already have a results_merged.json and want to re-run filtering / re-score with a different DuET mixture without re-calling the LLM:
uv run python run.py generator \
--config_name <score_only_config> \
--dataset_name reuse_results \
--reuse_results_path results/<old_run>/results_merged.json \
--max_concurrency 16 \
- run - merge_json - exitThe original file is read-only. Pass --run_name <name> to choose the output directory.
| Benchmark | Entry point |
|---|---|
| BigCodeBench | run.py evaluator evaluate[_multiple_fields] |
| LiveCodeBench | duet/evaluators/livecodebench_evaluator.py |
| DevEval | duet/evaluators/deveval_evaluator.py (has an eval_loop that watches a running generator and evaluates new tasks as they appear) |
All evaluators emit both a JSON summary and a CSV of per-candidate pass/fail for post-hoc analysis under duet/tools/.
@inproceedings{duet2026,
title = {DuET: Dual Execution for Test Output Prediction with Generated Code and Pseudocode},
author = {Han, Hojae and Kim, Jaejin and Hwang, Seung-won and Kim, Yu Jin and Lee, Moontae},
booktitle = {Findings of the Association for Computational Linguistics: ACL 2026},
year = {2026}
}This repository is released under the terms of LICENSE. Each vendored third-party component under third_party/ retains its own license.