Skip to content

ldilab/DuET

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DuET: Testcase Output Prediction via Consistency across Prediction Modes

Paper: DuET: Dual Execution for Test Output Prediction with Generated Code and Pseudocode, Findings of ACL 2026.

If you use this code, please cite the paper — see the Citation section.

DuET is a framework for testcase output prediction and code-candidate selection that combines multiple independent prediction modes and picks the prediction that the modes most agree on ("self-consistency across modes").

Given a program under test and a testcase input tc_input, DuET derives candidate outputs tc_output through four modes:

Mode How the output is obtained
vanilla LLM predicts the output from the prompt/context only
pseudo LLM writes pseudocode, then predicts the output from it
code LLM writes code, then predicts the output from the code
exec LLM writes code, the code is executed, the result is the output

Each candidate output is normalised into a canonical observation ({"status": "ok"|"error"|"timeout", "value": ...}) and aggregated into mixtures (subsets of modes). The predicted output is the highest-agreement observation within the chosen mixture.

Those predictions are then used as an oracle for ranking code candidates: each candidate is executed on the tc_inputs, and the score is how often its output matches DuET's predicted output. pass@k is computed over this ranking.

This repository contains the code and configurations used for all experiments in the paper.


Repository Layout

duet/
├── duet/
│   ├── configs/            # Pipeline configurations (per benchmark)
│   ├── datasets/           # Dataset loaders and mergers
│   ├── evaluators/         # Benchmark evaluators (BCB, DevEval, LiveCodeBench)
│   ├── tools/              # One-off analysis / re-scoring utilities
│   ├── tests/              # Unit tests for observation normalisation & scoring
│   └── utils/              # Execution drivers, logging, output parsers
├── third_party/            # Vendored dependencies (no submodules)
│   ├── bigcodebench/       # BigCodeBench benchmark
│   ├── LiveCodeBench/      # LiveCodeBench benchmark
│   ├── expand_langchain/   # Core pipeline framework
│   └── deveval_venv_addon/ # Per-repo venv setup helpers for DevEval
├── scripts/
│   └── fetch_deveval.sh    # One-time DevEval dataset fetch (optional)
├── run.py                  # Fire CLI entry point
├── pyproject.toml
├── uv.lock
└── LICENSE

Experiment outputs land in results/ and cached execution environments / datasets in data/. Both are gitignored.


Installation

Requirements: Python 3.10+, uv, git.

# 1. Clone
git clone <repo-url>
cd duet

# 2. Install Python deps (all third-party dependencies are vendored)
uv sync

# 3. (Optional) Set up env vars
cp .env.example .env
# edit .env with your OPENAI_API_BASE / OPENAI_API_KEY, etc.

# 4. (Only for DevEval experiments) Fetch the DevEval benchmark.
#    ~3.9 GB; skip if you don't plan to run DevEval.
bash scripts/fetch_deveval.sh

Each benchmark evaluator creates its own isolated execution venv under data/ the first time it runs (e.g. data/deveval_env_cache/, data/livecodebench_venv/, data/exec_venv/). These venvs are gitignored.


Configurations by Benchmark

The framework covers three benchmarks. Each config directory registers one or more pipeline names that are passed to run.py generator --config_name <name>.

Directory Benchmark Representative --config_name
duet/configs/bcb_hard_codegen/ BigCodeBench-Hard bcb_hard_codegen
duet/configs/20251219-livecodebench_codegen/ LiveCodeBench (codegen) 20251219-livecodebench_codegen-vanilla
duet/configs/livecodebench_tcoutpred/ LiveCodeBench (TC output) livecodebench_tcoutpred_llama31_8b
duet/configs/deveval_codegen/ DevEval (function-level) deveval_codegen, ..._local_completion, ..._local_infilling, ..._without_context, plus smoke_n1/smoke_n20 variants

duet/configs/original_qwq/ is kept for reference (early / exploratory config, not used in the paper).


Quick Start

1. BigCodeBench-Hard (codegen + DuET scoring)

# Generate + score + merge
uv run python run.py generator \
  --config_name bcb_hard_codegen \
  --dataset_name bcb_hard \
  --max_concurrency 16 \
  - run --n 10 - merge_json - exit

# Evaluate (BigCodeBench executes in an isolated venv)
uv run python run.py evaluator verify_ground_truth --use_eval_venv
uv run python run.py evaluator evaluate \
  results/<run_name>/results_merged.json \
  --use_eval_venv

Score-only configs (like bcb_hard_codegen) store per-candidate DuET scores in each result; the evaluator computes pass@k by ranking candidates with those scores (with probabilistic tie-breaking for unbiased pass@k).

2. LiveCodeBench (testcase output prediction)

uv run python run.py generator \
  --config_name livecodebench_tcoutpred_llama31_8b \
  --dataset_name livecodebench_tcoutpred \
  --max_concurrency 8 \
  - run - merge_json - exit

uv run python duet/evaluators/livecodebench_evaluator.py \
  results/<run_name>/results_merged.json \
  --start_date 2024-01-01 --end_date 2024-04-01

3. DevEval (function-level codegen)

# One-time setup (clones DevEval + downloads Source_Code / data archives)
bash scripts/fetch_deveval.sh

# Verify per-project venvs (caches under data/deveval_env_cache/)
uv run python duet/tools/deveval_gt_verify.py \
  --deveval_root third_party/DevEval

# Run one of the context-modes (local_completion / local_infilling / without_context)
bash duet/configs/deveval_codegen/scripts/run_deveval_100_local_completion.sh

End-to-end runs, including smoke tests, are shipped as reproducible shell scripts under duet/configs/deveval_codegen/scripts/.


Re-scoring From Existing Generations (no LLM calls)

If you already have a results_merged.json and want to re-run filtering / re-score with a different DuET mixture without re-calling the LLM:

uv run python run.py generator \
  --config_name <score_only_config> \
  --dataset_name reuse_results \
  --reuse_results_path results/<old_run>/results_merged.json \
  --max_concurrency 16 \
  - run - merge_json - exit

The original file is read-only. Pass --run_name <name> to choose the output directory.


Evaluation

Benchmark Entry point
BigCodeBench run.py evaluator evaluate[_multiple_fields]
LiveCodeBench duet/evaluators/livecodebench_evaluator.py
DevEval duet/evaluators/deveval_evaluator.py (has an eval_loop that watches a running generator and evaluates new tasks as they appear)

All evaluators emit both a JSON summary and a CSV of per-candidate pass/fail for post-hoc analysis under duet/tools/.


Citation

@inproceedings{duet2026,
  title     = {DuET: Dual Execution for Test Output Prediction with Generated Code and Pseudocode},
  author    = {Han, Hojae and Kim, Jaejin and Hwang, Seung-won and Kim, Yu Jin and Lee, Moontae},
  booktitle = {Findings of the Association for Computational Linguistics: ACL 2026},
  year      = {2026}
}

License

This repository is released under the terms of LICENSE. Each vendored third-party component under third_party/ retains its own license.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors