Skip to content
Merged
29 changes: 28 additions & 1 deletion examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ Explore runnable examples that show how to use Weco to optimize ML models, promp
- [🧠 Prompt Engineering](#-prompt-engineering)
- [📊 Extract Line Plot — Chart to CSV](#-extract-line-plot--chart-to-csv)
- [🛰️ Model Development — Spaceship Titanic](#️-model-development--spaceship-titanic)
- [🕵️ Fraud Detection — IEEE-CIS](#️-fraud-detection--ieee-cis)

### Prerequisites

Expand All @@ -35,6 +36,7 @@ pip install weco
| 🧠 Prompt Engineering | Iteratively refine LLM prompts to improve accuracy | `openai`, `datasets`, OpenAI API key | [README](prompt/README.md) |
| 📊 Agentic Scaffolding | Optimize agentic scaffolding for chart-to-CSV extraction | `openai`, `huggingface_hub`, `uv`, OpenAI API key | [README](extract-line-plot/README.md) |
| 🛰️ Spaceship Titanic | Improve a Kaggle model training pipeline | `pandas`, `numpy`, `scikit-learn`, `torch`, `xgboost`, `lightgbm`, `catboost` | [README](spaceship-titanic/README.md) |
| 🕵️ Fraud Detection | Optimize a fraud pipeline on IEEE-CIS (real Vesta transactions) | `pandas`, `numpy`, `scikit-learn`, `lightgbm`, `pyarrow`, `kaggle` | [README](fraud-detection/README.md) |

---

Expand Down Expand Up @@ -162,8 +164,33 @@ weco run --source train.py \
--log-dir .runs/spaceship-titanic
```

### 🕵️ Fraud Detection — IEEE-CIS

Optimize a tabular fraud-detection pipeline on real Vesta payment data.
Reproduces Weco's
[fraud-detection case study](https://weco.ai/blog/framing-the-problem)
(baseline AUC 0.914 → pooled 6-seed mean 0.9305 ± 0.0035 with full
instructions at 200 steps).

- **Prereqs**: Kaggle API token + [join the competition](https://www.kaggle.com/c/ieee-fraud-detection)
- **Install Dependencies**: `pip install -r requirements.txt`
- **Prepare data** (once, ~2-3 min): `python prepare_data.py`
- **Run**:
```bash
cd examples/fraud-detection
weco run --source train.py \
--eval-command "python evaluate.py" \
--metric auc_roc \
--goal maximize \
--steps 50 \
--model gemini-3.1-pro-preview \
--additional-instructions instructions.md \
--eval-timeout 300 \
--log-dir .runs/fraud-detection
```

---

If you're new to Weco, start with **Hello World**, then try **LangSmith ZephHR QA** for a realistic LangSmith optimization workflow, explore **Triton** and **CUDA** for kernel engineering, **Prompt Engineering** for optimzing an LLM's prompt, **Extract Line Plot** for optimzing agentic scaffolds, or **Spaceship Titanic** for model development.
If you're new to Weco, start with **Hello World**, then try **LangSmith ZephHR QA** for a realistic LangSmith optimization workflow, explore **Triton** and **CUDA** for kernel engineering, **Prompt Engineering** for optimzing an LLM's prompt, **Extract Line Plot** for optimzing agentic scaffolds, **Spaceship Titanic** for model development, or **Fraud Detection** for a production-scale tabular ML case study.


4 changes: 4 additions & 0 deletions examples/fraud-detection-loose/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
data/
.runs/
__pycache__/
*.pyc
175 changes: 175 additions & 0 deletions examples/fraud-detection-loose/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
# Fraud Detection (IEEE-CIS)

Optimize a tabular fraud-detection pipeline on the
[IEEE-CIS Fraud Detection](https://www.kaggle.com/c/ieee-fraud-detection) Kaggle
dataset (real Vesta payment transactions). Weco rewrites `train.py` — both
feature engineering and the LightGBM configuration — to maximize AUC-ROC on a
held-out, time-based validation split.

This example reproduces the setup from Weco's fraud-detection case study
([blog post](https://weco.ai/blog/framing-the-problem),
[code](https://github.com/WecoAI/fraud-detection-case-study)). The example's
baseline is **AUC ≈ 0.9102** (deterministic; verifiable via the SHA-256s
in `prepare_data.py`). The case study reported 0.914, which used a slightly
leaky `build_features` (concat-then-groupby on train+val); this example's
`train.py` fits all encoders on `train_df` only — no time-leakage. With the
bundled `instructions.md` and 200 steps of `gemini-3.1-pro-preview`, expect
AUC in the **0.928–0.933** range.

## Prerequisites

1. **Kaggle API token**. Put a valid `kaggle.json` at `~/.kaggle/kaggle.json`
(see [Kaggle API credentials](https://github.com/Kaggle/kaggle-api#api-credentials)),
then `chmod 600 ~/.kaggle/kaggle.json` to silence the permissions warning.
2. **You must join the competition.** Visit
<https://www.kaggle.com/c/ieee-fraud-detection> and click "Late Submission" /
"Join Competition" to accept the rules. Without this,
`prepare_data.py` will fail with `403 Forbidden` from the Kaggle API —
this is the single most common first-time friction.
3. **Weco API key** (free tier is fine). See the
[Weco docs](https://docs.weco.ai).

## Setup

```bash
cd examples/fraud-detection

# Virtualenv is strongly recommended — modern Python installs (Debian/Ubuntu,
# recent Homebrew) refuse `pip install` to the system site-packages under
# PEP 668. If you skip this step you'll hit
# `error: externally-managed-environment`.
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# After activation, `python` resolves to the venv's interpreter.

pip install --upgrade -r requirements.txt
# Always pull the latest weco-cli — never pin. Recent versions ship important
# fixes (e.g. 0.3.31 added queue-mode submit recovery that prevents transient
# network errors from prematurely terminating runs). `--upgrade` ensures you
# pick those up even if an older weco is already installed in the venv.

# Downloads ~120MB of CSVs, builds a small 100K/25K parquet split.
# Time-based split: last 20% of transactions by TransactionDT = validation.
# ~2-3 minutes on a modern laptop.
python prepare_data.py
```

After this you should have:

```
data/
train_transaction.csv, train_identity.csv, test_*.csv # raw
base_train_small.parquet # 100K rows, time-ordered
base_val_small.parquet # 25K rows, later in time
```

## Quick sanity check

Run the baseline once to confirm everything loads:

```bash
python evaluate.py
# → auc_roc: 0.910171 (deterministic, takes ~30s)
```

If you see an AUC in the 0.90-0.92 range, you're ready.

## Run Weco

The "default" run uses the full EDA + techniques instructions (recommended —
they contain the column semantics and known-good techniques for this dataset):

```bash
weco run --source train.py \
--eval-command "python evaluate.py" \
--metric auc_roc \
--goal maximize \
--steps 50 \
--model gemini-3.1-pro-preview \
--additional-instructions instructions.md \
--eval-timeout 300 \
--log-dir .runs/fraud-detection
```

Expected trajectory:

- Steps 1–10: Weco explores — tries log-amount, simple aggregations, category
encodings. AUC moves into 0.918-0.925.
- Steps 10–50: builds UID-style features (card1 + addr1 + account-creation
estimate via `D1`), target encoding with out-of-fold protection, velocity
features. AUC climbs to 0.928-0.933.
- Beyond step 50: diminishing returns; the pooled mean across 6 seeds in our
case study was 0.9305 ± 0.0035.

## Explanation

- `--source train.py` — the file Weco rewrites. Both `build_features` and
`train_and_evaluate` are fair game.
- `--eval-command "python evaluate.py"` — called after every proposed edit;
reimports `train.py`, runs the pipeline, prints `auc_roc: 0.xxxxxx`. Weco
parses the last line matching `--metric`.
- `--metric auc_roc --goal maximize` — Weco optimizes the metric printed by
the evaluator.
- `--additional-instructions instructions.md` — injects domain context into
every optimization step. **This is what mostly matters.** See the
case study: EDA-level instructions (what each column means in this
specific dataset) drive most of the gain. Kaggle-classic techniques are
typically already in the LLM's pretraining distribution. Feed the optimizer
what it couldn't already know — dataset-specific semantics, proprietary
heuristics, internal constraints.
- `--eval-timeout 300` — one eval takes ~30-60s; 300s gives headroom for
feature-heavy proposals.

## Things to try

1. **No instructions baseline**: remove `--additional-instructions` and watch
variance across seeds balloon (std ~0.008 vs ~0.002 with instructions).
Also watch for silently-leaky proposals (see below).
2. **EDA only**: keep only the column-meaning section of `instructions.md` —
the case study found this accounts for most of the mean gain.
3. **Scope restriction**: point Weco at `train.py`'s `build_features` only by
editing the file to expose just that function (or split the pipeline into
`features.py` + `model.py`). In our case study, features-only delivered
most of the improvement that full-pipeline did.

## Watch out for silent leakage

Two flavors both show up in IEEE-CIS optimization runs.

**Target leakage** — `isFraud` ends up encoded into features. A plausible
idea like "count how many columns are zero per row" becomes leaky if the
dataframe still contains `isFraud`, because fraud rows contribute a
different count than non-fraud rows. The baseline `build_features` drops
`isFraud` and `TransactionID` up-front; don't let proposals reintroduce
aggregations on a dataframe that still has the label. The case study walks
through a real instance where this bug reported AUC 0.9591 that dropped to
0.9154 after a one-line fix — see
<https://weco.ai/blog/framing-the-problem>.

**Time leakage** — validation-period statistics leak into train features.
This is a time-based split; at serving time you don't have the val period.
Any encoder, groupby aggregation, frequency count, or target encoding must
be **fit on `train_df` only** and then applied to both splits. The baseline
demonstrates the pattern — fit `card1_amt_mean` on train, `.join` it onto
both train and val, fill unseen val keys with a train-global default. If a
proposal does `pd.concat([train_df, val_df]).groupby(...)`, that's a leak
even if it drops `isFraud` first.

Signs a run has one of these leaks (AUC suspiciously high on this 100K/25K
subsample, e.g. > 0.95):

- Any `df.sum`/`df.mean`/`(df == x)` across all columns before the label is
dropped.
- Target encoding without out-of-fold protection (encoder fit on full train
then applied to train).
- Groupby / value-counts / target encoders fit on `pd.concat([train, val])`.
- Features computed using validation data at all — velocity features that
sort train + val together and take row-wise diffs, etc.

## Citing the case study

If you use this example, the underlying numbers come from
<https://github.com/WecoAI/fraud-detection-case-study>. Setup: 200 steps,
3 seeds per condition (6 for the Full pipeline + Full-instructions condition,
pooled since the two ablations share that configuration),
`gemini-3.1-pro-preview`.
35 changes: 35 additions & 0 deletions examples/fraud-detection-loose/evaluate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
"""Evaluator Weco calls after each proposed edit.

Loads train.py fresh each run (Weco rewrites it in place), executes the
pipeline, and prints a single `auc_roc: 0.xxxxxx` line that Weco parses as
the metric.
"""

from __future__ import annotations

import importlib.util
import sys
from pathlib import Path


def load_module(path: str):
spec = importlib.util.spec_from_file_location("train_under_test", path)
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)
return mod


def main() -> int:
train = load_module(str(Path(__file__).parent / "train.py"))
auc = train.run_pipeline()

if not (0.0 <= auc <= 1.0):
print(f"Constraint violated: AUC-ROC out of range ({auc})")
return 1

print(f"auc_roc: {auc:.6f}")
return 0


if __name__ == "__main__":
sys.exit(main())
116 changes: 116 additions & 0 deletions examples/fraud-detection-loose/instructions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Fraud Detection Optimization Instructions

## Task
Optimize `train.py` to maximize AUC-ROC for fraud detection on the IEEE-CIS dataset. You may modify both `build_features` (feature engineering) and `train_and_evaluate` (model config). Keep `run_pipeline`'s interface and the `auc_roc: 0.xxxxxx` print format unchanged so the evaluator can parse the metric.

## Dataset Details
- 100K train / 25K val, 3.5% fraud rate, time-based split
- Base data has 297 columns after V-feature correlation pruning
- Categoricals are already label-encoded as integers
- TransactionDT is in seconds (timedelta from reference date, NOT a timestamp)

## Column Meanings (from Kaggle community reverse-engineering)

### Raw columns
- **TransactionAmt**: USD amount. Heavy-tailed (median $68, max $4578). Log transform essential.
- **ProductCD**: Product type (5 categories: C, H, R, S, W). Each has a distinct V-feature NaN pattern and fraud rate (C=11%, W=2.1%).
- **card1**: Bank Identification Number (BIN) — first 6 digits of card. Top-3 importance.
- **card2**: Additional card info. 1.5% NaN. Top-3 importance.
- **card3/card5**: Card country/product type codes.
- **card4**: Card network (visa, mastercard, etc).
- **card6**: Card type (credit, debit).
- **addr1**: Billing zip code (anonymized). 11.5% NaN.
- **addr2**: Billing country.
- **P_emaildomain**: Purchaser email domain (gmail.com, yahoo.com, etc).
- **R_emaildomain**: Recipient email domain. Mismatch between P and R = fraud signal.
- **dist1/dist2**: Distance features.

### C-features (C1-C14): Entity occurrence COUNTS, no NaN
- **C1** (importance rank #2): Count of addresses associated with the payment card
- **C2**: Count of cards at the billing address
- **C5**: Count of email addresses seen with this card
- **C11**: Count of cards associated with a user identity
- **C12**: Count of addresses associated with a user identity
- **C13** (importance rank #4): Count of distinct email domains per entity — **one of the single most predictive raw features**. High values = fraud ring.
- **C14** (importance rank #3): Related count feature

### D-features (D1-D15): TIMEDELTA in days between events
- **D1** (0.2% NaN, median 1 day): Days since last transaction. Most important D-feature. `TransactionDT/86400 - D1` estimates the **account creation date** — this is the key insight for UID construction.
- **D2** (49% NaN, median 97 days): Days since card was first associated with the identity
- **D3** (46% NaN): Days since last similar transaction
- **D4** (29.5% NaN): Days since email association
- **D10** (14% NaN): Days since last device-linked transaction
- **D11** (52% NaN): Days since account was opened / account age
- **D15** (16.5% NaN, median 46 days): Days since last transaction (alternative)
- D-feature NaN rates themselves are informative — missingness patterns encode transaction type

### M-features (M1-M9): Binary MATCH indicators
Whether certain attributes match each other (name↔address, card↔billing, device↔historical, etc). Sum of True values, count of NaN, and the M-vector signature are all useful.

### V-features (V1-V339, ~202 after pruning): Vesta-engineered risk signals
Grouped by ProductCD — each product type uses a different subset of V-features (others are NaN). V258 is the #1 most important feature overall (gain=16703). Other important V-features: V283, V69, V130, V307, V294, V201.

## Top Winning Techniques (from 1st-3rd place solutions)

### 1. UID Construction (THE most impactful single technique)
```python
D1_start = floor(TransactionDT / 86400 - D1) # estimated account creation day
uid = card1 + "_" + addr1 + "_" + D1_start
```
This creates a stable user fingerprint. All aggregation features should be computed on this UID.

### 2. UID-level aggregation features
For each UID, compute: mean, std, count of TransactionAmt. Then z-score and ratio for each transaction relative to user's history. This captures "is this transaction unusual for this user?"

### 3. Temporal centroid distance
Compute the user's typical time-of-day using cyclical hour_sin/hour_cos means. The Euclidean distance of the current transaction from the centroid = "is this at an unusual time for this user?"

### 4. D-feature lifecycle lags
D1 - D2, D1 - D4, D1 - D10, D1 - D15: Inconsistencies between these timestamps indicate synthetic identities or account takeovers.

### 5. Velocity features (sort by [uid, TransactionDT])
Time since last transaction per user. Amount change from previous transaction. High velocity + high amount = fraud signal.

### 6. Cross-entity cardinality (nunique)
How many unique addr1 values per card1? How many unique card1 per addr1? How many unique P_emaildomain per uid? High cardinality = suspicious.

### 7. NaN pattern signature
The binary NaN/not-NaN pattern across D+M columns encodes the transaction type. Compute a bitwise signature or just count NaN per feature group.

### 8. Frequency encoding
For card1, card2, addr1, P_emaildomain, etc. — map each value to its frequency. Rare values (appearing once or twice) are fraud signals.

### 9. Interaction features
- amount_zscore × time_distance (unusual amount at unusual time)
- amount_zscore × C1_ratio (unusual amount with unusual address count)
- amount / (D1 + 1) = spending rate per day since last transaction

### 10. Row-wise missingness features
Count of NaN values across D-columns, M-columns, V-columns per row. Sum/mean of M-column values. The NaN pattern encodes the transaction profile.

## Important Constraints
- Keep code under 300 lines (Weco backend limit)
- Use n_jobs=4 for any model operations
- `train.py` loads `data/base_train_small.parquet` and `data/base_val_small.parquet` — don't change these paths
- Categoricals are already integer-encoded — treat them as numeric
- Keep the `run_pipeline() -> float` function signature and the `auc_roc: 0.xxxxxx` print format intact

## Avoiding silent leakage

Two distinct leaks to avoid. Both inflate reported AUC without improving the real pipeline.

**1. Target leakage (isFraud bleeding into features).** `isFraud` is the label. If you compute features that aggregate across all columns of the dataframe (e.g. `(df == 0).sum(axis=1)`, row-wise NaN counts over the entire frame), drop `isFraud` and `TransactionID` first. Otherwise the label signal encodes into the features and produces implausibly high AUC (> 0.95) that collapses the moment the fix is applied.

**2. Time leakage (validation distribution bleeding into features).** This is a time-based train/val split — val rows are transactions from a later period you wouldn't see at serving time. Any encoder, aggregation, frequency count, or target encoding MUST be fit on `train_df` only and then applied to both splits. Concatenating `train_df + val_df` before a `groupby` lets val-period statistics shape train features and lets each val row influence its own encoded values. Expected fallout: smaller inflation than target leakage, but still material (noticeable bump in val AUC that doesn't survive a real time cutoff).

Pattern to follow for any new group/frequency/target encoder:

```python
# Fit on train
freq = train_df[col].value_counts(normalize=True)
# Apply to both, unseen keys get 0 (or a sensible train-global default)
train_df[f"{col}_freq"] = train_df[col].map(freq).fillna(0)
val_df[f"{col}_freq"] = val_df[col].map(freq).fillna(0)
```

For target encoding specifically, even on train you need out-of-fold protection (fit encoder on K-1 folds, apply to the held-out fold) — otherwise you leak train labels into train features.
Loading
Loading