OpenAdapt-ML is a model-agnostic, domain-agnostic ML engine for GUI automation agents.
It provides:
- Schemas for GUI interaction trajectories (screens + actions + goals).
- Synthetic semantic UI generation for bootstrapping datasets.
- Dataset builders that turn episodes into next-action SFT samples.
- VLM adapters (Qwen3-VL, Qwen2.5-VL) using Hugging Face + PEFT.
- A minimal supervised training loop for fine-tuning.
- A simple runtime policy API that predicts the next GUI action.
The design is described in detail in docs/design.md.
From the repository root:
uv syncRun a fast, model-free smoke test:
uv run python -m openadapt_ml.scripts.demo_policy --backend dummyOn a machine with a suitable GPU, you can reproduce the Qwen3-VL synthetic login benchmark (train → eval base/FT → plot) with a single command:
uv run python -m openadapt_ml.scripts.run_qwen_login_benchmark \
--config configs/qwen3vl_synthetic_dev.yaml \
--out-dir experiments/qwen_login/2b_devThis default invocation will:
- Train a LoRA adapter on the hardened synthetic login scenario.
- Evaluate both the base and fine-tuned Qwen3-VL models on fresh synthetic episodes.
- Write eval JSONs and a comparison plot under
experiments/qwen_login/2b_dev/.
The qwen3vl_synthetic_dev config is sized for small development runs on Apple
Silicon / CPU, but will also run on CUDA GPUs.
To additionally compare against hosted API backends (Claude Sonnet 4.5 and
OpenAI GPT-5.1), first install the optional api extra and configure your API
keys:
uv sync --extra api
# Option 1: Use .env file (recommended)
cp .env.example .env
# Edit .env with your API keys
# Option 2: Export environment variables (for CI/containers)
export ANTHROPIC_API_KEY=... # for Claude Sonnet 4.5
export OPENAI_API_KEY=... # for GPT-5.1Then run:
uv run python -m openadapt_ml.scripts.run_qwen_login_benchmark \
--config configs/qwen3vl_synthetic_dev.yaml \
--out-dir experiments/qwen_login/2b_dev \
--include-all-apisThis will evaluate and plot Qwen3 base, Qwen3 FT, Claude Sonnet 4.5, and GPT-5.1 on the same synthetic login benchmark.
For more details on configs, adapters, and evaluation metrics, see the sections
below and docs/state_and_next_steps_qwen_login.md.
Key modules:
openadapt_ml/schemas/- Canonical dataclasses for
Session,Episode,Step,Observation,Action.
- Canonical dataclasses for
openadapt_ml/ingest/synthetic.py- Synthetic semantic UI generator (e.g. login screen) that produces PNG screenshots and scripted episodes.
openadapt_ml/datasets/next_action.py- Converts episodes into goal-conditioned, chat-style next-action SFT samples suitable for VLM fine-tuning.
openadapt_ml/models/base_adapter.pyBaseVLMAdapterabstraction shared by all VLM backends.
openadapt_ml/models/qwen_vl.pyQwenVLAdapterimplementing support for Qwen3-VL and Qwen2.5-VL.
openadapt_ml/models/dummy_adapter.py- Tiny fake adapter used to validate training and runtime flows without loading a real VLM.
openadapt_ml/training/trainer.py- Minimal supervised training loop (
train_supervised) with gradient accumulation and logging.
- Minimal supervised training loop (
openadapt_ml/runtime/policy.pyAgentPolicythat formats inputs for a VLM and parses textual actions likeCLICK(x=..., y=...)andDONE()into structuredActions.
openadapt_ml/scripts/train.py- CLI entry point for running synthetic-data training with a chosen model/config.
openadapt_ml/scripts/demo_policy.py- CLI demo showing how to use
AgentPolicywith different backends (dummy, Qwen3-VL, Qwen2.5-VL).
- CLI demo showing how to use
Configs and docs:
configs/qwen3vl_synthetic.yaml- Synthetic training config for Qwen3-VL-8B-Instruct.
configs/qwen2_5vl_synthetic.yaml- Synthetic training config for Qwen2.5-VL-7B-Instruct.
docs/design.md- High-level design document (scope, architecture, schemas, adapters, training, runtime, and evaluation strategy).
OpenAdapt-ML targets Python 3.12 and uses uv
for dependency management.
From the repository root:
# Ensure uv is installed (see uv docs for platform-specific install)
# Then:
uv syncThis will create a virtual environment (e.g. .venv/) and install all
packages declared in pyproject.toml.
Use uv run to execute Python modules and scripts with the synced
environment:
uv run python -m openadapt_ml.scripts.train --helpYou can also run pytest or other tools via uv run.
The v1 pipeline is validated on synthetic, semantic UIs, starting with a simple login flow.
OpenAdapt-ML includes synthetic UI generators for structured GUI automation benchmarks. Currently two scenarios are supported:
A simple login form with username, password, and login button.
Login SoM Evaluation Results:
| Metric | Qwen3-VL-2B FT |
|---|---|
| Action Type Accuracy | 100% |
| Element Accuracy | 100% |
| Episode Success Rate | 100% |
| Episodes / Steps | 32 / 192 |
A more complex registration form with first name, last name, email, password, confirm password, and register button.
Registration SoM Evaluation Results:
| Metric | Qwen3-VL-2B FT |
|---|---|
| Action Type Accuracy | 100% |
| Element Accuracy | 100% |
| Episode Success Rate | 100% |
| Episodes / Steps | 32 / 384 |
Synthetic data is generated on the fly by generate_synthetic_sessions in
openadapt_ml/ingest/synthetic.py and used internally by the training
scripts.
You can also call it directly from Python:
from openadapt_ml.ingest.synthetic import generate_synthetic_sessions
# Login scenario (default)
sessions = generate_synthetic_sessions(num_sessions=2, seed=123, output_dir="synthetic_login")
# Registration scenario
sessions = generate_synthetic_sessions(
num_sessions=2,
seed=123,
output_dir="synthetic_registration",
scenario="registration", # "login" or "registration"
use_som=True, # Enable Set-of-Marks visual overlays
)Each session contains episodes with:
- A goal (e.g. "Log in as demo user").
- A sequence of steps, each with:
- An observation (screenshot path).
- An action (e.g.
CLICK,TYPE,DONE).
Episodes are converted into SFT-style samples by
build_next_action_sft_samples in openadapt_ml/datasets/next_action.py.
Each sample has the form:
{
"images": ["/path/to/screenshot.png"],
"messages": [
{"role": "system", "content": ...},
{"role": "user", "content": "Goal: ...\nCurrent screen: ..."},
{"role": "assistant", "content": "CLICK(x=..., y=...)"},
],
}These samples are wrapped in a simple NextActionDataset for use with the
training loop.
For the full, canonical definition of the action DSL (CLICK/TYPE/WAIT/DONE)
and its invariants, see docs/design.md §7.4.
Training is driven by openadapt_ml/scripts/train.py and YAML configs under
configs/.
The training script:
- Loads a config file (YAML).
- Generates synthetic sessions.
- Flattens to episodes and builds SFT samples.
- Wraps them in a
NextActionDataset. - Instantiates a VLM adapter (e.g.
QwenVLAdapter). - Runs
train_supervisedover the dataset.
Config: configs/qwen3vl_synthetic.yaml
Key fields:
model:
name: Qwen/Qwen3-VL-8B-Instruct
load_in_4bit: false # 4-bit quantization is disabled on macOS / Apple Silicon
# LoRA config and training hyperparameters are also defined in the YAML.Run:
uv run python -m openadapt_ml.scripts.train --config configs/qwen3vl_synthetic.yamlThis will:
- Download and load
Qwen/Qwen3-VL-8B-Instruct. - Generate a small synthetic dataset.
- Run a single-epoch supervised fine-tuning loop.
- Print loss values as training progresses.
Config: configs/qwen2_5vl_synthetic.yaml
Key fields:
model:
name: Qwen/Qwen2.5-VL-7B-Instruct
load_in_4bit: falseRun:
uv run python -m openadapt_ml.scripts.train --config configs/qwen2_5vl_synthetic.yamlThis exercises the Qwen2.5-VL path in QwenVLAdapter, using a
process_vision_info-style helper internally to pack image inputs in the
format expected by the Qwen2.5-VL processor.
Note: Both configs are sized for small synthetic smoke runs, not large-scale production training.
OpenAdapt-ML ships a synthetic login benchmark backed by Qwen3-VL, used to compare base vs LoRA-fine-tuned models on a hardened synthetic environment (layout jitter + a decoy "Help" button).
FT = LoRA fine-tuned Qwen3-VL on synthetic login. Base = frozen pretrained Qwen3-VL.
Comprehensive Model Comparison (Login - 6 steps):
The plot compares all six evaluated models across four key metrics (action type accuracy, coordinate error, click hit rate, and episode success rate). The legend shows color coding for model types (Qwen 2B/8B vs API models) and hatching patterns for fine-tuned vs base models. It exposes step-level performance metrics, which let us visually answer the question: "Does fine-tuning a small local model outperform large API models?"
Comprehensive Results (all models on hardened synthetic login):
| Model | Type | Action Accuracy | Coord Error | Click Hit Rate |
|---|---|---|---|---|
| Qwen3-VL-2B base | Offline | 0.143 | N/A | N/A |
| Qwen3-VL-2B FT | Offline | 0.469 | 0.051 | 0.850 |
| Qwen3-VL-8B base | Offline | 0.143 | N/A | N/A |
| Qwen3-VL-8B FT | Offline | 0.286 | 0.004 | 1.000 |
| Claude Sonnet 4.5 | API | 0.121 | 0.757 | 0.000 |
| GPT-5.1 | API | 0.183 | 0.057 | 0.600 |
Key findings:
- Fine-tuning delivers massive gains: Both 2B and 8B models show 2-3x improvement in action accuracy after fine-tuning
- Small fine-tuned models beat large APIs: Qwen3-VL-2B FT (469% base) outperforms both Claude Sonnet 4.5 (121%) and GPT-5.1 (183%)
- Precision matters: Fine-tuned models have excellent click precision (85-100% hit rate, <0.05 coord error) while API models struggle with the action format
- Size vs specialization: The fine-tuned 2B model outperforms the general-purpose Claude Sonnet 4.5, showing that domain-specific fine-tuning trumps raw model size
With Set-of-Marks visual prompting, fine-tuned Qwen3-VL-2B achieves 100% accuracy on both login (6-step) and registration (12-step) scenarios:
| Scenario | Steps | Elements | Action Acc | Element Acc | Episode Success |
|---|---|---|---|---|---|
| Login | 6 | 3 | 100% | 100% | 100% |
| Registration | 12 | 6 | 100% | 100% | 100% |
Cost/Latency Comparison (SoM mode):
| Approach | Login Accuracy | Registration Accuracy | Cost | Latency |
|---|---|---|---|---|
| Claude API + SoM | 100% | 100%* | ~$0.01/step | ~500ms |
| GPT-4.1 API + SoM | 100% | 100%* | ~$0.01/step | ~500ms |
| Qwen 2B + SoM | 100% | 100% | Free (local) | ~50ms |
*API results on registration pending evaluation
How SoM works: Instead of predicting precise coordinates (CLICK(x=0.42, y=0.31)), the model selects numbered UI elements (CLICK([1])). This reduces spatial reasoning to element selection, which small models handle well.
To use SoM mode:
# Training with SoM
uv run python -m openadapt_ml.scripts.train --config configs/qwen3vl_synthetic_som.yaml
# Evaluation with SoM
uv run python -m openadapt_ml.scripts.eval_policy \
--config configs/qwen3vl_synthetic_som.yaml \
--backend qwen3 \
--dsl-mode som \
--overfit # Check memorizationFor the full SoM investigation report, see experiments/qwen_login/SOM_INVESTIGATION_REPORT.md.
OpenAdapt-ML includes a grounding module for locating UI elements on screenshots using natural language descriptions. This enables policy/grounding separation where the policy decides what to do and the grounder finds where to do it.
The GeminiGrounder uses Google's Gemini vision API to locate UI elements:
Calculator button "2" located by GeminiGrounder with 99% confidence
from openadapt_ml.grounding import GeminiGrounder
grounder = GeminiGrounder() # Uses GOOGLE_API_KEY from .env
candidates = grounder.ground(screenshot, "the login button", k=3)
if candidates:
best = candidates[0]
print(f"Found at {best.centroid} with {best.confidence:.0%} confidence")The grounding module includes functions for extracting all UI elements and overlaying numbered labels (Set-of-Marks):
from openadapt_ml.grounding import extract_ui_elements, overlay_element_marks
# Extract all interactive elements
elements = extract_ui_elements(screenshot)
# Returns: [{"id": 1, "label": "Login button", "bbox": [x1,y1,x2,y2], ...}, ...]
# Overlay numbered labels on screenshot
marked_screenshot = overlay_element_marks(screenshot, elements, style="compact")
marked_screenshot.save("screenshot_with_marks.png")This enables element-based actions using indices instead of coordinates:
- Old:
CLICK(x=0.487, y=0.328)- coordinate-based, brittle - New:
CLICK([1])- element-based, robust
See docs/gemini_grounding.md for full documentation and examples/test_gemini_grounding.py for a complete example.
| Grounder | Description | Latency | Use Case |
|---|---|---|---|
GeminiGrounder |
Google Gemini vision API | ~3s | Real UIs, zero-shot |
OracleGrounder |
Ground-truth bboxes | ~0ms | Evaluation |
DetectorGrounder |
Generic wrapper with backend selection | varies | Flexible |
The openadapt_ml.evals.grounding module provides metrics for evaluating grounding accuracy:
from openadapt_ml.evals import GroundingMetrics, evaluate_grounder
metrics = evaluate_grounder(grounder, test_cases, k=5)
print(metrics)
# Grounding Metrics (n=10):
# Mean IoU: 0.720
# Centroid Hit Rate: 0.900
# Oracle Hit @1: 0.800
# Mean Latency: 3150msAll VLM backends implement the shared BaseVLMAdapter interface in
openadapt_ml/models/base_adapter.py (prepare inputs, compute loss, generate
text from a sample).
Current adapters include:
QwenVLAdapter(openadapt_ml/models/qwen_vl.py) for Qwen3-VL and Qwen2.5-VL.DummyAdapter(openadapt_ml/models/dummy_adapter.py) for fast smoke tests without loading a real VLM.ApiVLMAdapter(openadapt_ml/models/api_adapter.py) for hosted VLM APIs (Anthropic Claude Sonnet 4.5 and OpenAI GPT-5.1). This adapter is inference-only and implementsgenerateusing the respective SDKs.
For full adapter internals and training-time vs runtime behavior, see
docs/design.md §8.
To use the API-backed adapter from Python, you can configure API keys via .env
file, environment variables, or pass them explicitly:
from openadapt_ml.models.api_adapter import ApiVLMAdapter
# Use .env file or environment variables (ANTHROPIC_API_KEY / OPENAI_API_KEY)
claude_adapter = ApiVLMAdapter(provider="anthropic")
gpt_adapter = ApiVLMAdapter(provider="openai")
# Or pass API keys explicitly from your application's config
claude_adapter = ApiVLMAdapter(provider="anthropic", api_key="...")
gpt_adapter = ApiVLMAdapter(provider="openai", api_key="...")The existing CLI scripts scripts/demo_policy.py and
scripts/eval_policy.py expose these backends via the --backend flag
(claude / openai).
The runtime policy is implemented in openadapt_ml/runtime/policy.py as
AgentPolicy.
AgentPolicy is initialized with a VLM adapter (dummy or real). Given an
SFT-style sample, it:
- Calls
adapter.generate(sample)to obtain assistant text. - Parses actions from strings like:
CLICK(x=0.45, y=0.71)DONE()
- Returns a structured
Actionplus an optional free-formthought.
openadapt_ml/scripts/demo_policy.py demonstrates how to use
AgentPolicy with different backends.
Run with a dummy backend (fast, no model load):
uv run python -m openadapt_ml.scripts.demo_policy --backend dummyRun with Qwen3-VL backend:
uv run python -m openadapt_ml.scripts.demo_policy --backend qwen3Run with Qwen2.5-VL backend:
uv run python -m openadapt_ml.scripts.demo_policy --backend qwen2_5Each invocation will:
- Generate a synthetic login episode and select one step.
- Build an SFT-style sample from that step.
- Use
AgentPolicyto predict the next action. - Print the raw messages and the parsed action/thought.
Basic tests are provided under tests/.
Run the test suite with:
uv run pytestIn particular:
tests/test_training_dummy.pyruns a smoke test over the training loop usingDummyAdapter.
OpenAdapt-ML can train on real GUI recordings captured with openadapt-capture.
# Install openadapt-capture
uv pip install openadapt-capture
# Record a workflow (e.g., turning off Night Shift)
openadapt-capture record --output ~/captures/turn-off-nightshiftuv run python -m openadapt_ml.scripts.train \
--config configs/qwen3vl_capture_4bit.yaml \
--capture ~/captures/turn-off-nightshift \
--open # Opens training dashboard in browserThe goal is automatically derived from the directory name (e.g., "Turn off nightshift").
uv run python -m openadapt_ml.scripts.compare \
--capture ~/captures/turn-off-nightshift \
--checkpoint checkpoints/qwen3vl2b_capture_lora \
--open # Opens comparison viewerThe comparison viewer shows:
- Side-by-side human actions vs model predictions
- Click position overlays on screenshots
- Accuracy metrics and distance calculations
- Navigation between training dashboard and comparison viewer
Train locally on your own GPU. Auto-detects CUDA or Apple Silicon (MPS).
# Train on a capture (auto-detects device and config)
uv run python -m openadapt_ml.cloud.local train \
--capture ~/captures/turn-off-nightshift \
--open # Opens dashboard in browser# Check device and training status
uv run python -m openadapt_ml.cloud.local status
# Train on a capture
uv run python -m openadapt_ml.cloud.local train --capture ~/captures/my-workflow --open
# Check training health (loss progression, convergence)
uv run python -m openadapt_ml.cloud.local check
# Start dashboard server
uv run python -m openadapt_ml.cloud.local serve --open
# Regenerate viewer
uv run python -m openadapt_ml.cloud.local viewer --open
# Run human vs AI comparison
uv run python -m openadapt_ml.cloud.local compare \
--capture ~/captures/my-workflow \
--checkpoint checkpoints/qwen3vl2b_capture_lora \
--openFor faster training on powerful GPUs, use Lambda Labs. Full documentation: docs/cloud_gpu_training.md.
# Set API key
export LAMBDA_API_KEY=your_key_here
# Launch, train, download, and terminate in one command
uv run python -m openadapt_ml.cloud.lambda_labs train \
--capture ~/captures/turn-off-nightshift \
--goal "Turn off Night Shift in System Settings"# List available instances and pricing
uv run python -m openadapt_ml.cloud.lambda_labs list
# Launch an A10 instance (~$0.75/hr)
uv run python -m openadapt_ml.cloud.lambda_labs launch --type gpu_1x_a10
# Check training status
uv run python -m openadapt_ml.cloud.lambda_labs train-status
# Check training health (loss progression, early stopping analysis)
uv run python -m openadapt_ml.cloud.lambda_labs check <instance_id>
# Download checkpoints and comparison results
uv run python -m openadapt_ml.cloud.lambda_labs download <instance_id>
# IMPORTANT: Terminate when done (billed by the hour!)
uv run python -m openadapt_ml.cloud.lambda_labs terminate <instance_id>The training process generates:
training_output/dashboard.html- Real-time training dashboard with loss curvestraining_output/viewer.html- Unified viewer for comparing human vs model predictions
Use the navigation tabs to switch between Training and Viewer.
To serve the dashboard:
uv run python -m openadapt_ml.cloud.local serve --port 8080 --openTraining Dashboard:
Shows training progress, loss curves, stats (current loss, min loss, avg step time), and ETA.
Training configuration and evaluation samples with visual overlays showing human (green) vs predicted (purple) click positions.
Comparison Viewer:
Compare human actions vs model predictions frame-by-frame. Shows action type, model reasoning output, and match/mismatch status.
Event timeline, event details, transcript, and video playback controls.
Keyboard shortcuts (Viewer):
Space- Play/pause←/→- Previous/next frameHome/End- First/last frameO- Toggle click overlay
- Apple Silicon / bitsandbytes:
- Example configs are sized for CPU / Apple Silicon development runs; see
docs/design.md§9.4 for details on QLoRA and platform-specific considerations.
- Example configs are sized for CPU / Apple Silicon development runs; see
- Batching:
- For v1,
QwenVLAdapteris implemented assumingbatch_size=1for simplicity when handling multimodal inputs. The training configs are sized accordingly.
- For v1,
- Evaluation:
- v1 focuses on smoke tests and qualitative behavior on synthetic data. More formal evaluation scripts and metrics are planned.
For deeper architectural details, see docs/design.md.
For the up-to-date, prioritized roadmap (including concrete implementation
targets and agent-executable acceptance criteria), see
docs/roadmap.md.







