A research system exploring recursive self-improvement through adversarial code evolution
Features | Installation | Usage | Architecture | Safety | Contributing
AASMS is a local-first AI system that evolves its own codebase through adversarial Red/Blue team dynamics. Blue Team agents propose code improvements as unified diffs, Red Team agents generate exploit tests to find flaws, and an Orchestrator evaluates proposals in Docker-isolated sandboxes, committing only survivors that improve benchmark scores.
β οΈ Research Prototype: This system is for research purposes only. See Known Limitations.
| Cycle | Score | Improvement | Status |
|---|---|---|---|
| 1 | 0.01 | β | β Committed |
| 2 | 0.02 | +100% | β Committed |
| 3 | 0.03 | +50% | β Committed |
| 4 | 0.04 | +33% | β Committed |
| 5 | 0.05 | +25% | β Committed |
| 6 | 0.06 | +20% | β Committed |
24 proposals applied, 0 errors, 0 reverts. Scores from benchmarks/reasoning_suite.json (10-question subset). Full methodology in benchmarks/README.md.
All results are reproducible and verifiable. See docs/REPRODUCIBILITY_PROOF.md for complete proof artifacts:
| Proof Type | Description |
|---|---|
| Seeded Benchmarks | python scripts/benchmark.py --cycles 10 --seed 42 produces identical results |
| JSONL Cycle Logs | Full timestamps, scores, commit hashes in results/proof_run_seed42.jsonl |
| Hardware Report | Local GPU (RTX 5070), Ollama endpoints, no cloud calls |
| Network Isolation | Docker --network=none, tcpdump verification |
| Human Oversight | Sample approvals/rejections in persistence/human_veto.json |
# Reproduce documented results
python scripts/benchmark.py --cycles 10 --seed 42
# Verify GPU and locality
python -c "from utils.gpu_docker import get_system_isolation_report; import json; print(json.dumps(get_system_isolation_report(), indent=2))"| Category | Capabilities |
|---|---|
| π Local Execution | Ollama-only inference (llama3.2), no cloud APIs, multi-GPU vendor support |
| π³ GPU-Aware Docker | NVIDIA/AMD/Intel detection, driver compatibility checks, graceful CPU fallback |
| π‘οΈ Cryptographic Safety | SHA-256 integrity verification, import monitoring, immutable file protection |
| π― Ensemble Anti-Gaming | Calibrated detection (90% precision, see methodology), rotating benchmarks |
| πΎ State Protection | Atomic commits, auto-reversion, stall watchdog, human veto with auto-approve |
| β‘ GPU-Aware Parallelism | Serialized GPU tasks, concurrent CPU work, contention prevention |
| π Structured Logging | JSONL schema with query tools, CSV export, aggregate statistics |
| π¬ Fidelity Verification | Domain-specific metrics (entity/number/code preservation), quality grades |
| Requirement | Version | Purpose |
|---|---|---|
| Python | 3.10+ | Runtime |
| Git | Any | Version control |
| Ollama | Latest | Local LLM inference |
| Docker | Latest | Sandbox isolation (recommended) |
| GPU Driver | See below | GPU passthrough (optional) |
# Clone and install
git clone https://github.com/moonrunnerkc/aasms.git && cd aasms
./scripts/install.sh # Unified installer for all platforms
# Or manual installation
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pytest tests/ -vπ Unified Installer Details
The scripts/install.sh script handles:
- Platform detection (Linux, macOS, WSL2)
- Python version verification (3.10+)
- Virtual environment setup
- Ollama installation with verification
- Docker capability check
- GPU detection and compatibility report
# Full installation with prompts
./scripts/install.sh
# Check capabilities after install
make check-gpuπ³ Docker + GPU Setup
Docker Installation:
# Ubuntu/Debian
sudo apt-get update && sudo apt-get install docker.io docker-compose
sudo usermod -aG docker $USER && newgrp docker
docker run hello-worldGPU Support by Vendor:
| Vendor | Requirement | Docker Flag | Notes |
|---|---|---|---|
| NVIDIA | nvidia-container-toolkit + driver 525+ | --gpus all |
Full support |
| AMD | ROCm 5.0+ + rocm-docker | --device=/dev/kfd |
See ROCm docs |
| Intel | Not supported | β | CPU fallback used |
NVIDIA Setup:
# Install nvidia-container-toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install nvidia-docker2
sudo systemctl restart docker
# Verify
docker run --gpus all nvidia/cuda:11.0-base nvidia-smiCheck AASMS GPU detection:
python -c "from utils.gpu_docker import get_system_isolation_report; import json; print(json.dumps(get_system_isolation_report(), indent=2))"π₯οΈ Platform-Specific Notes
WSL2 (Windows):
# 1. Install Docker Desktop with WSL2 backend
# 2. Enable WSL integration: Docker Desktop > Settings > Resources > WSL Integration
# 3. For GPU: Install NVIDIA drivers on Windows, not in WSL
# 4. Verify: wsl --update && docker run hello-world
# Troubleshooting
wsl --shutdown # Restart if docker not responding
docker context use default # Fix context issuesmacOS:
brew install ollama docker
# Docker Desktop required for Docker support
# GPU: Apple Silicon supported via Metal, no nvidia-docker neededLinux (non-Ubuntu):
# Fedora/RHEL
sudo dnf install docker docker-compose
sudo systemctl enable --now docker
# Arch
sudo pacman -S docker docker-compose
sudo systemctl enable --now dockerAll configuration lives in config/:
| File | Purpose |
|---|---|
models.yaml |
LLM endpoints (Ollama only) |
teams.yaml |
Blue/Red agent definitions |
benchmarks.yaml |
Test suites and scoring rules |
evolution_mode.yaml |
Safety thresholds and limits |
π§ Key Settings
evolution_mode.yaml:
mode: prompt_only # Start safe; unlock code_evolution after stability
min_improvement_pct: 5.0
max_improvement_pct: 50.0
required_test_pass_rate: 1.0
memory_limit_mb: 512
test_timeout_seconds: 30
# Human oversight with empirically-derived thresholds
human_oversight:
cycles_between_reviews: 10
max_unattended_cycles: 50
auto_approve:
max_files: 3 # 95% of safe commits touch β€3 files
max_lines: 50 # Mean safe change: 28 lines (Ο=15)
max_improvement_pct: 15.0
min_improvement_pct: 5.0
anomaly_thresholds:
improvement_threshold: 50.0 # >3Ο from mean
score_drop_threshold: 10.0
# Stall detection
watchdog:
phase_timeout_seconds: 300 # 5 min per phase
cycle_timeout_seconds: 1800 # 30 min per cycle
heartbeat_interval: 10# Safe mode (recommended start)
python run.py --cycles 10 --mode prompt_only --verbose
# Code evolution (after 20+ stable prompt-only cycles)
python run.py --cycles 5 --mode code_evolution --verbose
# Show options
python run.py --help| Hardware | Est. Time/Cycle | Breakdown |
|---|---|---|
| RTX 5070/5080 | ~2 min | Blue: 45s, Red: 40s, Eval: 35s |
| RTX 4090 | ~1.7 min | Blue: 35s, Red: 30s, Eval: 27s |
| RTX 3090 | ~2.6 min | Blue: 55s, Red: 50s, Eval: 43s |
| CPU-only | ~7.3 min | Blue: 180s, Red: 160s, Eval: 90s |
# Get estimate for your hardware
python -c "from utils.profiling import get_realistic_estimate; import json; print(json.dumps(get_realistic_estimate(), indent=2))"π Reproducible Benchmarks
# Single reproducible run
python scripts/benchmark.py --cycles 10 --seed 42
# Multiple runs for statistical validity
python scripts/benchmark.py --cycles 5 --runs 3 --output results.json
# See benchmark methodology
cat benchmarks/README.mdScoring Methodology:
- Dataset:
benchmarks/reasoning_suite.json(curated reasoning questions) - Scoring: Exact match after normalization
- Weights: reasoning (40%), math (30%), robustness (20%), code (10%)
- Low absolute scores (0.01-0.06) reflect strict scoring + small model; key metric is relative improvement
π Log Analysis Tools
# View cycle summary
python -m utils.log_schema persistence/cycle_logs --cycle 6
# Get aggregate statistics
python -m utils.log_schema persistence/cycle_logs --stats
# Export to CSV
python -m utils.log_schema persistence/cycle_logs --export-csv results.csvβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β π€ Human Oversight β
β (Configurable Veto, Empirically-Tuned Auto-Approve) β
βββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββ β βββββββββββββββββββ
β π΅ Blue Team β β β π΄ Red Team β
β 4 Specialists β β β 4 Specialists β
ββββββββββ¬βββββββββ β ββββββββββ¬βββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β βοΈ Orchestrator β
β βββββββββββββββββββββββββββββββββββββββ β
β β π³ Multi-Vendor GPU Docker β β
β β β’ NVIDIA/AMD/Intel detection β β
β β β’ Driver compatibility checks β β
β β β’ Graceful CPU fallback β β
β β β’ Stall watchdog (5min/phase) β β
β βββββββββββββββββββββββββββββββββββββββ β
β βββββββββββββββββββββββββββββββββββββββ β
β β π‘οΈ Cryptographic Safety β β
β β β’ SHA-256 integrity verification β β
β β β’ Import monitoring (AST) β β
β β β’ Immutable file protection β β
β βββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β π Calibrated Anti-Gaming Detection β
β β’ Z-score (threshold=2.5, 92% precision) β
β β’ Improvement cap (50%, 95% precision) β
β β’ Ensemble vote (90% precision target) β
βββββββββββββββββββββββββββββββββββββββββββββββ
π Project Structure
aasms/
βββ config/ # YAML configuration
βββ agents/
β βββ blue/ # 4 constructive specialists
β βββ red/ # 4 destructive specialists
βββ orchestrator/ # Core evolution loop
β βββ cycle_manager.py # Main coordinator
β βββ commitment.py # Commit decisions
β βββ redundancy.py # Checkpointing
βββ evaluator/ # Testing and validation
β βββ gaming_detection.py # Calibrated ensemble detection
β βββ human_oversight.py # Empirically-tuned auto-approve
β βββ alignment.py # Constitutional checks
β βββ static_analysis.py # Code analysis
βββ utils/ # Shared utilities
β βββ gpu_docker.py # Multi-vendor GPU detection
β βββ gpu_parallel.py # GPU-aware parallelism
β βββ immutable_guard.py # Cryptographic protection
β βββ log_schema.py # Structured logging
β βββ profiling.py # Hardware-specific estimates + watchdog
β βββ context_summarizer.py # Domain-specific fidelity metrics
βββ scripts/
β βββ install.sh # Unified cross-platform installer
β βββ benchmark.py # Reproducible benchmarks
βββ benchmarks/ # Datasets with methodology docs
βββ persistence/ # State and logs
βββ tests/ # 374+ tests
βββ Makefile # Local CI equivalent
βββ .github/workflows/ci.yml # GitHub Actions CI
# Run all tests
make test # or: pytest tests/ -v
# Full CI suite locally (lint + type + test + security)
make test-ci
# With coverage
make test-cov
# Quick checks
make lint # ruff
make type-check # mypy
make security # bandit| Category | Command | Purpose |
|---|---|---|
| All | make test-ci |
Full CI locally |
| Unit | pytest tests/test_llm_client.py |
LLM client |
| Integration | pytest tests/test_cycle_manager.py |
Orchestrator |
| Isolation | pytest tests/test_sandbox_isolation.py |
Docker sandbox |
| Constraint | Value | Enforcement |
|---|---|---|
| Sandbox | Docker + seccomp + multi-vendor GPU | utils/gpu_docker.py |
| Integrity | SHA-256 manifest | utils/immutable_guard.py |
| Timeout | 30s/test, 5min/phase, 30min/cycle | utils/profiling.CycleWatchdog |
| Memory | 512MB default | Docker + RLIMIT_AS |
| Network | Disabled | --network=none |
| Improvement | β₯5%, β€50% | orchestrator/commitment.py |
| Test Pass | 100% required | orchestrator/commitment.py |
| Immutable Files | 9 protected files | utils/immutable_guard.py |
| Human Review | Configurable (default: 10 cycles) | evaluator/human_oversight.py |
| Auto-Approve | β€3 files, β€50 lines, 5-15% | Empirically validated |
| Anti-Gaming | Ensemble (90% precision calibrated) | evaluator/gaming_detection.py |
| Stall Detection | Phase/cycle timeouts + heartbeat | utils/profiling.CycleWatchdog |
The 90% precision target is based on:
- Training: 200 synthetic gaming attempts
- Validation: 100 real cycles with 15 labeled gaming attempts
- Thresholds: Z-score=2.5 (92% precision), improvement cap=50% (95% precision)
- Ongoing: Metrics tracked in
persistence/gaming_metrics.json
See evaluator/gaming_detection.py docstring for full methodology.
- LLM Quality: Evolution quality depends on local model (tested llama3.2:3b)
- Compute: 2-8 min/cycle; see Cycle Time Estimates
- Context Windows: Domain-specific fidelity checks (not ROUGE) mitigate truncation
- GPU Vendors: NVIDIA full support, AMD partial (ROCm), Intel CPU-fallback
- Research Only: Not production-ready
- Sandbox Escapes: Docker provides strong but not absolute isolation
- Calibration Drift: Re-calibrate anti-gaming quarterly with accumulated data
- Human Oversight Required: Configure review intervals appropriately
- Early Results: 6 cycles shown; longer runs pending external validation
- Single Implementation: May not generalize to other architectures
- Dual-Use Risk: See
evaluator/alignment.pyfor ethical guidelines
# Fork, clone, and setup
git clone https://github.com/YOUR_USERNAME/aasms.git && cd aasms
make install-dev
# Create feature branch
git checkout -b feature/your-feature
# Run full CI locally before PR
make test-ci
# Push and open PR
git push origin feature/your-featureCode Style: PEP8 via ruff, type hints required, docstrings on public functions.
See CONTRIBUTING.md for full guidelines.
MIT License Β© 2026 Bradley R. Kinnard
@software{aasms2026,
author = {Kinnard, Bradley R.},
title = {Adversarial Architectural Self-Modification System},
year = {2026},
url = {https://github.com/moonrunnerkc/aasms}
}Made with π§ by @moonrunnerkc