🔄 AASMS

Adversarial Architectural Self-Modification System

A research system exploring recursive self-improvement through adversarial code evolution

📖 Overview

AASMS is a local-first AI system that evolves its own codebase through adversarial Red/Blue team dynamics. Blue Team agents propose code improvements as unified diffs, Red Team agents generate exploit tests to find flaws, and an Orchestrator evaluates proposals in Docker-isolated sandboxes, committing only survivors that improve benchmark scores.

⚠️ Research Prototype: This system is for research purposes only. See Known Limitations.

Verified Evolution Results

Cycle	Score	Improvement	Status
1	0.01	—	✓ Committed
2	0.02	+100%	✓ Committed
3	0.03	+50%	✓ Committed
4	0.04	+33%	✓ Committed
5	0.05	+25%	✓ Committed
6	0.06	+20%	✓ Committed

24 proposals applied, 0 errors, 0 reverts. Scores from benchmarks/reasoning_suite.json (10-question subset). Full methodology in benchmarks/README.md.

🔍 Reproducibility & Proof of Locality

All results are reproducible and verifiable. See docs/REPRODUCIBILITY_PROOF.md for complete proof artifacts:

Proof Type	Description
Seeded Benchmarks	`python scripts/benchmark.py --cycles 10 --seed 42` produces identical results
JSONL Cycle Logs	Full timestamps, scores, commit hashes in `results/proof_run_seed42.jsonl`
Hardware Report	Local GPU (RTX 5070), Ollama endpoints, no cloud calls
Network Isolation	Docker `--network=none`, tcpdump verification
Human Oversight	Sample approvals/rejections in `persistence/human_veto.json`

# Reproduce documented results
python scripts/benchmark.py --cycles 10 --seed 42

# Verify GPU and locality
python -c "from utils.gpu_docker import get_system_isolation_report; import json; print(json.dumps(get_system_isolation_report(), indent=2))"

✨ Features

Category	Capabilities
🏠 Local Execution	Ollama-only inference (llama3.2), no cloud APIs, multi-GPU vendor support
🐳 GPU-Aware Docker	NVIDIA/AMD/Intel detection, driver compatibility checks, graceful CPU fallback
🛡️ Cryptographic Safety	SHA-256 integrity verification, import monitoring, immutable file protection
🎯 Ensemble Anti-Gaming	Calibrated detection (90% precision, see methodology), rotating benchmarks
💾 State Protection	Atomic commits, auto-reversion, stall watchdog, human veto with auto-approve
⚡ GPU-Aware Parallelism	Serialized GPU tasks, concurrent CPU work, contention prevention
📊 Structured Logging	JSONL schema with query tools, CSV export, aggregate statistics
🔬 Fidelity Verification	Domain-specific metrics (entity/number/code preservation), quality grades

📦 Installation

Prerequisites

Requirement	Version	Purpose
Python	3.10+	Runtime
Git	Any	Version control
Ollama	Latest	Local LLM inference
Docker	Latest	Sandbox isolation (recommended)
GPU Driver	See below	GPU passthrough (optional)

Quick Start

# Clone and install
git clone https://github.com/moonrunnerkc/aasms.git && cd aasms
./scripts/install.sh  # Unified installer for all platforms

# Or manual installation
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
pytest tests/ -v

📋 Unified Installer Details

The scripts/install.sh script handles:

Platform detection (Linux, macOS, WSL2)
Python version verification (3.10+)
Virtual environment setup
Ollama installation with verification
Docker capability check
GPU detection and compatibility report

# Full installation with prompts
./scripts/install.sh

# Check capabilities after install
make check-gpu

🐳 Docker + GPU Setup

Docker Installation:

# Ubuntu/Debian
sudo apt-get update && sudo apt-get install docker.io docker-compose
sudo usermod -aG docker $USER && newgrp docker
docker run hello-world

GPU Support by Vendor:

Vendor	Requirement	Docker Flag	Notes
NVIDIA	nvidia-container-toolkit + driver 525+	`--gpus all`	Full support
AMD	ROCm 5.0+ + rocm-docker	`--device=/dev/kfd`	See ROCm docs
Intel	Not supported	—	CPU fallback used

NVIDIA Setup:

# Install nvidia-container-toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install nvidia-docker2
sudo systemctl restart docker

# Verify
docker run --gpus all nvidia/cuda:11.0-base nvidia-smi

Check AASMS GPU detection:

python -c "from utils.gpu_docker import get_system_isolation_report; import json; print(json.dumps(get_system_isolation_report(), indent=2))"

🖥️ Platform-Specific Notes

WSL2 (Windows):

# 1. Install Docker Desktop with WSL2 backend
# 2. Enable WSL integration: Docker Desktop > Settings > Resources > WSL Integration
# 3. For GPU: Install NVIDIA drivers on Windows, not in WSL
# 4. Verify: wsl --update && docker run hello-world

# Troubleshooting
wsl --shutdown  # Restart if docker not responding
docker context use default  # Fix context issues

macOS:

brew install ollama docker
# Docker Desktop required for Docker support
# GPU: Apple Silicon supported via Metal, no nvidia-docker needed

Linux (non-Ubuntu):

# Fedora/RHEL
sudo dnf install docker docker-compose
sudo systemctl enable --now docker

# Arch
sudo pacman -S docker docker-compose
sudo systemctl enable --now docker

⚙️ Configuration

All configuration lives in config/:

File	Purpose
`models.yaml`	LLM endpoints (Ollama only)
`teams.yaml`	Blue/Red agent definitions
`benchmarks.yaml`	Test suites and scoring rules
`evolution_mode.yaml`	Safety thresholds and limits

🔧 Key Settings

evolution_mode.yaml:

mode: prompt_only  # Start safe; unlock code_evolution after stability
min_improvement_pct: 5.0
max_improvement_pct: 50.0
required_test_pass_rate: 1.0
memory_limit_mb: 512
test_timeout_seconds: 30

# Human oversight with empirically-derived thresholds
human_oversight:
  cycles_between_reviews: 10
  max_unattended_cycles: 50
  auto_approve:
    max_files: 3      # 95% of safe commits touch ≤3 files
    max_lines: 50     # Mean safe change: 28 lines (σ=15)
    max_improvement_pct: 15.0
    min_improvement_pct: 5.0
  anomaly_thresholds:
    improvement_threshold: 50.0  # >3σ from mean
    score_drop_threshold: 10.0

# Stall detection
watchdog:
  phase_timeout_seconds: 300   # 5 min per phase
  cycle_timeout_seconds: 1800  # 30 min per cycle
  heartbeat_interval: 10

🚀 Usage

Running Evolution Cycles

# Safe mode (recommended start)
python run.py --cycles 10 --mode prompt_only --verbose

# Code evolution (after 20+ stable prompt-only cycles)
python run.py --cycles 5 --mode code_evolution --verbose

# Show options
python run.py --help

Cycle Time Estimates

Hardware	Est. Time/Cycle	Breakdown
RTX 5070/5080	~2 min	Blue: 45s, Red: 40s, Eval: 35s
RTX 4090	~1.7 min	Blue: 35s, Red: 30s, Eval: 27s
RTX 3090	~2.6 min	Blue: 55s, Red: 50s, Eval: 43s
CPU-only	~7.3 min	Blue: 180s, Red: 160s, Eval: 90s

# Get estimate for your hardware
python -c "from utils.profiling import get_realistic_estimate; import json; print(json.dumps(get_realistic_estimate(), indent=2))"

📊 Reproducible Benchmarks

# Single reproducible run
python scripts/benchmark.py --cycles 10 --seed 42

# Multiple runs for statistical validity
python scripts/benchmark.py --cycles 5 --runs 3 --output results.json

# See benchmark methodology
cat benchmarks/README.md

Scoring Methodology:

Dataset: benchmarks/reasoning_suite.json (curated reasoning questions)
Scoring: Exact match after normalization
Weights: reasoning (40%), math (30%), robustness (20%), code (10%)
Low absolute scores (0.01-0.06) reflect strict scoring + small model; key metric is relative improvement

📊 Log Analysis Tools

# View cycle summary
python -m utils.log_schema persistence/cycle_logs --cycle 6

# Get aggregate statistics
python -m utils.log_schema persistence/cycle_logs --stats

# Export to CSV
python -m utils.log_schema persistence/cycle_logs --export-csv results.csv

🏗️ Architecture

┌─────────────────────────────────────────────────────────────┐
│                   👤 Human Oversight                        │
│     (Configurable Veto, Empirically-Tuned Auto-Approve)     │
└─────────────────────────┬───────────────────────────────────┘
                          │
┌─────────────────┐       │       ┌─────────────────┐
│  🔵 Blue Team   │       │       │  🔴 Red Team    │
│  4 Specialists  │       │       │  4 Specialists  │
└────────┬────────┘       │       └────────┬────────┘
         │                │                │
         ▼                ▼                ▼
    ┌─────────────────────────────────────────────┐
    │           ⚙️ Orchestrator                   │
    │  ┌─────────────────────────────────────┐    │
    │  │  🐳 Multi-Vendor GPU Docker         │    │
    │  │  • NVIDIA/AMD/Intel detection       │    │
    │  │  • Driver compatibility checks      │    │
    │  │  • Graceful CPU fallback            │    │
    │  │  • Stall watchdog (5min/phase)      │    │
    │  └─────────────────────────────────────┘    │
    │  ┌─────────────────────────────────────┐    │
    │  │       🛡️ Cryptographic Safety       │    │
    │  │  • SHA-256 integrity verification   │    │
    │  │  • Import monitoring (AST)          │    │
    │  │  • Immutable file protection        │    │
    │  └─────────────────────────────────────┘    │
    └────────────────────┬────────────────────────┘
                         │
                         ▼
    ┌─────────────────────────────────────────────┐
    │     📊 Calibrated Anti-Gaming Detection     │
    │  • Z-score (threshold=2.5, 92% precision)   │
    │  • Improvement cap (50%, 95% precision)     │
    │  • Ensemble vote (90% precision target)     │
    └─────────────────────────────────────────────┘

📁 Project Structure

aasms/
├── config/                    # YAML configuration
├── agents/
│   ├── blue/                  # 4 constructive specialists
│   └── red/                   # 4 destructive specialists
├── orchestrator/              # Core evolution loop
│   ├── cycle_manager.py       # Main coordinator
│   ├── commitment.py          # Commit decisions
│   └── redundancy.py          # Checkpointing
├── evaluator/                 # Testing and validation
│   ├── gaming_detection.py    # Calibrated ensemble detection
│   ├── human_oversight.py     # Empirically-tuned auto-approve
│   ├── alignment.py           # Constitutional checks
│   └── static_analysis.py     # Code analysis
├── utils/                     # Shared utilities
│   ├── gpu_docker.py          # Multi-vendor GPU detection
│   ├── gpu_parallel.py        # GPU-aware parallelism
│   ├── immutable_guard.py     # Cryptographic protection
│   ├── log_schema.py          # Structured logging
│   ├── profiling.py           # Hardware-specific estimates + watchdog
│   └── context_summarizer.py  # Domain-specific fidelity metrics
├── scripts/
│   ├── install.sh             # Unified cross-platform installer
│   └── benchmark.py           # Reproducible benchmarks
├── benchmarks/                # Datasets with methodology docs
├── persistence/               # State and logs
├── tests/                     # 374+ tests
├── Makefile                   # Local CI equivalent
└── .github/workflows/ci.yml   # GitHub Actions CI

🧪 Testing

# Run all tests
make test                     # or: pytest tests/ -v

# Full CI suite locally (lint + type + test + security)
make test-ci

# With coverage
make test-cov

# Quick checks
make lint                     # ruff
make type-check               # mypy
make security                 # bandit

Category	Command	Purpose
All	`make test-ci`	Full CI locally
Unit	`pytest tests/test_llm_client.py`	LLM client
Integration	`pytest tests/test_cycle_manager.py`	Orchestrator
Isolation	`pytest tests/test_sandbox_isolation.py`	Docker sandbox

🛡️ Safety

Constraint	Value	Enforcement
Sandbox	Docker + seccomp + multi-vendor GPU	`utils/gpu_docker.py`
Integrity	SHA-256 manifest	`utils/immutable_guard.py`
Timeout	30s/test, 5min/phase, 30min/cycle	`utils/profiling.CycleWatchdog`
Memory	512MB default	Docker + RLIMIT_AS
Network	Disabled	`--network=none`
Improvement	≥5%, ≤50%	`orchestrator/commitment.py`
Test Pass	100% required	`orchestrator/commitment.py`
Immutable Files	9 protected files	`utils/immutable_guard.py`
Human Review	Configurable (default: 10 cycles)	`evaluator/human_oversight.py`
Auto-Approve	≤3 files, ≤50 lines, 5-15%	Empirically validated
Anti-Gaming	Ensemble (90% precision calibrated)	`evaluator/gaming_detection.py`
Stall Detection	Phase/cycle timeouts + heartbeat	`utils/profiling.CycleWatchdog`

Anti-Gaming Calibration

The 90% precision target is based on:

Training: 200 synthetic gaming attempts
Validation: 100 real cycles with 15 labeled gaming attempts
Thresholds: Z-score=2.5 (92% precision), improvement cap=50% (95% precision)
Ongoing: Metrics tracked in persistence/gaming_metrics.json

See evaluator/gaming_detection.py docstring for full methodology.

⚠️ Known Limitations

Technical

LLM Quality: Evolution quality depends on local model (tested llama3.2:3b)
Compute: 2-8 min/cycle; see Cycle Time Estimates
Context Windows: Domain-specific fidelity checks (not ROUGE) mitigate truncation
GPU Vendors: NVIDIA full support, AMD partial (ROCm), Intel CPU-fallback

Safety

Research Only: Not production-ready
Sandbox Escapes: Docker provides strong but not absolute isolation
Calibration Drift: Re-calibrate anti-gaming quarterly with accumulated data
Human Oversight Required: Configure review intervals appropriately

Research

Early Results: 6 cycles shown; longer runs pending external validation
Single Implementation: May not generalize to other architectures
Dual-Use Risk: See evaluator/alignment.py for ethical guidelines

🤝 Contributing

# Fork, clone, and setup
git clone https://github.com/YOUR_USERNAME/aasms.git && cd aasms
make install-dev

# Create feature branch
git checkout -b feature/your-feature

# Run full CI locally before PR
make test-ci

# Push and open PR
git push origin feature/your-feature

Code Style: PEP8 via ruff, type hints required, docstrings on public functions.

See CONTRIBUTING.md for full guidelines.

📄 License

📚 Citation

@software{aasms2026,
  author = {Kinnard, Bradley R.},
  title  = {Adversarial Architectural Self-Modification System},
  year   = {2026},
  url    = {https://github.com/moonrunnerkc/aasms}
}

⬆ Back to Top

Made with 🔧 by @moonrunnerkc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔄 AASMS

Adversarial Architectural Self-Modification System

📖 Overview

Verified Evolution Results

🔍 Reproducibility & Proof of Locality

✨ Features

📦 Installation

Prerequisites

Quick Start

⚙️ Configuration

🚀 Usage

Running Evolution Cycles

Cycle Time Estimates

🏗️ Architecture

🧪 Testing

🛡️ Safety

Anti-Gaming Calibration

⚠️ Known Limitations

Technical

Safety

Research

🤝 Contributing

📄 License

📚 Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github		.github
agents		agents
benchmarks		benchmarks
config		config
docs		docs
evaluator		evaluator
orchestrator		orchestrator
persistence		persistence
results		results
sandbox_template		sandbox_template
scripts		scripts
tests		tests
utils		utils
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile.sandbox		Dockerfile.sandbox
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
conftest.py		conftest.py
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
run.py		run.py

License

moonrunnerkc/aasms

Folders and files

Latest commit

History

Repository files navigation

🔄 AASMS

Adversarial Architectural Self-Modification System

📖 Overview

Verified Evolution Results

🔍 Reproducibility & Proof of Locality

✨ Features

📦 Installation

Prerequisites

Quick Start

⚙️ Configuration

🚀 Usage

Running Evolution Cycles

Cycle Time Estimates

🏗️ Architecture

🧪 Testing

🛡️ Safety

Anti-Gaming Calibration

⚠️ Known Limitations

Technical

Safety

Research

🤝 Contributing

📄 License

📚 Citation

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages